I was recently surprised by a 404 which I noticed in Google's Webmaster Tools,
which pointed to a truncated URL (http://www.wdl.org/en/item/) which is
never actually linked to. This happens frequently but in this case the failure was
interesting because the source link was an unlinked URL on a Persian blog and the
source link actually worked:
The canonical URL for that page is
and the author had presumably pasted the URL into a page where their software had
helpfully converted the latin digits (e.g. 0123456789) into what are
Eastern Arabic, Hindi or Indic-Arabic numerals:
٠١٢٣٤٥٦٧٨٩ (a closer look shows that this is the Persian variant as the
6 is actually ۶ instead of
Presumably Googlebot scans HTML for text which look like URLs but uses a limited
parser which breaks as soon as it finds a character which isn't in a limited set of
characters, presumably only ISO-8859-1, causing it to break the URL while other services
(e.g. Facebook, Google+, Github, etc.) extract the full URL.
So … mystery solved, we're done, right?
Wait, why does that URL work in the first place? We never added that as a supported
feature and, being security-aware, all of our URLs are carefully validated to ensure
that the IDs are valid numeric values – both as part of
Django's URL dispatching
and when the item is retrieved from the database. Besides, since the item IDs aren't
assigned using Eastern-Arabic digits how does it actually match the record?
Python's documentation for int() doesn't mention anything about this but
gives us a clue:
Decimal characters include digit characters, and all characters that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO.
What's happening here is that Python uses the Unicode definition of “digit”, which is actually a fairly long list.
The specification uses the
“Nd = Number, decimal digit” classification
and a quick look at the
Unidata script extensions list
shows quite a few different characters flagged with Nd. Python should treat all of those
as valid digits for the purposes of calling int(), isdecimal(),
and even the regular expressions used to validate our URLs. ۲۶۷۹ will match \d+
and will have been converted to the number 2679 by the time the database
Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr
Digitizing printed material has become an industrial process for large collections. Modern scanning equipment makes it easy to process millions of pages and concerted engineering effort has even produced options at the high-end for fragile rare items while innovative open-source projects like Project Gado make continual progress reducing the cost of reliable, batch scanning to fit almost any organization's budget.
Such efficiencies are great for our goals of preserving history and making it available but they start making painfully obvious the degree to which digitization capacity outstrips our ability to create metadata. This is a big problem because most of the ways we find information involves searching for text and a large TIFF file is effectively invisible to a full-text search engine. The classic library solution to this challenge has been cataloging but the required labor is well beyond most budgets and runs into philosophical challenges when users want to search on something which wasn't considered noteworthy at the time an item was cataloged.
In the spirit of finding the simplest thing that could possibly work I've been experimenting with a completely automated approach to perform OCR on new items and offering combined full-text search over both the available metadata and OCR text, as can be seen in this example:
Generating OCR text
As we receive new items, anything which matches our criteria (books, journals and newspapers created after 1800 — see below) is automatically placed into a Celery task queue as a low-priority task. Workers on multiple servers accept OCR tasks from the queue and process the master image using Tesseract using a simple shell command to generate text and HTML with embedded hOCR metadata.
Most people expect a combined search these days where relevant terms are selected from both descriptive metadata and the text contents. Simply combining all of the text into a single document to be indexed is unsuitable, however, because we want to be able to offer the ability to only search metadata in certain cases and we want to be able to return specific pages rather than telling someone to visually scan through a 700 page book. Unfortunately, this approach is incompatible with the normal way search engines determine the most relevant results for a query:
Storing each page separately means that the search score will be determined independently rather than for the entire item. This would prevent books from scoring highly unless all of the words were mentioned on a single page and, far worse, many queries would return pages from a single book mixed throughout the results based on their individual scores! The solution this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team is working on a similar feature). With field collapsing enabled, Solr will first group all of the matching documents using a specified fieldand then compute the scores for each combined group. This means that we can group our results by the item ID and receive a list of groups (i.e. items) with one or more documents (i.e. pages or metadata) which we can use to build exact links into a large book.
Search results are returned as simple HTML with the embedded data which we'll need to provide the original image segments. Here's what happens when someone searches for guineé:
At this point we have quickly returned search results and can link directly to individual pages but we're showing frequently ugly OCR text directly and not providing as much context as we'd like. The next step is to replace that raw text with an image slice from the scanned page:
An XHR request is made to retrieve the word coordinates for every word on each returned page
The word coordinate list is scanned for each highlighted word and the coordinates are selected. Since we often find words in multiple places on the same page and we want to display an easily readable section of text, the list of word coordinates is coalesced starting from the top of the page and no more than the first third of the page will be returned. For this display, we always use the full width of the page but the same process could be used to generate smaller slices if desired.
A separate request is made to load the relevant image slice. When the image has loaded, we replace the raw OCR text with the image. This way the raw text is visible for as long as it takes to load the image so we avoid showing empty areas until everything has transferred.
Finally, a partially-transparent overlay is displayed over the image for each word coordinate to highlight the matches (see e.g. css-tricks.com if you're not familiar with this form of CSS positioning). Since the OCRed word coordinates aren't consistently tightly cropped around the letters in the word a minor CSS box-shadow is used to make the edges softer and more like a highlighter.
From a workflow perspective, I highly recommend recording the source of your OCR text and whether it's been reviewed. Since this is a fully automated process it is extremely handy to be able to reprocess items in the future if your software improves without accidentally clobbering any items which have been hand-corrected by humans.
The word coordinates are pixel level coordinates based on the input file but our requests are made using calculated percentages since it's often the case that the scans are much higher resolution than we would want to display in a web-browser and our users wouldn't want to wait for a 600-dpi image to download in any case
You might be wondering why all of this work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load, this approach is friendlier for caches because a given image segment can be reused for multiple words (rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can thus be cached by CDN edge servers rather than requiring a full round-trip back to the server.
One common example of the cache-ability benefit is when you open a result and start reading it: in the viewer, we display full page images rather than the trimmed slices so we must fetch new images but those are likely to be cached because they haven't been customized with the search text and we can reuse the locally-cached word coordinates to immediately display the highlighting. If you change your search text within an item, we can again immediately update the display while the revised page list is retrieved.
Challenges & Future Directions
This was supposed to be the simplest thing which could possibly work and it turned out not to be that simple. As you might imagine, this leaves a number of open questions for where to go next:
OCR results vary considerably based on the quality of the input image. Accuracy can be improved considerably by preprocessing the image to remove borders, noise or use a more sophisticated algorithm to convert a full-color scan into the black-and-white image which Tesseract operates on. The trick is either coming up with good presets for your data, perhaps integrated with an image processing tool like ScanTailor, or developing smarter code which can select filters based on the characteristics of the image.
For older items, the OCR process is complicated by the condition of the materials, more primitive printing technology and stylistic choices like the long s (ſ) or ligatures which are no longer in common usage and thus not well supported by common OCR programs. One of my future goals is looking into the tools produced by the Early Modern OCR Project and seeing whether there's a production-ready path for this.
This seems like an area for research which any organization with large digitized collections should be supporting, particularly with an eye towards easier reuse. Ed Summers and I have idly discussed the idea for a generic web application which would display hOCR with the corresponding images for correction with all of the data stored somewhere like Github for full change tracking and review. It seems like something along those lines would be particularly valuable as a public service to avoid the expensive of everyone reinventing large parts of this process customized for their particular workflow.
The fight against DRM is not worth discarding your integrity. Misrepresenting the W3C's Encrypted Media Extensions will not do anything useful but it will hold the web back and make the EFF less effective.
First, some background: I've been a supporter and donor to the Electronic Frontier Foundation for a long time – at least 2001, although I believe I started earlier during the 90s Crypto Wars – and opposed to to DRM for at least as long. I've also been a fan of Danny O'Brien's reporting and personal blog for a similarly long time.
Unfortunately, today had me reconsidering that support because of O'Brien's recent blog post: Lowering Your Standards: DRM and the Future of the W3C . I feel this marks a dangerous trend of playing very loose with the facts in an attempt to pressure the W3C to drop the Encrypted Media Extensions (EME) spec and that this is not only like to fail but actually backfire in ensuring that millions of people continue to access content through proprietary, closed systems.
A little background information: most video played on the web and particularly commercial content uses Adobe's Flash or Microsoft's Silverlight plugins to run a video player inside a webpage. Both Flash and Silverlight are full programming environments with a significant range of capabilities beyond video playback and have significant overlap with the features provided by your browser. They're distributed as browser plugins, which require a hefty download to be installed before viewing anything, and both generally require proprietary tools for developers to create applications.
They're annoying for developers because they require using a completely different set of technologies than you use for everything else on the web but many places will write that off as a cost of doing business. What's more of a concern is that both plugins have a history of security problems and neither Microsoft nor Adobe appear to be particularly motivated to build the kind of fast, reliable, automatic update system which the modern browsers have so in addition to requiring your users to download something before viewing content, you're contributing to one of the leading sources of security exploits for the average user. It also means that anyone who wishes to publish video on the web is generally subject to the development roadmap for one of two companies.
HTML5 offers a way out of this mess: browsers could play back video directly, avoiding the massive external dependency and allowing them to make improvements for video as quickly as they do anything else rather than hoping a third-party developer wants to make improvements. HTML5 <video> is very easy to use, fast and has a consistent high-quality user experience. Unfortunately anyone looking to use it for commercial content will learn that the licensing rules from all of the major content owners require the use of DRM and thus HTML5 video is not an option.
What is EME, anyway?
DRM does not work and all DRMed content has ended up being available in unencrypted form very quickly because the only way to make DRM work is by completely locking down a device to prevent its owner from running code which can access the unencrypted data and, of course, there's always the Analog Hole. The EFF has a long, laudable history attempting to educate the public and lawmakers about these issues and I completely support those efforts.
Unfortunately, this effort has failed. No significant amount of commercial video on the web is available without DRM and users don't seem to care as the billions of dollars of sales through iTunes, Amazon, Google Play, etc. and Netflix is using somewhere around 30% of the total Internet traffic in North America to serve DRM-encumbered video, mostly using Silverlight. Clearly convenience and availability are more important to people.
The EFF has been taking a hard-line position on EME, focused on slippery-slope claims:
This is similar to some of the past claims made by Cory Doctorow:
Both of these are simply wrong: there is no meaningful distinction between what EME proposes and what is already the case with a browser plugin. If Firefox can play Flash or Silverlight content, it can decrypted video using a CDM which is either included in the host operating system, bundled under an agreement similar to Chrome's Flash plugin or installed by the user.
The real problem is that they're arguing the wrong point: those requests have always been made and, in most cases, have already happened. The lack of a W3C standard hasn't prevented the Amazon Kindle app from preventing your ability to save unencrypted text, iTunes from blocking saving snippets of a rented movie, etc. and it hasn't prevented either Adobe or Microsoft from adding every DRM feature requested by the content owners. What this has done is ensured that the web community hasn't had much say in the process because all of the content is created and played using proprietary closed software.
The EFF is shouting loudly but only Adobe and Microsoft will benefit. There's no indication whatsoever that the studios are going to drop their DRM requirements if this W3C spec is scuttled – we'll just continue to see a lot of opaque plugin content and, of course, more pressure away from the web towards proprietary app stores. Mozilla's Asa Dotzler summed this up perfectly earlier today on Hacker News:[T]he businesses (Hollywood) with the content that Web users want have done that math and decided that DRM through plug-ins and native apps is an EXCELLENT system and they're happy to keep mandating it forever. If Plug-ins go away, as they're slowly but surely doing, then native apps will be the only place to get this content.
This approach also runs the risk of damaging the reputation of the EFF and making it less effective: beyond basic factual problems, exaggerating the risks will backfire badly if people look and – correctly – see that the situation isn't so terrible (Netflix at $10/month is absurdly popular despite the DRM) and discount future claims made by the EFF. They'll need that credibility as the war on general purpose computing continues — and Cory is not wrong to sound the alarm over that.
What the open web community should be doing now is working to ensure that EME is designed in a way which improves security and reduces the proprietary footprint. If the standard for CDMs includes aggressive sandboxing it's a huge win for security alone even if all it does is turn Flash into a collective bad memory for web users. Additionally, separating the task of building a decryption module from building a high-performance video player with robust networking, makes it significantly easier for new vendors to enter the market and increases portability because so much less code needs to be adapted to a new platform.
Obviously a shift away from the DRM obsession won't happen overnight but it's not impossible, either, as content owners are concerned about the market leverage which the major DRM vendors like Apple and Amazon have. There's space for smart players willing to back away from DRM in favor of an approach which works at least as well and doesn't require hardware vendors to sell out their users. As Brendan said, there is hope.
Brendan Eich, Mozilla's CTO, posted his position on the EME issue: The Bridge of Khazad-DRM. Pushing the W3C for CDM-level interoperability is a good call and definitely feels characteristic of Mozilla by balancing the goal of protecting users’ interests with the realistic constraints of the current browser market. I strongly hope they succeed.
IT is one of the bright spots in the US economy – perhaps our government should be more cautious about helping the competition…
This is a great example of how the NSA's rogue actions are going to be endangering US IT companies for years: RSA has a security advisory out for several products, including a widely-used cryptography library, which defaulted to using the Dual EC DRBG random number generator, which we now know was released by the NSA with a backdoor to make it easier to spy on people.
Amidst all of the confusion and concern over an encryption algorithm that may contain an NSA backdoor, RSA Security released an advisory to developer customers today noting that the algorithm is the default algorithm in one of its toolkits and strongly advises them to stop using the algorithm.
This likely makes things weaker in a way which others could exploit – and given the high odds that people in e.g. China and Russia are racing to test that, it's likely that the NSA's actions exposed millions of people to unnecessary additional risk by weakening important software.
It's likely even more damaging, however, to the US IT industry's future. We can ship updates to software relatively quickly but the question of trust is going to be much thornier: almost every RSA customer – and especially foreign ones – must be asking whether RSA was innocently dupe or actively collaborating. Given how much business they do with the US government, they're probably never going to be able to convincingly disprove that theory. Every other major security vendor in the US and certain allied countries is going to face a similar question: “How do we know you won't be in the news next?”