Chris Adams

Content search on a budget

Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr

Digitizing printed material has become an industrial process for large collections. Modern scanning equipment makes it easy to process millions of pages and concerted engineering effort has even produced options at the high-end for fragile rare items while innovative open-source projects like Project Gado make continual progress reducing the cost of reliable, batch scanning to fit almost any organization's budget.

Such efficiencies are great for our goals of preserving history and making it available but they start making painfully obvious the degree to which digitization capacity outstrips our ability to create metadata. This is a big problem because most of the ways we find information involves searching for text and a large TIFF file is effectively invisible to a full-text search engine. The classic library solution to this challenge has been cataloging but the required labor is well beyond most budgets and runs into philosophical challenges when users want to search on something which wasn't considered noteworthy at the time an item was cataloged.

In the spirit of finding the simplest thing that could possibly work I've been experimenting with a completely automated approach to perform OCR on new items and offering combined full-text search over both the available metadata and OCR text, as can be seen in this example:

Searching for “whales” on the World Digital Library

The Process

Generating OCR text

As we receive new items, anything which matches our criteria (books, journals and newspapers created after 1800 — see below) is automatically placed into a Celery task queue as a low-priority task. Workers on multiple servers accept OCR tasks from the queue and process the master image using Tesseract using a simple shell command to generate text and HTML with embedded hOCR metadata.

Once we have the OCR text, it's transformed to solve two different needs: a full-text search engine like Apache Solr or ElasticSearch works with the pure text output but because we want to be able to highlight specific words the task also converts the hOCR into a word coordinates JSON file with the pixel coordinates for every word on the page.

Indexing the text for search

Most people expect a combined search these days where relevant terms are selected from both descriptive metadata and the text contents. Simply combining all of the text into a single document to be indexed is unsuitable, however, because we want to be able to offer the ability to only search metadata in certain cases and we want to be able to return specific pages rather than telling someone to visually scan through a 700 page book. Unfortunately, this approach is incompatible with the normal way search engines determine the most relevant results for a query:

Storing each page separately means that the search score will be determined independently rather than for the entire item. This would prevent books from scoring highly unless all of the words were mentioned on a single page and, far worse, many queries would return pages from a single book mixed throughout the results based on their individual scores! The solution this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team is working on a similar feature). With field collapsing enabled, Solr will first group all of the matching documents using a specified fieldand then compute the scores for each combined group. This means that we can group our results by the item ID and receive a list of groups (i.e. items) with one or more documents (i.e. pages or metadata) which we can use to build exact links into a large book.

Highlighting Results

Search results are returned as simple HTML with the embedded data which we'll need to provide the original image segments. Here's what happens when someone searches for guineé:

  1. Solr performs its normal language analysis and selects relevant documents
  2. All of the documents are grouped by item ID and each group is ranked for relevance
  3. Solr highlights the matched terms in the response
  4. The web site formats all of the results into an HTML document and adds some metadata indicating the type of document which contained each match so it can be enhanced by JavaScript later
The raw search results before JavaScript runs

At this point we have quickly returned search results and can link directly to individual pages but we're showing frequently ugly OCR text directly and not providing as much context as we'd like. The next step is to replace that raw text with an image slice from the scanned page:

  1. JavaScript looks for highlighted results from OCR text and uses the embedded microdata to determine the source volume and page
  2. An XHR request is made to retrieve the word coordinates for every word on each returned page
  3. The word coordinate list is scanned for each highlighted word and the coordinates are selected. Since we often find words in multiple places on the same page and we want to display an easily readable section of text, the list of word coordinates is coalesced starting from the top of the page and no more than the first third of the page will be returned. For this display, we always use the full width of the page but the same process could be used to generate smaller slices if desired.
  4. A separate request is made to load the relevant image slice. When the image has loaded, we replace the raw OCR text with the image. This way the raw text is visible for as long as it takes to load the image so we avoid showing empty areas until everything has transferred.
  5. Finally, a partially-transparent overlay is displayed over the image for each word coordinate to highlight the matches (see e.g. css-tricks.com if you're not familiar with this form of CSS positioning). Since the OCRed word coordinates aren't consistently tightly cropped around the letters in the word a minor CSS box-shadow is used to make the edges softer and more like a highlighter.

Notes

  • From a workflow perspective, I highly recommend recording the source of your OCR text and whether it's been reviewed. Since this is a fully automated process it is extremely handy to be able to reprocess items in the future if your software improves without accidentally clobbering any items which have been hand-corrected by humans.
  • The word coordinates are pixel level coordinates based on the input file but our requests are made using calculated percentages since it's often the case that the scans are much higher resolution than we would want to display in a web-browser and our users wouldn't want to wait for a 600-dpi image to download in any case
  • You might be wondering why all of this work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load, this approach is friendlier for caches because a given image segment can be reused for multiple words (rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can thus be cached by CDN edge servers rather than requiring a full round-trip back to the server.
    One common example of the cache-ability benefit is when you open a result and start reading it: in the viewer, we display full page images rather than the trimmed slices so we must fetch new images but those are likely to be cached because they haven't been customized with the search text and we can reuse the locally-cached word coordinates to immediately display the highlighting. If you change your search text within an item, we can again immediately update the display while the revised page list is retrieved.

Challenges & Future Directions

This was supposed to be the simplest thing which could possibly work and it turned out not to be that simple. As you might imagine, this leaves a number of open questions for where to go next:

  • OCR results vary considerably based on the quality of the input image. Accuracy can be improved considerably by preprocessing the image to remove borders, noise or use a more sophisticated algorithm to convert a full-color scan into the black-and-white image which Tesseract operates on. The trick is either coming up with good presets for your data, perhaps integrated with an image processing tool like ScanTailor, or developing smarter code which can select filters based on the characteristics of the image.
  • For older items, the OCR process is complicated by the condition of the materials, more primitive printing technology and stylistic choices like the long s (ſ) or ligatures which are no longer in common usage and thus not well supported by common OCR programs. One of my future goals is looking into the tools produced by the Early Modern OCR Project and seeing whether there's a production-ready path for this.
  • It would be interesting combine the results of OCR with my earlier figure extraction project for innovative displays like the Mechanical Curator or, with more work, trying to extract full figures with captions.
  • Finally, there's considerable room for integrating crowd-sourcing approaches like the direct text correction as epitomized by the National Library of Australia's wonderful Trove project and promising improvements on that concept like the UMD-MITH's ActiveOCR project.

    This seems like an area for research which any organization with large digitized collections should be supporting, particularly with an eye towards easier reuse. Ed Summers and I have idly discussed the idea for a generic web application which would display hOCR with the corresponding images for correction with all of the data stored somewhere like Github for full change tracking and review. It seems like something along those lines would be particularly valuable as a public service to avoid the expense of everyone reinventing large parts of this process customized for their particular workflow.

Dear EFF: please don't pick the wrong fight

The fight against DRM is not worth discarding your integrity. Misrepresenting the W3C's Encrypted Media Extensions will not do anything useful but it will hold the web back and make the EFF less effective.

First, some background: I've been a supporter and donor to the Electronic Frontier Foundation for a long time – at least 2001, although I believe I started earlier during the 90s Crypto Wars – and opposed to to DRM for at least as long. I've also been a fan of Danny O'Brien's reporting and personal blog for a similarly long time.

Unfortunately, today had me reconsidering that support because of O'Brien's recent blog post: Lowering Your Standards: DRM and the Future of the W3C . I feel this marks a dangerous trend of playing very loose with the facts in an attempt to pressure the W3C to drop the Encrypted Media Extensions (EME) spec and that this is not only like to fail but actually backfire in ensuring that millions of people continue to access content through proprietary, closed systems.

Background

A little background information: most video played on the web and particularly commercial content uses Adobe's Flash or Microsoft's Silverlight plugins to run a video player inside a webpage. Both Flash and Silverlight are full programming environments with a significant range of capabilities beyond video playback and have significant overlap with the features provided by your browser. They're distributed as browser plugins, which require a hefty download to be installed before viewing anything, and both generally require proprietary tools for developers to create applications.

They're annoying for developers because they require using a completely different set of technologies than you use for everything else on the web but many places will write that off as a cost of doing business. What's more of a concern is that both plugins have a history of security problems and neither Microsoft nor Adobe appear to be particularly motivated to build the kind of fast, reliable, automatic update system which the modern browsers have so in addition to requiring your users to download something before viewing content, you're contributing to one of the leading sources of security exploits for the average user. It also means that anyone who wishes to publish video on the web is generally subject to the development roadmap for one of two companies.

HTML5 offers a way out of this mess: browsers could play back video directly, avoiding the massive external dependency and allowing them to make improvements for video as quickly as they do anything else rather than hoping a third-party developer wants to make improvements. HTML5 <video> is very easy to use, fast and has a consistent high-quality user experience. Unfortunately anyone looking to use it for commercial content will learn that the licensing rules from all of the major content owners require the use of DRM and thus HTML5 video is not an option.

What is EME, anyway?

The W3C's EME group is working on way to reduce this dependency by adding a general mechanism which allows the use of HTML5 video with a little bit of JavaScript to specify a CDM and a decryption key for the file. This allows content providers to use the entire modern web stack and limit the DRM dependency to a small chunk of code which handles only the actual decryption – dramatically lowering the attack surface and avoiding the need for anywhere near as frequent updates as the actual decryption mechanism is far less complex than the entire, largely-duplicative platform which Flash or Silverlight provide.

The problem

DRM does not work and all DRMed content has ended up being available in unencrypted form very quickly because the only way to make DRM work is by completely locking down a device to prevent its owner from running code which can access the unencrypted data and, of course, there's always the Analog Hole. The EFF has a long, laudable history attempting to educate the public and lawmakers about these issues and I completely support those efforts.

Unfortunately, this effort has failed. No significant amount of commercial video on the web is available without DRM and users don't seem to care as the billions of dollars of sales through iTunes, Amazon, Google Play, etc. and Netflix is using somewhere around 30% of the total Internet traffic in North America to serve DRM-encumbered video, mostly using Silverlight. Clearly convenience and availability are more important to people.

The EFF has been taking a hard-line position on EME, focused on slippery-slope claims:

By approving this idea, the W3C has ceded control of the "user agent" (the term for a Web browser in W3C parlance) to a third-party, the content distributor. That breaks a—perhaps until now unspoken—assurance about who has the final say in your Web experience, and indeed who has ultimate control over your computing device.

A Web where you cannot cut and paste text; where your browser can't "Save As..." an image; where the "allowed" uses of saved files are monitored beyond the browser; where JavaScript is sealed away in opaque tombs; and maybe even where we can no longer effectively "View Source" on some sites, is a very different Web from the one we have today. It's a Web where user agents—browsers—must navigate a nest of enforced duties every time they visit a page. It's a place where the next Tim Berners-Lee or Mozilla, if they were building a new browser from scratch, couldn't just look up the details of all the "Web" technologies. They'd have to negotiate and sign compliance agreements with a raft of DRM providers just to be fully standards-compliant and interoperable.

Lowering Your Standards: DRM and the Future of the W3C, Danny O'Brien, EFF (emphasis mine)

This is similar to some of the past claims made by Cory Doctorow:

The first of these conditions – "robustness" against end-user modification – is a blanket ban on all free/open source software (free/open source software, by definition, can be modified by its users). That means that the two most popular browser technologies on the Web – WebKit (used in Chrome and Safari) and Gecko (used in Firefox and related browsers) – would be legally prohibited from implementing whatever "standard" the W3C emerges.

What I wish Tim Berners-Lee understood about DRM, Cory Doctorow, The Guardian. (emphasis mine)

Both of these are simply wrong: there is no meaningful distinction between what EME proposes and what is already the case with a browser plugin. If Firefox can play Flash or Silverlight content, it can decrypted video using a CDM which is either included in the host operating system, bundled under an agreement similar to Chrome's Flash plugin or installed by the user.

The real problem is that they're arguing the wrong point: those requests have always been made and, in most cases, have already happened. The lack of a W3C standard hasn't prevented the Amazon Kindle app from preventing your ability to save unencrypted text, iTunes from blocking saving snippets of a rented movie, etc. and it hasn't prevented either Adobe or Microsoft from adding every DRM feature requested by the content owners. What this has done is ensured that the web community hasn't had much say in the process because all of the content is created and played using proprietary closed software.

The EFF is shouting loudly but only Adobe and Microsoft will benefit. There's no indication whatsoever that the studios are going to drop their DRM requirements if this W3C spec is scuttled – we'll just continue to see a lot of opaque plugin content and, of course, more pressure away from the web towards proprietary app stores. Mozilla's Asa Dotzler summed this up perfectly earlier today on Hacker News: [T]he businesses (Hollywood) with the content that Web users want have done that math and decided that DRM through plug-ins and native apps is an EXCELLENT system and they're happy to keep mandating it forever. If Plug-ins go away, as they're slowly but surely doing, then native apps will be the only place to get this content.

This approach also runs the risk of damaging the reputation of the EFF and making it less effective: beyond basic factual problems, exaggerating the risks will backfire badly if people look and – correctly – see that the situation isn't so terrible (Netflix at $10/month is absurdly popular despite the DRM) and discount future claims made by the EFF. They'll need that credibility as the war on general purpose computing continues — and Cory is not wrong to sound the alarm over that.

What the open web community should be doing now is working to ensure that EME is designed in a way which improves security and reduces the proprietary footprint. If the standard for CDMs includes aggressive sandboxing it's a huge win for security alone even if all it does is turn Flash into a collective bad memory for web users. Additionally, separating the task of building a decryption module from building a high-performance video player with robust networking, makes it significantly easier for new vendors to enter the market and increases portability because so much less code needs to be adapted to a new platform.

There are some interesting long-term trends, as well: more education about the risks of DRMed content is good and reducing what consumers are willing to pay for restricted content may be the best long-term strategy. Some of that effort needs to be directed towards content owners and providers who are thinking about investing in complex, expensive systems which don't actually work. A very interesting approach was highlighted by Mozilla's Brendan Eich earlier this year in the form of OTOY's pure-JavaScript video codec which in addition to avoiding all of the issues with binary plugins has first-class support for watermarking.

Watermarking, not DRM. This could be huge. OTOY’s GPU cloud approach enables individually watermarking every intra-frame, and according to some of its Hollywood supporters including Ari Emanuel, this may be enough to eliminate the need for DRM.

Today I saw the future, Brendan Eich (Mozilla CTO)

Obviously a shift away from the DRM obsession won't happen overnight but it's not impossible, either, as content owners are concerned about the market leverage which the major DRM vendors like Apple and Amazon have. There's space for smart players willing to back away from DRM in favor of an approach which works at least as well and doesn't require hardware vendors to sell out their users. As Brendan said, there is hope.

2013-10-24 update

Brendan Eich, Mozilla's CTO, posted his position on the EME issue: The Bridge of Khazad-DRM. Pushing the W3C for CDM-level interoperability is a good call and definitely feels characteristic of Mozilla by balancing the goal of protecting users’ interests with the realistic constraints of the current browser market. I strongly hope they succeed.

Since Mozilla seems to be the only browser vendor taking a strong position in favor of user rights, now is a great time to support their work with a donation.

The NSA’s recklessness poses a risk to US business

IT is one of the bright spots in the US economy – perhaps our government should be more cautious about helping the competition…

This is a great example of how the NSA's rogue actions are going to be endangering US IT companies for years: RSA has a security advisory out for several products, including a widely-used cryptography library, which defaulted to using the Dual EC DRBG random number generator, which we now know was released by the NSA with a backdoor to make it easier to spy on people.

Amidst all of the confusion and concern over an encryption algorithm that may contain an NSA backdoor, RSA Security released an advisory to developer customers today noting that the algorithm is the default algorithm in one of its toolkits and strongly advises them to stop using the algorithm.

RSA Tells Its Developer Customers: Stop Using NSA-Linked Algorithm, Kim Zetter, Wired

This likely makes things weaker in a way which others could exploit – and given the high odds that people in e.g. China and Russia are racing to test that, it's likely that the NSA's actions exposed millions of people to unnecessary additional risk by weakening important software.

It's likely even more damaging, however, to the US IT industry's future. We can ship updates to software relatively quickly but the question of trust is going to be much thornier: almost every RSA customer – and especially foreign ones – must be asking whether RSA was innocently dupe or actively collaborating. Given how much business they do with the US government, they're probably never going to be able to convincingly disprove that theory. Every other major security vendor in the US and certain allied countries is going to face a similar question: “How do we know you won't be in the news next?”

Extracting images from scanned book pages

A first step toward building a visual index for books automatically

I work on a project which has placed a number of books online. Over the years we've improved server performance and worked on a fast, responsive viewer for scanned books to make our books as accessible as possible but it's still challenging to help visitors find something of interest out of hundreds of thousands of scanned pages.

Trevor and I have discussed various ways to improve the situation and one idea which seemed promising was seeing how hard it would be to extract the images from digitized pages so we could present a visual index of an item. Trevor’s THATCamp CHNM post on Freeing Images from Inside Digitized Books and Newspapers got a favorable reception and since it kept coming up at work I decided to see how far I could get using OpenCV.

Everything you see below is open-source and comments are highly welcome. I created a book-illustration-detection branch in my image mining project (see my previous experiment reconstructing higher-resolution thumbnails from the masters) so feel free to fork it or open issues.

The current process (locate-figures.py) is rather primitive:

  1. convert the image to grayscale, which is both necessary for some of the algorithms
  2. apply a binary filter converting image to black and white
  3. Optionally, apply an erode or dilate filter (see the OpenCV erosion and dilation tutorial)
  4. Optionally, apply Canny edge detection (OpenCV tutorial)
  5. find contours (i.e. what appear to be lines) (OpenCV tutorial)
  6. Filter contours which are very small or very large, to avoid extracting small things like defects, letters, etc. or large artifacts like borders from the scanning process which span an entire edge

The program requires Python, OpenCV and numpy, all of which should be easy to install on Ubuntu/Debian Linux or using Homebrew on OS X. When the prerequisites are installed, the program can be run like this:

Applying filters interactively, with contours and their bounding boxes displayed
locate-figures.py --interactive 211_1_82.png sn99021999_1913-08-31_1_1.png

Results

The results are quite promising:

Extracted cartoon from the Omaha Daily Bee front page for August 31st, 1913 JPEG-2000 Master (courtesy of Chronicling America)
Extracted illustration from The Amazon and Madeira Rivers: Sketches and Descriptions from the Note-Book of an Explorer
Extracted illustration from The Amazon and Madeira Rivers: Sketches and Descriptions from the Note-Book of an Explorer
Extracted print from Guide to the Great Siberian Railway

There are, of course, some problems:

Multiple contours were detected in multiple points of this illustration but unfortunately they weren't seen as contiguous and both were large enough to be extracted

The full results are worth reviewing – I was surprised at the quality from the initial pass:

There are some obvious areas for improvement such as attempting to prevent the above problem by filtering boxes which are entirely contained within other boxes. It would also be interesting to attempt to examine the surrounding area to see whether there appears to be a caption.

Cool ideas? Deep experience with image processing? I'd love to hear what you think.