Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr
Digitizing printed material has become an industrial process for large collections. Modern scanning equipment makes it easy to process millions of pages and concerted engineering effort has even produced options at the high-end for fragile rare items while innovative open-source projects like Project Gado make continual progress reducing the cost of reliable, batch scanning to fit almost any organization's budget.
Such efficiencies are great for our goals of preserving history and making it available but they start making painfully obvious the degree to which digitization capacity outstrips our ability to create metadata. This is a big problem because most of the ways we find information involves searching for text and a large TIFF file is effectively invisible to a full-text search engine. The classic library solution to this challenge has been cataloging but the required labor is well beyond most budgets and runs into philosophical challenges when users want to search on something which wasn't considered noteworthy at the time an item was cataloged.
In the spirit of finding the simplest thing that could possibly work I've been experimenting with a completely automated approach to perform OCR on new items and offering combined full-text search over both the available metadata and OCR text, as can be seen in this example:
Generating OCR text
As we receive new items, anything which matches our criteria (books, journals and newspapers created after 1800 — see below) is automatically placed into a Celery task queue as a low-priority task. Workers on multiple servers accept OCR tasks from the queue and process the master image using Tesseract using a simple shell command to generate text and HTML with embedded hOCR metadata.
Once we have the OCR text, it's transformed to solve two different needs: a full-text search engine like Apache Solr or ElasticSearch works with the pure text output but because we want to be able to highlight specific words the task also converts the hOCR into a word coordinates JSON file with the pixel coordinates for every word on the page.
Indexing the text for search
Most people expect a combined search these days where relevant terms are selected from both descriptive metadata and the text contents. Simply combining all of the text into a single document to be indexed is unsuitable, however, because we want to be able to offer the ability to only search metadata in certain cases and we want to be able to return specific pages rather than telling someone to visually scan through a 700 page book. Unfortunately, this approach is incompatible with the normal way search engines determine the most relevant results for a query:
Storing each page separately means that the search score will be determined independently rather than for the entire item. This would prevent books from scoring highly unless all of the words were mentioned on a single page and, far worse, many queries would return pages from a single book mixed throughout the results based on their individual scores! The solution this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team is working on a similar feature). With field collapsing enabled, Solr will first group all of the matching documents using a specified fieldand then compute the scores for each combined group. This means that we can group our results by the item ID and receive a list of groups (i.e. items) with one or more documents (i.e. pages or metadata) which we can use to build exact links into a large book.
Search results are returned as simple HTML with the embedded data which we'll need to provide the original image segments. Here's what happens when someone searches for guineé:
- Solr performs its normal language analysis and selects relevant documents
- All of the documents are grouped by item ID and each group is ranked for relevance
- Solr highlights the matched terms in the response
At this point we have quickly returned search results and can link directly to individual pages but we're showing frequently ugly OCR text directly and not providing as much context as we'd like. The next step is to replace that raw text with an image slice from the scanned page:
- An XHR request is made to retrieve the word coordinates for every word on each returned page
- The word coordinate list is scanned for each highlighted word and the coordinates are selected. Since we often find words in multiple places on the same page and we want to display an easily readable section of text, the list of word coordinates is coalesced starting from the top of the page and no more than the first third of the page will be returned. For this display, we always use the full width of the page but the same process could be used to generate smaller slices if desired.
- A separate request is made to load the relevant image slice. When the image has loaded, we replace the raw OCR text with the image. This way the raw text is visible for as long as it takes to load the image so we avoid showing empty areas until everything has transferred.
- Finally, a partially-transparent overlay is displayed over the image for each word coordinate to highlight the matches (see e.g. css-tricks.com if you're not familiar with this form of CSS positioning). Since the OCRed word coordinates aren't consistently tightly cropped around the letters in the word a minor CSS box-shadow is used to make the edges softer and more like a highlighter.
- From a workflow perspective, I highly recommend recording the source of your OCR text and whether it's been reviewed. Since this is a fully automated process it is extremely handy to be able to reprocess items in the future if your software improves without accidentally clobbering any items which have been hand-corrected by humans.
- The word coordinates are pixel level coordinates based on the input file but our requests are made using calculated percentages since it's often the case that the scans are much higher resolution than we would want to display in a web-browser and our users wouldn't want to wait for a 600-dpi image to download in any case
- You might be wondering why all of this work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load, this approach is friendlier for caches because a given image segment can be reused for multiple words (rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can thus be cached by CDN edge servers rather than requiring a full round-trip back to the server.
One common example of the cache-ability benefit is when you open a result and start reading it: in the viewer, we display full page images rather than the trimmed slices so we must fetch new images but those are likely to be cached because they haven't been customized with the search text and we can reuse the locally-cached word coordinates to immediately display the highlighting. If you change your search text within an item, we can again immediately update the display while the revised page list is retrieved.
Challenges & Future Directions
This was supposed to be the simplest thing which could possibly work and it turned out not to be that simple. As you might imagine, this leaves a number of open questions for where to go next:
- OCR results vary considerably based on the quality of the input image. Accuracy can be improved considerably by preprocessing the image to remove borders, noise or use a more sophisticated algorithm to convert a full-color scan into the black-and-white image which Tesseract operates on. The trick is either coming up with good presets for your data, perhaps integrated with an image processing tool like ScanTailor, or developing smarter code which can select filters based on the characteristics of the image.
- For older items, the OCR process is complicated by the condition of the materials, more primitive printing technology and stylistic choices like the long s (ſ) or ligatures which are no longer in common usage and thus not well supported by common OCR programs. One of my future goals is looking into the tools produced by the Early Modern OCR Project and seeing whether there's a production-ready path for this.
- It would be interesting combine the results of OCR with my earlier figure extraction project for innovative displays like the Mechanical Curator or, with more work, trying to extract full figures with captions.
- Finally, there's considerable room for integrating crowd-sourcing approaches like the direct text correction as epitomized by the National Library of Australia's wonderful Trove project and promising improvements on that concept like the UMD-MITH's ActiveOCR project.
This seems like an area for research which any organization with large digitized collections should be supporting, particularly with an eye towards easier reuse. Ed Summers and I have idly discussed the idea for a generic web application which would display hOCR with the corresponding images for correction with all of the data stored somewhere like Github for full change tracking and review. It seems like something along those lines would be particularly valuable as a public service to avoid the expense of everyone reinventing large parts of this process customized for their particular workflow.