Extracting images from scanned book pages

A first step toward building a visual index for books automatically

I work on a project which has placed a number of books online. Over the years we've improved server performance and worked on a fast, responsive viewer for scanned books to make our books as accessible as possible but it's still challenging to help visitors find something of interest out of hundreds of thousands of scanned pages.

Trevor and I have discussed various ways to improve the situation and one idea which seemed promising was seeing how hard it would be to extract the images from digitized pages so we could present a visual index of an item. Trevor’s THATCamp CHNM post on Freeing Images from Inside Digitized Books and Newspapers got a favorable reception and since it kept coming up at work I decided to see how far I could get using OpenCV.

Everything you see below is open-source and comments are highly welcome. I created a book-illustration-detection branch in my image mining project (see my previous experiment reconstructing higher-resolution thumbnails from the masters) so feel free to fork it or open issues.

The current process (locate-figures.py) is rather primitive:

  1. convert the image to grayscale, which is both necessary for some of the algorithms
  2. apply a binary filter converting image to black and white
  3. Optionally, apply an erode or dilate filter (see the OpenCV erosion and dilation tutorial)
  4. Optionally, apply Canny edge detection (OpenCV tutorial)
  5. find contours (i.e. what appear to be lines) (OpenCV tutorial)
  6. Filter contours which are very small or very large, to avoid extracting small things like defects, letters, etc. or large artifacts like borders from the scanning process which span an entire edge

The program requires Python, OpenCV and numpy, all of which should be easy to install on Ubuntu/Debian Linux or using Homebrew on OS X. When the prerequisites are installed, the program can be run like this:

Applying filters interactively, with contours and their bounding boxes displayed
locate-figures.py --interactive 211_1_82.png sn99021999_1913-08-31_1_1.png

Results

The results are quite promising:

Extracted cartoon from the Omaha Daily Bee front page for August 31st, 1913 JPEG-2000 Master (courtesy of Chronicling America)
Extracted illustration from The Amazon and Madeira Rivers: Sketches and Descriptions from the Note-Book of an Explorer
Extracted illustration from The Amazon and Madeira Rivers: Sketches and Descriptions from the Note-Book of an Explorer
Extracted print from Guide to the Great Siberian Railway

There are, of course, some problems:

Multiple contours were detected in multiple points of this illustration but unfortunately they weren't seen as contiguous and both were large enough to be extracted

The full results are worth reviewing – I was surprised at the quality from the initial pass:

There are some obvious areas for improvement such as attempting to prevent the above problem by filtering boxes which are entirely contained within other boxes. It would also be interesting to attempt to examine the surrounding area to see whether there appears to be a caption.

Cool ideas? Deep experience with image processing? I'd love to hear what you think.