Extracting images from scanned book pages
A first step toward building a visual index for books automatically
I work on a project which has placed a number of books online. Over the years we've improved server performance and worked on a fast, responsive viewer for scanned books to make our books as accessible as possible but it's still challenging to help visitors find something of interest out of hundreds of thousands of scanned pages.
Trevor and I have discussed various ways to improve the situation and one idea which seemed promising was seeing how hard it would be to extract the images from digitized pages so we could present a visual index of an item. Trevor’s THATCamp CHNM post on Freeing Images from Inside Digitized Books and Newspapers got a favorable reception and since it kept coming up at work I decided to see how far I could get using OpenCV.
Everything you see below is open-source and comments are highly welcome. I created a book-illustration-detection branch in my image mining project (see my previous experiment reconstructing higher-resolution thumbnails from the masters) so feel free to fork it or open issues.
The current process (locate-figures.py) is rather primitive:
- convert the image to grayscale, which is both necessary for some of the algorithms
- apply a binary filter converting image to black and white
- Optionally, apply an erode or dilate filter (see the OpenCV erosion and dilation tutorial)
- Optionally, apply Canny edge detection (OpenCV tutorial)
- find contours (i.e. what appear to be lines) (OpenCV tutorial)
- Filter contours which are very small or very large, to avoid extracting small things like defects, letters, etc. or large artifacts like borders from the scanning process which span an entire edge
The program requires Python, OpenCV and numpy, all of which should be easy to install on Ubuntu/Debian Linux or using Homebrew on OS X. When the prerequisites are installed, the program can be run like this:
The results are quite promising:
There are, of course, some problems:
The full results are worth reviewing – I was surprised at the quality from the initial pass:
- The Amazon and Madeira Rivers: Sketches and Descriptions from the Note-Book of an Explorer
- Guide to the Great Siberian Railway
There are some obvious areas for improvement such as attempting to prevent the above problem by filtering boxes which are entirely contained within other boxes. It would also be interesting to attempt to examine the surrounding area to see whether there appears to be a caption.
Cool ideas? Deep experience with image processing? I'd love to hear what you think.