Robust Page Detection for Scanned Historical Books

Rare Books Digital Library October 2023

U-Net Semantic Segmentation TensorFlow Transfer Learning OpenCV Morphological Operations

Robust Page Detection for Scanned Historical Books - Main project visualization showing Automated cropping of historical book scans was highly unreliable, failing 30% of the time due to wa

The Challenge

Automated cropping of historical book scans was highly unreliable, failing 30% of the time due to warped pages and shadows. This required constant manual intervention, stalling their entire digitization workflow.

Our Solution

Trained a U-Net model to predict page masks on noisy image scans across a variety of content collections. Used transfer learning from a pre-trained model and augmented training data with synthetic distortions. Implemented post-processing with morphological operations and contour refinement for precise boundary extraction.