Robust Page Detection for Scanned Historical Books

The Challenge

Historical book scans often contained curved pages, shadows, or bleed-through text from opposing pages, causing traditional edge-detection methods to fail (30% error rate). Manual cropping was prohibitively slow.

Robust Page Detection for Scanned Historical Books

Our Solution

Trained a U-Net model to predict page masks, combining gradient-based preprocessing (for edge hints) and geometric post-processing (for smooth quadrilateral fitting). Special handling for gutter shadows and folded corners.

Robust Page Detection for Scanned Historical Books

Technologies Used

U-Net OpenCV TensorFlow Morphological Operations

Results & Impact

  • Reduced page detection errors from 30% to 3% on challenging volumes
  • Processed 500K+ pages without manual intervention
  • Enabled seamless integration with OCR pipelines by providing consistent ROI