Accurate Line Detection for Historical Documents

National Archives Foundation November 2023

U-Net PyTorch Document Image Analysis Contour Detection Data Augmentation Transfer Learning

Accurate Line Detection for Historical Documents - Main project visualization showing Faded ink, stains, and complex layouts in historical documents caused their standard OCR to fail, ma

The Challenge

Faded ink, stains, and complex layouts in historical documents caused their standard OCR to fail, making vast portions of their collection digitally unusable and inaccessible to researchers.

Our Solution

Developed a hybrid approach combining traditional computer vision with a custom U-Net architecture trained on 50K+ annotated historical documents. Implemented special handling for curved baselines, interlinear annotations, and degraded text. Used data augmentation to simulate various degradation patterns.