Precise Newspaper Article Segmentation for Digital Archives

The Challenge

Overlapping articles, irregular column layouts, and advertisements made rule-based segmentation unreliable (50% accuracy). Manual annotation cost $2 per page.

Precise Newspaper Article Segmentation for Digital Archives

Our Solution

Fine-tuned a Mask R-CNN model on 10K labeled newspaper pages, with rules for ad/classifieds filtering. Added a post-processing step to merge fragmented regions and infer reading order.

Precise Newspaper Article Segmentation for Digital Archives

Technologies Used

Mask R-CNN LayoutLM Computer Vision Graph-Based Merging

Results & Impact

  • Achieved 94% segmentation accuracy across 19th–20th century newspapers
  • Reduced processing costs from $2/page to $0.05/page
  • Improved downstream OCR accuracy by 22% via clean article isolation