Overlapping articles, irregular column layouts, and advertisements made rule-based segmentation unreliable (50% accuracy). Manual annotation cost $2 per page.
Fine-tuned a Mask R-CNN model on 10K labeled newspaper pages, with rules for ad/classifieds filtering. Added a post-processing step to merge fragmented regions and infer reading order.