Historical Newspaper Article Deduplication

The Challenge

The client had millions of digitized newspaper pages with many duplicate articles appearing across different publications. Manual deduplication was impractical.

Historical Newspaper Article Deduplication

Our Solution

We developed a pipeline that extracts image embeddings from article images, then clusters similar articles for review. This reduced the manual deduplication workload by 75%.

Historical Newspaper Article Deduplication

Technologies Used

PyTorch Embeddings FAISS OpenCV

Results & Impact

  • Identified 700M duplicate articles across the collection
  • Reduced manual review and storage needs by 75%
  • Improved search accuracy by eliminating redundant results