Historical Newspaper Article Deduplication

Newspapers.com May 2023

PyTorch Vision Transformers Embeddings FAISS Vector Search OpenCV

Historical Newspaper Article Deduplication - Main project visualization showing Duplicate articles across millions of newspaper pages were bloating their database and frustrating u

The Challenge

Duplicate articles across millions of newspaper pages were bloating their database and frustrating users with redundant search results. Manual cleanup was impossible, and simple technical solutions failed due to scan quality issues.

Our Solution

We developed a modern AI pipeline using vision transformers to extract semantic embeddings from article images. Implemented FAISS for efficient similarity search at scale, with a clustering algorithm that groups near-duplicates for review. The system processes 100K+ articles daily with sub-second query times.