The client had millions of digitized newspaper pages with many duplicate articles appearing across different publications. Manual deduplication was impractical.
We developed a pipeline that extracts image embeddings from article images, then clusters similar articles for review. This reduced the manual deduplication workload by 75%.