Suppose you are working on an NLP project. Your input data are probably files like PDF, JPG, XML, TXT or similar and there are a lot of them. It is not unusual that in large data sets some documents with different names have exactly the same content, i.e. they are duplicates. There can be various reasons for this. Probably the most common one is improper storage and archiving of the documents.
Regardless of the cause, it is important to find the duplicates and remove them from the data set before you start labeling the documents.
In this blog post I will briefly demonstrate how the contents of different files can be compared using the Python module filecmp. After the duplicates have been identified, I will show how they can be deleted automatically.