Suppose you are working on an NLP project. Your input data are probably files like PDF, JPG, XML, TXT or similar and there are a lot of them. It is not unusual that in large data sets some documents with different names have exactly the same content, i.e. they are duplicates. There can be various reasons for this. Probably the most common one is improper storage and archiving of the documents.
Regardless of the cause, it is important to find the duplicates and remove them from the data set before you start labeling the documents.
In this blog post I will briefly demonstrate how the contents of different files can be compared using the Python module filecmp. After the duplicates have been identified, I will show how they can be deleted automatically.
For the purpose of this presentation, let us consider a simple data set containing six documents.
Here a figure showing the documents:
We see that the documents "doc1.pdf", "doc4.pdf" and "doc5.pdf" have exactly the same content. The same applies to "doc2.jpg" and "doc6.jpg". The goal is therefore to identify and remove the duplicates "doc4.pdf", "doc5.pdf" and "doc6.jpg".
Finding the duplicates
The module filecmp offers a very nice function -
filecmp.cmp(f1, f2, shallow=True) - for this purpose. It compares the files named
f2 and returns
True if they seem to be identical. Otherwise it returns
shallow parameter allows the user to specify whether the comparison should be based on the
os.stat()-signatures of the files or rather on their contents. The comparison of the contents is ensured by the setting
An exemplary Python code for finding the duplicates could therefore look like this:
import os from pathlib import Path from filecmp import cmp # list all documents DATA_DIR = Path('/path/to/your/data/directory') files = sorted(os.listdir(DATA_DIR)) # list containing the classes of documents with the same content duplicates =  # comparison of the documents for file in files: is_duplicate = False for class_ in duplicates: is_duplicate = cmp( DATA_DIR / file, DATA_DIR / class_, shallow = False ) if is_duplicate: class_.append(file) break if not is_duplicate: duplicates.append([file]) # show results duplicates
[['doc1.pdf', 'doc4.pdf', 'doc5.pdf'], ['doc2.jpg', 'doc6.jpg'], ['doc3.pdf']]
The above output is a list which contains the identified "equivalence classes", i.e. lists of documents with the same content. Note that it's enough to compare a given document with only one representative from each class, e.g. the first one
We learn, for example, that the document "doc1.pdf" has the same content as the documents "doc4.pdf" and "doc5.pdf". Furthermore, the document "doc2.jpg" has the same content as "doc6.jpg" and the document "doc3.pdf" has no duplicates. All this corresponds to what we have observed in the image above.
The next step would be to remove the duplicates "doc4.pdf", "doc5.pdf" and "doc6.jpg". An exemplary Python code that accomplishes this task could look like this:
# remove the duplicates for class_ in duplicates: for file in class_[1:]: os.remove(DATA_DIR / file)
There are certainly other ways to write the code or generally compare files. In this article I simply demonstrated one of the many possibilities.
I would also like to encourage you to take a closer look at the filecmp module. In addition to the
filecmp.cmp()-function, it offers also other methods such as
filecmp.cmpfiles() which can be used to compare files in two directories and may therefore suit your needs even better.