Contents

How to identify duplicate files with Python

Python Data Preparation Data Cleansing

Suppose you are working on an NLP project. Your input data are probably files like PDF, JPG, XML, TXT or similar and there are a lot of them. It is not unusual that in large data sets some documents with different names have exactly the same content, i.e. they are duplicates. There can be various reasons for this. Probably the most common one is improper storage and archiving of the documents.


Regardless of the cause, it is important to find the duplicates and remove them from the data set before you start labeling the documents.


In this blog post I will briefly demonstrate how the contents of different files can be compared using the Python module filecmp. After the duplicates have been identified, I will show how they can be deleted automatically.

Example documents

For the purpose of this presentation, let us consider a simple data set containing six documents.


Here a figure showing the documents:

We see that the documents "doc1.pdf", "doc4.pdf" and "doc5.pdf" have exactly the same content. The same applies to "doc2.jpg" and "doc6.jpg". The goal is therefore to identify and remove the duplicates "doc4.pdf", "doc5.pdf" and "doc6.png".

Finding the duplicates

The module filecmp offers a very nice function - filecmp.cmp(f1, f2, shallow=True) - for this purpose. It compares the files named f1 and f2 and returns True if they seem to be identical. Otherwise it returns False. The shallow parameter allows the user to specify whether the comparison should be based on the os.stat()-signatures of the files or rather on their contents. The comparison of the contents is ensured by the setting shallow=False.


An exemplary Python code for finding the duplicates could therefore look like this:

import os
from pathlib import Path
from filecmp import cmp


# list all documents
DATA_DIR = Path('/path/to/your/data/directory')
files = sorted(os.listdir(DATA_DIR))

# list containing the classes of documents with the same content
duplicates = []

# comparison of the documents
for file in files:

    is_duplicate = False

    for class_ in duplicates:
        is_duplicate = cmp(
            DATA_DIR / file,
            DATA_DIR / class_[0],
            shallow = False
        )
        if is_duplicate:
            class_.append(file)
            break

    if not is_duplicate:
        duplicates.append([file])     

# show results
duplicates

Output:

[['doc1.pdf', 'doc4.pdf', 'doc5.pdf'], ['doc2.jpg', 'doc6.jpg'], ['doc3.pdf']]

The above output is a list which contains the identified "equivalence classes", i.e. lists of documents with the same content. Note that it's enough to compare a given document with only one representative from each class, e.g. the first one class_[0].


We learn, for example, that the document "doc1.pdf" has the same content as the documents "doc4.pdf" and "doc5.pdf". Furthermore, the document "doc2.jpg" has the same content as "doc6.jpg" and the document "doc3.pdf" has no duplicates. All this corresponds to what we have observed in the image above.

Removing duplicates

The next step would be to remove the duplicates "doc4.pdf", "doc5.pdf" and "doc6.jpg". An exemplary Python code that accomplishes this task could look like this:

# remove the duplicates
for class_ in duplicates:
    for file in class_[1:]:
        os.remove(DATA_DIR / file)

There are certainly other ways to write the code or generally compare files. In this article I simply demonstrated one of the many possibilities.


I would also like to encourage you to take a closer look at the filecmp module. In addition to the filecmp.cmp()-function, it offers also other methods such as filecmp.cmpfiles() which can be used to compare files in two directories and may therefore suit your needs even better.