Compare Two CSV Files for Differences in Python (Example)

 

This post explains how to compare two CSV files and print out differences in Python.

Table of contents:

Let’s get started.

 

Example Data & Add-On Libraries

We first have to import the pandas library:

import pandas as pd                                 # Load pandas

Furthermore, consider the example data below:

data1 = pd.DataFrame({'x1':range(0, 6),             # Create first pandas DataFrame
                      'x2':['x', 'y', 'x', 'x', 'y', 'x'],
                      'x3':range(2, 8)})
print(data1)                                        # Print first pandas DataFrame

 

table 1 DataFrame compare two csv files for differences python

 

data2 = pd.DataFrame({'x1':range(3, 8),             # Create second pandas DataFrame
                      'x2':['x', 'x', 'x', 'y', 'y'],
                      'x3':range(5, 10)})
print(data2)                                        # Print second pandas DataFrame

 

table 2 DataFrame compare two csv files for differences python

 

The output of the previous Python programming syntax is shown in Tables 1 and 2: We have created two pandas DataFrames with the same columns but different values.

Let’s write these pandas DataFrames to two separate CSV files:

data1.to_csv('data1.csv', index = False)            # Export pandas DataFrames to CSV
data2.to_csv('data2.csv', index = False)

After the previous Python syntax has been executed, you should find multiple CSV files in your current working directory. These two files will be used as a basis for the following example.

 

Example: Compare Two CSV Files & Print Out Differences

The following Python programming syntax shows how to compare and find differences between pandas DataFrames in two CSV files in Python.

More precisely, we are searching for rows that do exist in the second pandas DataFrame, but not in the first DataFrame.

In the first step of this example, we are importing the two pandas DataFrames:

with open('data1.csv', 'r') as csv1, open('data2.csv', 'r') as csv2:  # Import CSV files
    import1 = csv1.readlines()
    import2 = csv2.readlines()

Next, we are using a loop to check for each line in the second CSV file if it exists in the first CSV file. Furthermore, we are creating a new CSV file that contains all rows that exist only in the second CSV file.

Let’s do this:

with open('data_diff.csv', 'w') as outFile:         # Create CSV file with differences
    for row in import2:
        if row not in import1:
            outFile.write(row)

 

table 3 DataFrame compare two csv files for differences python

 

Table 3 shows the output of the previous Python syntax – A data set showing the differences between the first and the second pandas DataFrame.

Please note that the code of this tutorial is partly based on this Stack Overflow thread. You may have a look there for more details on how to compare different data sets stored in CSV files in Python.

 

Video & Further Resources

Have a look at the following video on the Statistics Globe YouTube channel. I illustrate the Python programming syntax of this article in the video.

 

The YouTube video will be added soon.

 

In addition, you may read some of the related tutorials on this website. A selection of articles is listed here.

 

This tutorial has demonstrated how to find differences between two CSV files in the Python programming language. If you have additional questions, please let me know in the comments below.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
Top