Compare Two CSV Files for Differences in Python (Example)
This post explains how to compare two CSV files and print out differences in Python.
Table of contents:
Let’s get started.
Example Data & Add-On Libraries
We first have to import the pandas library:
import pandas as pd # Load pandas |
import pandas as pd # Load pandas
Furthermore, consider the example data below:
data1 = pd.DataFrame({'x1':range(0, 6), # Create first pandas DataFrame 'x2':['x', 'y', 'x', 'x', 'y', 'x'], 'x3':range(2, 8)}) print(data1) # Print first pandas DataFrame |
data1 = pd.DataFrame({'x1':range(0, 6), # Create first pandas DataFrame 'x2':['x', 'y', 'x', 'x', 'y', 'x'], 'x3':range(2, 8)}) print(data1) # Print first pandas DataFrame
data2 = pd.DataFrame({'x1':range(3, 8), # Create second pandas DataFrame 'x2':['x', 'x', 'x', 'y', 'y'], 'x3':range(5, 10)}) print(data2) # Print second pandas DataFrame |
data2 = pd.DataFrame({'x1':range(3, 8), # Create second pandas DataFrame 'x2':['x', 'x', 'x', 'y', 'y'], 'x3':range(5, 10)}) print(data2) # Print second pandas DataFrame
The output of the previous Python programming syntax is shown in Tables 1 and 2: We have created two pandas DataFrames with the same columns but different values.
Let’s write these pandas DataFrames to two separate CSV files:
data1.to_csv('data1.csv', index = False) # Export pandas DataFrames to CSV data2.to_csv('data2.csv', index = False) |
data1.to_csv('data1.csv', index = False) # Export pandas DataFrames to CSV data2.to_csv('data2.csv', index = False)
After the previous Python syntax has been executed, you should find multiple CSV files in your current working directory. These two files will be used as a basis for the following example.
Example: Compare Two CSV Files & Print Out Differences
The following Python programming syntax shows how to compare and find differences between pandas DataFrames in two CSV files in Python.
More precisely, we are searching for rows that do exist in the second pandas DataFrame, but not in the first DataFrame.
In the first step of this example, we are importing the two pandas DataFrames:
with open('data1.csv', 'r') as csv1, open('data2.csv', 'r') as csv2: # Import CSV files import1 = csv1.readlines() import2 = csv2.readlines() |
with open('data1.csv', 'r') as csv1, open('data2.csv', 'r') as csv2: # Import CSV files import1 = csv1.readlines() import2 = csv2.readlines()
Next, we are using a loop to check for each line in the second CSV file if it exists in the first CSV file. Furthermore, we are creating a new CSV file that contains all rows that exist only in the second CSV file.
Let’s do this:
with open('data_diff.csv', 'w') as outFile: # Create CSV file with differences for row in import2: if row not in import1: outFile.write(row) |
with open('data_diff.csv', 'w') as outFile: # Create CSV file with differences for row in import2: if row not in import1: outFile.write(row)
Table 3 shows the output of the previous Python syntax – A data set showing the differences between the first and the second pandas DataFrame.
Please note that the code of this tutorial is partly based on this Stack Overflow thread. You may have a look there for more details on how to compare different data sets stored in CSV files in Python.
Video & Further Resources
Have a look at the following video on the Statistics Globe YouTube channel. I illustrate the Python programming syntax of this article in the video.
The YouTube video will be added soon.
In addition, you may read some of the related tutorials on this website. A selection of articles is listed here.
- Handling DataFrames Using the pandas Library in Python
- Compare Two pandas DataFrames in Python
- Find Differences Between Two Columns of pandas DataFrame
- Compare Headers of Two pandas DataFrames in Python
- Check If Two pandas DataFrames are Equal in Python
- Read CSV File as pandas DataFrame in Python
- Read Only Certain Columns of CSV File as pandas DataFrame
- Skip Rows but Keep Header when Reading CSV File
- Read CSV File Line by Line in Python
- All Python Programming Tutorials
This tutorial has demonstrated how to find differences between two CSV files in the Python programming language. If you have additional questions, please let me know in the comments below.