Drop Duplicates from pandas DataFrame in Python (2 Examples)

 

In this Python tutorial you’ll learn how to remove duplicate rows from a pandas DataFrame.

The tutorial contains these content blocks:

Here’s how to do it.

 

Creating Example Data

To be able to use the functions of the pandas library, we first have to load pandas:

import pandas as pd                                            # Load pandas library

In the next step, we have to create an exemplifying DataFrame in Python:

data = pd.DataFrame({'x1':[1, 1, 1, 2, 2, 3, 4],              # Create example DataFrame
                     'x2':[5, 5, 5, 5, 5, 5, 5],
                     'x3':['a', 'a', 'a', 'b', 'c', 'd', 'e']})
print(data)                                                   # Print example DataFrame

 

table 1 DataFrame drop duplicates from pandas dataframe python

 

Table 1 shows the output of the previous syntax: We have created some example data containing seven rows and three columns. Some of the rows in our data are duplicates.

 

Example 1: Drop Duplicates from pandas DataFrame

In this example, I’ll explain how to delete duplicate observations in a pandas DataFrame.

For this task, we can use the drop_duplicates function as shown below:

data_new1 = data.copy()                                       # Create duplicate of example data
data_new1 = data_new1.drop_duplicates()                       # Remove duplicates
print(data_new1)                                              # Print new data

 

table 2 DataFrame drop duplicates from pandas dataframe python

 

As shown in Table 2, the previous syntax has created a new pandas DataFrame called data_new1, in which all repeated rows have been excluded.

 

Example 2: Drop Duplicates Across Certain Columns of pandas DataFrame

In this example, I’ll show how to drop lines that are duplicated in only some particular columns.

The following Python code retains only those rows that are not duplicated in the variables x1 and x2:

data_new2 = data.copy()                                       # Create duplicate of example data
data_new2 = data_new2.drop_duplicates(subset = ['x1', 'x2'])  # Remove duplicates in subset
print(data_new2)                                              # Print new data

 

table 3 DataFrame drop duplicates from pandas dataframe python

 

In Table 3 you can see that we have created another data set that contains even less rows by running the previous Python code.

 

Video & Further Resources

Do you need more explanations on how to remove duplicate rows from a pandas DataFrame? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel. I’m explaining the topics of this post in the video:

 

 

As an additional resource, I recommend watching the following video on the Data School YouTube channel. In the video, the speaker illustrates how to search, find and eliminate duplicate rows in another pandas DataFrame example.

 

 

Besides the video, you might want to read the related tutorials that I have published on https://www.statisticsglobe.com/.

 

Summary: In this article you have learned how to drop duplicates from a pandas DataFrame in Python. If you have additional comments and/or questions, tell me about it in the comments section below. Furthermore, please subscribe to my email newsletter in order to receive updates on the newest tutorials.

 

Subscribe to the Statistics Globe Newsletter

Get regular updates on the latest tutorials, offers & news at Statistics Globe.
I hate spam & you may opt out anytime: Privacy Policy.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Top