Drop Duplicates from pandas DataFrame in Python (2 Examples)
In this Python tutorial you’ll learn how to remove duplicate rows from a pandas DataFrame.
The tutorial contains these content blocks:
Here’s how to do it.
Creating Example Data
To be able to use the functions of the pandas library, we first have to load pandas:
import pandas as pd # Load pandas library
In the next step, we have to create an exemplifying DataFrame in Python:
data = pd.DataFrame({'x1':[1, 1, 1, 2, 2, 3, 4], # Create example DataFrame 'x2':[5, 5, 5, 5, 5, 5, 5], 'x3':['a', 'a', 'a', 'b', 'c', 'd', 'e']}) print(data) # Print example DataFrame
Table 1 shows the output of the previous syntax: We have created some example data containing seven rows and three columns. Some of the rows in our data are duplicates.
Example 1: Drop Duplicates from pandas DataFrame
In this example, I’ll explain how to delete duplicate observations in a pandas DataFrame.
For this task, we can use the drop_duplicates function as shown below:
data_new1 = data.copy() # Create duplicate of example data data_new1 = data_new1.drop_duplicates() # Remove duplicates print(data_new1) # Print new data
As shown in Table 2, the previous syntax has created a new pandas DataFrame called data_new1, in which all repeated rows have been excluded.
Example 2: Drop Duplicates Across Certain Columns of pandas DataFrame
In this example, I’ll show how to drop lines that are duplicated in only some particular columns.
The following Python code retains only those rows that are not duplicated in the variables x1 and x2:
data_new2 = data.copy() # Create duplicate of example data data_new2 = data_new2.drop_duplicates(subset = ['x1', 'x2']) # Remove duplicates in subset print(data_new2) # Print new data
In Table 3 you can see that we have created another data set that contains even less rows by running the previous Python code.
Video & Further Resources
Do you need more explanations on how to remove duplicate rows from a pandas DataFrame? Then you should have a look at the following YouTube video of the Statistics Globe YouTube channel. I’m explaining the topics of this post in the video:
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
As an additional resource, I recommend watching the following video on the Data School YouTube channel. In the video, the speaker illustrates how to search, find and eliminate duplicate rows in another pandas DataFrame example.
Please accept YouTube cookies to play this video. By accepting you will be accessing content from YouTube, a service provided by an external third party.
If you accept this notice, your choice will be saved and the page will refresh.
Besides the video, you might want to read the related tutorials that I have published on https://www.statisticsglobe.com/.
- Add Row to pandas DataFrame in Python
- Delete Rows of pandas DataFrame Conditionally
- Drop Rows with Blank Values from pandas DataFrame
- Drop Infinite Values from pandas DataFrame
- Remove Rows with NaN from pandas DataFrame
- How to Manipulate a pandas DataFrame in Python
- How to Use the pandas Library in Python
- Introduction to Python
Summary: In this article you have learned how to drop duplicates from a pandas DataFrame in Python. If you have additional comments and/or questions, tell me about it in the comments section below. Furthermore, please subscribe to my email newsletter in order to receive updates on the newest tutorials.