Pandas DataFrame - Remove duplicates


Pandas DataFrame - Remove duplicates

In Pandas, you can delete duplicate rows based on all columns, or specific columns, using DataFrame drop_duplicates() method.

In this tutorial, we shall go through examples on how to remove duplicate rows in a DataFrame using drop_duplicates() methods.


1. Remove duplicate rows from DataFrame based on all columns using drop_duplicates() method

In this example, we are given a DataFrame in df. We have to remove the duplicate rows from this DataFrame based on values from all the columns. We have to use DataFrame drop_duplicates() method in this example.

Pandas DataFrame - Remove duplicates based on all columns

Steps

  1. Given a DataFrame in df with three columns 'A', 'B', and 'C', and five rows.
df = pd.DataFrame({
        'A': [1, 2, 1, 1, 3],
        'B': [5, 4, 5, 5, 5],
        'C': [9, 10, 9, 9, 8]})
  1. Call drop_duplicates() method on the DataFrame df.
df.drop_duplicates()

The method returns a new DataFrame created from the original DataFrame by removing the duplicates. By default, the drop_duplicates() method considers all the columns for searching duplicate content.

  1. You may store the returned DataFrame in a new variable.
df_no_duplicates = df.drop_duplicates()
  1. You may print the returned DataFrame to output using print() built-in function.
print(df_no_duplicates)

Program

The complete program to remove duplicate rows from a given DataFrame df based on the values of all columns.

Python Program

import pandas as pd

# Take a sample DataFrame
df = pd.DataFrame({
        'A': [1, 2, 1, 1, 3],
        'B': [5, 4, 5, 5, 5],
        'C': [9, 10, 9, 9, 8]})

# Remove duplicate rows based on all columns
df_no_duplicates = df.drop_duplicates()

print(f"Original DataFrame\n{df}\n")
print(f"DataFrame with no duplicates\n{df_no_duplicates}")

Output

Original DataFrame
   A  B   C
0  1  5   9
1  2  4  10
2  1  5   9
3  1  5   9
4  3  5   8

DataFrame with no duplicates
   A  B   C
0  1  5   9
1  2  4  10
4  3  5   8

The row [1 5 9] occurred three times, but only the first occurrence is retained, and the duplicates are removed.

Explanation

Original DataFrame
   A  B   C
0  1  5   9   ← original row
1  2  4  10   ← original row
2  1  5   9   ← duplicate of row at index=0
3  1  5   9   ← duplicate of row at index=0
4  3  5   8   ← original row

Remove the duplicates highlighted, and you will get the
DataFrame with no duplicates
   A  B   C
0  1  5   9
1  2  4  10
4  3  5   8

2. Remove duplicate rows from DataFrame based on a single column using drop_duplicates() method

In this example, we are given a DataFrame in df. We have to remove the duplicate rows from this DataFrame based on values from the column 'B'. We have to use DataFrame drop_duplicates() method in this example.

Pandas DataFrame - Remove duplicates based on single column

Steps

  1. Given a DataFrame in df with three columns 'A', 'B', and 'C', and five rows.
df = pd.DataFrame({
        'A': [1, 2, 1, 1, 3],
        'B': [5, 4, 5, 5, 5],
        'C': [9, 10, 9, 9, 8]})
  1. Call drop_duplicates() method on the DataFrame df, and pass the list of columns, for subset parameter, based on which we would like to remove the duplicates. Since, we would like to remove the duplicates based on column 'B', pass column 'B' in a list for subset parameter to the drop_duplicates() method.
df.drop_duplicates(subset=['B'])

The method returns a new DataFrame created from the original DataFrame by removing the duplicates based on the values of given column. You may store it in a new variable.

df_no_duplicates = df.drop_duplicates(subset=['B'])
  1. You may print the returned DataFrame to output.
print(df_no_duplicates)

Program

The complete program to remove duplicate rows from a given DataFrame df based on the values of all columns.

Python Program

import pandas as pd

# Take a sample DataFrame
df = pd.DataFrame({
        'A': [1, 2, 1, 1, 3],
        'B': [5, 4, 5, 5, 5],
        'C': [9, 10, 9, 9, 8]})

# Remove duplicate rows based on column 'B'
df_no_duplicates = df.drop_duplicates(subset=['B'])

print(f"Original DataFrame\n{df}\n")
print(f"DataFrame with no duplicates\n{df_no_duplicates}")

Output

Original DataFrame
   A  B   C
0  1  5   9
1  2  4  10
2  1  5   9
3  1  5   9
4  3  5   8

DataFrame with no duplicates
   A  B   C
0  1  5   9
1  2  4  10

Explanation

Original DataFrame
   A  B   C
0  1  5   9   ← original row
1  2  4  10   ← original row
2  1  5   9   ← duplicate of 'B' at index=0
3  1  5   9   ← duplicate of 'B' at index=0
4  3  5   8   ← duplicate of 'B' at index=0

Remove the duplicates highlighted, and you will get the
DataFrame with no duplicates
   A  B   C
0  1  5   9
1  2  4  10

3. Remove duplicate rows from DataFrame based on multiple columns using drop_duplicates() method

This scenario is kind of an extension to the previous example, where we considered only one column to remove duplicates from a DataFrame.

In this example, we have to remove duplicates based on two columns: 'A' and 'B'.

Pandas DataFrame - Remove duplicates based on two columns

Steps

  1. Given a DataFrame in df with three columns 'A', 'B', and 'C', and five rows.
df = pd.DataFrame({
        'A': [1, 2, 1, 1, 3],
        'B': [5, 4, 5, 5, 5],
        'C': [9, 10, 9, 9, 8]})
  1. Call drop_duplicates() method on the DataFrame df, and pass the list of columns, for subset parameter, based on which we would like to remove the duplicates. Since, we would like to remove the duplicates based on columns 'A' and 'B', pass these two column labels in a list for subset parameter to the drop_duplicates() method.
df.drop_duplicates(subset=['A', 'B'])

The method returns a new DataFrame created from the original DataFrame by removing the duplicates based on the values of given column. You may store it in a new variable.

df_no_duplicates = df.drop_duplicates(subset=['A', 'B'])
  1. You may print the returned DataFrame to output.
print(df_no_duplicates)

Program

The complete program to remove duplicate rows from a given DataFrame df based on the values of all columns.

Python Program

import pandas as pd

# Take a sample DataFrame
df = pd.DataFrame({
        'A': [1, 2, 1, 1, 3],
        'B': [5, 4, 5, 5, 5],
        'C': [9, 10, 9, 9, 8]})

# Remove duplicate rows based on all columns
df_no_duplicates = df.drop_duplicates(subset=['A', 'B'])

print(f"Original DataFrame\n{df}\n")
print(f"DataFrame with no duplicates\n{df_no_duplicates}")

Output

Original DataFrame
   A  B   C
0  1  5   9
1  2  4  10
2  1  5   9
3  1  5   9
4  3  5   8

DataFrame with no duplicates
   A  B   C
0  1  5   9
1  2  4  10
4  3  5   8

Explanation

Original DataFrame
   A  B   C
0  1  5   9   ← original row
1  2  4  10   ← original row
2  1  5   9   ← duplicate of 'B' at index=0
3  1  5   9   ← duplicate of 'B' at index=0
4  3  5   8   ← original row

Remove the duplicates highlighted, and you will get the
DataFrame with no duplicates
   A  B   C
0  1  5   9
1  2  4  10
4  3  5   8

Summary

In this Pandas Tutorial, we learned how to remove duplicates from a DataFrame using drop_duplicates() method. We have covered uses cases of removing the duplicate rows based on all columns, a single column, or multiple columns, with example programs.




Python Libraries