Can't remove duplicates from .csv column with pandas

I'm trying to do something very simple to a .csv containing addresses. I want to use the pandas function drop_duplicates() to remove any rows if they contain a duplicate value in a singular column(['Addresses']).

Whenever I try to using drop_duplicates() and print or save my data frame to a new .csv, the duplicate rows/values are still there.


data = pandas.read_csv(r"C:\Users\markbrd\Desktop\PalmAveAddresses.csv",
encoding = "ISO-8859-1")

data.drop_duplicates(subset=['Addresses'], keep='first')

print(data['Addresses'])

results:

0             4834Via Estrella
1             5244Via Patricia
2        11721HIDDEN VALLEY RD
3                  30GARDEN CT
4      1999Fremont Blvd. Bldg.
5          8316Fountainhead Ct
6          8312Fountainhead Ct
7               1013Adella Ave
8               1005Adella Ave
9                 1520Tenth St
10                1536Tenth St

                ...           

607              847Florida St
608                 81212th St
609                 81212th St
610                 81212th St
611                 81212th St
612                 81212th St
613                 81212th St
614                 81212th St
615                 81212th St
616                 81212th St
617                 81212th St
618                 81212th St
619                 81212th St

As you can see, there are still several rows that contain duplicates in Addresses (see rows 609-619). Any help would be greatly appreciated!

2 answers

  • answered 2019-06-11 23:13 Kais Tounsi

    DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
    

    Return DataFrame with duplicate rows removed, optionally only considering certain columns

    Parameters: subset : column label or sequence of labels, optional

    Only consider certain columns for identifying duplicates, by default use all of the columns

    keep : {‘first’, ‘last’, False}, default ‘first’

    first : Drop duplicates except for the first occurrence. last : Drop duplicates except for the last occurrence. False : Drop all duplicates. inplace : boolean, default False

    Whether to drop duplicates in place or to return a copy

    Returns:
    deduplicated : DataFrame

  • answered 2019-06-12 05:12 Natheer Alabsi

    You need to assign or use inplace.

    data.drop_duplicates(subset=['Addresses'], keep='first', inplace=True)