Getting "SettingWithCopyWarning" while performing one hot encoding with pandas

I encountered the SettingWithCopyWarning in Python. I searched online but it seems that all the solutions do not work for me.

The input data is like this:

       id          genre
0       1        Drama, Romance
1       2        Action, Drama
2       3        Action, Comedy
3       4        Thriller

The expected outcome should be:

       id        Drama    Romance    Action    Comedy    Thriller
0       1          1         1         0         0         0
1       2          1         0         1         0         0
2       3          0         0         1         1         0
3       4          0         0         0         0         1

I want to get the subset of the input data, add some columns and modify the added column, and return the subset. Basically, I DO NOT want to modify the original data, I just want to get a subset, which should be a brand new dataframe :

# the function to deal with the genre
def genre(data):
    subset = data[['id', 'genre']]
    for i, row in subset.iterrows():
        if isinstance(row['genre'], float):
            continue
        genreList = row['genre'].split(', ')
        for genre in genreList:
            if genre in list(subset):
                subset.loc[i][genre] = 1
            else:
                subset.loc[:][genre] = 0
                subset.loc[i][genre] = 1
    return subset

I tried many ways, but neither of them gets rid of the SettingWithCopyWarning :

  1. subset = data[['A', 'B']].copy().
  2. subset = data.filter(['A','B'], axis=1)
  3. subset = pd.Dataframe(data[['A', 'B']])
  4. subset = data.copy()
    subset.drop(columns =['C','D'])
  5. subset = pd.DataFrame({'id': list(data.id), 'genre': list(data.genre)})

Does anyone have any idea how to fix this? Or is there a way to surpress the warning?

Thanks in advance!!

1 answer

  • answered 2018-12-16 07:37 cs95

    Iteration is not needed, and neither is subsetting. You can use str.get_dummies.

    df.drop('genre', 1).join(df['genre'].str.get_dummies(sep=', '))
    
       id  Action  Comedy  Drama  Romance  Thriller
    0   1       0       0      1        1         0
    1   2       1       0      1        0         0
    2   3       1       1      0        0         0
    3   4       0       0      0        0         1
    

    The result is a new DataFrame, you can assign this to something else (df2 = ...).