Series' object has no attribute 'decode in pandas

I am trying to decode utf-8 encoded text in python. The data is loaded to a pandas data frame and then I decode. This produces an error: AttributeError: 'Series' object has no attribute 'decode'. How can I properly decode the text that is in pandas column?

>> preparedData.head(5).to_dict( )
{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'date': {0: '2018-09-20', 1: '2018-09-20', 2: '2018-09-20', 3: '2018-09-20', 4: '2018-09-20'}, 'time': {0: '03:30:14', 1: '01:53:25', 2: '01:34:13', 3: '01:32:28', 4: '01:30:33'}, 'text': {0: "b'\\xf0\\x9f\\x8c\\xb9 are red, violets are blue, if you want to buy us \\xf0\\x9f\\x92\\x90, here is a CLUE \\xf0\\x9f\\x98\\x89 Our #flowerpowered eye & cheek palette is AL\\xe2\\x80\\xa6 '", 1: "b'\\xf0\\x9f\\x8e\\xb5Is it too late now to say sorry\\xf0\\x9f\\x8e\\xb5 #tartetalk #memes'", 2: "b'@JillianJChase Oh no! Please email your order # to social@tarte.com & we can help \\xf0\\x9f\\x92\\x95'", 3: 'b"@Danikins__ It\'s best applied with our buffer brush! \\xf0\\x9f\\x92\\x9c\\xc2\\xa0"', 4: "b'@AdelaineMorin DEAD \\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3'"}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}}

My data looks like the above. I want to decode the 'text' column.

ExampleText = b'\xf0\x9f\x8c\xb9 are red, violets are blue, if you want to buy us \xf0\x9f\x92\x90, here is a CLUE \xf0\x9f\x98\x89 Our #flowerpowered eye & cheek palette is AL\xe2\x80\xa6'

I could decode the text above as

ExampleText = ExampleText.decode('utf8')

However, when I try to decode text from a pandas dataframe column, I get the error. I tried like this,

preparedData['text'] = preparedData['text'].decode('utf8')

Then the error I get is,

Traceback (most recent call last):
File "F:/Level 4 Research Project/makeViral/main.py", line 23, in <module>
main()
File "F:/Level 4 Research Project/makeViral/main.py", line 19, in main
preprocessedData = preprocessData(preparedData)
File "F:\Level 4 Research Project\makeViral\preprocess.py", line 34, in preprocessData
 preparedData['text'] = preparedData['text'].decode('utf8')
File "C:\Users\Kabilesh\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 4376, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'decode'

I also tried

preparedData['text'] = preparedData['text'].str.decode('utf8', errors='strict')

This does not produce any error. But the resulting 'text' column is like,

'text': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}

1 answer

  • answered 2018-09-24 17:17 Sven Harris

    I could be wrong but I would guess that what you have are byte strings rather than strings of bytes strings b"XXXXX" instead of "b'XXXXX'" as you've posted in your answer in which case you could do the following (you need to use the string accessor):

    preparedData['text'] = preparedData['text'].str.decode('utf8')
    

    Edit: Looks like my assumption was wrong, in which case you can do a pre-processing step:

    import ast
    preparedData['text'] = preparedData['text'].apply(ast.literal_eval).str.decode("utf-8")