Determine unique dictionary keys across rows

I have a dataframe where every row is a dictionary, however the keys in each dictionary vary. I would like to iterate over each row and get one list with all unique keys. Does anyone know how to do this?

I tried this code

np.unique(np.array(train.totals.apply(lambda x: ast.literal_eval(x).keys())))

But this yields unique combinations of dict_keys(), not unique combinations of keys.

For example, lets say I have two rows. As mentioned above, the column values for each row are a dictionary. The dictionary keys for row 1 are fruit and vegetable, and the dictionary keys for row 2 are fruit, vegetable and grain.

The code above would produce

dict_keys(['fruit','vegetable']) 

and

dict_keys(['fruit','vegetable','grain']) 

However, what I am want the output to be is just a list or array with fruit, vegetable and grain (the unique keys seen across rows).

Edit: screenshot of dataframe addedenter image description here

edit2: Code sample below

import pandas as pd 
import numpy as np
import ast

dummy_data = [['A',str({"pageviews":"1","hits":"1"})],['B',str({"pageviews":"1","visits":"1"})]]
dummy_df = pd.DataFrame(dummy_data,columns = ['ID','totals'])

np.unique(np.array(dummy_df.totals.apply(lambda x: ast.literal_eval(x).keys())))

1 answer

  • answered 2018-11-08 00:07 juanpa.arrivillaga

    Just iterate and add to a set:

    In [1]: import pandas as pd
       ...: import numpy as np
       ...: import ast
       ...:
       ...: dummy_data = [['A',str({"pageviews":"1","hits":"1"})],['B',str({"pageviews":"1","visits":"1"})]]
       ...: dummy_df = pd.DataFrame(dummy_data,columns = ['ID','totals'])
       ...:
       ...:
    
    In [2]: dummy_df
    Out[2]:
      ID                             totals
    0  A    {'pageviews': '1', 'hits': '1'}
    1  B  {'pageviews': '1', 'visits': '1'}
    
    In [3]: uniq = set()
       ...: for x in dummy_df.totals:
       ...:     uniq.update(ast.literal_eval(x))
       ...:
    
    In [4]: uniq
    Out[4]: {'hits', 'pageviews', 'visits'}
    

    Probably the best you can do given the structure of your data.