Nested dicts with empty lists to Pandas dataframe columns

I have some data from an API that I am trying to convert to a Pandas dataframe. I am struggling to extract the 'station_xyz__cr' id number from the list in a nested dict (where a list can be empty as in the middle dataset).

output = {'data': [{'abc_serial_number__c': 'ABC2020-07571',
       'id': 'V48000000000F79',
       'modified_date__v': '2020-06-15T05:13:14.000Z',
       'name__v': 'VVV-001039',
       'station_xyz__cr': {'data': [{'id': 'V5J000000000B86'}],
                           'responseDetails': {'limit': 250,
                                               'offset': 0,
                                               'size': 1,
                                               'total': 1}}},
      {'abc_serial_number__c': 'ABC2020-09952',
       'id': 'V48000000001B94',
       'modified_date__v': '2020-06-24T11:30:40.000Z',
       'name__v': 'VVV-004040',
       'station_xyz__cr': {'data': [],
                           'responseDetails': {'limit': 250,
                                               'offset': 0,
                                               'size': 1,
                                               'total': 1}}},
      {'abc_serial_number__c': 'ABC2020-09196',
       'id': 'V48000000001B95',
       'modified_date__v': '2020-06-23T09:38:18.000Z',
       'name__v': 'VVV-004041',
       'station_xyz__cr': {'data': [{'id': 'V5J000000000Z10'}],
                           'responseDetails': {'limit': 250,
                                               'offset': 0,
                                               'size': 1,
                                               'total': 1}}}],
 'responseDetails': {'limit': 1000, 'offset': 0, 'size': 3, 'total': 3},
 'responseStatus': 'SUCCESS'}

I'm trying to get the nested id data into a column in the dataframe something like this:

   station_xyz__cr.data.id
0          V5J000000000B86
1                     None 
2          V5J000000000Z10

I've tried converting to a dataframe with json_normalize (droppping the columns I don't need):

df = pd.json_normalize(output['data'])
df = df.loc[:, ~df.columns.str.startswith('station_xyz__cr.responseDetails')]
print(df)

  abc_serial_number__c               id          modified_date__v     name__v  \
0        ABC2020-07571  V48000000000F79  2020-06-15T05:13:14.000Z  VVV-001039   
1        ABC2020-09952  V48000000001B94  2020-06-24T11:30:40.000Z  VVV-004040   
2        ABC2020-09196  V48000000001B95  2020-06-23T09:38:18.000Z  VVV-004041   

          station_xyz__cr.data  
0  [{'id': 'V5J000000000B86'}]  
1                           []  
2  [{'id': 'V5J000000000Z10'}] 

but Im stuggling to convert the 'station_xyz__cr.data' list of dicts to simple dataframe of the ids:

df2 = pd.DataFrame(df['station_xyz__cr.data'].tolist(), index= df.index)
df2 = df2.rename(columns = {0:'station_xyz__cr.data'})
df2

        station_xyz__cr.data
0  {'id': 'V5J000000000B86'}
1                       None
2  {'id': 'V5J000000000Z10'}

The 'None' is causing me problems when I tried to extract further. I tried replacing the None - but I could only replace with 0:

df.fillna(0, inplace=True)

1 answer

  • answered 2020-07-29 17:24 skullgoblet1089

    Get the row index of None values. Using row index as a mask, set the row, col combinations to a default value that is consistent with the rest of the columns' values for next stage in data flow.

    isna_idx = pd.isnull(df2['station_xyz__cr.data'])
    df2.loc[isna_idx, ['station_xyz__cr.data']] = {'id': '...'}