Doing pandas join with multiple values on the left and one on the right

I'm looking for help joining two datasets in pandas.

My first dataset is a contacts dataset, including name and an ID. The second is a deals dataset which amongst other fields includes one column with a number of comma separated contact IDs. I would like to left join the deals dataset to the contacts dataset where the contact ID matches one of the IDs in the 'associated contacts' field.

contacts_df = pd.DataFrame(
  {'name': ['John Smith', 'Jane Doe', 'James Bond'],
  'id': [1,2,3]}
  

deals_df = pd.DataFrame(
  {'deal_name': ['McDonalds', 'KFC'],
  'associated_contacts':['1,3','2']}

I have split the contacts in the deals dataframe into four different columns:

deals_df[['Contact ID 1','Contact ID 2','Contact ID 3', 'Contact ID 4']] = deals_df['associated_contacts'].str.split(',',expand=True)

And tried to join this to the contacts dataset:

merged = contacts_df.merge(deals_df, how='left', left_on='id', 
                    right_on=['Contact ID 1','Contact ID 2','Contact ID 3','Contact ID 4'])

But that returned a ValueError:

ValueError: len(right_on) must equal len(left_on)

Can anyone help me join these two datasets please? I think in my dataset each contact will only be associated with one deal. But a deal could have multiple contacts and I'd like to see the deal associated with each one.

1 answer

  • answered 2020-02-19 12:10 jezrael

    Use DataFrame.explode (pandas 0.25+) for repeat values of associated_contacts splitted by ,, laso is necessary convert column id to integers:

    deals_df = (deals_df.assign(id = deals_df.pop('associated_contacts').str.split(','))
                        .explode('id')
                        .assign(id = lambda x: x['id'].astype(int)))
    print (deals_df)
       deal_name  id
    0  McDonalds   1
    0  McDonalds   3
    1        KFC   2
    

    Your solution should be changed with DataFrame.stack and DataFrame.join to original:

    deals_df = (deals_df.join(deals_df.pop('associated_contacts')
                                      .str.split(',', expand=True)
                                      .stack()
                                      .astype(int)
                                      .reset_index(level=1, drop=True)
                                      .rename('id')))
    print (deals_df)
    0  McDonalds   1
    0  McDonalds   3
    1        KFC   2
    

    And then use merge with parameter on only:

    merged = contacts_df.merge(deals_df, how='left', on='id')
    print (merged)
             name  id  deal_name
    0  John Smith   1  McDonalds
    1    Jane Doe   2        KFC
    2  James Bond   3  McDonalds