Parsing a Dataframe multiple ways

I'm trying to blow out the below dataset, so that every email address is a line item, including all the columns.

    PrimaryEmail     ShipToEmail       SoldToEmail    City
0  jack@jill.com            None  bridge@white.com   Maine
1  jill@jack.com            None              None  Jersey
2           None  frank@tony.com   listen@what.com    Blah

The results should be:

    PrimaryEmail     ShipToEmail       SoldToEmail    City             Role             Email                  UserID
0  jack@jill.com            None  bridge@white.com   Maine  Primary Contact     jack@jill.com     jack@jill.com__user
1  jill@jack.com            None              None  Jersey  Primary Contact     jill@jack.com     jill@jack.com__user
2           None  frank@tony.com   listen@what.com    Blah  Ship-To Contact    frank@tony.com    frank@tony.com__user
3  jack@jill.com            None  bridge@white.com   Maine  Sold-To Contact  bridge@white.com  bridge@white.com__user
4           None  frank@tony.com   listen@what.com    Blah  Sold-To Contact   listen@what.com   listen@what.com__user

I've created the function below which works great for a small set of data (like above), however when I get into 250K+ lines it will do the first function call fine, i.e.:

user_modifications(df.copy(deep=True),'PrimaryEmail','Primary Contact')

However, the rest of the function calls, for the other roles, are returning line items where the email address for that function are blank.

import pandas as pd

def user_modifications(__dataframe,__emailcolumn, __Role):
    __df=pd.DataFrame()
    __df.iloc[0:0]
    __df = __dataframe[__dataframe[__emailcolumn].notnull()].copy(deep=True)
    __df['Role'] = __Role
    __df['Email'] = __df[__emailcolumn]
    __df['UserID'] = __df[__emailcolumn] + '__user'
    return(__df)

def main():
    df=pd.DataFrame({'PrimaryEmail':['jack@jill.com','jill@jack.com',None],'ShipToEmail':[None,None,'frank@tony.com'],'SoldToEmail':['bridge@white.com',None,'listen@what.com'], 'City':['Maine','Jersey','Blah']})
    df_users = pd.concat([user_modifications(df.copy(deep=True),'PrimaryEmail','Primary Contact'),user_modifications(df.copy(deep=True),'ShipToEmail','Ship-To Contact'),user_modifications(df.copy(deep=True),'SoldToEmail','Sold-To Contact')], ignore_index=True)
    print(df_users)

main()

Thanks for the help.

2 answers

  • answered 2020-09-14 05:36 Shaido

    It looks like you are doing some additional copying of the dataframe. You can try as simple as possible, split the dataframe depending on the values in the first 3 columns, add the Role and Email columns and the concatenate everything together. Finally, the UserID column can be added.

    primary_emails = df.loc[df['PrimaryEmail'].notnull()].copy()
    primary_emails['Role'] = 'Primary Contact'
    primary_emails['Email'] = df['PrimaryEmail']
    
    ship_emails = df.loc[df['ShipToEmail'].notnull()].copy()
    ship_emails['Role'] = 'Ship-To Contact'
    ship_emails['Email'] = df['ShipToEmail']
    
    sold_emails = df.loc[df['SoldToEmail'].notnull()].copy()
    sold_emails['Role'] = 'Sold-To Contact'
    sold_emails['Email'] = df['SoldToEmail']
    
    df = pd.concat([primary_emails, ship_emails, sold_emails], axis=0, ignore_index=True)
    df['UserId'] = df['Email'] + '__user'
    

    This will give the wanted resulting dataframe.

  • answered 2020-09-14 05:51 Chris

    One way using pandas.DataFrame.filter, melt and merge:

    df = df[df.ne("None")]
    df2 = df.filter(like="Email").reset_index().melt(id_vars="index", 
                                                     var_name="Role", 
                                                     value_name="Email").dropna()
    new_df = df.merge(df2, left_index=True, right_on="index").drop("index", 1)
    

    Output:

        PrimaryEmail     ShipToEmail       SoldToEmail    City          Role  \
    0  jack@jill.com             NaN  bridge@white.com   Maine  PrimaryEmail   
    6  jack@jill.com             NaN  bridge@white.com   Maine   SoldToEmail   
    1  jill@jack.com             NaN               NaN  Jersey  PrimaryEmail   
    5            NaN  frank@tony.com   listen@what.com    Blah   ShipToEmail   
    8            NaN  frank@tony.com   listen@what.com    Blah   SoldToEmail   
    
                  Email  
    0     jack@jill.com  
    6  bridge@white.com  
    1     jill@jack.com  
    5    frank@tony.com  
    8   listen@what.com  
    

    Then modify the new columns as desired:

    new_df["Role"] = new_df["Role"].str.replace("Email", " Contact")
    new_df["UserID"] = new_df["Email"] + "__user"
    print(new_df)
    

    Output:

        PrimaryEmail     ShipToEmail       SoldToEmail    City             Role  \
    0  jack@jill.com             NaN  bridge@white.com   Maine  Primary Contact   
    6  jack@jill.com             NaN  bridge@white.com   Maine   SoldTo Contact   
    1  jill@jack.com             NaN               NaN  Jersey  Primary Contact   
    5            NaN  frank@tony.com   listen@what.com    Blah   ShipTo Contact   
    8            NaN  frank@tony.com   listen@what.com    Blah   SoldTo Contact   
    
                  Email                  UserID  
    0     jack@jill.com     jack@jill.com__user  
    6  bridge@white.com  bridge@white.com__user  
    1     jill@jack.com     jill@jack.com__user  
    5    frank@tony.com    frank@tony.com__user  
    8   listen@what.com   listen@what.com__user