Apply value in column based on conditions while cross-evaluating 2 datasets

I have 2 DataFrames:

PROJECT1

  key   name   deadline     delivered
0 AA1   Tom    01/05/2018   02/05/2018
1 AA2   Sue    01/05/2018   30/04/2018
2 AA4   Jack   01/05/2018   04/05/2018

PROJECT2

  key   name   deadline     delivered
0 AA1   Tom    01/05/2018   30/04/2018
1 AA2   Sue    01/05/2018   30/04/2018
2 AA3   Jim    01/05/2018   03/05/2018

is is possible to create a column in PROJECT2 named 'In PROJECT1' and apply condition as such:

psuedo code

for row in PROJECT2: 
    if in the same row based on key column PROJECT1['delivered'] >= PROJECT2['deadline']:
        PROJECT2['In PROJECT1'] = 'project delivered before deadline'
    else: 
        'Project delayed'

expected result

  key   name   deadline     delivered    In PROJECT1
0 AA1   Tom    01/05/2018   30/04/2018   Project delayed
1 AA2   Sue    01/05/2018   30/04/2018   project delivered before deadline
2 AA3   Jim    01/05/2018   03/05/2018   NaN

not sure how to approach it (iterrows(), for loop, df.loc[conditions], np.where(), or perhaps I need to define some kind of function to use in df.apply()), any help highly appreciated.

2 answers

  • answered 2018-07-15 14:11 jpp

    You can use numpy.select to add a series with a list of conditions and values.

    Note I believe you have your desired criteria reversed, i.e. delivered before deadline should give "project delivered before deadline" rather than vice versa.

    import numpy as np
    
    # convert series to datetime if necessary
    for col in ['deadline', 'delivered']:
        df1[col] = pd.to_datetime(df1[col], dayfirst=True)
    
    for col in ['deadline', 'delivered']:
        df2[col] = pd.to_datetime(df2[col], dayfirst=True)
    
    # create series mapping key to delivered date in df1
    s = df1.set_index('key')['delivered']
    
    # define conditions and values
    conditions = [~df2['key'].isin(s.index), df2['key'].map(s) <= df2['deadline']]
    values = [np.nan, 'project delivered before deadline']
    
    # apply conditions and values, with fallback value
    df2['In Project1'] = np.select(conditions, values, 'Project delayed')
    
    print(df2)
    
       key name   deadline  delivered                        In Project1
    0  AA1  Tom 2018-05-01 2018-04-30                    Project delayed
    1  AA2  Sue 2018-05-01 2018-04-30  project delivered before deadline
    2  AA3  Jim 2018-05-01 2018-05-03                                nan
    

  • answered 2018-07-15 14:30 YOLO

    Here is an alternate way you can follow by joining both the data sets. This will help you avoid any necessity for loop and will be faster.

    ## join the two data sets
    #  p1 = Project 1
    #  p2 = Project 2
    p3 = p2.merge(p1.loc[:,['key','delivered']], on='key',how='left', suffixes=['_p2','_p1'])
    p3['In PROJECT1'] = np.where((p3['delivered_p1'] >= p3['delivered_p2']),'project delivered before deadline','Project delayed')
    
    # handle cases with NA
    set_to_na = p3[['delivered_p1','delivered_p2']].isnull().any(axis=1).values.tolist()
    p3['In PROJECT1'].iloc[set_to_na] = np.nan
    
    ## remove unwanted columns and rename
    p3.drop('delivered_p1', axis=1, inplace=True)
    p3.rename(columns={'delivered_p2':'delivered'}, inplace=True)
    
    print(p3)
    
       key name    deadline   delivered                        In PROJECT1
    0  AA1  Tom  01/05/2018  30/04/2018                    Project delayed
    1  AA2  Sue  01/05/2018  30/04/2018  project delivered before deadline
    2  AA3  Jim  01/05/2018  03/05/2018                                NaN