How to get the previous event for each row based on condition

I'm gathering events data from different data sources and merging them on a pandas dataframe.

I have two different event types (clicks and purchases) and I want to replicate a "last click attribution model". This consists of finding which was the last click the user did, previous to make a purchase.

In other words, I have think of it as follows: "for each purchase event, get the last click event previous to that purchase (if any)"

df = pd.DataFrame( {
   'user_id': [1234, 1234, 1234, 1234, 1234, 1234, 1234, 1234],
   'event_type': ['CLICK','CLICK','PURCHASE','PURCHASE','CLICK','PURCHASE','CLICK','CLICK'],
    'event_id': [4567, 7891, 11215, 14539, 17863, 21187, 24511, 27835],
   'timestamp': [2, 4, 7, 7, 14, 134, 739, 921]
    } );
   user_id event_type  event_id  timestamp
0     1234      CLICK      4567          2
1     1234      CLICK      7891          4
2     1234   PURCHASE     11215          7
3     1234   PURCHASE     14539          7
4     1234      CLICK     17863         14
5     1234   PURCHASE     21187        134
6     1234      CLICK     24511        739
7     1234      CLICK     27835        921

I have tried the following:

  1. Sorting values by user_id and timestamp
  2. Adding columns "previous_event_type", "previous event timestamp" with .shift().
  3. Adding conditional to evaluate if event_type and previous_event_type is "CLICK AND PURCHASE"
df['previous_event_type'] = df['event_type'].shift()
df['previous_event_timestamp'] = df['timestamp'].shift()
df['click_to_purchase'] = (df['event_type'] == 'PURCHASE') & (df['previous_event_type'] == 'CLICK')

Main problem with this solution is: if the user did two or more purchases, I can't get the last click previous to the second purchase (and I should)

Is there a way you can think of creating a function to: "for each purchase event, get the last click event previous to that purchase (if any)"

Can't think of other way to describe it.

Desired outcome

Thanks, Javier.

5 answers

  • answered 2019-08-18 01:19 Code Different

    I imagine you would like to do that on a per-user basis.

    First, since order is important, sort the dataframe by user_id and timestamp:

    df = df.sort_values(['user_id', 'timestamp']).reset_index(drop=True)
    

    For each users, split PURCHASE and CLICK into 2 separate dataframes and slice the clicks dataframe different per purchase:

    def summarize(x):
        purchases = x[x['event_type'] == 'PURCHASE']
        clicks = x[x['event_type'] == 'CLICK']
        last_clicks = purchases.index.to_series().apply(lambda i: clicks.loc[:i].iloc[-1]) 
        return purchases.join(last_clicks[['event_type', 'event_id', 'timestamp']].add_prefix('last_'))
    
    df.groupby('user_id').apply(summarize) \
        .droplevel(1).drop(columns='user_id')   # drop extra columns
    

    Result:

            event_type  event_id  timestamp last_event_type  last_event_id  last_timestamp
    user_id                                                                               
    1234      PURCHASE     11215          7           CLICK           7891               4
    1234      PURCHASE     14539          7           CLICK           7891               4
    1234      PURCHASE     21187        134           CLICK          17863              14
    

  • answered 2019-08-18 01:34 Ben.T

    I think you can avoid groupby, using some masking, with where and mask, as well as ffill (being equivalent to fillna with the method 'ffill'). To see that it is not considering last CLICK when new user, I added a new line to your dataframe with a new user and PURCHASE with df.loc[8,:] = [1235, 'PURCHASE', 11, 4]

    #first sort_values
    df = df.sort_values(['user_id', 'timestamp'])
    
    #create the mask of click
    mask_click = df.event_type.eq('CLICK')
    
    #create the mask of user, what you want is the last click if the same user
    mask_user = df.user_id.where(mask_click).ffill() != df.user_id
    
    #now create the columns
    df['last_click_id'] = df.event_id.where(mask_click).ffill().mask( mask_click | mask_user)
    df['last_click_timestamp'] = df.timestamp.where(mask_click).ffill().mask( mask_click | mask_user)
    
    print (df)
       user_id event_type  event_id  timestamp  last_click_id  \
    0   1234.0      CLICK    4567.0        2.0            NaN   
    1   1234.0      CLICK    7891.0        4.0            NaN   
    2   1234.0   PURCHASE   11215.0        7.0         7891.0   
    3   1234.0   PURCHASE   14539.0        7.0         7891.0   
    4   1234.0      CLICK   17863.0       14.0            NaN   
    5   1234.0   PURCHASE   21187.0      134.0        17863.0   
    6   1234.0      CLICK   24511.0      739.0            NaN   
    7   1234.0      CLICK   27835.0      921.0            NaN   
    8   1235.0   PURCHASE      11.0        4.0            NaN   #still nan as new user while purchase
    
       last_click_timestamp  
    0                   NaN  
    1                   NaN  
    2                   4.0  
    3                   4.0  
    4                   NaN  
    5                  14.0  
    6                   NaN  
    7                   NaN  
    8                   NaN  
    

  • answered 2019-08-18 03:16 manwithfewneeds

    I'd do a mask to find which consecutive rows have click followed by purchase, then assign the last columns with shift, and finally forward fill where there is a consecutive purchase

    m = df['event_type'].eq('PURCHASE') & df['event_type'].shift().eq('CLICK')
    df.loc[m, 'last click'] = df['event_id'].shift()
    df.loc[m, 'last time'] = df['timestamp'].shift()
    df.loc[df['event_type'].eq('PURCHASE')]= df.loc[df['event_type'].eq('PURCHASE')].ffill()
    

  • answered 2019-08-18 03:56 Allen

    Setup

    df = pd.DataFrame( {
       'user_id': [1234, 1234, 1234, 1234, 1234, 1234, 1234, 1234],
       'event_type': ['CLICK','CLICK','PURCHASE','PURCHASE','CLICK','PURCHASE','CLICK','CLICK'],
        'event_id': [4567, 7891, 11215, 14539, 17863, 21187, 24511, 27835],
       'timestamp': [2, 4, 7, 7, 14, 134, 739, 921]
        } )
    
    df = pd.concat([df, df.assign(user_id=1235)]).reset_index(drop=True)
    

    Solution:

    df['clk_events'] = df.apply(lambda x: df.iloc[0:x.name].loc[lambda y: (y.event_type=='CLICK') & (y.user_id==x.user_id)], axis=1)
    df['last_clk'] = df.clk_events.apply(lambda x: np.nan if len(x)==0 else x.event_id.tolist()[-1])
    df.loc[df.event_type=='CLICK', 'last_clk']=np.nan
    df.drop('clk_events',1, inplace=True)
    
    user_id event_type  event_id    timestamp   last_clk
    0   1234    CLICK       4567    2           NaN
    1   1234    CLICK       7891    4           NaN
    2   1234    PURCHASE    11215   7           7891.0
    3   1234    PURCHASE    14539   7           7891.0
    4   1234    CLICK       17863   14          NaN
    5   1234    PURCHASE    21187   134         17863.0
    6   1234    CLICK       24511   739         NaN
    7   1234    CLICK       27835   921         NaN
    8   1235    CLICK       4567    2           NaN
    9   1235    CLICK       7891    4           NaN
    10  1235    PURCHASE    11215   7           7891.0
    11  1235    PURCHASE    14539   7           7891.0
    12  1235    CLICK       17863   14          NaN
    13  1235    PURCHASE    21187   134         17863.0
    14  1235    CLICK       24511   739         NaN
    15  1235    CLICK       27835   921         NaN
    

  • answered 2019-08-18 04:16 Parijat Bhatt

    I have added for only for last_click_id but if you need help with timestamp too let me know.

    
    df = df.sort_values(by=['time_stamp'])
    
    def f(x):
        index = x['index']
        event = x['event_type']
        if event == "purchase":
            return np.nan if index==0 else df.loc[index-1,'event_id']
        else:
            return df.loc[index,'event_id']
    
    df['last_click_id'] = df[['index','event type']].apply(lambda x:f(x))