How to create a new boolean column in a dataframe based on multiple conditions from other dataframe in pandas

I have a dataframe

entity  response    date
p   a1  1-Feb-14
p   a2  2-Feb-14
p   a3  3-Feb-14
p   a4  4-Feb-14
p   a5  5-Feb-14
p   a6  6-Feb-14
p   a7  7-Feb-14
p   a8  8-Feb-14
p   a9  9-Feb-14
p   a10 10-Feb-14
p   a11 11-Feb-14
p   a12 12-Feb-14
p   a13 13-Feb-14
p   a14 14-Feb-14
p   a15 15-Feb-14

and another data frame :

entity  start_date  end_date
p   2-Feb-14    4-Feb-14
p   6-Feb-14    7-Feb-14
p   9-Feb-14    12-Feb-14
q   1-Feb-14    7-Feb-14

based on the second data frame I have to create a True False column in the 1st dataframe for P if the date lies between any of start and end date window it should be true else false.

What could be the fastest way of doing this and shortest as well. I tried iterating over the whole data frame but that takes time and makes the code long as well

2 answers

  • answered 2018-08-09 00:44 RafaelC

    Maybe I'm overthinking, but

    def f(s):
        f2 = lambda d, n: ((d >= df2[df2.entity == n].start_date) & (d <= df2[df2.entity==n].end_date)).any()
        return(s.transform(f2, n=s.name))
    
    df.groupby('entity').date.transform(f)
    
    0     False
    1      True
    2      True
    3      True
    4     False
    5      True
    6      True
    7     False
    8      True
    9      True
    10     True
    11     True
    12    False
    13    False
    14    False
    15    False
    Name: date, dtype
    

    You can also do some preprocessing first to speed up the process

    df2['j']  = df2.agg(lambda k: pd.Interval(k.start_date, k.end_date), 1)
    dic = df2.groupby('entity').agg(lambda k: list(k)).to_dict()['j']
    df[['entity', 'date']].transform(lambda x: any(x['date'] in z for z in dic[x['entity']]), 1)
    

    Notice that this uses pd.Interval by default closed only on the right, but should be around 20x faster than chained transforms.

  • answered 2018-08-09 01:22 Sacry

    IMHO, depending on your data, sometimes it's acceptable to expand date range first

    df2 = pd.concat([
        pd.DataFrame(pd.date_range(start_date, end_date), columns=['date']).assign(entity=entity)
        for _, (entity, start_date, end_date) in df2.iterrows()
    ]).drop_duplicates()
    df.merge(df2, on=['entity', 'date'], how='left', indicator=True)['_merge'] == 'both'