Remove duplication based on date conditions

I had a dataframe like below, and I would like to remove duplications based on certain criteria. 1) If the startdate is greater than Month, it will be removed. 2) If the startdate is less than Month, keep the latest record.

>       COMP    Month       Startdate   bundle            result
> 0     TD3M    2018-03-01  2015-08-28  01_Essential      keep    
> 1     TD3M    2018-03-01  2018-07-17  04_Complete       remove
> 2     TD3M    2018-04-01  2015-08-28  01_Essential      keep
> 3     TD3M    2018-04-01  2018-07-17  04_Complete       remove
> 4     TD3M    2018-05-01  2015-08-28  01_Essential      keep
> 5     TD3M    2018-05-01  2018-07-17  04_Complete       remove
> 6     TD3M    2018-06-01  2015-08-28  01_Essential      keep
> 7     TD3M    2018-06-01  2018-07-17  04_Complete       remove
> 8     TD3M    2018-08-01  2015-08-28  01_Essential      remove
> 9     TD3M    2018-08-01  2018-07-17  04_Complete       keep
> 10    TD3M    2018-09-01  2015-08-28  01_Essential      remove
> 11    TD3M    2018-09-01  2018-07-17  04_Complete       keep

The expected output would be:

>       COMP    Month       Startdate   bundle            
> 0     TD3M    2018-03-01  2015-08-28  01_Essential      
> 2     TD3M    2018-04-01  2015-08-28  01_Essential     
> 4     TD3M    2018-05-01  2015-08-28  01_Essential     
> 6     TD3M    2018-06-01  2015-08-28  01_Essential     
> 9     TD3M    2018-08-01  2018-07-17  04_Complete  
> 11    TD3M    2018-09-01  2018-07-17  04_Complete          

2 answers

  • answered 2019-07-18 16:11 thomask

    First of all, I drop your column 'result':

    df = df.drop(columns='result')

    First check that your Month and Startdate fields are in datetime format:

    df.Month = pd.to_datetime(df.Month) df.Startdate = pd.to_datetime(df.Startdate)

    Then filter and groupby (agg by max) :

    df = df[df.Startdate <= df.Month] df.groupby(['COMP', 'Month'], as_index=False).max()

  • answered 2019-07-18 16:13 WeNYoBen

    Here is one way of using sort_values drop_duplicates

    df.query('Startdate<=Month').sort_values('Startdate').drop_duplicates('Month',keep='last')
    Out[892]: 
        COMP      Month  Startdate        bundle result
    0   TD3M 2018-03-01 2015-08-28  01_Essential   keep
    2   TD3M 2018-04-01 2015-08-28  01_Essential   keep
    4   TD3M 2018-05-01 2015-08-28  01_Essential   keep
    6   TD3M 2018-06-01 2015-08-28  01_Essential   keep
    9   TD3M 2018-08-01 2018-07-17   04_Complete   keep
    11  TD3M 2018-09-01 2018-07-17   04_Complete   keep