Remove duplication based on date conditions
I had a dataframe like below, and I would like to remove duplications based on certain criteria. 1) If the startdate is greater than Month, it will be removed. 2) If the startdate is less than Month, keep the latest record.
> COMP Month Startdate bundle result
> 0 TD3M 2018-03-01 2015-08-28 01_Essential keep
> 1 TD3M 2018-03-01 2018-07-17 04_Complete remove
> 2 TD3M 2018-04-01 2015-08-28 01_Essential keep
> 3 TD3M 2018-04-01 2018-07-17 04_Complete remove
> 4 TD3M 2018-05-01 2015-08-28 01_Essential keep
> 5 TD3M 2018-05-01 2018-07-17 04_Complete remove
> 6 TD3M 2018-06-01 2015-08-28 01_Essential keep
> 7 TD3M 2018-06-01 2018-07-17 04_Complete remove
> 8 TD3M 2018-08-01 2015-08-28 01_Essential remove
> 9 TD3M 2018-08-01 2018-07-17 04_Complete keep
> 10 TD3M 2018-09-01 2015-08-28 01_Essential remove
> 11 TD3M 2018-09-01 2018-07-17 04_Complete keep
The expected output would be:
> COMP Month Startdate bundle
> 0 TD3M 2018-03-01 2015-08-28 01_Essential
> 2 TD3M 2018-04-01 2015-08-28 01_Essential
> 4 TD3M 2018-05-01 2015-08-28 01_Essential
> 6 TD3M 2018-06-01 2015-08-28 01_Essential
> 9 TD3M 2018-08-01 2018-07-17 04_Complete
> 11 TD3M 2018-09-01 2018-07-17 04_Complete
2 answers
-
answered 2019-07-18 16:11
thomask
First of all, I drop your column 'result':
df = df.drop(columns='result')
First check that your Month and Startdate fields are in datetime format:
df.Month = pd.to_datetime(df.Month) df.Startdate = pd.to_datetime(df.Startdate)
Then filter and groupby (agg by max) :
df = df[df.Startdate <= df.Month] df.groupby(['COMP', 'Month'], as_index=False).max()
-
answered 2019-07-18 16:13
YO and BEN_W
Here is one way of using
sort_values
drop_duplicates
df.query('Startdate<=Month').sort_values('Startdate').drop_duplicates('Month',keep='last') Out[892]: COMP Month Startdate bundle result 0 TD3M 2018-03-01 2015-08-28 01_Essential keep 2 TD3M 2018-04-01 2015-08-28 01_Essential keep 4 TD3M 2018-05-01 2015-08-28 01_Essential keep 6 TD3M 2018-06-01 2015-08-28 01_Essential keep 9 TD3M 2018-08-01 2018-07-17 04_Complete keep 11 TD3M 2018-09-01 2018-07-17 04_Complete keep