How do I implement rank function for nearest values for a column in dataframe?
df.head():
run_time match_datetime country league home_team away_team
0 2021-08-07 00:04:36.326391 2021-08-06 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka
1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19
2 2021-08-07 00:04:36.326391 2021-08-08 World Club Friendly Alaves Al Nasr
3 2021-08-07 00:04:36.326391 2021-08-09 China Jia League Chengdu Rongcheng Shenyang Urban FC
4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger
5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
6 2021-08-06 00:04:36.326391 2021-08-08 Russia Youth League Konoplev Academy U19 Rubin Kazan U19
7 2021-08-06 00:04:36.326391 2021-08-09 World Club Friendly Real Sociedad Eibar
desired df
run_time match_datetime country league home_team away_team
0 2021-08-07 00:04:36.326391 2021-08-06 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka
1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19
4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger
5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
How do i use rank
function to filter only the 2 nearest match_datetime
dates for every run_time
value.
i.e. desired dataframe will be a filtered dataframe that will have all the nearest 2 match_datetime
values for every run_time
2 answers
-
answered 2022-05-07 04:37
Corralien
Update
Using
rank
instead ofhead
:diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])).abs() out = df.loc[diff.groupby(df['run_time']).rank(method='dense') <= 2]
Output:
>>> out run_time match_datetime country league home_team away_team 1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19 2 2021-08-07 00:04:36.326391 2021-08-08 World Club Friendly Alaves Al Nasr 4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger 5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
Alternative
You can use:
diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])) \ .abs().sort_values() out = df.loc[diff.groupby(df['run_time']).head(2).index].sort_index()
-
answered 2022-05-07 04:48
max
I am somehow afraid that the
pandas.DataFrame.rank
method can't do this. Butpandas.DataFrame.groupby
can do this, if you usepandas.DataFrame.head
with it.Assuming you have the following
pandas.DataFrame
:import pandas as pd import numpy as np np.random.seed(42) df = pd.DataFrame(np.array([np.random.randint(0, 3, 10), np.random.rand(10)]).transpose(), columns=['a', 'b'])
And that you want to keep
max_num_per_example = 2
representatives of each unique values in the columndf['a']
:max_num_per_example = 2 df.groupby(['a']).head(max_num_per_example)
yields
a b 0 2.0 0.058084 1 0.0 0.866176 2 2.0 0.601115 4 0.0 0.020584 7 1.0 0.212339 This is the same as you would get if you to the naive approach:
max_idx_per_example = 2 idx_to_keep = [] for el_uq in df['a'].unique(): lg = el_uq == df['a'] for i, idx in enumerate(lg[lg].index): if i < max_idx_per_example: idx_to_keep.append(idx) else: break df_new = df.iloc[idx_to_keep]
Which underlines the power of
pandas
=)
How many English words
do you know?
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
how many words do you know
Powered by Examplum