How do I implement rank function for nearest values for a column in dataframe?

df.head():

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
2   2021-08-07 00:04:36.326391     2021-08-08           World         Club Friendly                Alaves                    Al Nasr
3   2021-08-07 00:04:36.326391     2021-08-09           China            Jia League     Chengdu Rongcheng          Shenyang Urban FC
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19
6   2021-08-06 00:04:36.326391     2021-08-08          Russia          Youth League  Konoplev Academy U19            Rubin Kazan U19
7   2021-08-06 00:04:36.326391     2021-08-09           World         Club Friendly         Real Sociedad                      Eibar

desired df

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19

How do i use rank function to filter only the 2 nearest match_datetime dates for every run_time value. i.e. desired dataframe will be a filtered dataframe that will have all the nearest 2 match_datetime values for every run_time

2 answers

  • answered 2022-05-07 04:37 Corralien

    Update

    Using rank instead of head:

    diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])).abs()
    out = df.loc[diff.groupby(df['run_time']).rank(method='dense') <= 2]
    

    Output:

    >>> out
                         run_time match_datetime         country         league          home_team                  away_team
    1  2021-08-07 00:04:36.326391     2021-08-07          Russia   Youth League           Ural U19  Krylya Sovetov Samara U19
    2  2021-08-07 00:04:36.326391     2021-08-08           World  Club Friendly             Alaves                    Al Nasr
    4  2021-08-06 00:04:36.326391     2021-08-06           China   Super League           Wuhan FC       Tianjin Jinmen Tiger
    5  2021-08-06 00:04:36.326391     2021-08-07  Czech Republic     U19 League  Sigma Olomouc U19                Karvina U19
    

    Alternative

    You can use:

    diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])) \
                                  .abs().sort_values()
    out = df.loc[diff.groupby(df['run_time']).head(2).index].sort_index()
    

  • answered 2022-05-07 04:48 max

    I am somehow afraid that the pandas.DataFrame.rank method can't do this. But pandas.DataFrame.groupby can do this, if you use pandas.DataFrame.head with it.

    Assuming you have the following pandas.DataFrame:

    import pandas as pd
    import numpy as np
    
    np.random.seed(42)
    df = pd.DataFrame(np.array([np.random.randint(0, 3, 10), np.random.rand(10)]).transpose(), columns=['a', 'b'])
    

    And that you want to keep max_num_per_example = 2 representatives of each unique values in the column df['a']:

    max_num_per_example = 2
    df.groupby(['a']).head(max_num_per_example)
    

    yields

    a b
    0 2.0 0.058084
    1 0.0 0.866176
    2 2.0 0.601115
    4 0.0 0.020584
    7 1.0 0.212339

    This is the same as you would get if you to the naive approach:

    max_idx_per_example = 2
    idx_to_keep = []
    for el_uq in df['a'].unique():
        lg = el_uq == df['a']
        for i, idx in enumerate(lg[lg].index):
            if i < max_idx_per_example:
                idx_to_keep.append(idx)
            else:
                break
    df_new = df.iloc[idx_to_keep]
    

    Which underlines the power of pandas =)

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum