Complex mask for dataframe

I have a dataframe with a time series in one single column. The data looks like this chart

input

I would like to create a mask that is TRUE each time that the data is equal or lower than -0.20. It should also be TRUE before reaching -0.20 while negative. It should also be true after reaching -0.20 while negative. This version of the chart

output

is my manual attempt to show (in red) the values where the mask would be TRUE. I started creating the mask but I could only make it equal to TRUE while the data is less than -0.20 mask = (df['data'] < -0.2). I couldn't do any better, does anybody know how to achieve my goal?

2 answers

  • answered 2022-01-24 17:59 Benjamin Rio

    Idea

    Group by consecutive values of same sign, and then check if the minimum of such a group is less than the defined treshold.

    Implementation

    First, we want to separate negative from positive values.

    negative_mask = (df['data']<0)

    We then can create classes (ordered with integers) for each consecutive positive or negative series. The class increases by one each time the data changes sign.

    consecutives = negative_mask.diff().ne(0).cumsum()

    We then select only the data where the minimum of the group of consecutive elements is less than 0.2.

    df.groupby(consecutives).filter(lambda df : df[0].min() < -0.2)

    Example with random data

    We can try our example with random data:

    import numpy as np
    import pandas as pd
    
    np.random.seed(42)
    data = np.random.randint(-300, 300, size=1000)/1000
    df = pd.DataFrame(data, columns=["data"])
    

    Output

        data
    2   -0.030
    3   -0.194
    4   -0.229
    5   -0.280
    6   -0.179
    ... ...
    991 -0.293
    995 -0.247
    996 -0.062
    997 -0.072
    999 -0.250
    
    363 rows × 1 columns
    

  • answered 2022-01-24 18:05 tlgs

    One approach could be to group segments that are entirely below zero, and then for each group verify whether or not there any values below -0.2.

    enter image description here

    See below for a full reproducible example script:

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    
    
    np.random.seed(167)
    
    df = pd.DataFrame(
        {"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(10 ** 5)])}
    )
    plt.plot(df)
    
    gt_zero = df["y"] < 0
    regions = (gt_zero != gt_zero.shift()).cumsum()
    
    # here's your interesting DataFrame with the specified mask
    df_interesting = df.groupby(regions).filter(lambda s: s.min() < -0.2)
    
    # plot individual regions
    for i, grp in df.groupby(regions):
        if grp["y"].min() < -0.2:
            plt.plot(grp, color="tab:red", linewidth=5, alpha=0.6)
    
    plt.axhline(0, linestyle="--", color="tab:gray")
    plt.axhline(-0.2, linestyle="--", color="tab:gray")
    plt.show()
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum