Sort Pandas data frame by rows that have multiple similar values

I'm trying to sort a pandas data frame by rows that have two specific values in any column. In the sample data below, I would want to select the rows that have a value of 'apple' AND 'grape',

  a     b      c
0 apple orange grape
1 grape apple  banana
2 pear  kiwi   apple

resulting in a filtered data frame that shows:

  a     b      c
0 apple orange grape
1 grape apple  banana

Using the the code below, I can select all the rows that have one specific value:

df[(df == 'orange').any(axis=1)]

The result retuned, as expected, was:

  a     b      c
0 apple orange grape

Using the following line of code, I expected to select the rows that had both values somewhere in the row, but this returned all the rows that had either apple OR grape as a column value:

df[np.isin(df, ['apple', 'grape']).any(axis=1)]

I expected to get only the rows that had apple AND grape using the previous line, but that obviously isn't the correct way to accomplish this. How do I go about selecting rows that only have both values in any column?

3 answers

  • answered 2021-05-17 04:32 Henry Ecker

    One option is to "count" the number of Trues from np.isin on axis=1 using sum then compare whether it is greater than equal to the number of values that are being checked:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({
        'a': {0: 'apple', 1: 'grape', 2: 'pear'},
        'b': {0: 'orange', 1: 'apple', 2: 'kiwi'},
        'c': {0: 'grape', 1: 'banana', 2: 'apple'}
    })
    
    vals = ['apple', 'grape']
    
    filtered = df[np.isin(df, vals).sum(axis=1) >= len(vals)]
    
    print(filtered)
    

    Another option would be to turn the values into a set and apply on axis=1 issubset:

    filtered = df[df.apply(set(vals).issubset, axis=1)]
    

    Both give:

           a       b       c
    0  apple  orange   grape
    1  grape   apple  banana
    

  • answered 2021-05-17 04:37 Anurag Dabas

    Another way is to create a boolean mask:

    mask=df.isin(['apple','grape']).sum(1).eq(2)
    

    Finally:

    result=df[mask]
    

    output of result:

        a       b       c
    0   apple   orange  grape
    1   grape   apple   banana
    

  • answered 2021-05-17 05:12 RavinderSingh13

    With your shown samples and with boolean masking try following. Using .any function of Pandas.

    m1 = (df=='apple').any(1)
    m2 = (df=='grape').any(1)
    df[m1 & m2]
    

    Output will be as follows:

        a       b       c
    0   apple   orange  grape
    1   grape   apple   banana