drops a column if it exceeds a specific number of NA values

i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.

def check(x):
for column in df:
    if df.column.isnull().sum() > 2:
        df.drop(column,axis=1)

there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.

P.S:I know about the thresh arguement in df.dropna(thresh,axis)

Any tips?Why isnt my code working?

Thanks

3 answers

  • answered 2018-07-14 06:20 jezrael

    I think best here is use dropna with parameter thresh:

    thresh : int, optional

    Require that many non-NA values.

    So for vectorize solution subtract it from length of DataFrame:

    N = 2
    df = df.dropna(thresh=len(df)-N, axis=1)
    print (df)
       A  D    E  F
    0  a  1  5.0  a
    1  b  3  3.0  a
    2  c  5  6.0  a
    3  d  7  9.0  b
    4  e  1  2.0  b
    5  f  0  NaN  b
    

    I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):

    df = pd.DataFrame({'A':list('abcdef'),
                       'B':[np.nan,np.nan,np.nan,5,5,np.nan],
                       'C':[np.nan,8,np.nan,np.nan,2,3],
                       'D':[1,3,5,7,1,0],
                       'E':[5,3,6,9,2,np.nan],
                       'F':list('aaabbb')})
    
    print (df)
       A    B    C  D    E  F
    0  a  NaN  NaN  1  5.0  a
    1  b  NaN  8.0  3  3.0  a
    2  c  NaN  NaN  5  6.0  a
    3  d  5.0  NaN  7  9.0  b
    4  e  5.0  2.0  1  2.0  b
    5  f  NaN  3.0  0  NaN  b
    
    def check(df):
        for column in df:
            if df[column].isnull().sum() > 2:
                df.drop(column,axis=1, inplace=True)
        return df
    
    print (df.pipe(check))
       A  D    E  F
    0  a  1  5.0  a
    1  b  3  3.0  a
    2  c  5  6.0  a
    3  d  7  9.0  b
    4  e  1  2.0  b
    5  f  0  NaN  b
    

  • answered 2018-07-14 07:47 Anton vBR

    Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.

    Full example:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({
        'A':list('abcdef'),
        'B':[np.nan,np.nan,np.nan,5,5,np.nan],
        'C':[np.nan,8,np.nan,np.nan,2,3],
        'D':[1,3,5,7,1,0],
        'E':[5,3,6,9,2,np.nan],
        'F':list('aaabbb')
    })
    
    m = ~df.isnull().sum().gt(2)
    df = df.loc[:,m]
    
    print(df)
    

    Returns:

       A  D    E  F
    0  a  1  5.0  a
    1  b  3  3.0  a
    2  c  5  6.0  a
    3  d  7  9.0  b
    4  e  1  2.0  b
    5  f  0  NaN  b
    

    Explanation

    Assume we print the columns and the mask before applying it.

    print(df.columns.tolist())
    print(m.tolist())
    

    It would return this:

    ['A', 'B', 'C', 'D', 'E', 'F']
    [True, False, False, True, True, True]
    

    Columns B and C are unwanted (False). They are removed when the mask is applied.

  • answered 2018-07-14 09:15 Zero

    Alternatively, you can use count which counts non-null values

    In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
    Out[23]:
       A  D    E  F
    0  a  1  5.0  a
    1  b  3  3.0  a
    2  c  5  6.0  a
    3  d  7  9.0  b
    4  e  1  2.0  b
    5  f  0  NaN  b