Filter columns by values in a row in Pandas
I have obtained the statistics for my dataframe by df.describe() in Pandas.
statistics = df.describe()
I want to filter the statistics dataframe base on count:
main Meas1 Meas2 Meas3 Meas4 Meas5
sublvl Value Value Value Value Value
count 7.000000 1.0 1.0 582.00 97.000000
mean 30 37.0 26.0 33.03 16.635350
I want to get something like that: filter out all Values with count less than 30 and show me only the columns with count >30 in a new dataframe (or give me a list with all main that have count>30).
For the above example, I want:
main Meas4 Meas5
sublvl Value Value
count 582.00 97.000000
mean 33.03 16.635350
and [Meas4, Meas5]
I have tried
thresh = statistics.columns[statistics['count']>30]
And variations thereof.
Thank you!
2 answers

import pandas as pd df = pd.DataFrame.from_dict({'name':[1,2,3,4,5], 'val':[1, None,None,None,None]}) df name val 0 1 1.0 1 2 NaN 2 3 NaN 3 4 NaN 4 5 NaN
if you want to use
describe()
then note that describe does not give all columns. only columns with numerical data types are returned by default:you can do so in this way:
statistics = df.describe() # to describe all columns you can do this statistics = df.describe(include = 'all') [column for column in statistics.columns if statistics.loc['count'][column] > 3] # output ['name']
As discussed in comments, As this is a MultiIndex column to chose only first index we can do this:
# [column[0] for column in statistics.columns if statistics.loc['count'][column] > 3] # this code won't work correctly for non multi index dataframes.
for each column check if count is > threshold and add it to chosen_columns list:
chosen_columns = [] for column in df.columns: if len(df[column].value_counts()) > 3: chosen_columns.append(column) # chosen_columns output: ['name']
OR:
chosen_columns = [] for column in df.columns: if df[column].count() > 3: chosen_columns.append(column) # chosen_columns output: ['name']

As a direct solution for your dataset, you can filter using
df.loc['count'] > 30
and then use the resultant values to index again:In [1066]: df.loc[:, (df.loc['count'] > 30).values] Out[1066]: main Meas4 Meas5 sublvl Value Value count 582.00 97.00000 mean 33.03 16.63535