Why does set of a pandas dataframe return column names of the dataframe?

I was just tinkering around and found this amusing:

>>> import pandas as pd
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> x = set(df)
>>> x
{'col2', 'col1'}

Why does pandas return column names as set values?

2 answers

  • answered 2018-10-11 20:16 jpp

    Because that's how __iter__ is defined in the source code for NDFrame, of which pd.DataFrame is a child:

    def __iter__(self):
        """Iterate over infor axis"""
        return iter(self._info_axis)
    

    pd.DataFrame._info_axis is used internally to store column labels:

    df = pd.DataFrame(columns=list('abcd'))
    
    df._info_axis # Index(['a', 'b', 'c', 'd'], dtype='object')
    

    set iterates the pd.DataFrame instance via __iter__, hashes each element, and returns a set of values corresponding to unique column labels.

  • answered 2018-10-11 20:17 user3483203

    You can find the implementation for __iter__ in DataFrame's parent class NDFrame:

    def __iter__(self):
        """Iterate over infor axis"""
        return iter(self._info_axis)
    

    It's essentially the same as calling keys on a DataFrame, defined in the same location. I'm including it here because the docstring is more helpful, and describes the differences in _info_axis between Series, DataFrame and Panel

    def keys(self):
        """Get the 'info axis' (see Indexing for more)
        This is index for Series, columns for DataFrame and major_axis for
        Panel.
        """
        return self._info_axis