Why does pandas.DataFrame.mean() work but pandas.DataFrame.std() does not over same data

I'm trying to figure out why the pandas.DataFrame.mean() function works over a ndarray of ndarrays, but the pandas.DataFrame.std() does not over the same data. The following is a minimum example.

x = np.array([1,2,3])
y = np.array([4,5,6])
df = pd.DataFrame({"numpy": [x,y]})

df["numpy"].mean() #works as expected
Out[231]: array([ 2.5,  3.5,  4.5])

df["numpy"].std() #does not work as expected
Out[231]: TypeError: setting an array element with a sequence.

However, if I do it through

df["numpy"].values.mean() #works as expected
Out[231]: array([ 2.5,  3.5,  4.5])

df["numpy"].values.std() #works as expected
Out[233]: array([ 1.5,  1.5,  1.5])

Debug information:

df["numpy"].dtype
Out[235]: dtype('O')

df["numpy"][0].dtype
Out[236]: dtype('int32')

df["numpy"].describe()
Out[237]: 
count             2
unique            2
top       [1, 2, 3]
freq              1
Name: numpy, dtype: object

df["numpy"]
Out[238]: 
0    [1, 2, 3]
1    [4, 5, 6]
Name: numpy, dtype: object

1 answer

  • answered 2018-01-16 23:42 MaxU

    Assuming you have the following orginal DF (containing numpy arrays of the same shape in cells):

    In [320]: df
    Out[320]:
      file      numpy
    0    x  [1, 2, 3]
    1    y  [4, 5, 6]
    

    Convert it to the following format:

    In [321]: d = pd.DataFrame(df['numpy'].values.tolist(), index=df['file'])
    
    In [322]: d
    Out[322]:
          0  1  2
    file
    x     1  2  3
    y     4  5  6
    

    Now you are free to use all the Pandas/Numpy/Scipy power:

    In [323]: d.sum(axis=1)
    Out[323]:
    file
    x     6
    y    15
    dtype: int64
    
    In [324]: d.sum(axis=0)
    Out[324]:
    0    5
    1    7
    2    9
    dtype: int64
    
    In [325]: d.mean(axis=0)
    Out[325]:
    0    2.5
    1    3.5
    2    4.5
    dtype: float64
    
    In [327]: d.std(axis=0)
    Out[327]:
    0    2.12132
    1    2.12132
    2    2.12132
    dtype: float64