Pandas: Relationship between a data frame and the numpy.array used to define it

I just wanted to created two data frames of the same dimensions which where initially empty. I did it this way:

import numpy as np
import pandas as pd

m = np.empty((2, 3))*np.nan
df1 = pd.DataFrame(m)
df2 = pd.DataFrame(m)

But when I changed a particular value in one data frame, all three objects are affected:

df2.iloc[1, 2] = 1

print(df2)
    0   1    2
0 NaN NaN  NaN
1 NaN NaN  1.0

print(df1)
    0   1    2
0 NaN NaN  NaN
1 NaN NaN  1.0

print(m)
array([[nan, nan, nan],
       [nan, nan,  1.]])

So it seems that a data frame is just wrapper around an numpy array: no copy is made. I have not seen this behavior documented anywhere and I just wanted to point it out. Any comments?

3 answers

  • answered 2018-10-22 12:21 A. Wolf

    I think that this happens because df1 and df2 are pointers to the same memory address. If you're not familiar with pointers, see for example this.
    A quick way to solve the problem is to copy the shared numpy array in a new array:

     import numpy as np
    import pandas as pd
    
    m = np.empty((2, 3))*np.nan
    n = m.copy()
    df1 = pd.DataFrame(m)
    df2 = pd.DataFrame(n)
    
    df2.iloc[1, 2] = 1
    
    print(df1)
    print(df2)
    

  • answered 2018-10-22 12:51 emirc

    There is an init arg to DataFrame that let's you specify to copy data from ndarray to the DataFrame.

    See source code of pandas frame.py , line 405 and later... By default, copy is False.

    So, you can force copying with something like:

    import numpy as np
    import pandas as pd
    
    m = np.empty((2, 3))*np.nan
    df1 = pd.DataFrame(m,copy=True)
    df2 = pd.DataFrame(m)
    
    df2.iloc[1, 2] = 1
    print(df1)
    print(df2)
    

  • answered 2018-10-22 13:18 B. M.

    The idea behind this behavior is that numpy and pandas are designed for efficiency. So the philosophy of developers is: contents is copied only when necessary.

    For example :

    a=np.ones((2,3))
    df=pd.DataFrame(a)
    df.iloc[0,0]="string" 
    
    In [2]: a
    Out[2]: 
    array([[ 1.,  1.,  1.],
           [ 1.,  1.,  1.]])
    
    In [3]: df
    Out[3]: 
            0    1    2
    0  string  1.0  1.0
    1       1  1.0  1.0
    

    in this case a copy is made, since dtypes are changed.