pandas - Add values of two or more different DataFrames through a list

I'm looking to add values between three or more DataFrames through a list instead of doing them one by one.

First, I'll use merge as an example.

The following line merges DataFrames (data0, data1, data2) one by one:

final_data = data0.merge(data1, on=['player_id', 'player_name'])
final_data = final_data.merge(data2, on=['player_id', 'player_name'])

However, instead, I could merge the DataFrames through a list, which significantly helps when dealing with more DF's, such as this:

data_list = [data0, data1, data2]
final_data = reduce(lambda left, right: pd.merge(left, right, on=['player_id', 'player_name']), data_list)

So now, I have these three following DataFrames and I would like to add the values between them.

data0:

    player_id  player_name  ab  run  hit
0       28920     S. Smith   0    0    0
1       33351   T. Mancini   0    0    0
2       30267    C. Gentry   0    0    0
3       28513     A. Jones   0    0    0
4       31097   M. Machado   0    0    0
5       29170     C. Davis   0    0    0
6       29322    M. Trumbo   0    0    0
7       29564  W. Castillo   0    0    0
8       34885       H. Kim   0    0    0
9       32952   J. Rickard   0    0    0
10      31988    J. Schoop   0    0    0
11       5908   J.J. Hardy   0    0    0

Next,

data1:

   player_id player_name  ab  run  hit
0      28920    S. Smith   1    4    6
1      33351  T. Mancini   0    0    2
2      28513    A. Jones   2    1    0
3      31097  M. Machado   1    8    0
4      34885      H. Kim   1    1    2
5      32952  J. Rickard   0    2    0
6      31988   J. Schoop   5    3    4
7       5908  J.J. Hardy   4    2   10

And next,

data2:

   player_id player_name  ab  run  hit
0      28920    S. Smith   1    9    2
1      31097  M. Machado   3    3    3
2      29170    C. Davis   9    6    4
3      29322   M. Trumbo   3    5    7
4      32952  J. Rickard   1    3    4
5       5908  J.J. Hardy   0    0    5

The final DataFrame I am looking to get should look like this:

final_data:

    player_id  player_name  ab  run  hit
0       28920     S. Smith   2   13    8
1       33351   T. Mancini   0    0    2
2       30267    C. Gentry   0    0    0
3       28513     A. Jones   2    1    0
4       31097   M. Machado   4   11    3
5       29170     C. Davis   9    6    4
6       29322    M. Trumbo   3    5    7
7       29564  W. Castillo   0    0    0
8       34885       H. Kim   1    1    2
9       32952   J. Rickard   1    5    4
10      31988    J. Schoop   5    3    4
11       5908   J.J. Hardy   4    2   15

I could get the result through the following code, but that adds the DataFrames one by one.

data0 = pd.read_csv('initial_df.csv')
data1 = pd.read_csv('add_vals1.csv')
data2 = pd.read_csv('add_vals2.csv')


data0 = data0.set_index(['player_id', 'player_name'])
data1 = data1.set_index(['player_id', 'player_name'])
data2 = data2.set_index(['player_id', 'player_name'])

final_data = data0.add(data1, fill_value=0).astype(int).reset_index()
final_data = final_data.set_index(['player_id', 'player_name'])
final_data = final_data.add(data2, fill_value=0).astype(int).reset_index()

Could anyone please help to get the final result through a list as I did with the merge function up on top? Thank you so much!

1 answer

  • answered 2018-05-16 05:37 jezrael

    I believe need use parameter index_col for MultiIndex in read_csv and then reduce with add:

    from functools import reduce
    
    data0 = pd.read_csv('initial_df.csv', index_col=['player_id', 'player_name'])
    data1 = pd.read_csv('add_vals1.csv', index_col=['player_id', 'player_name'])
    data2 = pd.read_csv('add_vals2.csv', index_col=['player_id', 'player_name'])
    
    data_list = [data0, data1, data2]
    final_data = reduce(lambda x, y: x.add(y, fill_value=0), data_list).reset_index()
    print (final_data)
        player_id  player_name   ab   run   hit
    0        5908   J.J. Hardy  4.0   2.0  15.0
    1       28513     A. Jones  2.0   1.0   0.0
    2       28920     S. Smith  2.0  13.0   8.0
    3       29170     C. Davis  9.0   6.0   4.0
    4       29322    M. Trumbo  3.0   5.0   7.0
    5       29564  W. Castillo  0.0   0.0   0.0
    6       30267    C. Gentry  0.0   0.0   0.0
    7       31097   M. Machado  4.0  11.0   3.0
    8       31988    J. Schoop  5.0   3.0   4.0
    9       32952   J. Rickard  1.0   5.0   4.0
    10      33351   T. Mancini  0.0   0.0   2.0
    11      34885       H. Kim  1.0   1.0   2.0
    

    Another solution with concat and sum by both levels:

    data_list = [data0, data1, data2]
    final_data = pd.concat(data_list).sum(level=[0,1]).reset_index()
    print (final_data)
        player_id  player_name  ab  run  hit
    0       28920     S. Smith   2   13    8
    1       33351   T. Mancini   0    0    2
    2       30267    C. Gentry   0    0    0
    3       28513     A. Jones   2    1    0
    4       31097   M. Machado   4   11    3
    5       29170     C. Davis   9    6    4
    6       29322    M. Trumbo   3    5    7
    7       29564  W. Castillo   0    0    0
    8       34885       H. Kim   1    1    2
    9       32952   J. Rickard   1    5    4
    10      31988    J. Schoop   5    3    4
    11       5908   J.J. Hardy   4    2   15