Add sum columns back to pandas dataframe through chain style

I am trying to add a list of column sums back to my pandas dataframe through assign(). But I am not so sure how to do it when there's more than one column. What's the best way to do it or any other way to do it in a chain style given I have had other steps before?

data2.assign(data2[rate_name].abs() / data2.groupby(level = 'date')[rate_name].transform('sum'))
                   rate_water  rate_fire  rate_wood
id     date                                        
apple  2019-01-01   -0.500000  -0.500000   0.000000
orange 2019-01-01   -0.636364  -0.963636   3.000000
melon  2019-01-01   -0.333333   5.666667  27.888889
apple  2020-01-01   -0.100000   7.900000  76.000000
orange 2020-01-01    0.363636  -0.963636  26.500000
melon  2020-01-01    0.166667   6.166667  27.235043
apple  2021-01-01    0.328571  26.261702  84.220779
orange 2021-01-01    0.363636  28.036364  28.683673
melon  2021-01-01    0.611111  39.944444  27.679487

Reproducible:

from pandas import Timestamp
data2 = pd.DataFrame.from_dict({'rate_water': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.6363636363636364, ('melon', Timestamp('2019-01-01 00:00:00')): -0.33333333333333337, ('apple', Timestamp('2020-01-01 00:00:00')): -0.10000000000000009, ('orange', Timestamp('2020-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2020-01-01 00:00:00')): 0.16666666666666663, ('apple', Timestamp('2021-01-01 00:00:00')): 0.3285714285714285, ('orange', Timestamp('2021-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2021-01-01 00:00:00')): 0.611111111111111}, 'rate_fire': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2019-01-01 00:00:00')): 5.666666666666667, ('apple', Timestamp('2020-01-01 00:00:00')): 7.9, ('orange', Timestamp('2020-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2020-01-01 00:00:00')): 6.166666666666667, ('apple', Timestamp('2021-01-01 00:00:00')): 26.261702127659575, ('orange', Timestamp('2021-01-01 00:00:00')): 28.036363636363635, ('melon', Timestamp('2021-01-01 00:00:00')): 39.94444444444444}, 'rate_wood': {('apple', Timestamp('2019-01-01 00:00:00')): 0.0, ('orange', Timestamp('2019-01-01 00:00:00')): 3.0, ('melon', Timestamp('2019-01-01 00:00:00')): 27.88888888888889, ('apple', Timestamp('2020-01-01 00:00:00')): 76.0, ('orange', Timestamp('2020-01-01 00:00:00')): 26.5, ('melon', Timestamp('2020-01-01 00:00:00')): 27.235042735042736, ('apple', Timestamp('2021-01-01 00:00:00')): 84.22077922077922, ('orange', Timestamp('2021-01-01 00:00:00')): 28.683673469387756, ('melon', Timestamp('2021-01-01 00:00:00')): 27.67948717948718}})
                   rate_water  rate_fire  rate_wood  sum_water  sum_fire    sum_wood
id     date                                                                         
apple  2019-01-01   -0.500000  -0.500000   0.000000  -1.469697   4.20303   30.888889
orange 2019-01-01   -0.636364  -0.963636   3.000000  -1.469697   4.20303   30.888889
melon  2019-01-01   -0.333333   5.666667  27.888889  -1.469697   4.20303   30.888889
apple  2020-01-01   -0.100000   7.900000  76.000000   0.430303  13.10303  129.735043
orange 2020-01-01    0.363636  -0.963636  26.500000   0.430303  13.10303  129.735043
melon  2020-01-01    0.166667   6.166667  27.235043   0.430303  13.10303  129.735043
apple  2021-01-01    0.328571  26.261702  84.220779   1.303319  94.24251  140.583940
orange 2021-01-01    0.363636  28.036364  28.683673   1.303319  94.24251  140.583940
melon  2021-01-01    0.611111  39.944444  27.679487   1.303319  94.24251  140.583940

2 answers

  • answered 2022-05-04 09:43 jezrael

    Use dictionary comprehension for dict of Series and add it to dataFrame with unpack **:

    from pandas import Timestamp
    data2 = pd.DataFrame.from_dict({'rate_water': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.6363636363636364, ('melon', Timestamp('2019-01-01 00:00:00')): -0.33333333333333337, ('apple', Timestamp('2020-01-01 00:00:00')): -0.10000000000000009, ('orange', Timestamp('2020-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2020-01-01 00:00:00')): 0.16666666666666663, ('apple', Timestamp('2021-01-01 00:00:00')): 0.3285714285714285, ('orange', Timestamp('2021-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2021-01-01 00:00:00')): 0.611111111111111}, 'rate_fire': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2019-01-01 00:00:00')): 5.666666666666667, ('apple', Timestamp('2020-01-01 00:00:00')): 7.9, ('orange', Timestamp('2020-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2020-01-01 00:00:00')): 6.166666666666667, ('apple', Timestamp('2021-01-01 00:00:00')): 26.261702127659575, ('orange', Timestamp('2021-01-01 00:00:00')): 28.036363636363635, ('melon', Timestamp('2021-01-01 00:00:00')): 39.94444444444444}, 'rate_wood': {('apple', Timestamp('2019-01-01 00:00:00')): 0.0, ('orange', Timestamp('2019-01-01 00:00:00')): 3.0, ('melon', Timestamp('2019-01-01 00:00:00')): 27.88888888888889, ('apple', Timestamp('2020-01-01 00:00:00')): 76.0, ('orange', Timestamp('2020-01-01 00:00:00')): 26.5, ('melon', Timestamp('2020-01-01 00:00:00')): 27.235042735042736, ('apple', Timestamp('2021-01-01 00:00:00')): 84.22077922077922, ('orange', Timestamp('2021-01-01 00:00:00')): 28.683673469387756, ('melon', Timestamp('2021-01-01 00:00:00')): 27.67948717948718}})
    data2.index.names=['id','date']
    
    cols = ['rate_water','rate_fire','rate_wood']
    data2 = data2.assign(**{rate_name.replace('rate','sum'): 
                            data2[rate_name].abs() / data2.groupby(level = 'date')[rate_name].transform('sum') 
                            for rate_name in cols})
    print (data2)
                       rate_water  rate_fire  rate_wood  sum_water  sum_fire  \
    id     date                                                                
    apple  2019-01-01   -0.500000  -0.500000   0.000000  -0.340206  0.118962   
    orange 2019-01-01   -0.636364  -0.963636   3.000000  -0.432990  0.229272   
    melon  2019-01-01   -0.333333   5.666667  27.888889  -0.226804  1.348234   
    apple  2020-01-01   -0.100000   7.900000  76.000000   0.232394  0.602914   
    orange 2020-01-01    0.363636  -0.963636  26.500000   0.845070  0.073543   
    melon  2020-01-01    0.166667   6.166667  27.235043   0.387324  0.470629   
    apple  2021-01-01    0.328571  26.261702  84.220779   0.252104  0.278661   
    orange 2021-01-01    0.363636  28.036364  28.683673   0.279008  0.297492   
    melon  2021-01-01    0.611111  39.944444  27.679487   0.468888  0.423847   
    
                       sum_wood  
    id     date                  
    apple  2019-01-01  0.000000  
    orange 2019-01-01  0.097122  
    melon  2019-01-01  0.902878  
    apple  2020-01-01  0.585809  
    orange 2020-01-01  0.204262  
    melon  2020-01-01  0.209928  
    apple  2021-01-01  0.599078  
    orange 2021-01-01  0.204032  
    melon  2021-01-01  0.196889  
    

    Another way is processing all columns together:

    cols = ['rate_water','rate_fire','rate_wood']
    data2 = data2.join(data2[cols].abs().div(data2.groupby(level = 'date')[cols].transform('sum') )
                         .rename(columns=lambda x: x.replace('rate','sum')))
    

    cols = ['rate_water','rate_fire','rate_wood']
    data2 = data2.assign(**data2[cols].abs().div(data2.groupby(level = 'date')[cols].transform('sum') )
                           .rename(columns=lambda x: x.replace('rate','sum')))
    print (data2)
                       rate_water  rate_fire  rate_wood  sum_water  sum_fire  \
    id     date                                                                
    apple  2019-01-01   -0.500000  -0.500000   0.000000  -0.340206  0.118962   
    orange 2019-01-01   -0.636364  -0.963636   3.000000  -0.432990  0.229272   
    melon  2019-01-01   -0.333333   5.666667  27.888889  -0.226804  1.348234   
    apple  2020-01-01   -0.100000   7.900000  76.000000   0.232394  0.602914   
    orange 2020-01-01    0.363636  -0.963636  26.500000   0.845070  0.073543   
    melon  2020-01-01    0.166667   6.166667  27.235043   0.387324  0.470629   
    apple  2021-01-01    0.328571  26.261702  84.220779   0.252104  0.278661   
    orange 2021-01-01    0.363636  28.036364  28.683673   0.279008  0.297492   
    melon  2021-01-01    0.611111  39.944444  27.679487   0.468888  0.423847   
    
                       sum_wood  
    id     date                  
    apple  2019-01-01  0.000000  
    orange 2019-01-01  0.097122  
    melon  2019-01-01  0.902878  
    apple  2020-01-01  0.585809  
    orange 2020-01-01  0.204262  
    melon  2020-01-01  0.209928  
    apple  2021-01-01  0.599078  
    orange 2021-01-01  0.204032  
    melon  2021-01-01  0.196889  
    

  • answered 2022-05-04 09:52 sammywemmy

    One option is with the assign method and unpacking:

     data2.assign(**data2
                    .abs()
                    .div(
                    data2.groupby('date')
                    .transform('sum'))
                    .rename(columns = lambda df: df.removeprefix('rate_'))
                    .add_prefix('sum_'))
    
                        rate_water  rate_fire  rate_wood  sum_water  sum_fire  sum_wood
    id     date
    apple  2019-01-01   -0.500000  -0.500000   0.000000  -0.340206  0.118962  0.000000
    orange 2019-01-01   -0.636364  -0.963636   3.000000  -0.432990  0.229272  0.097122
    melon  2019-01-01   -0.333333   5.666667  27.888889  -0.226804  1.348234  0.902878
    apple  2020-01-01   -0.100000   7.900000  76.000000   0.232394  0.602914  0.585809
    orange 2020-01-01    0.363636  -0.963636  26.500000   0.845070  0.073543  0.204262
    melon  2020-01-01    0.166667   6.166667  27.235043   0.387324  0.470629  0.209928
    apple  2021-01-01    0.328571  26.261702  84.220779   0.252104  0.278661  0.599078
    orange 2021-01-01    0.363636  28.036364  28.683673   0.279008  0.297492  0.204032
    melon  2021-01-01    0.611111  39.944444  27.679487   0.468888  0.423847  0.196889
    

    Another option would be to concatenate along axis=1:

     pd.concat([data2,
               data2.abs()
                    .div(groupby('date')
                    .transform('sum'))
                    .rename(columns = lambda df: df.removeprefix('rate_'))
                    .add_prefix('sum_')], 
               axis = 1)
    Out[103]:
                        rate_water  rate_fire  rate_wood  sum_water  sum_fire  sum_wood
    id     date
    apple  2019-01-01   -0.500000  -0.500000   0.000000  -0.340206  0.118962  0.000000
    orange 2019-01-01   -0.636364  -0.963636   3.000000  -0.432990  0.229272  0.097122
    melon  2019-01-01   -0.333333   5.666667  27.888889  -0.226804  1.348234  0.902878
    apple  2020-01-01   -0.100000   7.900000  76.000000   0.232394  0.602914  0.585809
    orange 2020-01-01    0.363636  -0.963636  26.500000   0.845070  0.073543  0.204262
    melon  2020-01-01    0.166667   6.166667  27.235043   0.387324  0.470629  0.209928
    apple  2021-01-01    0.328571  26.261702  84.220779   0.252104  0.278661  0.599078
    orange 2021-01-01    0.363636  28.036364  28.683673   0.279008  0.297492  0.204032
    melon  2021-01-01    0.611111  39.944444  27.679487   0.468888  0.423847  0.196889
    

    I feel though that it would probably be cleaner to create temporary variables instead, especially if this is going to be in production code

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum