Grouping by date and summing the first values from each group in Pyspark

I have a data frame similar to this:

date      | balance|
-------------------|
01/01/2018| 1000   |
01/07/2018| 1200   |
01/01/2019| 900    |
01/07/2019| 1200   |
01/01/2018| 133    |
01/07/2018| 1335   |
01/01/2019| 1244   |
01/07/2019| 124    |

I want to group by date and use may be the first method and sum the first rows and get something like:

date      | first(balance)|
--------------------------|
01/01/2018| 1133          |
01/01/2010| 2235          |

I have:

df = df.groupBy("balance").sum(f.first("balance"))

result:

TypeError: Column is not iterable

1 answer

  • answered 2019-11-08 15:14 Vignesh dvp

    Your question and example dataframes dont exactly match.

    From what i infer from your question,

    df = df.groupBy('date').agg(F.first('balance').alias('balance')).agg(F.sum('balance'))
    

    This should ideally work.