Group data by two columns and count it using pandas

I am having the following data.

  1. songs
  2. play_event

In songs the data is as below:

song_id  total_plays
1        2000
2        4532
3        9999
4        2343

And in play event the data is as below:

user_id song_id
102         1
103         4
102         1
102         3
104         2
102         1

For each time a song was played, there is a new entry, even is a song is played again.

With this data I want to:

  1. Get total no. of time each user played each songs. For example, if user_id 102 played, the song_id 1 three times, as per above data. I want to have it grouped by the user_id with total count. Something like below:

    user_id  song_id  count
    102      1        3
    102      3        1
    103      4        1
    104      2        1
    

I am thinking of using Pandas to do this. But I want to know if pandas is the right choice.

If its not pandas, then what should be my way forward.

If Pandas is the right choice, then:

The below code allows me to get the count either grouped by user or grouped by user_id how do we get the count grouped by user_id & song_id? See a sample code I tried below:

import pandas as pd

#Load data from csv file
data = pd.DataFrame.from_csv('play_events.csv')

# Gives how many entries per user
data['user_id'].value_counts()

# Gives how many entries per songs
data['song_id'].value_counts()

1 answer

  • answered 2018-10-11 19:48 sacul

    For your first problem, a simple groupby and value_counts does the trick. Note that everything after value_counts() in the code below is just to get it to an actual dataframe in the same format as your desired output.

    counts = play_events.groupby('user_id')['song_id'].value_counts().to_frame('count').reset_index()
    
    >>> counts
       user_id  song_id  count
    0      102        1      3
    1      102        3      1
    2      103        4      1
    3      104        2      1
    

    Then for your second problem (which you have deleted in your edited post, but I will leave just in case it is useful to you), you can loop through counts, grouping by user_id, and save each as csv:

    for user, data in counts.groupby('user_id', as_index=False):
        data.to_csv(str(user)+'_events.csv')
    

    For your example dataframes, this gives you 3 csvs: 102_events.csv, 103_events.csv, and 103_events.csv. The first looks like:

       user_id  song_id  count
    0      102        1      3
    1      102        3      1