Python Export Data to CSV Based on Column Name in Chunk

I am processing some data in Python with ~5 million rows. I need to export these data into csv files based on specific value in a column. I also want to make sure no file has more than 1 million rows. If a file has more than 1 million rows, python will create another csv file to store rest of the data.

I tried the following code to export files based on column value in column 'col', but I am not sure how to limit each file to 1 million rows.

for u in df['col'].unique():
    file_name = 'output/{0}.csv'.format(u) 
    df[df['col'] == u].to_csv(file_name,  encoding = 'utf-8', index = 
    False)

Example: Let's assume I have following data, when city = 'new_york', we have 2 million rows, when city = 'miami', we have 1 million rows.

city = ['new_york', 'new_york','new_york','miami','miami']
population = ['8.5','3.9','0.25','0.45','1.4','0.87']
df = pd.DataFrame({'city':city,'population':population})

In this case, I want three csv files in total: new_york0.csv, new_york1.csv and 'miami.csv'. 'new_york0.csv' and 'new_york1.csv' should contains data only when city = 'new_york' and each file have 1 million rows. 'miami.csv' contains data when city = 'miami'

1 answer

  • answered 2019-06-11 23:27 Valentino

    Something like this should work:

    maxrow = 1000000
    for i in range(0, len(df), maxrow):
        df.iloc[i:i+maxrow].to_csv(f"test{i//maxrow}.csv") #using formatted string literals.
    

    This works on the full dataframe df, but it's easy to extend it to a selection. Simply save a copy of the selected dataframe before and then use the code above on the selection.

    maxrow = 1000000
    for u in df['col'].unique():
        seldf = df.loc[df['col'] == u]
        for i in range(0, len(seldf), maxrow):
            seldf.iloc[i:i+maxrow].to_csv("{}{:d}.csv".format(u, i//maxrow)), encoding='utf-8', index=False)