Different result pandas Groupby in Python for DF based on Excel and CSV

To make my code faster, I want to switch from Excel input to CSV input data. First, I create two df's that are exactly the same.

demand_data = pd.ExcelFile("Input Data\Historical Demand.xlsx")
FY20 = pd.read_excel(demand_data, 'Data FY20')
FY20b = pd.read_csv("Input Data\Historical Demand FY20.csv")

The resulting df's are: Based on Excel Based on CSV

Next, I want to group my df by some columns using pandas groupby and sum over some colums. I use the following code:

FY20 = FY20.groupby(['SKU', 'Material', 'Plant'])[["OrderQuantity","DeliveredQuantity"]].sum().reset_index()
FY20b = FY20b.groupby(['SKU', 'Material', 'Plant'])[["OrderQuantity","DeliveredQuantity"]].sum().reset_index()

This is the result: Result based on Excel DF Result based on CSV DF

This does not make any sense to me, since the two dataframes are exactly the same, but the result is not. How do I get the same groupby result from the CSV based dataframe?

1 answer

  • answered 2021-09-27 16:19 saedx1

    This probably because your numbers in the CSV are strings (e.g. 1,516.0), you can notice that from the commas. You'll need to remove those and then convert the resulting strings to integer columns:

    FY20b["OrderQuantity"] = FY20b["OrderQuantity"].apply(lambda x: x.replace(',', ''))
    FY20b["OrderQuantity"] = pd.to_numeric(FY20b["OrderQuantity"])
    
    FY20b["DeliveredQuantity"] = FY20b["DeliveredQuantity"].apply(lambda x: x.replace(',', ''))
    FY20b["DeliveredQuantity"] = pd.to_numeric(FY20b["DeliveredQuantity"])
    

    Then you can do:

    FY20b = FY20b.groupby(['SKU', 'Material', 'Plant'])[["OrderQuantity","DeliveredQuantity"]].sum().reset_index()
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum