Group several columns then aggregate a set of columns in Pandas (It crashes badly compared to R's data.table)

I am relatively new to the world of Python and trying to use it as a back-up platform to do data analysis. I generally use data.table for my data analysis needs.

The issue is that when I run group-aggregate operation on big CSV file (randomized, zipped, uploaded at http://www.filedropper.com/ddataredact_1), Python throws:

grouping pandas return getattr(obj, method)(*args, **kwds) ValueError: negative dimensions are not allowed

OR (I have even encountered...)

File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\util.py", line 65, in cartesian_product for i, x in enumerate(X)] File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\util.py", line 65, in for i, x in enumerate(X)] File "C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 445, in repeat return _wrapfunc(a, 'repeat', repeats, axis=axis) File "C:\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 51, in _wrapfunc return getattr(obj, method)(*args, **kwds) MemoryError

I have spent three days trying to reduce the file size (I was able to reduce the size by 89%), adding breakpoints, debugging it, but I was not able to make any progress.

Surprisingly, I thought of running the same group/aggregate operation in data.table in R, and it hardly took 1 second. Moreover, I didn't have to do any data type conversion etc., suggested at https://www.dataquest.io/blog/pandas-big-data/.

I also researched other threads: Avoiding Memory Issues For GroupBy on Large Pandas DataFrame, Pandas: df.groupby() is too slow for big data set. Any alternatives methods?, and pandas groupby with sum() on large csv file?. It seems these threads are more about matrix multiplication. I'd appreciate if you wouldn't tag this as duplicate.

Here's my Python code:

finaldatapath = "..\Data_R"
ddata = pd.read_csv(finaldatapath +"\\"+"ddata_redact.csv", low_memory=False,encoding ="ISO-8859-1")

#before optimization: 353MB
ddata.info(memory_usage="deep")

#optimize file: Object-types are the biggest culprit.
ddata_obj = ddata.select_dtypes(include=['object']).copy()

#Now convert this to category type:
#Float type didn't help much, so I am excluding it here.
for col in ddata_obj:
    del ddata[col]
    ddata.loc[:, col] = ddata_obj[col].astype('category')

#release memory
del ddata_obj

#after optimization: 39MB
ddata.info(memory_usage="deep")


#Create a list of grouping variables:
group_column_list = [
                 "Business",
                 "Device_Family",
                 "Geo",
                 "Segment",
                 "Cust_Name",
                 "GID",
                 "Device ID",
                 "Seller",
                "C9Phone_Margins_Flag",
                 "C9Phone_Cust_Y_N",
                 "ANDroid_Lic_Type",
                 "Type",
                 "Term",
                 'Cust_ANDroid_Margin_Bucket',
                 'Cust_Mobile_Margin_Bucket',
# #                'Cust_Android_App_Bucket',
                 'ANDroind_App_Cust_Y_N'
]

print("Analyzing data now...")

def ddata_agg(x):
    names = {
        'ANDroid_Margin': x['ANDroid_Margin'].sum(),
        'Margins': x['Margins'].sum(),
        'ANDroid_App_Qty': x['ANDroid_App_Qty'].sum(),
        'Apple_Margin':x['Apple_Margin'].sum(),
       'P_Lic':x['P_Lic'].sum(),
       'Cust_ANDroid_Margins':x['Cust_ANDroid_Margins'].mean(),
       'Cust_Mobile_Margins':x['Cust_Mobile_Margins'].mean(),
       'Cust_ANDroid_App_Qty':x['Cust_ANDroid_App_Qty'].mean()
    }
    return pd.Series(names)

ddata=ddata.reset_index(drop=True)

ddata = ddata.groupby(group_column_list).apply(ddata_agg) 

The code crashes in above .groupby operation.

Can someone please help me? Compared to my other posts, I have probably spent the most amount of time on this StackOverflow post, trying to fix it and learn new stuff about Python. However, I have reached saturation--it even more frustrates me because R's data.table package processes this file in <2 seconds. This post is not about pros and cons of R and Python, but about using Python to be more productive.

I am completely lost, and I'd appreciate any help.


Here's my data.table R code:

path_r = "../ddata_redact.csv"
ddata<-data.table::fread(path_r,stringsAsFactors=FALSE,data.table = TRUE, header = TRUE)

group_column_list <-c(
  "Business",
  "Device_Family",
  "Geo",
  "Segment",
  "Cust_Name",
  "GID",
  "Device ID",
  "Seller",
  "C9Phone_Margins_Flag",
  "C9Phone_Cust_Y_N",
  "ANDroid_Lic_Type",
  "Type",
  "Term",
  'Cust_ANDroid_Margin_Bucket',
  'Cust_Mobile_Margin_Bucket',
  # #                'Cust_Android_App_Bucket',
  'ANDroind_App_Cust_Y_N'
  )

    ddata<-ddata[, .(ANDroid_Margin = sum(ANDroid_Margin,na.rm = TRUE), 
Margins=sum(Margins,na.rm = TRUE), 
Apple_Margin=sum(Apple_Margin,na.rm=TRUE),  
Cust_ANDroid_Margins = mean(Cust_ANDroid_Margins,na.rm = TRUE),  
Cust_Mobile_Margins = mean(Cust_Mobile_Margins,na.rm = TRUE),
Cust_ANDroid_App_Qty = mean(Cust_ANDroid_App_Qty,na.rm = TRUE),
ANDroid_App_Qty=sum(ANDroid_App_Qty,na.rm = TRUE)
), 
by=group_column_list]

I have a 4-core 16GB RAM Win10x64 machine. I can provide any details needed by experts.