Convert a Spark dataframe to a R dataframe

I use R on Zeppelin at work to develop machine learning models. I extract the data from Hive tables using %sparkr, sql(Constring, 'select * from table') and in default it generates a spark data frame with 94 Million records.

However, I cannot perform all R data munging tasks on this Spark df, so I try to convert it to an R data frame using Collect(), but I run into memory node/ time-out issues.

I was wondering if stack overflow community is aware of any other way to convert a Spark df to R df by avoiding time-out issues?

1 answer

  • answered 2018-08-09 19:40 nate

    Did you try to cache your spark dataframe first? If you cache your data first, it may help speed up the collect as the data is already in RAM...that could get rid of the timeout problem. At the same time, this would only increase your RAM requirements. I too have seen those timeout issues when you are trying to serialize or deserialize certain data types, or just large amounts of data between R and Spark. Serialization and deserialization for large data sets is far from a "bullet proof" operation with R and Spark. Moreover, 94M records may just be too much for your driver node to handle in the first place, especially if there is a lot of dimensionality to your dataset.

    One workaround I've used, but am not proud of is to use spark to write out the dataframe as a CSV and then have R read that CSV file back in on the next line of the script. Oddly enough, in a few of the cases I did this, the write a file and read the file method actually ended up being faster than a simple collect operation. A lot faster.

    Word of advice- make sure to watch out for partitioning when writing out csv files with spark. You'll get a bunch of csv files and have to do some sort of tmp<- lapply(list_of_csv_files_from_spark, function(x){read.csv(x)}) operation to read in each csv file individually and then maybe a df<-"rbind", tmp) would probably be best to use fread to read in the csvs in place of read.csv as well.

    Perhaps the better question is, what other data munging tasks are you unable to do in Spark that you need R for?

    Good luck. I hope this was helpful. -nate