How to pivot huge data set in spark
I have a 15TB data set structured in the following way (the actual data is tab delimited, but for clarity here I replaced tab with comma)
id, date, time, key, value 1, 'date1', 'time1', 'key1', 1 1, 'date1', 'time1', 'key2', 0 2, 'date2', 'time2', 'key2', 10 ...
Thankfully, the data appears to be sorted by id, date, and time. There approximately 100 different keys. I would like to pivot the data set into the following format:
id, date, time, key1, key2,...keyN 1, 'date1', 'time1', 1, 0 .... 2, 'date2', 'time2', null, 10,...
I have several questions:
Given I am attempting to do this in spark, what would be the best approach from a high level to accomplish this using EMR?
How would I parallelize this given that the files are well sorted?
What would be the best output format for a flat file? Should I be considering something like parquet, or is a flat CSV/TSV a good option?