Read spark dataset only first n columns

I have a dataset with more than 5000 columns and OutOfMemoryException was thrown when try to read the dataset, even when limiting to 10 rows. There is another post on the cause of exception and so I want to read only first n columns to avoid the error. I could not find an api call that does that and only the rows could be restricted with head or limit. Is there a way to do restricting to only first few columns? Thanks.

1 answer

  • answered 2018-10-11 20:00 cheseaux

    Given that your Dataset is ds, you can extract the first n columns into an Array :

    val n = 2
    val firstNCols = ds.columns.take(n)
    

    and then select only these columns from the Dataset :

    ds.select(firstNCols.head, firstNCols.tail:_*)