Appending longer column to spark dataframe

Assuming I have a dataframe of length X and I want to add a column with values from a file, meaning it can be longer or shorter then the current dataframe. The missing way on either way should be filled by Nulls.

For example:

   +-----------+---+----+----------+
   |       Jack|Jan|Alex| Alexander|
   +-----------+---+----+----------+
   | 2342342342| 41| 234|1459204800|
   | 2348972342| 22|   2|1459294200|
   | 1234342342| 31|    |1459595700|
   +-----------+---+----+----------+

and I want to add column X (Victor) with values (2133, 234, 3345435, 13123,4345), which is longer by 2 than the dataset and the result should be as following:

   +-----------+---+----+----------+---------+
   |       Jack|Jan|Alex| Alexander|   Victor|
   +-----------+---+----+----------+---------+
   | 2342342342| 41| 234|1459204800|2133     |
   | 2348972342| 22|   2|1459294200|234      |
   | 1234342342| 31|    |1459595700|3345435  |
   |           |   |    |          |13123    |
   |           |   |    |          | 4345    |
   +-----------+---+----+----------+---------+

I played around with row_number and window function, but they seem to me not efficient. Perhaps a simple withColumn would work?

Also, do I gain anything by storing the data like this? Won't it be better to store it in a usual row-base form?

   +-----+-----------+
   | Name|      Value|
   +-----------------+
   | Jack| 2342342342|
   | Jack| 2348972342|
   | Jack| 1234342342|      
   | ... |           | 

1 answer

  • answered 2018-02-13 10:20 pratiklodha

    You can add an index column in both the dataframe like

    ddf1 = df1.withColumn("row_id", monotonically_increasing_id())
    ddf2 = df2.withColumn("row_id", monotonically_increasing_id())
    

    and perform an outer join, or left join with the bigger dataframe

    result = ddf1.join(ddf2, ["row_id"])