add missing categories in columns in Dataframe

I am having following spark dataFrame. There are 10 distinct values in column country. I want new dataframe as given in the Expected result.

DataFrame
+-------------+--------------+------------------+
|         Code|       country|                t1|
+-------------+--------------+------------------+
|            A|        Canada| 6218.400000000001|
|            A|       Central|              30.4|
|            A|        France|24540.629999999965|
|            A|       Germany|27688.029999999966|
|            A|     Northeast|             51.41|
|            A|     Northwest| 56261.31000000015|
|            A|     Southeast|             55.71|
|            A|     Southwest| 92640.42999999833|
|            A|United Kingdom|              0.64|
|            B|     Australia|145856.31999999806|
|            C|        Canada| 28223.26999999983|
|            C|     Northwest|              0.87|
|            C|     Southwest|              0.44|
+-------------+--------------+------------------+

Distinct values for country column are :
+--------------+
|       country|
+--------------+
|     Australia|
|        Canada|
|       Central|
|        France|
|       Germany|
|     Northeast|
|     Northwest|
|     Southeast|
|     Southwest|
|United Kingdom|
+--------------+

Expected Result :

+-------------+--------------+------------------+
|         Code|       country|                t1|
+-------------+--------------+------------------+
|            A|     Australia|              null|
|            A|        Canada| 6218.400000000001|
|            A|       Central|              30.4|
|            A|        France|24540.629999999965|
|            A|       Germany|27688.029999999966|
|            A|     Northeast|             51.41|
|            A|     Northwest| 56261.31000000015|
|            A|     Southeast|             55.71|
|            A|     Southwest| 92640.42999999833|
|            A|United Kingdom|              0.64|
|            B|     Australia|145856.31999999806|
|            B|        Canada|              null|
|            B|       Central|              null|
|            B|        France|              null|
|            B|       Germany|              null|
|            B|     Northeast|              null|
|            B|     Northwest|              null|
|            B|     Southeast|              null|
|            B|     Southwest|              null|
|            B|United Kingdom|              null|
|            C|     Australia|145856.31999999806|
|            C|        Canada| 28223.26999999983|
|            C|       Central|              null|
|            C|        France|              null|
|            C|       Germany|              null|
|            C|     Northeast|              null|
|            C|     Northwest|              0.87|
|            C|     Southeast|              null|
|            C|     Southwest|              0.44|
|            C|United Kingdom|              null|

How can i achieve this expected output in scala ? I have referred function/method for the dataset but not able to find any clue that will me to start with this.

Note that there could be multiple column, so for the multiple column logic goes same that i want to insert missing categories against each category in all columns.

I am beginner to spark scala. Thanks in advance :)

1 answer

  • answered 2019-04-15 06:12 Arnon Rotem-Gal-Oz

    cross join the distinct codes with countries and then left join that to the original table something like

    val codes= data.select($"Code").distinct
    val combinations = codes.crossJoin(countries)
    val result = combinations.join(data, combinations("code")===data("code") && combinations("country")===data("country"),"leftouter").select(combinations("code"),combinations("coiuntry"),data("t1")).orderBy($"code",$"value")