Concatenation of unique values into a spark dataframe

I have two spark dataframes with different values that I would like to concatenate:

df:

c1    c2
A     D
B     E
B     F

df2:

A    B
key1 4
key2 5
key3 6

I would like to concatenate the unique values for certain columns in these dataframes into a single dataframe. Thus, the output would be

res:

values      origin
A           first
B           first
key1        second
key2        second
key3        second

1 answer

  • answered 2022-01-19 17:34 blackbishop

    Simple union should do the job:

    import pyspark.sql.functions as F
    
    df1 = df1.selectExpr("c1 as value").distinct().withColumn("origin", F.lit("first"))
    
    df2 = df2.selectExpr("A as value").distinct().withColumn("origin", F.lit("second"))
    
    res = df1.union(df2)
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum