Extract json data from StringType Spark.SQL

There is hive table with single column of type string.

hive> desc logical_control.test1;
OK
test_field_1          string                  test field 1
val df2 = spark.sql("select * from logical_control.test1")

df2.printSchema()
root
|-- test_field_1: string (nullable = true)
df2.show(false)
+------------------------+
|test_field_1            |
+------------------------+
|[[str0], [str1], [str2]]|
+------------------------+

How to transform it to structured column like below?

root
|-- A: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- S: string (nullable = true)

I tried to recover it with initial schema that data being structured before it was written to the hdfs. But json_data is null.

val schema = StructType(
    Seq(
      StructField("A", ArrayType(
        StructType(
          Seq(
            StructField("S", StringType, nullable = true))
        )
      ), nullable = true)
    )
  )

val df3 = df2.withColumn("json_data", from_json(col("test_field_1"), schema))

df3.printSchema()
root
|-- test_field_1: string (nullable = true)
|-- json_data: struct (nullable = true)
|    |-- A: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- S: string (nullable = true)
df3.show(false)
+------------------------+---------+
|test_field_1            |json_data|
+------------------------+---------+
|[[str0], [str1], [str2]]|null     |
+------------------------+---------+

1 answer

  • answered 2020-03-25 20:09 werner

    If the structure of test_field_1 is fixed and you don't mind "parsing" the field yourself, you can use an udf to perform the transformation:

    case class S(S:String)
    def toArray: String => Array[S] = _.replaceAll("[\\[\\]]","").split(",").map(s => S(s.trim))
    val toArrayUdf = udf(toArray)
    
    val df3 = df2.withColumn("json_data", toArrayUdf(col("test_field_1")))
    df3.printSchema()
    df3.show(false)
    

    prints

    root
     |-- test_field_1: string (nullable = true)
     |-- json_data: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- S: string (nullable = true)
    
    +------------------------+------------------------+
    |test_field_1            |json_data               |
    +------------------------+------------------------+
    |[[str0], [str1], [str2]]|[[str0], [str1], [str2]]|
    +------------------------+------------------------+
    

    The tricky part is to create the second level (element: struct) of the structure. I have used the case class S to create this struct.