How to join two ARRAY<STRUCT> fields on a join key in Spark SQL 3.2

Situation

I have a nested data structure featuring multiple ARRAY<STRUCT> fields. I'd like to flatten this data structure to end up with a flattened table in the end. The challenge is, that in those array fields there are structs that share a reference key that I need to keep together.

Test data

Test data definition

%%sql
CREATE OR REPLACE TEMPORARY VIEW school_class AS (
    SELECT
        'junior' AS student_age
        , array(
            named_struct('name', 'Andy',    'birthday', '2000-01-01')
          , named_struct('name', 'Beth',    'birthday', '2000-01-02')
          , named_struct('name', 'Charley', 'birthday', '2000-01-03')
          , named_struct('name', 'Doris',   'birthday', '2000-01-04')
        ) AS students
        , array(
            named_struct('subject', 'math', 'name', 'Andy',    'grade', 'A', 'favorite_number', 7 )
          , named_struct('subject', 'math', 'name', 'Charley', 'grade', 'C', 'favorite_number', 9)
          , named_struct('subject', 'math', 'name', 'Beth',    'grade', 'B', 'favorite_number', 42)
          
        ) AS students_math
        , array(
            named_struct('subject', 'chemistry', 'name', 'Beth',    'grade', 'B', 'favorite_element', 'Ti')
          , named_struct('subject', 'chemistry', 'name', 'Charley', 'grade', 'A', 'favorite_element', 'Kr')
          , named_struct('subject', 'chemistry', 'name', 'Doris',   'grade', 'A', 'favorite_element', 'Ne')
        ) AS students_chemistry
);

Schema of test data

spark.read.table('school_class').printSchema()
root
 |-- student_age: string (nullable = false)
 |-- students: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- birthday: string (nullable = true)
 |-- students_math: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- subject: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- grade: string (nullable = true)
 |    |    |-- favorite_number: integer (nullable = true)
 |-- students_chemistry: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- subject: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- grade: string (nullable = true)
 |    |    |-- favorite_element: string (nullable = true)

Desired result

I'd like to end up with a flat table as follows:

| student_age | name    | birthday   | grade_math | grade_chemistry |
| ----------- | ------- | ---------- | ---------- | --------------- |
| junior      | Andy    | 2000-01-01 | A          | null            |
| junior      | Beth    | 2000-01-02 | B          | B               |
| junior      | Charley | 2000-01-03 | C          | A               |
| junior      | Doris   | 2000-01-04 | null       | A               |

This can be achived using CTE/subqueries and explodes:

%%sql
WITH
  aux_students AS (
      SELECT 
        student_age
        , stud.name
        , stud.birthday
      FROM
        school_class
      LATERAL VIEW explode(students) exploded_students AS stud
  ),
  aux_students_m AS (
      SELECT 
        student_age
        , stud_m.name
        , stud_m.grade  AS grade_math
      FROM
        school_class
      LATERAL VIEW OUTER explode(students_math) exploded_students AS stud_m
  ),
  aux_students_ch AS (
      SELECT 
        student_age
        , stud_ch.name
        , stud_ch.grade AS grade_chemistry
      FROM
        school_class
      LATERAL VIEW OUTER explode(students_chemistry) exploded_students AS stud_ch
  )
SELECT aux_students.*, aux_students_m.grade_math, aux_students_ch.grade_chemistry  FROM aux_students
LEFT JOIN aux_students_m ON (aux_students.student_age = aux_students_m.student_age AND aux_students.name = aux_students_m.name)
LEFT JOIN aux_students_ch ON (aux_students.student_age = aux_students_ch.student_age AND aux_students.name = aux_students_ch.name);

Is there a better way?

I am looking for a more elegant way, since my current solution will not scale properly if the original data structure would contain thousands of rows. I think in that case, the join would consider more rows than it actually need to since the views contain all rows for all "student_age".

I stumbled across the arrays_zip function. This is nearly what I want to do: it joins two or more ARRAY<STRUCT>. However, It does the join not considering a join criterion (ON-clause in SQL) but only on the position of the array elements:

%%sql
SELECT
  student_age
  , s.students.name
  , s.students.birthday
  , s.students_math.grade AS grade_math
  , s.students_chemistry.grade AS grade_chemistry
FROM
  school_class
LATERAL VIEW explode(arrays_zip(students, students_math, students_chemistry)) expl AS s
;

The result is wrong 🔴, since not all elements are present in all arrays and the elements are not sorted on the join key ("name") Result:

| student_age | name    | birthday   | grade_math | grade_chemistry |
| ----------- | ------- | ---------- | ---------- | --------------- |
| junior      | Andy    | 2000-01-01 | A 🔴       | B 🔴            |
| junior      | Beth    | 2000-01-02 | C 🔴       | A 🔴            |
| junior      | Charley | 2000-01-03 | B 🔴       | A               |
| junior      | Doris   | 2000-01-04 | null       | null 🔴         |

Questions

  1. Is there some similar functionality to arrays_zip that let me specify the join-condition?
  2. Is there some some workaround by making sure that all elements are present in the arrays (e.g. create dummy null values) and sort the arrays before applying arrays_zip?
  3. Is there a completely different approach for solving my problem?
How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum