How does Spark serialization work for case classes?

I've run into something odd in Spark 2.2 and how it deserializes case classes. For these examples, assume this case class:

case class X(a:Int, b:Int) {
  println("in the constructor!!!")
}

If I have the following map operation, I see both my constructor and the value of 'a' messages in the executor logs.

ds.map(x => {
  val x = X(1, 2)
  println(s"a=${x.a})
}

With the following map operation, I do not see my constructor message but I do see the value of 'a' message in the executor logs. The constructor message is in the driver logs.

val x = X(1, 2)
ds.map(x => println(s"a=${x.a}"))

And I get the same behavior if I use a broadcast variable.

val xBcast = sc.broadcast(X(1, 2))
ds.map(x => println(s"a=${xBcast.value.a}"))

Any idea what's going on? Is Spark serializing each field as needed? I would have expected the whole object to be shipped over and deserialized. With that deserialization I'd expect a constructor call.

When I looked at the encoder code for Products it looks like it gets the necessary fields from the constructor. I guess I was assuming it would use those encoders for this kind of stuff.

I even decompiled my case class's class file and the constructor generated seems reasonable.

1 answer

  • answered 2018-10-09 18:16 Levi Ramsey

    Spark is using Java serialization (available because case classes extend Serializable) by default, which does not require the use of a constructor to deserialize. See this StackOverflow question for details on Java serialization/deserialization.

    Note that this reliance on Java serialization can cause issues, as the internal serialization format is not set in stone so JVM version differences can cause deserialization to fail.