Kryo bug with Spark MLlib Recommender

This does not contain new information, but it took me a while to find the info online.

If you’re using Spark MLlib to do recommendations and you’re serializing using Kryo (which makes things faster), you might have run into a Kryo error when the training data gets large, something about a HashSet:

com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet

You might have tried to register HashSet in Kryo, and it still happens. Even more frustrating, it doesn’t happen on small datasets.

This is a known issuehttps://issues.apache.org/jira/browse/SPARK-1977

When the data gets larger, additional serialization happens, and that’s what triggers the error. To make things more annoying, the HashSet is not the (only?) culprit, it’s actually BitSet. Here’s one way to set your KryoRegistrar to fix it:

class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.register(classOf[Rating])
    kryo.register(classOf[mutable.HashSet[_]])
    kryo.register(classOf[mutable.BitSet])
  }
}

Note 1: we use Spark 1.1.0. Note 2: I actually didn’t test if you need all 3 lines, it works and I stopped touching it 😉

Hope it helps, cheers!