Kryo bug with Spark MLlib Recommender

This does not contain new information, but it took me a while to find the info online.

If you’re using Spark MLlib to do recommendations and you’re serializing using Kryo (which makes things faster), you might have run into a Kryo error when the training data gets large, something about a HashSet:

com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet

You might have tried to register HashSet in Kryo, and it still happens. Even more frustrating, it doesn’t happen on small datasets.

This is a known issuehttps://issues.apache.org/jira/browse/SPARK-1977

When the data gets larger, additional serialization happens, and that’s what triggers the error. To make things more annoying, the HashSet is not the (only?) culprit, it’s actually BitSet. Here’s one way to set your KryoRegistrar to fix it:

class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.register(classOf[Rating])
    kryo.register(classOf[mutable.HashSet[_]])
    kryo.register(classOf[mutable.BitSet])
  }
}

Note 1: we use Spark 1.1.0. Note 2: I actually didn’t test if you need all 3 lines, it works and I stopped touching it 😉

Hope it helps, cheers!

Share

One thought on “Kryo bug with Spark MLlib Recommender

  1. Hi,

    Thanks a lot for your solution.
    In fact, I think that we just need register Rating and BitSet:

    kryo.register(classOf[Rating])
    kryo.register(classOf[mutable.BitSet])

    However, the spark guys have already revert the default setting of pyspark. Therefore, there will be no such bug in next version of spark.

    Cheers

Leave a Reply

Your email address will not be published. Required fields are marked *