This does not contain new information, but it took me a while to find the info online.
If you’re using Spark MLlib to do recommendations and you’re serializing using Kryo (which makes things faster), you might have run into a Kryo error when the training data gets large, something about a HashSet:
com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet
You might have tried to register HashSet in Kryo, and it still happens. Even more frustrating, it doesn’t happen on small datasets.
This is a known issue: https://issues.apache.org/jira/browse/SPARK-1977
When the data gets larger, additional serialization happens, and that’s what triggers the error. To make things more annoying, the HashSet is not the (only?) culprit, it’s actually BitSet. Here’s one way to set your KryoRegistrar to fix it:
class MyKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[Rating]) kryo.register(classOf[mutable.HashSet[_]]) kryo.register(classOf[mutable.BitSet]) } }
Note 1: we use Spark 1.1.0. Note 2: I actually didn’t test if you need all 3 lines, it works and I stopped touching it 😉
Hope it helps, cheers!
Hi,
Thanks a lot for your solution.
In fact, I think that we just need register Rating and BitSet:
kryo.register(classOf[Rating])
kryo.register(classOf[mutable.BitSet])
However, the spark guys have already revert the default setting of pyspark. Therefore, there will be no such bug in next version of spark.
Cheers