It’s been a while since I’ve updated this blog, and being more verbose is definitely a resolution for 2017.

In the meantime, I’m happy to say I’m joining YouTube (within Google) and moving to Paris to work on metadata (ingestion, knowledge management, …).

Working on a music dataset, I’ve experiences first hand the difficulty (hell hole) of matching different sources of data. I’m looking forward to trying again, but this time at a giant scale!

If you’re in Paris come say hi, or if there are cool meetups / workshops / etc to attend let me know. Cheers!

Kryo bug with Spark MLlib Recommender

This does not contain new information, but it took me a while to find the info online.

If you’re using Spark MLlib to do recommendations and you’re serializing using Kryo (which makes things faster), you might have run into a Kryo error when the training data gets large, something about a HashSet:

com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet

You might have tried to register HashSet in Kryo, and it still happens. Even more frustrating, it doesn’t happen on small datasets.

This is a known issue:

When the data gets larger, additional serialization happens, and that’s what triggers the error. To make things more annoying, the HashSet is not the (only?) culprit, it’s actually BitSet. Here’s one way to set your KryoRegistrar to fix it:

class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {

Note 1: we use Spark 1.1.0. Note 2: I actually didn’t test if you need all 3 lines, it works and I stopped touching it 😉

Hope it helps, cheers!

Collaborative Filtering with Implicit Feedback: Faster Python Development Code

Assuming that’s your situation, it was mine a few days ago:

  • You’re playing around with CF.
  • You have implicit feedback (views, clicks, …).
  • You want to use  Hu et al. 2008, one of the most famous matrix factorization algorithm for this case.
  • You want to experiment with this in Python.

You’ve found this great development implementation by Chris Johnson and it’s enough to get you started. But as soon as your matrix gets a little big (e.g. 20k x 50k), you start spending a lot of time waiting for the result.

Now the proper solution is: move to a production implementation! Something like MyMediaLite would speed things up, or my new favorite, the Spark MLlib.

But assuming you still want to play around with python for a little longer before you commit to something (slightly) more involving, I’ve made a multi-threaded version of Chris Johnson’s implementation. You can find the fork here:

It contains the original code,, and the new one, You can use the new file, simply set the parameter num_threads when you create the ImplicitMF instance. A few notes:

  • because of Python’s GIL, I don’t actually use threads, but processes, via the multiprocessing library;
  • the code is not tested (beyond my own use case), and is not optimized for memory, or even for speed really;
  • it does help speed-wise, going from the regular version to 4 threads, run time for my dataset (on a 2014 macbook pro) went from 25 minutes to 9, a ~2.5x speedup.

As I said, using Spark’s MLlib is probably your best solution, it can be called from Scala, Java or Python. But if you want to keep playing with Python locally, my code can help you scale a little bit.


P.S. want to develop CF solutions in Spark and other large-scale frameworks like Scalding? We’re always hiring!

Actionable Web Analytics

A quick praise for a book I’m currently reading, Actionable Web Analytics: Using Data to Make Smart Business Decisions by Atchison and Burby. It discusses what to measure on a company website and how to leverage it to improve its performance.

It’s a book aimed at managers. It tells what to do, not how to do it. Still, programmers can learn a lot from it, in particular how to speak the same language as their boss. Examples are interesting, although they are taken from major companies. I do think most of the the lessons learned can be applied to startup websites, but the authors assume large resources (money, time, team size) to analyze the data. Anyway, the main goal is to make your think about your own company website and which user flows can be improved.

Overall, if like me you’re new to web analytics, it’s a good place to start and an easy read.