Collaborative Filtering with Implicit Feedback: Faster Python Development Code

Assuming that’s your situation, it was mine a few days ago:

  • You’re playing around with CF.
  • You have implicit feedback (views, clicks, …).
  • You want to use  Hu et al. 2008, one of the most famous matrix factorization algorithm for this case.
  • You want to experiment with this in Python.

You’ve found this great development implementation by Chris Johnson and it’s enough to get you started. But as soon as your matrix gets a little big (e.g. 20k x 50k), you start spending a lot of time waiting for the result.

Now the proper solution is: move to a production implementation! Something like MyMediaLite would speed things up, or my new favorite, the Spark MLlib.

But assuming you still want to play around with python for a little longer before you commit to something (slightly) more involving, I’ve made a multi-threaded version of Chris Johnson’s implementation. You can find the fork here:

https://github.com/tbertinmahieux/implicit-mf

It contains the original code, mf.py, and the new one, mf_threaded.py. You can use the new file, simply set the parameter num_threads when you create the ImplicitMF instance. A few notes:

  • because of Python’s GIL, I don’t actually use threads, but processes, via the multiprocessing library;
  • the code is not tested (beyond my own use case), and is not optimized for memory, or even for speed really;
  • it does help speed-wise, going from the regular version to 4 threads, run time for my dataset (on a 2014 macbook pro) went from 25 minutes to 9, a ~2.5x speedup.

As I said, using Spark’s MLlib is probably your best solution, it can be called from Scala, Java or Python. But if you want to keep playing with Python locally, my code can help you scale a little bit.

Cheers!

P.S. want to develop CF solutions in Spark and other large-scale frameworks like Scalding? We’re always hiring!

Share

3 thoughts on “Collaborative Filtering with Implicit Feedback: Faster Python Development Code

  1. Hi Thierry,

    Thanks for the great article and fantastic local implementation in Python.

    I am working on a school project and would love to use SparkML, more specifically als.trainImplicit, however my data is just {user_id, item_id} purchasing data.

    I have not been able to successfully get started with the project. Would you have any insight into how to create an RDD from my data?

    I would really appreciate any help.

    Thanks,
    Alek

    • Hi Alek, if you only have implicit feedback, just assume all users gave you a rating of 1 for those products. In the implicit setup, it means you have equal confidence in all the entries, and you’ve set that confidence to 1.

  2. Hi,

    Thanks for the fantastic implementation. It’s really helpful. I was going through the code and trying to understand it. Could you possibly tell me the relevance of this line:

    norm_diff = scipy.linalg.norm(user_vectors_old – self.user_vectors) + scipy.linalg.norm(item_vectors_old – self.item_vectors)

    Is the difference between the old calculated vectors and the new calculated vectors reducing supposed to show a reducing trend?

Leave a Reply

Your email address will not be published. Required fields are marked *