Assuming that’s your situation, it was mine a few days ago:
- You’re playing around with CF.
- You have implicit feedback (views, clicks, …).
- You want to use Hu et al. 2008, one of the most famous matrix factorization algorithm for this case.
- You want to experiment with this in Python.
You’ve found this great development implementation by Chris Johnson and it’s enough to get you started. But as soon as your matrix gets a little big (e.g. 20k x 50k), you start spending a lot of time waiting for the result.
But assuming you still want to play around with python for a little longer before you commit to something (slightly) more involving, I’ve made a multi-threaded version of Chris Johnson’s implementation. You can find the fork here:
It contains the original code, mf.py, and the new one, mf_threaded.py. You can use the new file, simply set the parameter num_threads when you create the ImplicitMF instance. A few notes:
- because of Python’s GIL, I don’t actually use threads, but processes, via the multiprocessing library;
- the code is not tested (beyond my own use case), and is not optimized for memory, or even for speed really;
- it does help speed-wise, going from the regular version to 4 threads, run time for my dataset (on a 2014 macbook pro) went from 25 minutes to 9, a ~2.5x speedup.
As I said, using Spark’s MLlib is probably your best solution, it can be called from Scala, Java or Python. But if you want to keep playing with Python locally, my code can help you scale a little bit.
P.S. want to develop CF solutions in Spark and other large-scale frameworks like Scalding? We’re always hiring!