New venture

Well, I definitely didn’t keep my promise of writing more often, but some news on my end, I’m leaving YouTube to co-found a startup.

It’s in the AI + retail space, I’ll share more as it evolves. I am still in Paris for now, although there will be more regular trips to NYC.

I haven’t been this excited since joining Oyster, wish me luck!


It’s been a while since I’ve updated this blog, and being more verbose is definitely a resolution for 2017.

In the meantime, I’m happy to say I’m joining YouTube (within Google) and moving to Paris to work on metadata (ingestion, knowledge management, …).

Working on a music dataset, I’ve experiences first hand the difficulty (hell hole) of matching different sources of data. I’m looking forward to trying again, but this time at a giant scale!

If you’re in Paris come say hi, or if there are cool meetups / workshops / etc to attend let me know. Cheers!

Kryo bug with Spark MLlib Recommender

This does not contain new information, but it took me a while to find the info online.

If you’re using Spark MLlib to do recommendations and you’re serializing using Kryo (which makes things faster), you might have run into a Kryo error when the training data gets large, something about a HashSet:

com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet

You might have tried to register HashSet in Kryo, and it still happens. Even more frustrating, it doesn’t happen on small datasets.

This is a known issue

When the data gets larger, additional serialization happens, and that’s what triggers the error. To make things more annoying, the HashSet is not the (only?) culprit, it’s actually BitSet. Here’s one way to set your KryoRegistrar to fix it:

class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {

Note 1: we use Spark 1.1.0. Note 2: I actually didn’t test if you need all 3 lines, it works and I stopped touching it 😉

Hope it helps, cheers!

Collaborative Filtering with Implicit Feedback: Faster Python Development Code

Assuming that’s your situation, it was mine a few days ago:

  • You’re playing around with CF.
  • You have implicit feedback (views, clicks, …).
  • You want to use  Hu et al. 2008, one of the most famous matrix factorization algorithm for this case.
  • You want to experiment with this in Python.

You’ve found this great development implementation by Chris Johnson and it’s enough to get you started. But as soon as your matrix gets a little big (e.g. 20k x 50k), you start spending a lot of time waiting for the result.

Now the proper solution is: move to a production implementation! Something like MyMediaLite would speed things up, or my new favorite, the Spark MLlib.

But assuming you still want to play around with python for a little longer before you commit to something (slightly) more involving, I’ve made a multi-threaded version of Chris Johnson’s implementation. You can find the fork here:

It contains the original code,, and the new one, You can use the new file, simply set the parameter num_threads when you create the ImplicitMF instance. A few notes:

  • because of Python’s GIL, I don’t actually use threads, but processes, via the multiprocessing library;
  • the code is not tested (beyond my own use case), and is not optimized for memory, or even for speed really;
  • it does help speed-wise, going from the regular version to 4 threads, run time for my dataset (on a 2014 macbook pro) went from 25 minutes to 9, a ~2.5x speedup.

As I said, using Spark’s MLlib is probably your best solution, it can be called from Scala, Java or Python. But if you want to keep playing with Python locally, my code can help you scale a little bit.


P.S. want to develop CF solutions in Spark and other large-scale frameworks like Scalding? We’re always hiring!

New job: from beauty to books

A few words on a big change for me, I’m leaving Birchbox to join Oyster Books, starting tomorrow.

It’s never easy to leave a position you enjoy, especially since it let me grow so much professionally. In my 15 months at Birchbox, I had the chance to play with search, recommendation, prediction, optimization & integer programming, MySQL, Redis, Flume, Tornado, Mixpanel, MyMediaLite, … and I’m forgetting much more. I use the term “play” because when you’re in a small team with a great boss that encourages you to pursue most of your ideas, there’s no reason not to have fun while doing it. Birchbox is a great place to work, and I’ve never seen a company valuing more their employees (they are always hiring, including in the Data Science team, take a look!).

So, why leave a company when they’re opening their first store and raising $60M in a round B? Good question; and as always, the answer came from a combination of many things. But a few reasons while I’m so excited to start at Oyster’s tomorrow include:

  • recommendation and search are part of their DNA: if you can’t find a book to read, you won’t read;
  • a very clear and focused vision: remove any hurdle that prevents someone from reading, so you read more, more often;
  • a great group of core developers that have already proven themselves with the release of the app;
  • the opportunity to help start their Data Science team, so it can keep up with (and fuel) product growth;
  • the belief that the unlimited subscription model is the right model for books;
  • I’ve always loved books;
  • the app is so pretty!

There are opportunities that are too good to pass, and joining Oyster Books was such one. Of course, today I’m both excited and a little scared at what comes next, but “good scared”. Can’t wait!


Good books and Stack Overflow

Quick post, mostly to recommend two books:

  • Search Patterns: as you may know from my previous post, we are brainstorming about search at my work, and this book is a great place to get the discussion started. Not very technical, its clear goal is to make you think about search in novel ways, and not to teach you how to copy what has already been done.Great read!
  • Programing Pig: to get started with this higher language of Hadoop, written by one of the original engineer working on Pig at Yahoo. Getting slowly out of date, but still a good place to get the fundamentals of Pig.

Otherwise, after years of using it, I finally started contributing to Stack Overflow. I hadn’t realized how exciting it is to win reputation… and how competitive it gets when a question has an easy answer, you only have a few minutes to write it!
My only concern, so far my answers have mostly been hacks (e.g. this format switcher) instead of deep, insightful comments about a language or program. But if it’s useful to someone, great! I’m glad I can finally upvote answers and write comments.

Finally, I’ve discovered Reddit AWW and can’t get enough 😉

Investigating Google Site Search

The company I work at is wondering how search can be improved on our site (current solution is not state-of-the-art). To explore our options, I spent a few days building an minimum viable product (MVP) with Google Site Search (GSS). This is what I learned in the process: how to get started, how good it works, and its limitations. First takeaway: Google’s documentation is terrible!

What is Google Site Search?

Here is the website, you can think of it as the paying version of Google’s Custom Search Engine (CSE). A CSE is similar to the regular Google search, but you can specify which pages to index (or not to index), and you have some control over the ranking (e.g. by boosting some pages when they’re relevant).

How to get started?

A very simple (free) example: in CSE, create a new custom search engine (give it a cool name, like SearchMan or SuperSearch), then in ‘edit search engine’ / ‘setup’ / ‘basic’, under site to search, add: and that’s it! You have a search engine that only return results from Try it by typing ‘shampoo’ in the right panel. You can also have a public link to it where anyone can use the search engine, it looks like a modified Google homepage.

GSS uses your custom search engine, you pay to be able to query it more (and get rid of ads and get an XML feed, more on this later). So, before continuing any further, spend time with your CSE to make sure you like the results: play with indexing only parts of your website, or excluding some, specify autocompletion words, etc. Pass it to other people around your organization for feedback. If you don’t like the results, it won’t change with the paid version!

Is GSS a viable option?

I assume you have a CSE you like and you’re thinking “wouldn’t it be great if those were the results on my website, but with a different UI?”. At the same time, GSS is a paid service, as opposed to hosted solutions such as Solr or ElasticSearch. So… not an obvious choice! GSS requires little maintenance, it’s ready to work from the get go, and the service should not fail. The highest publish prices used to be (last week) $12K/year for up to 3M queries. Now it says $2K/year for <500K queries. You have to contact them in case of more traffic. Still, assuming you have a reasonably successful website at 1M queries/year, $12K might be worth it if you factor in the extra engineering time for a good Solr implementation and the additional risks. Now, price is not the only factor. To make GSS do everything you want, you might have to modify your frontend code (to add tagging and labels that Google crawlers will pick up). There are also other customization limits we'll discuss later. But GSS deserves a MVP! Getting started with the XML feed

GSS comes with its own search box and search engine result page (SERP) implementations, but chances are you’ll want to fully customize those things to fit your website’s look and feel. Don’t bother looking at Google CSE themes, it won’t get you there. You want to call GSS through an API and deal with the result yourself. For that, you need to move away from basic CSE and buy the 200K queries/year option for $100 (using Google wallet).

Don’t look at the JSON API! The pricing is different and doesn’t change if you buy GSS. There’s also a hard limit of 10K queries/day that can easily be reached in a high traffic day.

So, XML API it is. You query your public custom search engine using this call:
where CSE_ID is a set of digits + ‘:’ + letter/digit hash identifying your CSE. You can find it from your public URL. Here is a sample result.

Side note, to deal with XML, I use Hash.from_xml in Ruby (my MVP uses Rails, no comments…) or xmltodict in Python. Note that in Ruby, it seems that different Ruby versions return slightly different Hash from the same XML. So… stick to one version? Don’t use Rails?

Anatomy of an XML GSS response

Read the full documentation, this is just a cheat sheet of the few XML objects I found useful. A sample result is available.

  • TM is the response time in milliseconds.
  • Q is the query.
  • RES contains the results and the number of results as an attribute.
  • R is an individual result, containing the field below.
  • TM is the response time in milliseconds.
  • U is the page link.
  • T is the page title.
  • S is the summary.
  • PageMap contains info crawled from the website in a DataObjects element.
  • cse_thumbnail, inside a DataObject, contains the thumbnail link hosted by Google.

Title, link, thumbnail, summary… that should be enough to get started with your MVP!

Customizing the search box

So far I have assume that your app is responsible for the search box, you query GSS using the link above, and display a result based on the XML answer. Problem is, some of the cool GSS features are only available through their widget, the main one being autocompletion. But if you install the widget, it also displays the results…

One way to go is to install the widget using code provided in the CSE panel. Use version 1 as it is more customizable. It will give you a widget with autocompletion. Now, use the following callback to prevent the widget to actually search and present the results:

customSearchControl.setSearchStartingCallback({}, function() {
var q = customSearchControl.getInputQuery();
window.location = '/search?q=' + q;

What’s happening is that, when the user press search, my function is called. It gets the query, and loads another page (/search) with the query as argument. This new page (which can be the one you’re currently on!) will see the ‘q’ param, call GSS, and display the results. That’s it! It’s hacky, but it works, you have the power of the GSS search box with a fully customized SERP.

Another trick, to remove the Google logo from the search box, add this to your CSS


That said, it doesn’t seem that easy to fully modify the look & feel of the search box, but I’m not a CSS expert. Just, expect spending some time on this if it’s important to you.


Now that you have an MVP similar to mine below:

Google Site Search MVP screenshot

Example of a quick Google Site Search MVP

What’s wrong with it? For me, those are some of the issues:

  • Indexing: if you don’t want to index a whole website, and there is no easy way to tell Google not to index unwanted part (e.g. with*), you will want to specify every part of the website you want as a whitelist. You can do that by uploading an annotation file. However, GSS only accepts 5K entries that can contain wildcards. If you’re site is well structured and you want to annotate all posts like this:*, it’s perfect. But if your website contains a lot of url slugs that can vary and there’s no way to include all of them without including bad ones (and some CRM do create a lot of URLs!), 5K can be reached fast.
  • Autocompletion: it requires using and hacking the Google search box to customize it. Also, auto-detected words for autocompletion are poor in my case. You can provide your own list, but there’s a size limit. I haven’t fully found it, but it’s less than 1K (I successfully uploaded ~70 words), which can be a small limit.
  • Thumbnail: the thumbnail provided by Google is not always the right picture for a page. I’m sure it’s a website issue, being not crawler-friendly, but you don’t necessarily control that.
  • Tag individual pages: I don’t know if tag is the right word, but following the previous issue, if every result could come with a page ID, or a product ID for e-commerce pages, I could build the full SERP out of that. Problem is, there is no way of passing that information to Google as a dictionary of URL->ID. The only way I can think of is to include that information in the webpage in a way Google can pick it up: see structured data. I believe it is a great solution, but it might require a lot of frontend engineering time!

Google Site Search let me built a search MVP in 3 days for $100. It’s great, it shows me what I should expect from our site search, and I can pass the MVP around the company for feedback and guided brainstorming. If you have a small website and needs something that works fast, GSS might be your solution. If you find the Google SERP ok, you can be done in a few hours.

But to push it to a fully customized, professional site search, it would require serious engineering time, and I’m afraid of limitations I don’t control. My current thinking: the $12K or more per year would be better invested in a long-term, in-house Solr or ElasticSearch solution.

Actionable Web Analytics

A quick praise for a book I’m currently reading, Actionable Web Analytics: Using Data to Make Smart Business Decisions by Atchison and Burby. It discusses what to measure on a company website and how to leverage it to improve its performance.

It’s a book aimed at managers. It tells what to do, not how to do it. Still, programmers can learn a lot from it, in particular how to speak the same language as their boss. Examples are interesting, although they are taken from major companies. I do think most of the the lessons learned can be applied to startup websites, but the authors assume large resources (money, time, team size) to analyze the data. Anyway, the main goal is to make your think about your own company website and which user flows can be improved.

Overall, if like me you’re new to web analytics, it’s a good place to start and an easy read.

Encouraging Code Sharing for Research

A great idea was suggested by E. J. Humphrey on the music-ir mailing list today, on how to encourage people to share more code when publishing research at ISMIR (the main music technology conference). You can read his full proposal here: Encouraging Code Sharing at ISMIR.

In short, it recommends to be able to submit code along your paper, and to have that acknowledged in the online proceedings (e.g., by an extra icon). I love this idea because:

  • it is not restrictive, organizations that really can’t share their code (some companies?) won’t be prevented from submitting;
  • it acknowledges the effort of people that do share their code;
  • it improves research reproducibility, which benefits every one;
  • it sets the right mindset: “this community values code sharing”.

As a bonus, as Eric mentions, developing research code knowing that other people will actually use it might make the code better!
This idea is probably not that new, I’m sure some conferences already implement such a system. But I also know many that would benefit from it.

Code sharing is good for research. Period.