YouTube

It’s been a while since I’ve updated this blog, and being more verbose is definitely a resolution for 2017.

In the meantime, I’m happy to say I’m joining YouTube (within Google) and moving to Paris to work on metadata (ingestion, knowledge management, …).

Working on a music dataset, I’ve experiences first hand the difficulty (hell hole) of matching different sources of data. I’m looking forward to trying again, but this time at a giant scale!

If you’re in Paris come say hi, or if there are cool meetups / workshops / etc to attend let me know. Cheers!

Kryo bug with Spark MLlib Recommender

This does not contain new information, but it took me a while to find the info online.

If you’re using Spark MLlib to do recommendations and you’re serializing using Kryo (which makes things faster), you might have run into a Kryo error when the training data gets large, something about a HashSet:

com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet

You might have tried to register HashSet in Kryo, and it still happens. Even more frustrating, it doesn’t happen on small datasets.

This is a known issuehttps://issues.apache.org/jira/browse/SPARK-1977

When the data gets larger, additional serialization happens, and that’s what triggers the error. To make things more annoying, the HashSet is not the (only?) culprit, it’s actually BitSet. Here’s one way to set your KryoRegistrar to fix it:

class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
    kryo.register(classOf[Rating])
    kryo.register(classOf[mutable.HashSet[_]])
    kryo.register(classOf[mutable.BitSet])
  }
}

Note 1: we use Spark 1.1.0. Note 2: I actually didn’t test if you need all 3 lines, it works and I stopped touching it 😉

Hope it helps, cheers!

Collaborative Filtering with Implicit Feedback: Faster Python Development Code

Assuming that’s your situation, it was mine a few days ago:

  • You’re playing around with CF.
  • You have implicit feedback (views, clicks, …).
  • You want to use  Hu et al. 2008, one of the most famous matrix factorization algorithm for this case.
  • You want to experiment with this in Python.

You’ve found this great development implementation by Chris Johnson and it’s enough to get you started. But as soon as your matrix gets a little big (e.g. 20k x 50k), you start spending a lot of time waiting for the result.

Now the proper solution is: move to a production implementation! Something like MyMediaLite would speed things up, or my new favorite, the Spark MLlib.

But assuming you still want to play around with python for a little longer before you commit to something (slightly) more involving, I’ve made a multi-threaded version of Chris Johnson’s implementation. You can find the fork here:

https://github.com/tbertinmahieux/implicit-mf

It contains the original code, mf.py, and the new one, mf_threaded.py. You can use the new file, simply set the parameter num_threads when you create the ImplicitMF instance. A few notes:

  • because of Python’s GIL, I don’t actually use threads, but processes, via the multiprocessing library;
  • the code is not tested (beyond my own use case), and is not optimized for memory, or even for speed really;
  • it does help speed-wise, going from the regular version to 4 threads, run time for my dataset (on a 2014 macbook pro) went from 25 minutes to 9, a ~2.5x speedup.

As I said, using Spark’s MLlib is probably your best solution, it can be called from Scala, Java or Python. But if you want to keep playing with Python locally, my code can help you scale a little bit.

Cheers!

P.S. want to develop CF solutions in Spark and other large-scale frameworks like Scalding? We’re always hiring!

New job: from beauty to books

A few words on a big change for me, I’m leaving Birchbox to join Oyster Books, starting tomorrow.

It’s never easy to leave a position you enjoy, especially since it let me grow so much professionally. In my 15 months at Birchbox, I had the chance to play with search, recommendation, prediction, optimization & integer programming, MySQL, Redis, Flume, Tornado, Mixpanel, MyMediaLite, … and I’m forgetting much more. I use the term “play” because when you’re in a small team with a great boss that encourages you to pursue most of your ideas, there’s no reason not to have fun while doing it. Birchbox is a great place to work, and I’ve never seen a company valuing more their employees (they are always hiring, including in the Data Science team, take a look!).

So, why leave a company when they’re opening their first store and raising $60M in a round B? Good question; and as always, the answer came from a combination of many things. But a few reasons while I’m so excited to start at Oyster’s tomorrow include:

  • recommendation and search are part of their DNA: if you can’t find a book to read, you won’t read;
  • a very clear and focused vision: remove any hurdle that prevents someone from reading, so you read more, more often;
  • a great group of core developers that have already proven themselves with the release of the app;
  • the opportunity to help start their Data Science team, so it can keep up with (and fuel) product growth;
  • the belief that the unlimited subscription model is the right model for books;
  • I’ve always loved books;
  • the app is so pretty!

There are opportunities that are too good to pass, and joining Oyster Books was such one. Of course, today I’m both excited and a little scared at what comes next, but “good scared”. Can’t wait!

 

Good books and Stack Overflow

Quick post, mostly to recommend two books:

  • Search Patterns: as you may know from my previous post, we are brainstorming about search at my work, and this book is a great place to get the discussion started. Not very technical, its clear goal is to make you think about search in novel ways, and not to teach you how to copy what has already been done.Great read!
  • Programing Pig: to get started with this higher language of Hadoop, written by one of the original engineer working on Pig at Yahoo. Getting slowly out of date, but still a good place to get the fundamentals of Pig.

Otherwise, after years of using it, I finally started contributing to Stack Overflow. I hadn’t realized how exciting it is to win reputation… and how competitive it gets when a question has an easy answer, you only have a few minutes to write it!
My only concern, so far my answers have mostly been hacks (e.g. this format switcher) instead of deep, insightful comments about a language or program. But if it’s useful to someone, great! I’m glad I can finally upvote answers and write comments.

Finally, I’ve discovered Reddit AWW and can’t get enough 😉
Cheers!

Investigating Google Site Search

The company I work at is wondering how search can be improved on our site (current solution is not state-of-the-art). To explore our options, I spent a few days building an minimum viable product (MVP) with Google Site Search (GSS). This is what I learned in the process: how to get started, how good it works, and its limitations. First takeaway: Google’s documentation is terrible!

What is Google Site Search?

Here is the website, you can think of it as the paying version of Google’s Custom Search Engine (CSE). A CSE is similar to the regular Google search, but you can specify which pages to index (or not to index), and you have some control over the ranking (e.g. by boosting some pages when they’re relevant).

How to get started?

A very simple (free) example: in CSE, create a new custom search engine (give it a cool name, like SearchMan or SuperSearch), then in ‘edit search engine’ / ‘setup’ / ‘basic’, under site to search, add: http://www.birchbox.com/ and that’s it! You have a search engine that only return results from birchbox.com. Try it by typing ‘shampoo’ in the right panel. You can also have a public link to it where anyone can use the search engine, it looks like a modified Google homepage.

GSS uses your custom search engine, you pay to be able to query it more (and get rid of ads and get an XML feed, more on this later). So, before continuing any further, spend time with your CSE to make sure you like the results: play with indexing only parts of your website, or excluding some, specify autocompletion words, etc. Pass it to other people around your organization for feedback. If you don’t like the results, it won’t change with the paid version!

Is GSS a viable option?

I assume you have a CSE you like and you’re thinking “wouldn’t it be great if those were the results on my website, but with a different UI?”. At the same time, GSS is a paid service, as opposed to hosted solutions such as Solr or ElasticSearch. So… not an obvious choice! GSS requires little maintenance, it’s ready to work from the get go, and the service should not fail. The highest publish prices used to be (last week) $12K/year for up to 3M queries. Now it says $2K/year for <500K queries. You have to contact them in case of more traffic. Still, assuming you have a reasonably successful website at 1M queries/year, $12K might be worth it if you factor in the extra engineering time for a good Solr implementation and the additional risks. Now, price is not the only factor. To make GSS do everything you want, you might have to modify your frontend code (to add tagging and labels that Google crawlers will pick up). There are also other customization limits we'll discuss later. But GSS deserves a MVP! Getting started with the XML feed

GSS comes with its own search box and search engine result page (SERP) implementations, but chances are you’ll want to fully customize those things to fit your website’s look and feel. Don’t bother looking at Google CSE themes, it won’t get you there. You want to call GSS through an API and deal with the result yourself. For that, you need to move away from basic CSE and buy the 200K queries/year option for $100 (using Google wallet).

Don’t look at the JSON API! The pricing is different and doesn’t change if you buy GSS. There’s also a hard limit of 10K queries/day that can easily be reached in a high traffic day.

So, XML API it is. You query your public custom search engine using this call:
http://www.google.com/cse?&cx=CSE_ID&q=shampoo&num=20&output=xml_no_dtd&client=google-csbe&gl=us
where CSE_ID is a set of digits + ‘:’ + letter/digit hash identifying your CSE. You can find it from your public URL. Here is a sample result.

Side note, to deal with XML, I use Hash.from_xml in Ruby (my MVP uses Rails, no comments…) or xmltodict in Python. Note that in Ruby, it seems that different Ruby versions return slightly different Hash from the same XML. So… stick to one version? Don’t use Rails?

Anatomy of an XML GSS response

Read the full documentation, this is just a cheat sheet of the few XML objects I found useful. A sample result is available.

  • TM is the response time in milliseconds.
  • Q is the query.
  • RES contains the results and the number of results as an attribute.
  • R is an individual result, containing the field below.
  • TM is the response time in milliseconds.
  • U is the page link.
  • T is the page title.
  • S is the summary.
  • PageMap contains info crawled from the website in a DataObjects element.
  • cse_thumbnail, inside a DataObject, contains the thumbnail link hosted by Google.

Title, link, thumbnail, summary… that should be enough to get started with your MVP!

Customizing the search box

So far I have assume that your app is responsible for the search box, you query GSS using the link above, and display a result based on the XML answer. Problem is, some of the cool GSS features are only available through their widget, the main one being autocompletion. But if you install the widget, it also displays the results…

One way to go is to install the widget using code provided in the CSE panel. Use version 1 as it is more customizable. It will give you a widget with autocompletion. Now, use the following callback to prevent the widget to actually search and present the results:

customSearchControl.setSearchStartingCallback({}, function() {
var q = customSearchControl.getInputQuery();
window.location = '/search?q=' + q;
});

What’s happening is that, when the user press search, my function is called. It gets the query, and loads another page (/search) with the query as argument. This new page (which can be the one you’re currently on!) will see the ‘q’ param, call GSS, and display the results. That’s it! It’s hacky, but it works, you have the power of the GSS search box with a fully customized SERP.

Another trick, to remove the Google logo from the search box, add this to your CSS

.gsc-input{
background:none;
}

That said, it doesn’t seem that easy to fully modify the look & feel of the search box, but I’m not a CSS expert. Just, expect spending some time on this if it’s important to you.

Limitations

Now that you have an MVP similar to mine below:

Google Site Search MVP screenshot

Example of a quick Google Site Search MVP

What’s wrong with it? For me, those are some of the issues:

  • Indexing: if you don’t want to index a whole website, and there is no easy way to tell Google not to index unwanted part (e.g. with -www.website.com/garbage/*), you will want to specify every part of the website you want as a whitelist. You can do that by uploading an annotation file. However, GSS only accepts 5K entries that can contain wildcards. If you’re site is well structured and you want to annotate all posts like this: www.mywebsite.com/posts/*, it’s perfect. But if your website contains a lot of url slugs that can vary and there’s no way to include all of them without including bad ones (and some CRM do create a lot of URLs!), 5K can be reached fast.
  • Autocompletion: it requires using and hacking the Google search box to customize it. Also, auto-detected words for autocompletion are poor in my case. You can provide your own list, but there’s a size limit. I haven’t fully found it, but it’s less than 1K (I successfully uploaded ~70 words), which can be a small limit.
  • Thumbnail: the thumbnail provided by Google is not always the right picture for a page. I’m sure it’s a website issue, being not crawler-friendly, but you don’t necessarily control that.
  • Tag individual pages: I don’t know if tag is the right word, but following the previous issue, if every result could come with a page ID, or a product ID for e-commerce pages, I could build the full SERP out of that. Problem is, there is no way of passing that information to Google as a dictionary of URL->ID. The only way I can think of is to include that information in the webpage in a way Google can pick it up: see structured data. I believe it is a great solution, but it might require a lot of frontend engineering time!

Conclusion
Google Site Search let me built a search MVP in 3 days for $100. It’s great, it shows me what I should expect from our site search, and I can pass the MVP around the company for feedback and guided brainstorming. If you have a small website and needs something that works fast, GSS might be your solution. If you find the Google SERP ok, you can be done in a few hours.

But to push it to a fully customized, professional site search, it would require serious engineering time, and I’m afraid of limitations I don’t control. My current thinking: the $12K or more per year would be better invested in a long-term, in-house Solr or ElasticSearch solution.

Actionable Web Analytics

A quick praise for a book I’m currently reading, Actionable Web Analytics: Using Data to Make Smart Business Decisions by Atchison and Burby. It discusses what to measure on a company website and how to leverage it to improve its performance.

It’s a book aimed at managers. It tells what to do, not how to do it. Still, programmers can learn a lot from it, in particular how to speak the same language as their boss. Examples are interesting, although they are taken from major companies. I do think most of the the lessons learned can be applied to startup websites, but the authors assume large resources (money, time, team size) to analyze the data. Anyway, the main goal is to make your think about your own company website and which user flows can be improved.

Overall, if like me you’re new to web analytics, it’s a good place to start and an easy read.
Cheers!
T.

Encouraging Code Sharing for Research

A great idea was suggested by E. J. Humphrey on the music-ir mailing list today, on how to encourage people to share more code when publishing research at ISMIR (the main music technology conference). You can read his full proposal here: Encouraging Code Sharing at ISMIR.

In short, it recommends to be able to submit code along your paper, and to have that acknowledged in the online proceedings (e.g., by an extra icon). I love this idea because:

  • it is not restrictive, organizations that really can’t share their code (some companies?) won’t be prevented from submitting;
  • it acknowledges the effort of people that do share their code;
  • it improves research reproducibility, which benefits every one;
  • it sets the right mindset: “this community values code sharing”.

As a bonus, as Eric mentions, developing research code knowing that other people will actually use it might make the code better!
This idea is probably not that new, I’m sure some conferences already implement such a system. But I also know many that would benefit from it.

Code sharing is good for research. Period.
T.

Playing with Neo4j – Part I

I have been interested recently in NoSQL databases as a way to handle large collections (I recommend NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence as an introduction). I have touched keystores and BigTable in previous internships, but never a graph database. Therefore, here is my attempt at using one on my laptop to create a dummy submission for the Million Song Dataset Challenge (on music recommendation). This post will be about installing the database, loading the training data from the challenge, and perform basic queries. The second post will cover walking through the graph and making recommendations based on “people would listened to this also listened to that”. This is not intended to be a fully documented example! My intent is to gather pointers to the relevant information so I (and hopefully, you) can get started faster the next time.

Neo4j

Why Neo4j? I could have chosen something else, but it’s open-source, works on linux/max/windows, there is a good amount of documentation online, and for non-intensive use you can interact with it from most languages. Also, now that I understand it better, it’s trivial to install!

Installing Neo4j

It’s so easy! I use Ubuntu 12.04, it took me 30 seconds to be up and running. Download a tar package, untar, launch with

bin/neo4j start

and that’s it! You have a database contained in a folder “data”, a server running to interact through a REST API, and a dashboard that looks like this (counters would be 1 – 0 – 0 – 0):

From Neo4j

It really takes a matter of seconds to complete the installation on one machine. Of course, the challenge starts now.

Interacting: Python + REST API

Not knowing where to start, I chose my language of choice, Python. I used this Neo4j client which can be installed using easy_install. Basic commands are as follow:

from neo4jrestclient.client import GraphDatabase
gdb = GraphDatabase("http://localhost:7474/db/data/")
n1 = gdb.node(name='song1', id='123')
n2 = gdb.node(name='user1', id='abc')

The code above connects to the database (the webserver must be running) and creates two nodes, each with two properties: ‘name’ and ‘id’. In our experiments, nodes will be either users or songs. In IPython, you can easily inspect the properties of a node:

In [6]: n1.properties
Out[6]: {u'id': u'123', u'name': u'song1'}

A graph database is made of nodes, relationships (directed edges between nodes), and properties (metadata/info on nodes or relationships). We saw how to make nodes, relationships are as easy. Below we create, than delete, a relationship of the type ‘listened_to’ between two nodes n1 and n2:

rel = n2.relationships.create("listened_to", n1)
rel.delete()

Once a node has a relationship ‘rel’, you can find the other end of that relationship. In our case, n2 is a user that “has listened to” n1, a song. Remember that relationships are directed. In IPython, basic relationship traversing is as follow (note ‘outgoing’ versus ‘incoming’):

In [44]: for rel in n2.relationships.outgoing(types=['listened_to'])[:]:
    print rel.end.properties
   ....: 
{u'id': u'123', u'name': u'song1'}
In [46]: for rel in n1.relationships.incoming(types=['listened_to'])[:]:
    print rel.start.properties
   ....: 
{u'id': u'abc', u'name': u'user1'}

Indexing

If you have a node object, you can walk along its relationships to other nodes and find paths with different constraints, great things we will discuss in the second post on this subject. But a more pressing matter is to find a node that you inserted in the database. We assume that you have some form of ID that you put in the properties of that node (e.g., ‘user_id’ for users). We need indexes!

Creating an index is easy, and then you must insert each node you create into it. Note that we haven’t explored Neo4j auto-indexing yet. Making and using an index looks like:

In [160]: n1 = gdb.nodes.create(song_id='123')
In [161]: index1 = gdb.nodes.indexes.create("index1")
In [162]: index1['song_id']['123'] = n1
In [163]: for n in index1.get('song_id', '123'):
.....:     print n.properties
{u'song_id': u'123'}

Cypher

Cypher is a graph query language for Neo4j. It is inspired by SQL and borrows from SPARQL. It sounds like the proper way to use Neo4j if you get serious about it, but we haven’t explored it much yet. However, one command that’s useful when experimenting is how to delete everything in the database (except the root node):

START n=node(*)
MATCH n-[r?]-()
WHERE ID(n) <> 0
DELETE n,r

and that query made through Python is:

query = gdb.query("START n=node(*) MATCH n-[r?]-() WHERE ID(n) <> 0 DELETE n,r;")
response = query.get_response()

Batch Inserting the MSD Challenge data

We know the basic commands, the goal is now to upload the training file of the Million Song Dataset Challenge. The direct link is (careful it’s 500MB) here. Unzipped, it’s more than 2GB. It contains info about >1M users that have listened to >300K songs. Each of the 50M lines is a triplet (user, song, playcount):

b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAKIMP12A8C130995	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAPDEY12A81C210A9	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBBMDR12A8C13253B	2
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBFNSP12AF72A0E22	1
...

We want to insert the data with the following structure:

  • each user is a node, with a property ‘user_id’
  • each song is a node, with a property ‘song_id’
  • we have one type of relationship, ‘listened_to’, that goes from a user to a song
  • ‘listened_to’ has one property, playcount
  • we have two indexes: song_index and user_index (they record the song/user ID)

We could insert all that data (1.5M nodes, 50M relationships) using the Python commands we just saw, but after doing that for an hour, it’s clearly a dead end in term of speed.
Therefore, we have to use batch inserting! It is intended to jump start a database with a lot of data, exactly our case. Long story short, Neo4j is developed in Java, and through Java you can interact directly with the database files, there is no more efficient way.

Following the example Neo4j provides, we created a Java script to go through the data file, and (in order):

  1. insert all 1M users (with indexing)
  2. insert all 300K songs (with indexing)
  3. enter all relationships (by querying the indexes to find the proper nodes)

It is not the most efficient, we should have entered only the songs, than enter all relationships for one user at a time so we don’t have to search for that user in the index. But hey, we tried, it worked! The full code is available here: MSDCBatchInsert.java. You will need to stop the Neo4j server before launching the insert code, it needs to have an exclusive lock on the database. FYI, the section that inserts songs looks like this:

// We insert song IDs.
BatchInserterIndex insert_idx_songs =
    indexProvider.nodeIndex( "songs",
		             MapUtil.stringMap( "type", "exact" ) );
Map<String, Object> song_properties = new HashMap<String, Object>();
Iterator song_it = data.song_ids.iterator();
while(song_it.hasNext())
    {
	song_properties.put( "song_id", song_it.next() );
	long node = inserter.createNode( song_properties );
	insert_idx_songs.add(node, song_properties);
    }
System.out.println("All songs inserted.");

To compile and run the code, include all .jar files that come with your Neo4j installation (that’s the ‘echo’ nonsence below):

javac -cp $(echo neo4j-community-1.9.M04/lib/*.jar | tr ' ' ':') MSDCBatchInsert.java
time java -cp .:$(echo neo4j-community-1.9.M04/lib/*.jar | tr ' ' ':') MSDCBatchInsert your_path/train_triplets.txt

On my Ubuntu laptop with 4GB of RAM, core i3 CPU, inserting all the nodes took 3 minutes, inserting all the relationships took 90 minutes. Not bad! Now my dashboard looks like:

dashboard_fullyloaded

The MSD Challenge training data is loaded in the graph database, now we need to use it. We’ll do so in our second post on the subject, where we’ll try to make simple recommendations by walking along the edges. Stay tuned!
T.