Playing with Neo4j – Part I

I have been interested recently in NoSQL databases as a way to handle large collections (I recommend NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence as an introduction). I have touched keystores and BigTable in previous internships, but never a graph database. Therefore, here is my attempt at using one on my laptop to create a dummy submission for the Million Song Dataset Challenge (on music recommendation). This post will be about installing the database, loading the training data from the challenge, and perform basic queries. The second post will cover walking through the graph and making recommendations based on “people would listened to this also listened to that”. This is not intended to be a fully documented example! My intent is to gather pointers to the relevant information so I (and hopefully, you) can get started faster the next time.

Neo4j

Why Neo4j? I could have chosen something else, but it’s open-source, works on linux/max/windows, there is a good amount of documentation online, and for non-intensive use you can interact with it from most languages. Also, now that I understand it better, it’s trivial to install!

Installing Neo4j

It’s so easy! I use Ubuntu 12.04, it took me 30 seconds to be up and running. Download a tar package, untar, launch with

bin/neo4j start

and that’s it! You have a database contained in a folder “data”, a server running to interact through a REST API, and a dashboard that looks like this (counters would be 1 – 0 – 0 – 0):

From Neo4j

It really takes a matter of seconds to complete the installation on one machine. Of course, the challenge starts now.

Interacting: Python + REST API

Not knowing where to start, I chose my language of choice, Python. I used this Neo4j client which can be installed using easy_install. Basic commands are as follow:

from neo4jrestclient.client import GraphDatabase
gdb = GraphDatabase("http://localhost:7474/db/data/")
n1 = gdb.node(name='song1', id='123')
n2 = gdb.node(name='user1', id='abc')

The code above connects to the database (the webserver must be running) and creates two nodes, each with two properties: ‘name’ and ‘id’. In our experiments, nodes will be either users or songs. In IPython, you can easily inspect the properties of a node:

In [6]: n1.properties
Out[6]: {u'id': u'123', u'name': u'song1'}

A graph database is made of nodes, relationships (directed edges between nodes), and properties (metadata/info on nodes or relationships). We saw how to make nodes, relationships are as easy. Below we create, than delete, a relationship of the type ‘listened_to’ between two nodes n1 and n2:

rel = n2.relationships.create("listened_to", n1)
rel.delete()

Once a node has a relationship ‘rel’, you can find the other end of that relationship. In our case, n2 is a user that “has listened to” n1, a song. Remember that relationships are directed. In IPython, basic relationship traversing is as follow (note ‘outgoing’ versus ‘incoming’):

In [44]: for rel in n2.relationships.outgoing(types=['listened_to'])[:]:
    print rel.end.properties
   ....: 
{u'id': u'123', u'name': u'song1'}
In [46]: for rel in n1.relationships.incoming(types=['listened_to'])[:]:
    print rel.start.properties
   ....: 
{u'id': u'abc', u'name': u'user1'}

Indexing

If you have a node object, you can walk along its relationships to other nodes and find paths with different constraints, great things we will discuss in the second post on this subject. But a more pressing matter is to find a node that you inserted in the database. We assume that you have some form of ID that you put in the properties of that node (e.g., ‘user_id’ for users). We need indexes!

Creating an index is easy, and then you must insert each node you create into it. Note that we haven’t explored Neo4j auto-indexing yet. Making and using an index looks like:

In [160]: n1 = gdb.nodes.create(song_id='123')
In [161]: index1 = gdb.nodes.indexes.create("index1")
In [162]: index1['song_id']['123'] = n1
In [163]: for n in index1.get('song_id', '123'):
.....:     print n.properties
{u'song_id': u'123'}

Cypher

Cypher is a graph query language for Neo4j. It is inspired by SQL and borrows from SPARQL. It sounds like the proper way to use Neo4j if you get serious about it, but we haven’t explored it much yet. However, one command that’s useful when experimenting is how to delete everything in the database (except the root node):

START n=node(*)
MATCH n-[r?]-()
WHERE ID(n) <> 0
DELETE n,r

and that query made through Python is:

query = gdb.query("START n=node(*) MATCH n-[r?]-() WHERE ID(n) <> 0 DELETE n,r;")
response = query.get_response()

Batch Inserting the MSD Challenge data

We know the basic commands, the goal is now to upload the training file of the Million Song Dataset Challenge. The direct link is (careful it’s 500MB) here. Unzipped, it’s more than 2GB. It contains info about >1M users that have listened to >300K songs. Each of the 50M lines is a triplet (user, song, playcount):

b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAKIMP12A8C130995	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAPDEY12A81C210A9	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBBMDR12A8C13253B	2
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBFNSP12AF72A0E22	1
...

We want to insert the data with the following structure:

  • each user is a node, with a property ‘user_id’
  • each song is a node, with a property ‘song_id’
  • we have one type of relationship, ‘listened_to’, that goes from a user to a song
  • ‘listened_to’ has one property, playcount
  • we have two indexes: song_index and user_index (they record the song/user ID)

We could insert all that data (1.5M nodes, 50M relationships) using the Python commands we just saw, but after doing that for an hour, it’s clearly a dead end in term of speed.
Therefore, we have to use batch inserting! It is intended to jump start a database with a lot of data, exactly our case. Long story short, Neo4j is developed in Java, and through Java you can interact directly with the database files, there is no more efficient way.

Following the example Neo4j provides, we created a Java script to go through the data file, and (in order):

  1. insert all 1M users (with indexing)
  2. insert all 300K songs (with indexing)
  3. enter all relationships (by querying the indexes to find the proper nodes)

It is not the most efficient, we should have entered only the songs, than enter all relationships for one user at a time so we don’t have to search for that user in the index. But hey, we tried, it worked! The full code is available here: MSDCBatchInsert.java. You will need to stop the Neo4j server before launching the insert code, it needs to have an exclusive lock on the database. FYI, the section that inserts songs looks like this:

// We insert song IDs.
BatchInserterIndex insert_idx_songs =
    indexProvider.nodeIndex( "songs",
		             MapUtil.stringMap( "type", "exact" ) );
Map<String, Object> song_properties = new HashMap<String, Object>();
Iterator song_it = data.song_ids.iterator();
while(song_it.hasNext())
    {
	song_properties.put( "song_id", song_it.next() );
	long node = inserter.createNode( song_properties );
	insert_idx_songs.add(node, song_properties);
    }
System.out.println("All songs inserted.");

To compile and run the code, include all .jar files that come with your Neo4j installation (that’s the ‘echo’ nonsence below):

javac -cp $(echo neo4j-community-1.9.M04/lib/*.jar | tr ' ' ':') MSDCBatchInsert.java
time java -cp .:$(echo neo4j-community-1.9.M04/lib/*.jar | tr ' ' ':') MSDCBatchInsert your_path/train_triplets.txt

On my Ubuntu laptop with 4GB of RAM, core i3 CPU, inserting all the nodes took 3 minutes, inserting all the relationships took 90 minutes. Not bad! Now my dashboard looks like:

dashboard_fullyloaded

The MSD Challenge training data is loaded in the graph database, now we need to use it. We’ll do so in our second post on the subject, where we’ll try to make simple recommendations by walking along the edges. Stay tuned!
T.

Large-Scale Pattern Discovery in Music

This is a quick overview of my Ph.D. thesis. It tries to answer: what is about? and, is it worth it for you to read it? You can get the PDF here.

Quick background: I defended my thesis in January 2013. The work was done in collaboration mostly with my adviser, Dan Ellis, at LabROSA, Columbia University. A list of my publications can be found here.

My thesis is split in two parts: 1) the Million Song Dataset (MSD), and 2) large-scale cover song recognition using the dataset. The MSD is a very useful resource, and the thesis gives a good and complete overview. That being said, you might be better off starting with my post: the MSD in 250 words, the MSD website, and the original paper.

The 2nd part can be summarized by: we have this awesome resource(the MSD), what cool stuff can we do with it. Tons of tasks can be performed on the dataset (tagging, metadata analysis, year prediction,  recommendation, etc), but we decided to focus on cover song recognition. Why?

  • This task has never been studied on a large scale (more than a few thousand songs).
  • It needs a second wind: MIREX 2011 results did not present any improvement, and the task wasn’t run in 2012.
  • It is a difficult problem, a lot can change between covers! Thus, a good cover song recognition solution should be helpful for other tasks as well (e.g. segmentation).

Note that if you do any work on cover song recognition, make sure to start with Serrà’s thesis! (PDF). It is the reference work.

We mostly start from scratch on this task. The main reason is that we have a new dataset (18K covers out of 1M songs) with only The Echo Nest chroma features that were not used in previous systems. Therefore, our main goals are: 1) showing that the task can be tackled at that scale, and 2) provide lessons learned and a reference point for other researchers. Our first solution is inspired by the Shazam algorithm for audio fingerprinting. It was presented at WASPAA ’11 (PDF), and nothing really new was added in the thesis.

Our second solution is of more interest. The idea is two take the magnitude of the 2D Fourier transform of a chromagram (2DFTM). This higher-level feature was first introduced by Marolt (PDF). Our experiments show that it works much better than our fingerprinting-like solution, and those first results were presented at ISMIR ’12 (PDF). Two songs that have similar 2DFTM are likely covers, and you can reduce its dimension with PCA without sacrificing much accuracy.

In the thesis, we further analyze this feature. In particular, we show that:

  • As expected, it is more robust to small time offset than regular chroma features.
  • The phase does not seem to add value to the magnitude as a feature, or at least we did not find a proper way to include it.
  • Encoding a set of patches using a simple distribution (mean and variance for each bin) does not seem to work better in practice than taking the median across all patches.
  • Computing a set of 2DFTM per song (instead of 1), with different beat-per-frame for the underlying beat-aligned chromagram, can improve results at the cost of more data to handle.
  • Our original normalization, before PCA, was wrong, z-scoring the bins help.

So, should you read the thesis? If you are working on large-scale cover song recognition, probably, it will give you a nice reference and help you implement our solution  (we can provide some of the code, too). Otherwise, if you are still interested in our work, my “regular publications” (available here) are probably shorter and more to the point.

Cheers!
T.

The Million Song Dataset in 250 words

MSD logo

As I am finishing my PhD developing and working on the Million Song Dataset (MSD), I thought it would be interesting to try to summarize the project. The goal is to give a quick grasp on what the MSD is and what can be done with it. For more information, visit the MSD website or read the original paper.

 

The MSD is a very large collection of music data aimed at researchers. It was created by LabROSA and The Echo Nest in 2011. The goals of the MSD include: 1) encourage music technologist to work on a commercial-like scale, 2) create a reference dataset for evaluating research, 3) help new researchers get started in MIR.

The core of the MSD is information about one million songs gathered from The Echo Nest API. It includes identifiers (artist name, albums, titles, Musicbrainz IDs, …), audio features (loudness, timbre, pitches, beats, …) and relationship data (similar artists, artist tags).

Other organizations have joined the project: SecondHandSongs for identifying cover songs, musiXmatch to provide lyrics, Last.fm for song-level tags and similarity, and The Echo Nest again for user data. Other audio features for 30s snippets were computed by Austrian researchers. All these collections are matched to the 1M songs, making research involving connected data easy. For instance, McFee et al. investigated what information can be used to make playlists. Serrà et al. looked at audio features over time. We also organized a very large, open music recommendation contest.

A lot of the MSD information is gathered or computed automatically (e.g., audio features) and not by human expert. It implies a certain level of noise and errors. However, it is unavoidable when working at that scale, and the size makes up a lot for it. Real music data for 1M songs: start exploring!

First post – Welcome to my blog!

Hello world!

As I’m leaving academia for good (here is my old website), it is time to start fresh with a brand new website. It is here to stay (the url and hosting does not depend on anyone but me), it has a great blog (thanks WordPress!), and it can be expended at will!

This blog will mostly let me talk about technologies I’m investigating for my work as a data scientist. That said, I love NYC (especially the restaurants) and Montreal, I have a soft spot for music technology which I studied in grad school, I think the Canadiens is the best NHL team ever, and I might comment on those too!

Finally, if something I post is useful to you, don’t hesitate to send me en email! I’m always interested to hear about people working on similar problems. For instance, I try to share my code (often python) whenever it makes sense.

Cheers!
T.