This is a quick overview of my Ph.D. thesis. It tries to answer: what is about? and, is it worth it for you to read it? You can get the PDF here.
Quick background: I defended my thesis in January 2013. The work was done in collaboration mostly with my adviser, Dan Ellis, at LabROSA, Columbia University. A list of my publications can be found here.
My thesis is split in two parts: 1) the Million Song Dataset (MSD), and 2) large-scale cover song recognition using the dataset. The MSD is a very useful resource, and the thesis gives a good and complete overview. That being said, you might be better off starting with my post: the MSD in 250 words, the MSD website, and the original paper.
The 2nd part can be summarized by: we have this awesome resource(the MSD), what cool stuff can we do with it. Tons of tasks can be performed on the dataset (tagging, metadata analysis, year prediction, recommendation, etc), but we decided to focus on cover song recognition. Why?
- This task has never been studied on a large scale (more than a few thousand songs).
- It needs a second wind: MIREX 2011 results did not present any improvement, and the task wasn’t run in 2012.
- It is a difficult problem, a lot can change between covers! Thus, a good cover song recognition solution should be helpful for other tasks as well (e.g. segmentation).
Note that if you do any work on cover song recognition, make sure to start with Serrà’s thesis! (PDF). It is the reference work.
We mostly start from scratch on this task. The main reason is that we have a new dataset (18K covers out of 1M songs) with only The Echo Nest chroma features that were not used in previous systems. Therefore, our main goals are: 1) showing that the task can be tackled at that scale, and 2) provide lessons learned and a reference point for other researchers. Our first solution is inspired by the Shazam algorithm for audio fingerprinting. It was presented at WASPAA ’11 (PDF), and nothing really new was added in the thesis.
Our second solution is of more interest. The idea is two take the magnitude of the 2D Fourier transform of a chromagram (2DFTM). This higher-level feature was first introduced by Marolt (PDF). Our experiments show that it works much better than our fingerprinting-like solution, and those first results were presented at ISMIR ’12 (PDF). Two songs that have similar 2DFTM are likely covers, and you can reduce its dimension with PCA without sacrificing much accuracy.
In the thesis, we further analyze this feature. In particular, we show that:
- As expected, it is more robust to small time offset than regular chroma features.
- The phase does not seem to add value to the magnitude as a feature, or at least we did not find a proper way to include it.
- Encoding a set of patches using a simple distribution (mean and variance for each bin) does not seem to work better in practice than taking the median across all patches.
- Computing a set of 2DFTM per song (instead of 1), with different beat-per-frame for the underlying beat-aligned chromagram, can improve results at the cost of more data to handle.
- Our original normalization, before PCA, was wrong, z-scoring the bins help.
So, should you read the thesis? If you are working on large-scale cover song recognition, probably, it will give you a nice reference and help you implement our solution (we can provide some of the code, too). Otherwise, if you are still interested in our work, my “regular publications” (available here) are probably shorter and more to the point.