As I am finishing my PhD developing and working on the Million Song Dataset (MSD), I thought it would be interesting to try to summarize the project. The goal is to give a quick grasp on what the MSD is and what can be done with it. For more information, visit the MSD website or read the original paper.
The MSD is a very large collection of music data aimed at researchers. It was created by LabROSA and The Echo Nest in 2011. The goals of the MSD include: 1) encourage music technologist to work on a commercial-like scale, 2) create a reference dataset for evaluating research, 3) help new researchers get started in MIR.
The core of the MSD is information about one million songs gathered from The Echo Nest API. It includes identifiers (artist name, albums, titles, Musicbrainz IDs, …), audio features (loudness, timbre, pitches, beats, …) and relationship data (similar artists, artist tags).
Other organizations have joined the project: SecondHandSongs for identifying cover songs, musiXmatch to provide lyrics, Last.fm for song-level tags and similarity, and The Echo Nest again for user data. Other audio features for 30s snippets were computed by Austrian researchers. All these collections are matched to the 1M songs, making research involving connected data easy. For instance, McFee et al. investigated what information can be used to make playlists. Serrà et al. looked at audio features over time. We also organized a very large, open music recommendation contest.
A lot of the MSD information is gathered or computed automatically (e.g., audio features) and not by human expert. It implies a certain level of noise and errors. However, it is unavoidable when working at that scale, and the size makes up a lot for it. Real music data for 1M songs: start exploring!