Post 3

Data
In our previous blog post, we noted a few issues that arose with our data. More specifically, we realized that with our initial data pruning of the This Is My Jam dataset, we cut out a lot of jams tied to a song. This drastically reduced the number of jams/likes a song would get, which lead to lackluster results with our machine learning attempts. We decided to use a different database called the Million Song Dataset which is a collection of audio features and metadata for a million popular music tracks. Each of these songs has a 'hotness' level tied to it, similar to the popularity metric of a song on Spotify. We decided to use this database since the songs from the This Is My Jams dataset were fairly niche.
Readjusting Our Goal
We are no longer trying to classify/predict a song's popularity lifespan using date-based data from the This Is My Jam dataset. Instead, we are looking to classify a song's popularity using audio features (e.g. danceability, loudness, key, etc.) from the Million Song Dataset. The first step we've taken to test the viability of the approach was by labeling the songs into two categories: with a '1' if the song is popular and with a '2' if the song is unpopular. We determined these labels with a strict cutoff: if the song has a hotness metric of 60 or greater, it is labeled as popular; else, not popular. With this new approach, we managed to get an accuracy rate of 82% on a simply binary classification (popular vs not popular) based on audio features. We then hypothesized that by further classifying song popularity into four different categories would yield improved accuracy values. We tested this hypothesis based on a four-way division of hotness values:
- 0-25: very unpopular,
- 25-50: unpopular,
- 50-75: popular,
- 75-100: very popular.