Skip to content

heyjessi/spotifys-angels

Repository files navigation

spotifys-angels

cs 109 project for harvard's data science class, spotify recommendation system for top hits

project abstract

Our project seeks to use intrinsic audio features to characterize the most popular songs on Spotify from the past decade. As a team, we realized that many of Spotify's recommendation playlists feature the same artists again and again. We wanted to see if separate from name recognition, intrinsic audio characteristics could be used to generate playlists of the best hits. You will see how this question develops during our predictive popularity model exploration and selection. After querying tracks from Spotify from 2010 to 2019, and training models on a subset of popular and relatively unpopular songs (according to Spotify's autogenerated popularity metric), we deploy our top models to create new playlists of popular songs over time!

Our predictive model appeals to a wide audience, from students hoping to impress their friends with the latest hits, to nostalgic working adults who grew up jamming to the 2010s' best tunes, to people throwing a decades party who want to choose oldies hits that will appeal to a modern audience.

conclusion

In this project, we set out to model the intrinsic audio features that characterize the most popular songs on Spotify from the past decade. We created models that use audio features automatically generated by Spotify to determine the expected popularity of input tracks, with the ultimate goal being to apply our best model to new data to generate playlists recommending top hits to listeners, over any period of time. Since listeners may be interested in both popular playlists with tracks in shuffled order, and playlists with tracks ordered from most to least popular, we used both binary classification and regression frameworks throughout this project.

We stepped through increasingly complex models, each time fitting on a random training subset of songs from 2010 to 2019 and evaluating their performance on another test subset of songs with known popularity. We found that for the classification task of predicting whether or not a track is popular, a bagging model performs best, achieving a training accuracy of 0.9928 and test accuracy of 0.8273. For the regression task of predicting a track's numerical popularity score between 0 and 100, a random forest model performs best, achieving a training R-squared of 0.8811 and test R-squared of 0.2467. The fact that we can predict popularity with such high accuracy (especially under the classification framework) seems to suggest that intrinsic audio features are in fact able to capture a significant portion of what makes songs popular on Spotify. In particular, traits including energy and danceability seem to be strong positive predictors of popularity. This is in accordance with some of the findings of our initial literature review.

Armed with these two best models, we queried new data from Spotify — one of new releases from the past week and one of oldies from the 1970s and 1980s — and used our bagging and random forest models to generate new playlists for listeners looking to discover the most popular songs of today and yesterday.

About

cs 109 project for harvard's data science class, spotify recommendation system for top hits

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published