Deep Learning and Music: Mood Classification of Spotify Songs

Creating Spotify Playlists Using Spotipy and Keras

9 min readJun 7, 2021

By Jonathan Wallach, Brendan Corr, Michael Moschitto

Introduction

After completing data science projects within common sectors such as sports and finance, our team set out to tackle something a little less conventional for our project as part of Cal Poly’s Knowlege Discovery from Data course (CSC 466). We landed at the combination of Mood and Music. Can we teach a computer to learn how music will make people feel? And so we set out to see if we could scrape songs from our streaming platform of choice, Spotify, predict how songs would make someone feel, and compose a playlist of similar feeling songs.

Background

Before we began any data collection the first task was to decide how we would label each song and what moods would be predicted. We were able to find an article from Tufts University on Music Mood Classification regarding Robert Thayer’s traditional model of mood which states that the 4 most common feelings are Happy, Sad, Calm, and Energetic.

Music Mood Classification

The article will cover the analysis of music using various DSP and music theory techniques involving rhythm, harmony…

sites.tufts.edu

Gathering the Data

As with any data science project, our first task was to collect data. To get access to the raw song data, we needed to leverage the Spotify Web API. This API interaction was a huge part of our project as there are very few mood labeled datasets containing the information we needed, not to mention listener specific. This was accomplished using an existing Python library called Spotipy that allowed for less focus on endpoints and status codes and more on data collection.

In order to make the classification user specific, we first obtained a list Michael’s public playlists.

Next, we scraped songs that would make up our testing and training datasets.

This resulted in data frames with name, uri (identifier), genre, artist, and playlist columns. The API response body contains more information, but for our purposes this was all we needed.

The features we planned to train a network on took advantage of how the music sounded, as it often has a large effect on the way that it makes us feel and would allow us to classify without language processing. The API has an endpoint for audio analysis and from it we took the following features:

Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides a strong likelihood that the track is live.
Loudness: the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing the relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
Speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audiobook, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, the tempo is the speed or pace of a given piece and derives directly from the average beat duration.

Data Exploration and Cleaning

Because we used an established API, there was little cleaning that needed to be done. The only real cleaning that we did was sorting through responses, picking out relevant values, and dealing with pagination. The way that songs are stored was a blocker to our data engineering at first, as we couldn’t figure out why for a given playlist only the first 100 songs would be returned. The reason for this was that pages of songs acted as linked lists, and once a node in the list was traversed it was necessary to check the next for more data.

Exploration was a slightly larger challenge. We took means, medians, standard deviations, line charts, bar charts, you name it… and still couldn’t find a good way to visualize what we were seeing (and hearing). Then came the idea of using radar plots to chart features. Radar plots give shapes to data which gave us the ability to visually compare the sound of different songs; exactly what we wanted! We were also able to chart the means of each mood (happy, sad, calm energetic) and get a feeling for those as well.

Above are the plots for individual moods, in addition to those for a subset of songs in Michael’s country playlist.

The Training Data

Once we had an idea of what the data looked like all we had left before classification was to create our training set. This was accomplished by pulling different mood labeled playlists which resulted in over 1700 songs.

The calmDF, happyDF, energeticDF, and sadDF were each scraped using our SpotifyWrapper.py class function getSongsFromPlaylist while trainingFeatures was the result of getFeatures.

Classification

When trying to accurately predict moods we attempted KMeans, Random Forest, and Neural Network classifiers. In the end the neural net edged out Random Forest with ~76% accuracy.

Preprocessing:

In an effort to fine tune our model we ran permutation feature importance. Permutation feature importance is the process of shuffling all values for one feature (one column) at a time and seeing the impact the shuffling has on the accuracy of the model. The more the accuracy of the model decreases when you shuffle a column, the more important that column must be. The column must be shuffled and tested multiple times to ensure the impact is not a coincidence. After determining which features are the most important, you may reduce the number of dimensions that your model uses to train and resultantly see accuracy improvements.

Feature Importance

Below are the results we received when running our permutation based feature importance. All values represent the average difference between the original accuracy and the accuracy with the feature shuffled.

Acousticness: 4.29%
Danceability: .64%
Energy : 1.84%
Liveness: 3.43%
Loudness: .53%
Speechiness: 3.78%
Tempo: 1.28%
Valence: 6.10%

The most important features determined by this test are valence and acousticness, while the least important are loudness and danceability. When we removed these less important features from our dataset, the accuracy of our model decreased ~3%, so we made the decision to keep all features present.

Scaling and Encoding:

Finally, we used Scitit-Learn’s MinMaxScaler and LabelEncoder to normalize audio analysis values and give a numerical assignment to each mood.

Model Creation:

We used the Deep Learning Network library Keras, which is a fast and powerful library for getting networks up and running. As the goal was to classify 4 different moods, our model consisted of a Multi-Class Network with 9 input features to a dense layer with a rectified linear unit (Relu) activation function, connected with a dense output layer this time using a softmax.

Once we had our model we utilized K-Fold Cross Validation and 10 splits to evaluate our classifier.

Results:

The final model showed to be fairly accurate with an accuracy score of about 76%. This means about 76% of the time our model correctly predicts the mood of the song from the training data. Considering that we are trying to predict the mood of a song which can be subjective in itself, predicting correctly about 76% of the time is good. We also looked at the F1 score which is a metric combining precision and recall essentially measuring how well the true positives are actually correct. F1 score is often regarded as a better measure than accuracy as F1 score can better predict for imbalanced classes. Our F1 score for the model was about 73% which is relatively good as far as F1 scores go. Our model performed well and most importantly, passes the “ear test”, as almost all of the songs in each playlist we outputted seem to fall under the given mood.

Application

Finally, once we had a trained model the last step was to apply what we had made! Using testing data we were able to make predictions for about 600 songs from 4 of Michael’s playlists and sort those results into Dataframes.

It’s important to notice that for each mood there are decimal values (pctCalm, pctEnergetic, pctHappy, pctSad) corresponding to how likely the classifier thought each song elicited a given mood. Any one song has the ability to make different people feel different ways which made classification difficult, and so we used these percent rankings to sort the outputs and only take the top 50 from each mood.

Lastly, we matched the unique Spotify identifier (URI) to each song name, and write a playlist back to Michael’s account.

DrAsEnergeticMix (shoutout professor Dr. Anderson) is a collection of songs for when you’re feeling excited, that is generally high in the liveness, loudness, and energy.

Future Work

While we are highly satisfied with the progress we made on this project, there are a few remaining steps we would like to take.

First off, in the original planning of this project we intended to have our model classify both genre AND mood, so the recommended playlists could be better tailored to a user’s interests. Unfortunately songs are not tagged with genres causing a hole in our data. We were able to determine a genre from the general one assigned to a song’s artist, but felt that it was not accurate enough to be used in this project. In the future, we would love to dive deeper into this possibility to see if we could incorporate this additional specification.

Second, we believe that this model can provide a great benefit to listeners and would love to provide an interface for people to take advantage of. In the future, we look forward to leveraging our web experience to develop an interface allowing users to connect to their own account and create their own playlists.

Lastly, another aspect that we would like to add is a lyrical language analysis as what a song says often has as much, if not more, effect on how it makes us feel than it’s auditory components.

Final Thoughts

It was really fun completing a project that taught us a ton about the data science pipeline. Throughout the process, we learned about everything from data scraping, preparation, and visualization, to machine learning and API usage. The most rewarding part was that we were able to see and hear our results in tangible ways useful to us and others. If you’re interested you can check out Dr. A’s happy, sad, calm, and energetic playlists through Michael’s Spotify account, and the rest of our code on our Github. Happy listening!