TikTok is a rapidly growing social media platform where users commonly create vides of themselves performing “dance challenges” to songs. Each song has a particular choreography devised by the initial creator of the challenge. TikTok is now the 7th largest social media platform and boasts more users than Snapchat, Pinterest, and Twitter¹.
The top TikTok songs used in these dance challenges have gained a staggering number of videos made to them. For instance, Laxed by Jawsh 685 has had 49.40M videos made with the songs as the music.
As TikTok has seen its users skyrocket, the top creators on the platform have become real celebrities. For instance, Charlie D’Amelio has 111M followers and Addison Rae has 78M followers. Charlie D’Amelio has become so famous that she even has her own signature drink at Dunkin’ served throughout the United States.
However, these TikTok celebrities are different from the celebrities of old.
“In the old model of celebrity, stars were propped up by studios and agencies with a stake in their enduring appeal. TikTok’s young stars have grown up in a world where fame can arrive in an instant, but also disappear overnight.” — Rachel Monroe, The Atlantic
It has become harder for these new stars to stay relevant and keep putting out great content.
Given how competitive TikTok has become, I wanted to create a solution that provided potential hit songs to TikTok creators. I created this through building a classification model to predict whether songs would be successful on TikTok.
- Collecting top TikTok Songs: I collected the top 1,000 daily songs songs on TikTok through Chartmetric. Chartmetric was the only company I found that was tracking this type of information. Chartmetric provides an expensive API and so I instead used Selenium to download all of the top historical charts through their premium offering. This left me with ~15,000 total songs.
- Feature Extraction: Once I had a list of songs on TikTok, I extracted song features using the Spotify API. After I had extracted these features, I was left with ~7,500 songs given a large number of songs were not on Spotify or were simply sounds rather than music.
Labels / Features
Chartmetric ranks top songs on two measures: 1) Top daily songs by video count 2) Top 7 day trailing songs by video count. I broke labels out into two groups:
- Hit: if a song every breaks the top 50 for daily video counts or 7 day trailing video counts
- Not a Hit: if a song never breaks the top 50 for daily video counts or 7 day trailing video counts
We can take a look at the below example. The first song will be labelled a “hit” given its peak rank is 30, but the second song will be labelled “not a hit” since its peak position is 144.
Features for this model were broken out between audio related features and metadata (13 total features).
Song features: danceability, energy, key, acousticness, mode, speechiness, liveness, valence, loudness, temp, instrumentalness
Metadata: year released, artists
I trained and optimized multiple classification models on the dataset including K-Nearest Neighbors, Logistic Regression, Random Forest, Balanced Random Forest, XG Boost and a neural network. I paid special attention to the imbalance of the dataset (~10% of songs were hits) through either oversampling using ADASYN, altering model weights, or selecting models that are designed for label imbalance.
I selected my top model based on the F1 score, given I wanted to equally maximize precision and recall. Ultimately, the balanced random forest had the highest F1 score at 80%.
The model performed well with only 4 false negatives and 88 false positives.
The top features in the model were artist, year released and danceability. Other features like mode and key had a limited impact. I also trained and tested models excluding the artist feature as I did not want the artist to overly bias the model; however, these models all resulted in a significantly lower F1 scores of ~30%.
Predicting on Outside Data
After training and testing these classification models, I then predicted whether or not songs from this 127,000 song Kaggle dataset would be hits. After predicting these results, I utilized Streamlit to display my results. This allows TikTok users to filter for songs and find potential songs they may want to use in a new video. You can check out the web application here.
The model appears to perform well. When filtering the dataset to songs from 1920 to 1971, no songs are returned to be hits. This makes sense given the younger TikTok users prefer more modern music.
When filtering songs to 2018–2021. The first 5 predicted hits return promising results.
- 92 Explorer by Post Malone — has yet to be predicted a hit on TikTok, but Post Malone is a popular artist on TikTok with other hits
- Walk by Comethazine — has yet to be a hit
- Ride it by Regard — hit, peak position on the charts was 5th
- Mine by Bazzi — hit, peak position on the charts was 2nd
- May I by Flo Milli — hit, peak position on the charts was 1st
This project was completed as part of 3-month Metis data science bootcamp program.
For more information on this project, including code and presentation slides, please check out my GitHub repository here.