About this Dataset

Featured Spotify Artists/tracks With Metadata

This is a sample of Spotify artist and track metadata for artists featured on Spotify's curated Editorial playlists from April 1st to May 9th, 2024. The artist metadata includes Monthly Listener counts, follower counts, genres, dates of first and most recent releases, total number of releases and total number of tracks, as of 2024. The track metadata includes the number of times featured, markets, popularity and Spotify audio features.

The dataset is designed to be representative of the distribution of artists and tracks featured by Spotify's editorial team. It can be compared/contrasted with this large random sample across all Spotify artists.

There are three files:

featured_Spotify_artist_info.csv (~10,000 unique artists over ~28,000 rows) contains featured artists, including repetitions across playlists and dates - each appearance is assigned its own row. Note that only one artist from each track is listed, if the track is a collaboration. That artist is chosen randomly.
featured_Spotify_track_info.csv (~15,000 unique featured tracks) contains the featured tracks and their metadata, where multiple artists, dates and playlists are collapsed into comma-separated strings
CLEANED_featured_Spotify_artist_info.csv is similar to (1), but removes any rows with null values in any column, and supplements genre data with data scraped from Spotify biographies (see below)

All the code associated with the creation of this dataset can be found on my GitHub. See details of data collection and data Provenance (below) for further details.

All suggestions for improvement are very welcome!

Possible usage

See this Medium article. This dataset is well-suited to a demographic study of the types of artists and music that are favored by Spotify's editorial team. In particular, it can be compared directly to this random sample of Spotify artists, providing two populations: featured artists vs. typical artists on Spotify. For example:

The genre split of featured music vs. music on Spotify
The prior popularity and Monthly Listeners counts of featured artists (at the time of featuring) vs. artists on Spotify
A causal exploration of the factors (and biases) that might lead an artist to be featured in the Editorial playlists
Any other ideas you have - please discuss! 🙏

Details and scope of data collection

The data were collected each morning at 8.00ET via the Spotify Web API, as follows:

Featured editorial playlists were pulled for the US market. Playlists were excluded if they were in the 'TV & Movies' category, or had titles containing the words "This is...", "Official", "Hits" or "Top", as these are explicitly biased towards popular artists.
Unique tracks, added in the last 24 hours, were pulled. The time limit was used to avoid unnecessary duplicates, and to ensure that artist/track popularity and Monthly Listener counts are representative of the values at the time the tracks were featured.
Associated unique artists were found
If >1,000 unique artists were found, a random sample of 1,000 was selected, to respect the API rate limits
Metadata was pulled for the artist sample, plus the associated featured tracks

NOTE: Four days are missing (3 Saturdays and 1 Sunday) due to crashes in the scheduled scraping script. See the Weekday Comparison notebook for a crude check that this missing data does not bias the other features. You may want to check the impact of weekday in causal analyses, to be sure.

Column description

`featured_Spotify_artist_info.csv`

dates: Date that an artist was featured, str
ids: Unique Spotify IDs of each artist, str
names: Spotify artist name, str
monthly_listeners: The number of unique monthly listeners for each artist, collected during April and May 2024. This is the most reliable measure of an artist's popularity, that's publicly available on Spotify, float, 0 if absent
popularity: The Spotify-defined popularity metric, int
- Note that Spotify does not actually disclose how this is calculated, and so this metric should be used with caution. In broad terms, it's calculated from the popularity of an artist's tracks, which in turn is "based, in the most part, on the total number of plays the track has had and how recent those plays are."
followers: The number of followers the artist has, int
genres: The musical genres associated with each artist: where more than one genre is associated with an artist, separate genres are contained within quotation marks and separated by commas; where only one genre is present, no quotations marks are used, str, empty if no genres
- Note that genres are often absent from the Spotify metadata, so in CLEANED_featured_Spotify_artist_info.csv, we've done an additional scrape of the Spotify biographies of artists with missing genres (see Provenance for details)
- To use only the genres from the official Spotify metadata, you can apply your own cleaning to featured_Spotify_artist_info.csv
first_release: The year of an artist's first release, int, -1 if no releases
last_release: The year of an artist's most recent release, as of May 2024, int, -1 if no releases
num_releases: The total number of releases an artist has made, as of May 2024, capped at 20 (all numbers >20 are set to 20) int, -1 if no releases
num_tracks: The number of tracks in the most recent album/single that an artist has released, as of May 2024, int, -1 if no tracks
playlists_found: The Editorial playlists in which the artist was featured, on the date in question, str, follows the same formatting as genres
feat_track_ids: Spotify track IDs of the featured tracks

`featured_Spotify_track_info.csv`

ids: Unique Spotify IDs of each track, str
names: Name of the track, str
popularity: The Spotify-defined popularity metric, int
- Note that Spotify does not actually disclose how this is calculated, and so this metric should be used with caution. In broad terms, it's "based, in the most part, on the total number of plays the track has had and how recent those plays are."
markets: The market code of the markets in which the track is available, int
artists: The Spotify IDs of the artists that created the track: where more than one artist has collaborated on a track, separate artists are contained within quotation marks and separated by commas; where only one artist is present, no quotations marks are used, str
release_date: The date on which the track was released, as provided by Spotify. Sometimes this is just a year, other times a specific day, str
count: The number of separate instances (dates and Editorial playlists) in which the track was featured, str
dates: The dates that the track was featured on any playlist, str
playlists_found: The Editorial playlists in which the track was featured, str, follows the same formatting as in featured_Spotify_artist_info.csv
The following are copied directly from the Spotify Web API documentation
duration_ms: The duration of the track in milliseconds, int
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic, float, range 0-1
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity, float, range 0-1
energy: Represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy, float, range 0-1
instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0, float, range 0-1
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live, float, range 0-1
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db, float
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks, float, range 0-1
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration, float
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry), float
musicalkey: Equivalent to the 'key' field in the Spotify Web API syntax. The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1, int
musicalmode: Equivalent to the 'mode' field in the Spotify Web API syntax. Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0, int
time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4", int, only 5 time signatures are represented

Tables

Cleaned Featured Spotify Artist Info

@kaggle.sarahjeffreson_featured_spotify_artiststracks_with_metadata.cleaned_featured_spotify_artist_info

1.37 MB
20251 rows
13 columns


CREATE TABLE cleaned_featured_spotify_artist_info (
  "dates" TIMESTAMP,
  "ids" VARCHAR,
  "names" VARCHAR,
  "monthly_listeners" DOUBLE,
  "popularity" BIGINT,
  "followers" BIGINT,
  "genres" VARCHAR,
  "first_release" BIGINT,
  "last_release" BIGINT,
  "num_releases" BIGINT,
  "num_tracks" BIGINT,
  "playlists_found" VARCHAR,
  "feat_track_ids" VARCHAR
);

Featured Spotify Artist Info

@kaggle.sarahjeffreson_featured_spotify_artiststracks_with_metadata.featured_spotify_artist_info

1.76 MB
27782 rows
13 columns


CREATE TABLE featured_spotify_artist_info (
  "dates" TIMESTAMP,
  "ids" VARCHAR,
  "names" VARCHAR,
  "monthly_listeners" DOUBLE,
  "popularity" BIGINT,
  "followers" BIGINT,
  "genres" VARCHAR,
  "first_release" BIGINT,
  "last_release" BIGINT,
  "num_releases" BIGINT,
  "num_tracks" BIGINT,
  "playlists_found" VARCHAR,
  "feat_track_ids" VARCHAR
);

Featured Spotify Track Info

@kaggle.sarahjeffreson_featured_spotify_artiststracks_with_metadata.featured_spotify_track_info

1.6 MB
15052 rows
22 columns


CREATE TABLE featured_spotify_track_info (
  "ids" VARCHAR,
  "names" VARCHAR,
  "popularity" DOUBLE,
  "markets" DOUBLE,
  "artists" VARCHAR,
  "release_date" TIMESTAMP,
  "duration_ms" DOUBLE,
  "acousticness" DOUBLE,
  "danceability" DOUBLE,
  "energy" DOUBLE,
  "instrumentalness" DOUBLE,
  "liveness" DOUBLE,
  "loudness" DOUBLE,
  "speechiness" DOUBLE,
  "tempo" DOUBLE,
  "valence" DOUBLE,
  "musicalkey" DOUBLE,
  "musicalmode" DOUBLE,
  "time_signature" DOUBLE,
  "count" DOUBLE,
  "dates" VARCHAR,
  "playlists_found" VARCHAR
);