About this Dataset

Large Random Spotify Artist Sample With Metadata

This is a (pseudo) random sample of Spotify artist metadata, including Monthly Listener counts from 2024, follower counts, genres, dates of first and most-recent releases, total number of releases and total number of tracks, as of 2024. Version 3 also includes track metadata for one randomly-selected track per artist.

The dataset is designed to be representative of the distribution of all artists on Spotify, without bias towards more well-known/popular artists. This sets it apart from other similar datasets, retrieved using the Spotify Web API.

There are three files: CLEANED_Spotify_artist_info.csv(~15,000 artists) contains only artists/rows with non-null values of all columns but Popularity and Followers (Monthly Listeners is more informative than either of these). The larger file, Spotify_artist_info.csv (~37,000 artists) retains the artists/rows with null values in these columns. Lastly, CLEANED_Spotify_artist_info_tracks.csv (~15,000 tracks) contains the track metadata for one randomly-selected track, for each artist in CLEANED_Spotify_artist_info.csv.

All the code associated with the creation of this dataset can be found on my GitHub. See data Provenance (below) for further details.

All suggestions for improvement are very welcome!

Possible usage

See this Medium article. This dataset is particularly useful as a baseline/comparison for the exploration of bias, or for any demographic study of artists on Spotify, for example:

Studying how many artists on Spotify are actively-producing music, and at what rate
Comparing Monthly Listener distributions across different genres
In combination with this dataset of artists featured on Editorial playlists, a causal exploration of the factors (and biases) that might lead an artist to be featured on the Editorial playlists
Any other interesting ideas you have - please discuss! 🙏

Column description

Artist metadata

ids: Unique Spotify IDs of each artist, str
names: Spotify artist name, str
popularity: The Spotify-defined popularity metric, int
- Note that Spotify does not actually disclose how this is calculated, and so this metric should be used with caution. In broad terms, it's calculated from the popularity of an artist's tracks, which in turn is "based, in the most part, on the total number of plays the track has had and how recent those plays are."
followers: The number of followers the artist has, int
genres: The musical genres associated with each artist: where more than one genre is associated with an artist, separate genres are contained within quotation marks and separated by commas; where only one genre is present, no quotations marks are used, str, empty if no genres
- Note that genres are often absent from the Spotify metadata, so in CLEANED_Spotify_artist_info.csv, we've done an additional scrape of the Spotify biographies of artists with missing genres, avoiding throwing out an additional 3,000 rows (see Provenance for details)
- To use only the genres from the official Spotify metadata, you can apply your own cleaning to Spotify_artist_info.csv
first_release: The year of an artist's first release, int, -1 if no releases
last_release: The year of an artist's most recent release, as of May 2024, int, -1 if no releases
num_releases: The total number of releases an artist has made, as of May 2024, capped at 20 (all numbers >20 are set to 20) int, -1 if no releases
num_tracks: The total number of tracks in the artist's most recent album or single, not compilation, as of May 2024, int, -1 if no tracks
monthly_listeners: The number of unique monthly listeners for each artist, collected during April and May 2024. This is the most reliable measure of an artist's popularity, that's publicly available on Spotify, float, 0 if absent

Track metadata

ids: Unique Spotify IDs of each track, str
names: Name of the track, str
popularity: The Spotify-defined popularity metric, int
- Note that Spotify does not actually disclose how this is calculated, and so this metric should be used with caution. In broad terms, it's "based, in the most part, on the total number of plays the track has had and how recent those plays are."
markets: The market code of the markets in which the track is available, int
artists: The Spotify IDs of the artists that created the track: where more than one artist has collaborated on a track, separate artists are contained within quotation marks and separated by commas; where only one artist is present, no quotations marks are used, str
release_date: The date on which the track was released, as provided by Spotify. Sometimes this is just a year, other times a specific day, str
The following are copied directly from the Spotify Web API documentation
duration_ms: The duration of the track in milliseconds, int
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic, float, range 0-1
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity, float, range 0-1
energy: Represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy, float, range 0-1
instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0, float, range 0-1
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live, float, range 0-1
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db, float
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks, float, range 0-1
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration, float
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry), float
musicalkey: Equivalent to the 'key' field in the Spotify Web API syntax. The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1, int
musicalmode: Equivalent to the 'mode' field in the Spotify Web API syntax. Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0, int
time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4", int, only 5 time signatures are represented

Tables

Cleaned Spotify Artist Info

@kaggle.sarahjeffreson_large_random_spotify_artist_sample_with_metadata.cleaned_spotify_artist_info

943.85 KB
15027 rows
10 columns


CREATE TABLE cleaned_spotify_artist_info (
  "ids" VARCHAR,
  "names" VARCHAR,
  "popularity" BIGINT,
  "followers" BIGINT,
  "genres" VARCHAR,
  "first_release" BIGINT,
  "last_release" BIGINT,
  "num_releases" BIGINT,
  "num_tracks" BIGINT,
  "monthly_listeners" DOUBLE
);

Cleaned Spotify Artist Info Tracks

@kaggle.sarahjeffreson_large_random_spotify_artist_sample_with_metadata.cleaned_spotify_artist_info_tracks

1.7 MB
15013 rows
19 columns


CREATE TABLE cleaned_spotify_artist_info_tracks (
  "ids" VARCHAR,
  "names" VARCHAR,
  "popularity" BIGINT,
  "markets" BIGINT,
  "artists" VARCHAR,
  "release_date" TIMESTAMP,
  "duration_ms" BIGINT,
  "acousticness" DOUBLE,
  "danceability" DOUBLE,
  "energy" DOUBLE,
  "instrumentalness" DOUBLE,
  "liveness" DOUBLE,
  "loudness" DOUBLE,
  "speechiness" DOUBLE,
  "tempo" DOUBLE,
  "valence" DOUBLE,
  "musicalkey" DOUBLE,
  "musicalmode" DOUBLE,
  "time_signature" DOUBLE
);

Spotify Artist Info

@kaggle.sarahjeffreson_large_random_spotify_artist_sample_with_metadata.spotify_artist_info

1.97 MB
37000 rows
10 columns


CREATE TABLE spotify_artist_info (
  "ids" VARCHAR,
  "names" VARCHAR,
  "popularity" BIGINT,
  "followers" BIGINT,
  "genres" VARCHAR,
  "first_release" BIGINT,
  "last_release" BIGINT,
  "num_releases" BIGINT,
  "num_tracks" BIGINT,
  "monthly_listeners" DOUBLE
);