Objective: explore songs using the Spotify API
This dataset comes from the project TidyTuesday where each week a contributor submit a dataset that participants can explore and analyze. Of course, this dataset was obtained from public access but transformed to meet the tidy principles. Here, we are looking at the Spotify dataset.
The already pre-processed dataset can be directly obtained here
spotify_songs
geom_density()
and only one mapped aesthetics is needed (since univariate), so use geom_density(aes(x = n))
once you have computed the number of songs per album.fill = "grey", alpha = 0.5
to have the feeling of the density with some transparency.scale_x_log10()
and annotation_logticks(sides = "b")
to better visualize the values and distribution.
Even if you don’t know much about music, you know that most songs are written in one key associated to a mode, such as B minor. In the columns key
and mode
, spotify encoded those information following those correspondences.
Optionally, one could add the percentages of the mode inside the bars.
the plot will look nicer if:
fct_reorder()
for that matterrecode()
can help you to get the true note instead of the numerical encoding. See below an example:
mtcars %>%
mutate(transmission = recode(am, `0` = "automatic",
`1` = "manual")) %>%
ggplot(aes(y = transmission)) +
geom_bar()
It was suggested that the arrival of streaming services has changed the way artists are working. Especially, the length of tracks in rap music. Since we have the data, we can check this assumption.
geom_vline()
) to highlight the starts of for example 1 million Spotify users.alpha
). But a recent package ggpoindensity
proposes a neat solution.geom_smooth(method = "loess")
as.Date(track_album_release_date)
to coerce characters to R dates.
In the same page, Spotify data scientists created several parameters, like valence, speechiness or energy and assigned a score to each song. Let’s explore those parameters and compare the distribution to for example one of my playlist. You can of course use your own playlist if you wish, I can help you with fetching your data.
col_to_keep <- c("track_name", "track_artist", "track_popularity", "track_album_name",
"track_album_release_date", "danceability", "speechiness", "acousticness",
"instrumentalness", "liveness", "valence", "tempo", "duration_ms")
spotify_songs
tibble. Assign it to the name sub_spotify_songs
If you wish to extract the song features, to compare with the 33k songs above, the Spotify data you can download don’t contain those. However, you can track the number of plays etc, also a nice project to explore those.
To get the features, we need to use the Spotifyr
package. You need to get your client ids and secret, follow the docs on the website.
Here is the code I used once those info obtained:
# package not on CRAN
# remotes::install_github("charlie86/spotifyr")
library(spotifyr)
Sys.setenv(SPOTIFY_CLIENT_ID = 'xxxxxxxxxxxxxxxxxxxxxxxxxx')
Sys.setenv(SPOTIFY_CLIENT_SECRET = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
access_token <- get_spotify_access_token()
my_plists <- get_my_playlists()
# I used only one playlist, get the id of yours
ginolhac <- get_playlist_audio_features(playlist_uris = "xxxxxxxxxxxxxxxx")
filter(ginolhac, !is_local) %>%
select(track_id = track.id, track_name = track.name, track_album_name = track.album.name,
track_popularity = track.popularity, danceability,
key = key_mode, loudness:tempo, duration_ms = track.duration_ms) %>%
vroom::vroom_write("data/yourname_spotify.tsv.gz")
spogino
spotify_songs
and spogino
, assign the name spomerge
inner_join()
) tables. But then all columns would be bind, and even if renamed meaningfully, it won’t be tidy. A smarter way of doing this, would be add an id
column to each tibble before binding the rows. This will be working, only because ALL columns exist in both tables and are named the same.
# see this toy example
(t1 <- tibble(id = 1,
a = 1:3,
b = c("a", "a", "a")))
## # A tibble: 3 x 3
## id a b
## <dbl> <int> <chr>
## 1 1 1 a
## 2 1 2 a
## 3 1 3 a
(t2 <- tibble(id = 2,
a = 4:6,
b = c("b", "b", "b")))
## # A tibble: 3 x 3
## id a b
## <dbl> <int> <chr>
## 1 2 4 b
## 2 2 5 b
## 3 2 6 b
bind_rows(t1, t2)
## # A tibble: 6 x 3
## id a b
## <dbl> <int> <chr>
## 1 1 1 a
## 2 1 2 a
## 3 1 3 a
## 4 2 4 b
## 5 2 5 b
## 6 2 6 b
id
c("danceability", "speechiness", "acousticness", "duration_ms", "track_popularity", "liveness", "valence", "tempo")
scales = "free"
as we have very different distributions.