(April-2018: updated to use ggridges package instead of deprecated ggjoy)

Hello, for those who know me well you would know that my favorite band is Thrice! For those that aren’t familiar with them, they are a post-hardcore rock band from California, specifically the area around where I went to college (OC/Irvine area). This article will be Part 1 of a series that will cover data analysis of Thrice’s lyrics. Part 1, however, we will just be looking at doing some exploratory analysis with all of the non-lyrics data so we can all get a understanding of the context of what we are dealing with before we deep-dive into the lyrics!

# Packages:
library(tidyverse)     # for dplyr and tidyr
library(lubridate)     # measuring and calculating time periods
library(scales)        # fiddling with scales on our plots
library(stringr)       # detecting string patterns
library(gridExtra)     # arranging multiple plots in a single output

# Load and tidy  ----------------------------------------------------------

df <- read.csv('~/R_materials/ThriceLyrics/thrice.df.csv', header = TRUE, stringsAsFactors = FALSE)
str(df, list.len = 3)
## 'data.frame':    103 obs. of  9 variables:
##  $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ album   : chr  "Identity Crisis" "Identity Crisis" "Identity Crisis" "Identity Crisis" ...
##  $ year    : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##   [list output truncated]

One of the important things to note about reading in files into the R environment is that if you already have headers in the data set you are importing, you need to set header = TRUE or else your column/variable names will appear on their own as the first row of each column, as shown below:

df2 <- read.csv('~/R_materials/ThriceLyrics/thrice.df.csv', header = FALSE, stringsAsFactors = FALSE)

str(df2, list.len = 3)
## 'data.frame':    104 obs. of  9 variables:
##  $ V1: chr  "ID" "1" "2" "3" ...
##  $ V2: chr  "album" "Identity Crisis" "Identity Crisis" "Identity Crisis" ...
##  $ V3: chr  "year" "2000" "2000" "2000" ...
##   [list output truncated]

Let’s get a “glimpse” of our data frame!

glimpse(df)
## Observations: 103
## Variables: 9
## $ ID       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ album    <chr> "Identity Crisis", "Identity Crisis", "Identity Crisi...
## $ year     <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,...
## $ tracknum <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, ...
## $ title    <chr> "Identity Crisis", "Phoenix Ignition", "In Your Hands...
## $ writers  <chr> "Dustin Kensrue", "Dustin Kensrue", "Riley Breckenrid...
## $ length   <chr> "2M 58S", "3M 31S", "2M 47S", "3M 4S", "3M 2S", "2M 4...
## $ lengthS  <chr> "178S", "211S", "167S", "184S", "182S", "124S", "57S"...
## $ lyrics   <chr> "Image marred by self-infliction <br>  Private wars o...

As we can see the song ID, year, track num variables are all of the type integer, all others are character types, even the length and lengthS variables. To transform these last two variables we can use the lubridate package. The `ms() and seconds() functions in this package transforms character or numeric types into a Period type, which is a specific class that can track the changes between date/times. Concurrently we can turn album and year variables into a factor!

library(lubridate)

df <- df %>% 
  mutate(album = factor(album, levels = unique(album)),
         year = factor(year, levels = unique(year)),
         length = ms(length),
         lengthS = seconds(length))

glimpse(df)
## Observations: 103
## Variables: 9
## $ ID       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ album    <fct> Identity Crisis, Identity Crisis, Identity Crisis, Id...
## $ year     <fct> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,...
## $ tracknum <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, ...
## $ title    <chr> "Identity Crisis", "Phoenix Ignition", "In Your Hands...
## $ writers  <chr> "Dustin Kensrue", "Dustin Kensrue", "Riley Breckenrid...
## $ length   <S4: Period> 2M 58S, 3M 31S, 2M 47S, 3M 4S, 3M 2S, 2M 4S, 5...
## $ lengthS  <S4: Period> 178S, 211S, 167S, 184S, 182S, 124S, 57S, 250S,...
## $ lyrics   <chr> "Image marred by self-infliction <br>  Private wars o...

Both length and lengthS are now Period type variables! album and year are a factor!

Now let’s take a closer look at our data! First, let’s look at how many total albums have Thrice released?

# Explore our data -----------------------------------------------

length(unique(df$album))
## [1] 11

11 albums so far! Do note that in reality, The Alchemy Index albums (divided into the four elements of Fire, Water, Air, and Earth) were organized into two albums of two elements each (released in 2007 and 2008 respectively). I divided each element album individually because they’re stylistically very different from one another and for the purposes of the lyrics analysis later on, I thought it would be better to categorize them into distinct albums.

Another way to do the above and in more readable code is to use the n_distinct() function from the dply package while also taking advantage of the magrittr pipes:

df %>% select(album) %>% n_distinct()  
## [1] 11

How many total songs have Thrice released?

df %>% select(title) %>% n_distinct()
## [1] 103

Now let’s list all of the Thrice albums by name:

df %>% select(album, year) %>% unique() 
##                                 album year
## 1                     Identity Crisis 2000
## 12             The Illusion Of Safety 2002
## 25        The Artist In The Ambulance 2003
## 37                            Vheissu 2005
## 48             The Alchemy Index Fire 2007
## 54            The Alchemy Index Water 2007
## 60              The Alchemy Index Air 2008
## 66            The Alchemy Index Earth 2008
## 72                            Beggars 2009
## 82                        Major Minor 2011
## 93 To Be Everywhere And To Be Nowhere 2016

What is the length in seconds and minutes of each album?

df %>% 
  group_by(album, year) %>% 
  summarise(num_songs = n(),    # Number of songs in each album
            duration = as.duration(sum(lengthS))) %>% 
  arrange(desc(duration))
## # A tibble: 11 x 4
## # Groups:   album [11]
##    album                            year  num_songs duration              
##    <fct>                            <fct>     <int> <S4: Duration>        
##  1 Major Minor                      2011         11 2962s (~49.37 minutes)
##  2 Vheissu                          2005         11 2960s (~49.33 minutes)
##  3 Beggars                          2009         10 2624s (~43.73 minutes)
##  4 To Be Everywhere And To Be Nowh~ 2016         11 2496s (~41.6 minutes) 
##  5 The Artist In The Ambulance      2003         12 2374s (~39.57 minutes)
##  6 The Illusion Of Safety           2002         13 2307s (~38.45 minutes)
##  7 Identity Crisis                  2000         11 2142s (~35.7 minutes) 
##  8 The Alchemy Index Water          2007          6 1627s (~27.12 minutes)
##  9 The Alchemy Index Air            2008          6 1454s (~24.23 minutes)
## 10 The Alchemy Index Fire           2007          6 1327s (~22.12 minutes)
## 11 The Alchemy Index Earth          2008          6 1256s (~20.93 minutes)

Major/Minor and Vheissu are the longest albums, both totaling up to a bit over 49 mins!

How about the length of each song?

df %>% 
  group_by(title) %>% 
  summarise(duration = as.duration(sum(lengthS))) %>% 
  arrange(desc(duration))
## # A tibble: 103 x 2
##    title                       duration            
##    <chr>                       <S4: Duration>      
##  1 Words In The Water          386s (~6.43 minutes)
##  2 Salt And Shadow             368s (~6.13 minutes)
##  3 Night Diving                362s (~6.03 minutes)
##  4 Daedalus                    360s (~6 minutes)   
##  5 Stand And Feel Your Worth   352s (~5.87 minutes)
##  6 Beggars                     324s (~5.4 minutes) 
##  7 A Song For Milly Michaelson 307s (~5.12 minutes)
##  8 The Weight                  300s (~5 minutes)   
##  9 The Earth Isn't Humming     298s (~4.97 minutes)
## 10 Kings Upon The Main         296s (~4.93 minutes)
## # ... with 93 more rows

Besides grouping with group_by() and summarizing with summarize(), there are other ways to filter our data. For example, let’s say we want to see the total duration of The Alchemy Index (Fire, Water, Earth, and Air) then we could use the grepl() function to search for all albums with the term “Index” in it:

df %>% 
  filter(grepl("Index", album)) %>% 
  summarise(duration_minutes = seconds_to_period(sum(lengthS)))
##   duration_minutes
## 1       1H 34M 24S

or we can use stringr package’s str_detect() function to find all instances inside album which has the term “Index” in it:

library(stringr)

df %>% 
  filter(str_detect(album, "Index")) %>% 
  summarise(duration_minutes = seconds_to_period(sum(lengthS)))
##   duration_minutes
## 1       1H 34M 24S

Both do practically the same thing. The seconds_to_period() function here essentially allows us to create a Period output (Days/Hours/Minutes/Seconds) from the variable lengthS (which is in seconds).

If we wanted to look at a specific album:

df %>% 
  filter(album == "Vheissu") %>% 
  summarise(duration_minutes = seconds_to_period(sum(lengthS)))
##   duration_minutes
## 1          49M 20S

How about we try summarizing as we did a few code chunks back but use the seconds_to_period() function instead?

df %>% 
  group_by(album) %>% 
  summarize(duration_minutes = seconds_to_period(sum(lengthS))) %>% 
  arrange(desc(duration_minutes))
## # A tibble: 11 x 2
##    album                              duration_minutes
##    <fct>                              <S4: Period>    
##  1 The Alchemy Index Earth            35M 56S         
##  2 Beggars                            44S             
##  3 Identity Crisis                    42S             
##  4 To Be Everywhere And To Be Nowhere 36S             
##  5 The Artist In The Ambulance        34S             
##  6 The Illusion Of Safety             27S             
##  7 Major Minor                        22S             
##  8 Vheissu                            20S             
##  9 The Alchemy Index Air              14S             
## 10 The Alchemy Index Fire             7S              
## 11 The Alchemy Index Water            7S
df %>% 
  group_by(title) %>% 
  summarize(duration_song = seconds_to_period(sum(lengthS))) %>% 
  arrange(desc(duration_song))
## # A tibble: 103 x 2
##    title                            duration_song
##    <chr>                            <S4: Period> 
##  1 All The World Is Mad             3M 59S       
##  2 Black Honey                      59S          
##  3 Don't Tell And We Won't Ask      59S          
##  4 Paper Tigers                     59S          
##  5 Identity Crisis                  58S          
##  6 The Earth Isn't Humming          58S          
##  7 Yellow Belly                     58S          
##  8 The Next Day                     57S          
##  9 Between The End And Where We Lie 56S          
## 10 Kings Upon The Main              56S          
## # ... with 93 more rows

Unfortunately, the seconds_to_period() conversion doesn’t seem to work well with summarize() across the entire set of the songs or albums. I find it very weird as from previous times we used it, such as when we summarized all the Alchemy Index albums together, it worked perfectly fine. I’ll have to look into this later…

Leaving that aside for now (especially since we can still calculate the sums just fine using duration()), let’s start plotting to visualize the song lengths for Thrice!

Plot song lengths!

# Plotting! ---------------------------------------------------------------

df %>% 
  ggplot(aes(x = as.numeric(lengthS))) + 
  geom_histogram(binwidth = 10, 
                 color = 'white',
                 fill = 'darkgreen') +
  scale_y_continuous(breaks = pretty_breaks(), 
                      limits = c(0, 13), expand = c(0, 0)) +  # expand 0,0 to reduce space
  scale_x_continuous(breaks = pretty_breaks(10), 
                     limits = c(0, 420), expand = c(0, 0)) +  # set limits manually
  xlab('Seconds') +
  ylab('# of Songs') +
  labs(title = 'Distribution of Thrice Songs by Length') +
  theme_bw() +
  theme(axis.text = element_text(size = 14, face = "bold", color = "#252525"))

Let’s try plotting in minutes as well by dividing the lengthS (length in seconds) by 60, it won’t be a perfect conversion as it’s not sexagesimal (base-60) but it’s good enough for our purposes. Also, the period variable type that we created doesn’t seem to work with ggplot as far as I know, which is why you have to convert it to numeric in ggplot().

df %>% 
  ggplot(aes(x = as.numeric(lengthS)/60)) + 
  geom_histogram(binwidth = 0.5, 
                 color = 'white',
                 fill = 'darkgreen') +
  scale_y_continuous(breaks = pretty_breaks(10), 
                     expand = c(0,0), limits = c(0, 30)) +
  scale_x_continuous(breaks = pretty_breaks(5)) +
  xlab('Minutes') +
  ylab('# of Songs') +
  labs(title = 'Distribution of Thrice Songs by Length') +
  theme_bw() +
  theme(axis.text = element_text(size = 14, color = "#252525"),
        axis.title = element_text(size = 14))

Change to plot by length in minutes (not perfect as it won’t be in base 60):

Click to show code!
  
```r

histogram <- df %>% 
  ggplot(aes(x = as.numeric(lengthS)/60)) + 
  geom_histogram(binwidth = 0.5, 
                 color = "#FFFFFF",
                 fill = "#006400") +
  scale_y_continuous(breaks = pretty_breaks(), expand = c(0, 0), limits = c(0, 7)) +
  scale_x_continuous(breaks = pretty_breaks()) +
  xlab('Minutes') +
  ylab('# of Songs') +
  labs(title = 'Distribution of Thrice Songs by Length') +
  theme_bw() +
  theme(axis.text = element_text(size = 8, color = "#252525"),
        axis.title = element_text(size = 8)) 
        
```

How can we see differences between albums? We can use subset our data to create mini-plots for each individual level of our variable (album in our case) using facets. First let’s try the facet_wrap() function:

histogram + facet_wrap(~album)

With this setup we can see the distribution for an individual album quite well, however it’s hard to compare across different albums unless they are situated in the same column.

How about we try it the other way around, with the plot of each album being a row instead while also add some trend lines? This time we’ll use the facet_grid() function with the levels of the variable (album) being distributed vertically:

histogram + facet_grid(album ~.) + 
  geom_smooth(se = FALSE, stat = "bin", bins = 10, col = "#FF3030")

That looks really bad. On one hand, we can compare the histograms against each other easily, but the bars are all squished and that makes it hard to discern any differences. There are just way too many albums and not enough screen space to take advantage of facetting like this.

If there weren’t so many albums it’ll look better but even then, the trend lines aren’t very smooth in the first place. What we can do is try out a different plotting method altogether, so now let’s introduce…

Joy plots!

Joy plots engulfed the data science/visualization community during the past summer. First popularized in a post by Henrik Lindberg on “peak times for sports and leisure”, joy plots are useful for visualizing changes in distribution over time or space and was made to be an alternative to heat maps. Amidst much debate on the various advantages and disadvantages of this visualization method all across social media, Claus Wilke released the ggjoy package that allows you to easily make joy plots on top of the existing ggplot2 package.

(April-2018: updated to use ggridges package)

I finally have a chance to put this to practice with my own data so let’s try it out here!

# Joy Plots ---------------------------------------------------------------
library(ggridges)

df %>% 
  ggplot(aes(x = as.numeric(lengthS)/60, y = album)) +
  geom_density_ridges() +
  xlab('Minutes') +
  scale_x_continuous(breaks = pretty_breaks(7))

You can see that the the ridge lines are drawn from the densities of the data along time (x-axis). The more numerous the amount of songs of any particular duration of time, the higher the ridges appear, with the overall effect being that of a mountain range that can be compared across different groups, in this case Thrice’s albums.

Now let’s add some color (dark green = #006400, dark grey = #404040) and tinker with the scales a bit…

joyplot <- df %>% 
  mutate(group = reorder(album, desc(lengthS))) %>%   # reorder based on lengthS (descending)
  ggplot(aes(x = as.numeric(lengthS)/60, y = group, fill = group)) +   
  geom_density_ridges(scale = 2) +                       # scale to set amount of overlap between ridges
  xlab('Minutes') +
  scale_x_continuous(breaks = pretty_breaks(10)) +
  scale_y_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = rep(c("#006400", "#404040"), n_distinct(df$album))) +
  theme_bw() +
  theme(legend.position = "none")

joyplot

From the joy plot you can clearly see the density of songs shift from around 3 minutes in The Illusion of Safety to around 4 minutes or more in the bottom few albums. The only thing really setting apart the longer albums are the amount of songs that are 6 minutes or longer, otherwise most of the songs in an album are around the 4-5 minute mark. A note about The Alchemy Index: Water is the fourth track, Night Diving a 6+ minute long instrumental which, although really nice to listen to on long drives or on a plane, inflates the album’s position in the joy plot! In contrast, both To Be Everywhere And To Be Nowhere and Identity Crisis also have instrumentals but with a length of around a minute each!

Finally, let’s compare our histogram with the joy plot!

We can use the grid package to customize layouts:

library(grid)

pushViewport(viewport(layout = grid.layout(1,2)))

print(joyplot, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(hist, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))

Or you could use the gridExtra package and the grid.arrange() function which is a lot more faster:

library(gridExtra)

grid.arrange(joyplot2, hist, nrow = 1)

We can see that the joy plots make the data a lot more understandable (for the final comparison I took out the y-axis labels so we can see the joy plot better).

And that concludes Part 1! Next we will be getting into the real meat of sentiment analysis using the tidytext package!