I recently came across a cool waffle viz for the top 20 shot-creating action players in the big 5 European soccer leagues done by Harsh Krishna on Twitter, see original Tweet here. He also posted the code in this Github gist with a call for help in solving an issue he was having with the preserving the sum of a group of metrics after rounding the individual metrics to integers. Below is the code he posted in full. The problem in particular is after the second for loop where he rounds the metrics, he finds that the sum of the metrics don’t add up to 100(%). His way of resolving this was choosing the Sh_SCA variable and trying to manipulate the values in that metric until the total summed back up to 100.

# Harsh Krishna `@placeholder2004`
# https://github.com/harshkrishna17/R-Code/blob/main/Waffle.R
# Libraries

library(tidyverse)
library(worldfootballR)
library(ggtext)
library(extrafont)
library(waffle)
library(MetBrewer)

# Scraping

data <- fb_big5_advanced_season_stats(season_end_year = 2022, stat_type = "gca", team_or_player = "player")

# Data Wrangling 

data1 <- data %>%
  filter(Mins_Per_90 >= 9) %>%
  select(Player, Mins_Per_90, SCA90_SCA, SCA_SCA, PassLive_SCA, PassDead_SCA, Drib_SCA, Sh_SCA, Fld_SCA, Def_SCA)

data1 <- data1[order(as.numeric(data1$SCA90_SCA),decreasing = TRUE),]
data1 <- data1[c(1:20),]

df <- data1[order(as.numeric(data1$SCA90_SCA),decreasing = TRUE),]
df <- df[c(1:20),]

Player <- data1$Player
Mins <- data1$Mins_Per_90
data1 <- subset(data1, select = -c(Player, Mins_Per_90, SCA90_SCA))

for(i in 1:ncol(data1)) {
  data1[, i] <- data1[, i] / Mins
}

SCA <- data1$SCA_SCA 

for(i in 1:ncol(data1)) {
  data1[, i] <- round((data1[, i] / SCA) * 100, 0)
}

data1 <- data1 %>%
  mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA)

## run this ifelse statement as many times as necessary until the Total comes out to be a 100 for all rows.
data1 <- data1 %>% mutate(Sh_SCA = ifelse(Total == 100, Sh_SCA,
                                          ifelse(Total < 100, Sh_SCA + 1,
                                                 ifelse(Total > 100, Sh_SCA - 1, NA)))) %>% 
  mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA)

data1$Player <- Player 

data1 <- data1 %>%
  pivot_longer(!Player, values_to = "SCAp90", names_to = "SCATypes") %>%
  filter(!SCATypes == "SCA_SCA") %>%
  filter(!SCATypes == "Total") %>%
  count(Player, SCATypes, wt = SCAp90)

data1$Player <- factor(data1$Player, levels = print(df$Player))

# Custom theme function

theme_athletic <- function() {
  theme_minimal() +
    theme(plot.background = element_rect(colour = "#151515", fill = "#151515"),
          panel.background = element_rect(colour = "#151515", fill = "#151515")) +
    theme(plot.title = element_text(colour = "white", size = 24, family = "Fried Chicken Bold", hjust = 0.5),
          plot.subtitle = element_markdown(colour = "#525252", size = 18, hjust = 0.5),
          plot.caption = element_text(colour = "white", size = 15, hjust = 1),
          axis.title.x = element_text(colour = "white", face = "bold", size = 14),
          axis.title.y = element_text(colour = "white", face = "bold", size = 14),
          axis.text.x = element_blank(),
          axis.text.y = element_blank()) +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank()) +
    theme(legend.title = element_text(colour = "white"),
          legend.text = element_text(colour = "white"))
}

# Plotting

data1 %>%
  ggplot(aes(fill = SCATypes, values = n)) +
  geom_waffle(nrows = 10, size = 1.5, colour = "#151515", flip = TRUE) +
  scale_fill_manual(values = met.brewer(name = "Gauguin", n = 6, type = "discrete")) +
  facet_wrap(~Player) +
  labs(title = "Big 5 Leagues Shot-Creating Actions Share [2021/22]",
       subtitle = "Top 20 Players with the most SCA per 90 so far",
       caption = "Minimum 9 90's Played\nData from FBref\nCreated by @placeholder2004") +
  theme_athletic() +
  theme(aspect.ratio = 1,
        strip.background = element_blank(),
        strip.text = element_text(colour = "white", size = 14),
        legend.position = "bottom",
        legend.text = element_text(size = 14))

# Save

setwd("C:/Users/harsh_1mwi2o4/Downloads")
ggsave("wafflebig5.png", width = 3100, height = 3500, units = "px")

This is a problem I’ve faced in the past and I’m sure many others have too, whether in the context of soccer analysis or otherwise. So I decided to tackle this problem as a fun coding challenge using the tidyverse!

Let’s get started!

Packages

I prefer typing out the specific tidyverse packages rather than loading everything in at once.

library(dplyr)
library(tidyr)
library(worldfootballR)
library(ggtext)
library(extrafont)
library(waffle)
library(MetBrewer)

Scrape data

Get data from FBref/StatsBomb with the {worldfootballR} package.

# Scraping
data_raw <- fb_big5_advanced_season_stats(season_end_year = 2022, stat_type = "gca", team_or_player = "player")

glimpse(data_raw)

## Rows: 2,540
## Columns: 27
## $ X               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,~
## $ Season_End_Year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, ~
## $ Squad           <chr> "Alavés", "Alavés", "Alavés", "Alavés", "Alavés", "Ala~
## $ Comp            <chr> "La Liga", "La Liga", "La Liga", "La Liga", "La Liga",~
## $ Player          <chr> "Martin Agirregabiria", "Mircea Alexandru Tirlea", "Ru~
## $ Nation          <chr> "ESP", "ROU", "ESP", "ARG", "ESP", "ESP", "SWE", "ESP"~
## $ Pos             <chr> "DF", "DF", "DF", "MF", "MF", "FW", "FW,MF", "MF", "FW~
## $ Age             <chr> "25-249", "21-292", "26-088", "28-293", "24-012", "27-~
## $ Born            <int> 1996, 2000, 1995, 1993, 1998, 1994, 1992, 1994, 1990, ~
## $ Mins_Per_90     <dbl> 13.6, 0.2, 16.4, 0.6, 7.6, 0.0, 3.0, 0.4, 18.5, 17.9, ~
## $ SCA_SCA         <int> 22, 0, 23, 0, 23, 0, 10, 1, 40, 3, 4, 5, 2, 12, 3, 11,~
## $ SCA90_SCA       <dbl> 1.62, 0.00, 1.40, 0.00, 3.02, 0.00, 3.28, 2.25, 2.16, ~
## $ PassLive_SCA    <int> 16, 0, 11, 0, 12, 0, 5, 1, 24, 0, 4, 3, 2, 10, 1, 6, 4~
## $ PassDead_SCA    <int> 2, 0, 9, 0, 8, 0, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 18, 0,~
## $ Drib_SCA        <int> 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 2, 0, 1, 1, 0, ~
## $ Sh_SCA          <int> 0, 0, 0, 0, 1, 0, 0, 0, 6, 1, 0, 0, 0, 0, 2, 0, 1, 2, ~
## $ Fld_SCA         <int> 3, 0, 2, 0, 1, 0, 2, 0, 6, 0, 0, 0, 0, 0, 0, 3, 0, 1, ~
## $ Def_SCA         <int> 0, 0, 1, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, ~
## $ GCA_GCA         <int> 2, 0, 3, 0, 0, 0, 3, 0, 5, 1, 0, 1, 0, 0, 0, 3, 3, 0, ~
## $ GCA90_GCA       <dbl> 0.15, 0.00, 0.18, 0.00, 0.00, 0.00, 0.99, 0.00, 0.27, ~
## $ PassLive_GCA    <int> 1, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 1, 1, 0, ~
## $ PassDead_GCA    <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ~
## $ Drib_GCA        <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Sh_GCA          <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, ~
## $ Fld_GCA         <int> 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Def_GCA         <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Url             <chr> "https://fbref.com/en/players/355c883a/Martin-Agirrega~

Instead of ordering and subsetting with base functions in the original, I used arrange() and slice() to grab the top 20 players by SCA per 90.

# Data Wrangling 
df1 <- data_raw %>%
  filter(Mins_Per_90 >= 9) %>%
  ## use contains() so I don't have to type out every `SCA` variable out
  select(Player, Mins_Per_90, contains("SCA")) %>% 
  ## arrange by SCA per 90 (descending) then take top 20 rows
  arrange(desc(SCA90_SCA)) %>%
  slice(1:20)

glimpse(df1)

## Rows: 20
## Columns: 10
## $ Player       <chr> "Dimitri Payet", "Lorenzo Pellegrini", "Hakan Çalhanoglu"~
## $ Mins_Per_90  <dbl> 15.4, 13.8, 12.7, 9.2, 12.2, 9.2, 10.3, 9.6, 11.8, 17.3, ~
## $ SCA_SCA      <int> 115, 83, 76, 55, 72, 53, 59, 55, 66, 94, 66, 89, 56, 63, ~
## $ SCA90_SCA    <dbl> 7.48, 6.02, 5.97, 5.95, 5.88, 5.77, 5.75, 5.72, 5.59, 5.4~
## $ PassLive_SCA <int> 71, 34, 41, 35, 40, 34, 44, 37, 52, 50, 45, 43, 43, 36, 3~
## $ PassDead_SCA <int> 23, 26, 26, 10, 25, 5, 11, 4, 1, 28, 8, 25, 4, 15, 8, 2, ~
## $ Drib_SCA     <int> 7, 7, 1, 2, 2, 3, 1, 4, 3, 6, 4, 4, 5, 5, 5, 1, 13, 1, 1,~
## $ Sh_SCA       <int> 3, 10, 3, 4, 1, 4, 2, 5, 4, 1, 6, 4, 1, 4, 2, 3, 6, 3, 1,~
## $ Fld_SCA      <int> 11, 5, 3, 3, 2, 6, 1, 4, 6, 8, 2, 9, 2, 3, 8, 0, 7, 1, 1,~
## $ Def_SCA      <int> 0, 1, 2, 1, 2, 1, 0, 1, 0, 1, 1, 4, 1, 0, 3, 2, 2, 0, 2, 2

Instead of using a for loop to perform the calculations on each column, I used mutate() and then across() to specify the columns I wanted to run the same operation on, dividing the metrics by the number of minutes that each player played. Also since we are using mutate() we don’t need to pull out the Mins variable anymore as we can just refer to that specific column in the data.frame.

df2 <- df1 %>% 
  ## use across() to specify which vars to perform operations on
  ## ex. all cols EXCEPT `Player`, `Mins_Per_90`, and `SCA90_SCA`
  mutate(across(-c(Player, Mins_Per_90, SCA90_SCA), ~ . / Mins_Per_90)) 

glimpse(df2)

## Rows: 20
## Columns: 10
## $ Player       <chr> "Dimitri Payet", "Lorenzo Pellegrini", "Hakan Çalhanoglu"~
## $ Mins_Per_90  <dbl> 15.4, 13.8, 12.7, 9.2, 12.2, 9.2, 10.3, 9.6, 11.8, 17.3, ~
## $ SCA_SCA      <dbl> 7.467532, 6.014493, 5.984252, 5.978261, 5.901639, 5.76087~
## $ SCA90_SCA    <dbl> 7.48, 6.02, 5.97, 5.95, 5.88, 5.77, 5.75, 5.72, 5.59, 5.4~
## $ PassLive_SCA <dbl> 4.610390, 2.463768, 3.228346, 3.804348, 3.278689, 3.69565~
## $ PassDead_SCA <dbl> 1.49350649, 1.88405797, 2.04724409, 1.08695652, 2.0491803~
## $ Drib_SCA     <dbl> 0.45454545, 0.50724638, 0.07874016, 0.21739130, 0.1639344~
## $ Sh_SCA       <dbl> 0.19480519, 0.72463768, 0.23622047, 0.43478261, 0.0819672~
## $ Fld_SCA      <dbl> 0.71428571, 0.36231884, 0.23622047, 0.32608696, 0.1639344~
## $ Def_SCA      <dbl> 0.00000000, 0.07246377, 0.15748031, 0.10869565, 0.1639344~

Again, here instead of using a for loop I used the same mutate() + across() trick again to divide the values by each of the total SCA types and then multiply that by 100.

What you get now is that all the per 90 stats are in terms of percentages of the total SCA.

df3 <- df2 %>% 
  ## use across() to specify which vars to perform operations on
  ## ex. all cols EXCEPT `Player`, `Mins_Per_90`, and `SCA90_SCA`
  mutate(across(-c(Player, Mins_Per_90, SCA90_SCA), ~ (. / SCA_SCA) * 100 )) %>% 
  mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA) %>% 
  select(Player, Mins_Per_90, Total, contains("SCA"))

glimpse(df3)

## Rows: 20
## Columns: 11
## $ Player       <chr> "Dimitri Payet", "Lorenzo Pellegrini", "Hakan Çalhanoglu"~
## $ Mins_Per_90  <dbl> 15.4, 13.8, 12.7, 9.2, 12.2, 9.2, 10.3, 9.6, 11.8, 17.3, ~
## $ Total        <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 10~
## $ SCA_SCA      <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 10~
## $ SCA90_SCA    <dbl> 7.48, 6.02, 5.97, 5.95, 5.88, 5.77, 5.75, 5.72, 5.59, 5.4~
## $ PassLive_SCA <dbl> 61.73913, 40.96386, 53.94737, 63.63636, 55.55556, 64.1509~
## $ PassDead_SCA <dbl> 20.000000, 31.325301, 34.210526, 18.181818, 34.722222, 9.~
## $ Drib_SCA     <dbl> 6.086957, 8.433735, 1.315789, 3.636364, 2.777778, 5.66037~
## $ Sh_SCA       <dbl> 2.608696, 12.048193, 3.947368, 7.272727, 1.388889, 7.5471~
## $ Fld_SCA      <dbl> 9.565217, 6.024096, 3.947368, 5.454545, 2.777778, 11.3207~
## $ Def_SCA      <dbl> 0.000000, 1.204819, 2.631579, 1.818182, 2.777778, 1.88679~

Then I check using Total column if the numbers sum to 100. They do, so we know that these per 90 numbers are good and everything adds up to 100% properly.

For the purpose of making a waffle chart, we have to turn all of these values into integers. However, the problem with this is that due to rounding the individual values, the sum doesn’t equal 100 after the calculation! Some sums might equal to 99 or 98, some to 100, and some to 101!

Preserve sum after rounding function

I looked around and tried out a lot of different things, not just different functions but also ways to re-shape the data so that the algorithm would work correctly. The first function I tried was the largeRem() function here but this only worked for when the sums would add up to -1 (99) or +1 (101) from 100. So I then started manually adding more branches to the if-else statements but since I knew there had to be something better out there I moved on. Eventually I found the function below, hoisted from the {JLutils} package that has a pretty good implementation of the largest remainder algorithm:

round_preserve_sum <- function(x, digits = 0) {
  up <- 10^digits
  x <- x * up
  y <- floor(x)
  indices <- tail(order(x - y), round(sum(x)) - sum(y))
  y[indices] <- y[indices] + 1
  y / up
}

I honestly wanted to show how to do row-wise operations using the tidyverse in this section because I got tired of having to pivot back and forth and back and forth. See here for how row-wise operations work in the tidyverse. However, the custom function I used to preserve the sum wasn’t created to support the tidyverse’s row-wise method, so I ended up having to transpose the data and un-transpose it with the pivot functions anyway. At the end, we can check our work by creating a Total variable again to see that everything sums up to 100 even after rounding all the metrics into integers!

df4 <- df3 %>% 
  select(-Mins_Per_90, -SCA90_SCA, -SCA_SCA, -Total) %>% 
  ## transpose
  pivot_longer(cols = -Player) %>% 
  pivot_wider(names_from = Player, values_from = value) %>% 
  ## run function over player column
  mutate(across(-name, ~ round_preserve_sum(.))) %>% 
  ## transpose back to original shape
  pivot_longer(names_to = "Player", values_to = "thing", cols = -name) %>% 
  pivot_wider(names_from = name, values_from = thing) %>% 
  mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA)

glimpse(df4)

## Rows: 20
## Columns: 8
## $ Player       <chr> "Dimitri Payet", "Lorenzo Pellegrini", "Hakan Çalhanoglu"~
## $ PassLive_SCA <dbl> 62, 41, 54, 64, 55, 64, 74, 67, 79, 53, 68, 48, 77, 57, 5~
## $ PassDead_SCA <dbl> 20, 31, 34, 18, 35, 9, 19, 7, 1, 30, 12, 28, 7, 24, 13, 2~
## $ Drib_SCA     <dbl> 6, 9, 1, 4, 3, 6, 2, 7, 5, 6, 6, 4, 9, 8, 8, 1, 14, 1, 1,~
## $ Sh_SCA       <dbl> 3, 12, 4, 7, 1, 8, 3, 9, 6, 1, 9, 5, 2, 6, 3, 4, 7, 3, 2,~
## $ Fld_SCA      <dbl> 9, 6, 4, 5, 3, 11, 2, 8, 9, 9, 3, 10, 3, 5, 13, 0, 8, 1, ~
## $ Def_SCA      <dbl> 0, 1, 3, 2, 3, 2, 0, 2, 0, 1, 2, 5, 2, 0, 5, 2, 2, 0, 3, 3
## $ Total        <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 10~

Now all that’s left is to pivot the data based on the different Shot-Creating Action types so that the data is re-shaped into the format needed for the waffle plot. I also cleaned up the SCA types so that it reads nicer on the final plot using the case_when() function, which is like a Super Saiyan version of if-else statements.

df5 <- df4 %>%
  ## we don't need the `Total` column anymore...
  select(-Total) %>% 
  ## Pivot so that we get the all the SCA types collapsed into a single column
  pivot_longer(!Player, values_to = "SCAp90", names_to = "SCATypes") %>%
  mutate(SCATypes = case_when(
    SCATypes == "Def_SCA" ~ "Defensive Action SCA",
    SCATypes == "Drib_SCA" ~ "Dribble SCA",
    SCATypes == "Fld_SCA" ~ "Fouled SCA",
    SCATypes == "PassDead_SCA" ~ "Dead-ball Pass SCA",
    SCATypes == "PassLive_SCA" ~ "Pass SCA",
    SCATypes == "Sh_SCA" ~ "Shot SCA",
    TRUE ~ SCATypes
  )) %>% 
  count(Player, SCATypes, wt = SCAp90)

glimpse(df5)

## Rows: 120
## Columns: 3
## $ Player   <chr> "Ángel Di María", "Ángel Di María", "Ángel Di María", "Ángel ~
## $ SCATypes <chr> "Dead-ball Pass SCA", "Defensive Action SCA", "Dribble SCA", ~
## $ n        <dbl> 7, 2, 9, 3, 77, 2, 20, 0, 6, 9, 62, 3, 29, 3, 1, 2, 63, 2, 34~

‘The Athletic’ Theme

I do feel that the subtitle should pop out a bit more, but I want to preserve the original as much as possible so I won’t change anything here…!

theme_athletic <- function() {
  theme_minimal() +
    theme(plot.background = element_rect(colour = "#151515", fill = "#151515"),
          panel.background = element_rect(colour = "#151515", fill = "#151515")) +
    theme(plot.title = element_text(colour = "white", size = 24, family = "Fried Chicken Bold", hjust = 0.5),
          plot.subtitle = element_markdown(colour = "#525252", size = 18, hjust = 0.5),
          plot.caption = element_text(colour = "white", size = 15, hjust = 1),
          axis.title.x = element_text(colour = "white", face = "bold", size = 14),
          axis.title.y = element_text(colour = "white", face = "bold", size = 14),
          axis.text.x = element_blank(),
          axis.text.y = element_blank()) +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank()) +
    theme(legend.title = element_text(colour = "white"),
          legend.text = element_text(colour = "white"))
}

Plot

Not much needs to be changed here again, I just cleaned the legend title a bit.

NOTE: The players in the plot changed because, well, there’s been plenty of football since the original viz was posted!

wafplot <- df5 %>%
  ggplot(aes(fill = SCATypes, values = n)) +
  geom_waffle(nrows = 10, size = 1.5, colour = "#151515", flip = TRUE) +
  scale_fill_manual(values = met.brewer(name = "Gauguin", n = 6, type = "discrete"),
                    name = "Shot-Creating Action Type") +
  facet_wrap(~Player) +
  labs(title = "Big 5 Leagues Shot-Creating Actions Share [2021/22]",
       subtitle = "Top 20 Players with the most SCA per 90 so far",
       caption = "Minimum 9 90's Played\nData from FBref\nCreated by @placeholder2004") +
  theme_athletic() +
  theme(aspect.ratio = 1,
        strip.background = element_blank(),
        strip.text = element_text(colour = "white", size = 14),
        legend.position = "bottom",
        legend.text = element_text(size = 14))
wafplot

I’ve just added the call to here::here(), a force of habit to keep file paths relative to the project root directory. Typing out the entire path is an annoyance so you can just do this to avoid it. Also helps when you move projects to a different computer or you’re collaborating using somebody else’s code, you don’t have to re-type C:/Users/blahblah/blahblah/... all the time.

ggsave(plot = wafplot, filename = here::here("Europe 2021-2022/output/big5_SCA_waffle_plot.png"), 
       width = 3100, height = 3500, units = "px")

…and done!

So this was a very short blog post on finding a solution to the “preserve sum after rounding” problem as well as re-writing some of the code to fit my coding style using the tidyverse. Just to be clear, aside from the main problem, none of the code in the original script was wrong! At the end of the day, it worked and a great viz was created, so there obviously was no problem at all. I hope the way I did it shows people there are different ways to approach a problem and that you learned about a couple of new functions and tricks along the way!

I spent about 1~2 hours on this and most of it was googling for different solutions and finding documentation which goes to show how important being able to search for the right things on the internet is for programming in general. You’re never going to be able to memorize everything you’ve ever done, so being good at googling is paramount. Being able to look up stuff efficiently is a skill that needs to be mastered, so things like taking on coding challenges I see in the wild (like the problem presented in this blog post) or taking part in community challenges like #TidyTuesday can help you gain valuable experience.

Of course, there are also other ways to crystallize knowledge such as … writing a blog post about an interesting problem!

Special thanks to Harsh Krishna for sharing the code to his beautiful data viz. Hope this blog post was useful to everyone!