Plotting

Author

Aurélien Ginolhac

Published

February 8, 2023

Note

This practical aims at performing exploratory plots and how-to build layer by layer to be familiar with the grammar of graphics. In the last part, an optional exercise will focus on plotting genome-wide CNV.

Scatter plots of penguins

The penguins dataset is provided by the palmerpenguins R package. As for every function, most data-sets shipped with a package contain also a useful help page (?).

If not done already, install the package palmerpenguins and load it.
# install.packages("palmerpenguins")
library(palmerpenguins)
Plot the body mass on the y axis and the bill length on the x axis.
penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g)) +
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

Plot again the body mass on the y axis and the bill length on the x axis, but with colour by species
penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g, 
             colour = species)) +
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

The geom_smooth() layer can be used to add a trend line. Try to overlay it to your scatter plot.
Tip

geom_smooth is using a loess regression by default for < 1,000 points and adds standard error intervals.

  • The method argument can be used to change the regression to a linear one: method = "lm"
  • to disable the ribbon of standard errors, set se = FALSE

Be careful where the aesthetics are located, so the trend linear lines are also colored per species.

penguins |>
  ggplot(aes(x = bill_length_mm, 
             y = body_mass_g, 
             colour = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Adjust the aesthetics of point in order to
  • The shape map to the originated island
  • A fixed size of 3
  • A transparency of 40%
Tip

You should still have only 3 coloured linear trend lines. Otherwise check to which layer your are adding the aesthetic shape. Remember that fixed parameters are to be defined outside aes()

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g, 
             colour = species)) +
  geom_point(aes(shape = island), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Ajust the colour aesthetic to the ggplot() call to propagate it to both point and regression line.

Try the scale colour viridis for discrete scale (scale_colour_viridis_d()). Try to change the default theme to theme_bw()

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g, 
             colour = species)) +
  geom_point(aes(shape = island), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_viridis_d() +
  theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Find a way to produce the following plot:

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g, 
             colour = species)) +
  geom_point(aes(shape = island), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_viridis_d() +
  theme_bw(14) +
  theme(plot.caption.position = "plot",
        plot.caption = element_text(face = "italic"),
        plot.subtitle = element_text(size = 11)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Penguin bill length and body mass",
       caption = "Horst AM, Hill AP, Gorman KB (2020)",
       subtitle = "Dimensions for male/female Adelie, Chinstrap and Gentoo Penguins\nat Palmer Station LTER",
        x = "Bill length (mm)",
       y = "Body mass (g)",
       color = "Penguin species")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Remember that
  • All aesthetics defined in the ggplot(aes()) command will be inherited by all following layers
  • aes() of individual geoms are specific (and overwrite the global definition if present).
  • labs() controls of plot annotations
  • theme() allows to tweak the plot like theme(plot.caption = element_text(face = "italic")) to render in italic the caption

Categorical data

We are going to use a dataset from the TidyTuesday initiative. Several dataset about the theme deforestation on April 2021 were released, we will focus on the csv called brazil_loss.csv. The dataset columns are described in the linked README and the csv is directly available at this url

Load the brazil_loss.csv file, remove the 2 first columns (entity and code since it is all Brazil) and assign the name brazil_loss
Note

Set the data type of year to character().

brazil_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/brazil_loss.csv"
brazil_loss <- read_csv(brazil_url, 
                        col_select = -c(entity, code),
                        col_types = cols(year = col_character()))
Is this dataset tidy?
# No, the reason for deforestation are in the wide format. Columns commercial_crops to small_scale_clearing should be in the long format
Pivot the deforestation reasons (columns commercial_crops to small_scale_clearing) to the long format. Values are areas in hectares (area_ha is a good column name). Save as brazil_loss_long
pivot_longer(brazil_loss,
             cols = commercial_crops:small_scale_clearing,
             names_to = "reasons",
             values_to = "area_ha") -> brazil_loss_long
Plot the deforestation areas per year as bars
Tip
  • year needs to be a categorical datum. If you didn’t read the data as character for this column, you can convert it with factor()
  • geom_col() requires 2 aesthetics
    • x must be categorical / discrete (see first item)
    • y must be continuous
brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha)) +
  geom_col()

Same as the plot above but bar filled by the reasons for deforestation
brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_lump_n(reasons, n = 5, w = area_ha))) +
  geom_col() +
  labs(title = "The 5 main reasons for deforestation in Brazil",
       fill = NULL)

Even if we have too many categories, we can appreciate the amount of natural_disturbances versus the reasons induced by humans.

Lump the reasons for deforestations, keeping only the top 5 reasons, lumping as “Other” the rest
Tip
  • The function fct_lump_n() is very useful for this operation. Be careful to weight the categories with the appropriate continuous variable.
  • The legend of filled colours could be renamed and suppress if the title is explicit
brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_infreq(reasons, w = area_ha, ordered = TRUE))) +
  geom_col() +
  labs(title = "Sorted reasons for deforestation in Brazil",
       fill = NULL)

Optimize the previous plot by sorting the main deforestation reasons
forcats

Since v1.0.0, fct_infreq() does have a weight argument.

you can play with the ordered argument to get a viridis binned color scale

brazil_loss_long |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_infreq(reasons, w = area_ha, ordered = TRUE))) +
  geom_col() +
  labs(title = "Sorted reasons for deforestation in Brazil",
       fill = NULL)

Optimize the previous plot by sorting the 5 main deforestation reasons
Tip

One solution would be extract the top 5 main reasons using dplyr statements.

Then use this vector to recode the reasons with the reason name when part of the top5 or other if not. Then fct_reorder(reasons2, area_ha) does the correct reordering. You might want to use fct_rev() to have the sorting from top to bottom in the legend.

brazil_loss_long |>
  group_by(reasons) |>
  summarise(sum = sum(area_ha)) |>
  arrange(desc(sum)) |>
  slice_head(n = 5) |>
  pull(reasons) -> top_5_reasons

brazil_loss_long |>
  mutate(reasons2 = if_else(reasons %in% top_5_reasons, reasons, "other")) |>
  ggplot(aes(x = year, y = area_ha, 
             fill = fct_reorder(reasons2, area_ha) |> fct_rev())) +
  geom_col() +
  scale_fill_brewer(type = "qual", palette = "Set1") +
  labs(title = "The 5 main reasons for deforestation in Brazil",
       fill = NULL)

Supplementary exercises

Note

These questions are optional and not required to be completed.

Genome-wide copy number variants (CNV) detection

Let’s have a look at a real output file for CNV detection. The used tool is called Reference Coverage Profiles: RCP.

It was developed by analyzing the depth of coverage in over 6000 high quality (>40×) genomes.

In the end, for every kb a state is assigned and similar states are merged eventually.

state means:

  • 0, no coverage
  • 1, deletion
  • 2, expected diploidy
  • 3, duplication
  • 4, > 3 copies

Reading data

The file is accessible here.

CNV.seg has 5 columns and the first 10 lines look like:

CNV
Load the file CNV.seg in R, assign to the name cnv
Several issues must be fixed!
  • Lines starting by a hash (comment) should be discarded. You can specify in vroom comment = '#' or skip = 2L
  • The first and last column names are unclean. #chrom contains a hash and length (kb). Would be neater to fix this upfront.

By skipping the 2 first lines, we need to redo the header manually, avoiding vroom to use a data line as header.

  • Specify a character vector of length 5 to col_names = to create the header
cnv <- read_tsv("http://biostat2.uni.lu/practicals/data/CNV.seg", comment = "#",
             col_names = c("chr", "start", "end", "state", "length_kb"))
Rows: 19288 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): chr
dbl (4): start, end, state, length_kb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cnv
# A tibble: 19,288 × 5
   chr    start    end state length_kb
   <chr>  <dbl>  <dbl> <dbl>     <dbl>
 1 1          0  11000     0        11
 2 1      11000 178000     2       167
 3 1     178000 227000     0        49
 4 1     227000 268000     2        41
 5 1     268000 317000     0        49
 6 1     317000 472000     2       155
 7 1     472000 522000     0        50
 8 1     522000 606000     2        84
 9 1     606000 609000     0         3
10 1     609000 611000     2         2
# ℹ 19,278 more rows

Exploratory plots

Plot the counts of the different states. We expect a majority of diploid states.
Plot the counts of the different states per chromosome. Might be worth freeing the count scale.
cnv |>
  ggplot(aes(x = state)) +
  geom_bar()

Using the previous plot, reorder the levels of chromosomes to let them appear in the karyotype order (1:22, X, Y)
Tip

We could provide the full levels lists in the desired order explicitly. In the tibble, the chromosomes appear in the right order - essentially we need to tell ggplot to keep this order.

See the fct_inorder() function in the forcats package to this end.

cnv |>
  mutate(chr = fct_inorder(chr)) |>
  ggplot(aes(x = state)) +
  geom_bar() +
  facet_wrap(~ chr, scales = "free_y")

Sexual chromosomes are not informative, collapse them into a gonosomes level
Tip

See the fct_collapse() function in the forcats

cnv |>
  mutate(chr = fct_inorder(chr) |> 
           fct_collapse(gonosomes = c("X", "Y"))) |>
  ggplot(aes(x = state)) +
  geom_bar() +
  facet_wrap(~ chr, scales = "free_y")

Plot the genomic segments length per state
Tip
  • The distributions are completely skewed: transform to log-scale to get a decent plot.
  • Add the summary mean and median with colored hyphens using the ToothGrowth example

Of note, shape = 95 for dots is an hyphen shape.

cnv |>
  ggplot(aes(x = factor(state), y = length_kb)) +
  geom_violin() +
  geom_point(stat = "summary", fun = "mean", shape = 95,
             aes(colour = "mean"), size = 10) +
  geom_point(stat = "summary", fun = "median", 
             aes(colour = "median"), size = 10, shape = 95) +
  scale_y_log10() +
  annotation_logticks(sides = "l") +
  labs(x = "State",
       y = "Length (kb)",
       colour = NULL)

Count gain / loss summarising events per chromosome

Filter the tibble only for autosomes and remove segments with no coverage and diploid (i.e states 0 and 2 respectively). Save as cnv_auto.
cnv_auto <- cnv |>
  filter(state == 1 | state > 2,
         !chr %in% c("X", "Y"))
We are left with state 1 and 3 and 4. Rename 1 as loss and the others as gain
Count the events per chromosome and per state
For loss counts, set them to negative so the barplot will be display up / down. Save as cnv_auto_chr
cnv_auto |>
  mutate(state = if_else(state == 1, "loss", "gain")) |>
  mutate(chr = fct_inorder(chr)) |>
  count(chr, state) |>
  mutate(n = if_else(state == "loss", -n, n)) -> cnv_auto_chr
Plot cnv_auto_chr using the count as the y variable
cnv_auto_chr |>
  ggplot(aes(x = chr, fill = state, y = n)) +
  geom_col() +
  scale_y_continuous(labels = abs) + # absolute values for negative counts
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = c("springgreen3", "steelblue2")) + 
  theme_classic(14) +
  theme(legend.position = c(1, 1),
        legend.justification = c(1, 1)) +
    labs(x = NULL,
         y = "count")

This is the final plot, where the following changes were made:

  • Labels of the y axis in absolute numbers
  • Set expand = c(0, 0) on the x axis. see stackoverflow’s answer
  • Use theme_classic()
  • Set the legend on the top right corner. Use a mix of legend.position and legend.justification in a theme() call.
  • Remove the label of the x axis, you could use chromosomes if you prefer
  • Change the color of the fill argument with c("springgreen3", "steelblue2")

It is now obvious that we have mainly huge deletions on chromosome 10 and amplifications on chromosome 11.

In order to plot the genomic localisations of these events, we want to focus on the main chromosomes that were affected by amplifications/deletions.

Lump the chromsomes by the number of CNV events (states 1, 3 or 4) keeping the 5 top ones and plot the counts
read_tsv("http://biostat2.uni.lu/practicals/data/CNV.seg", comment = "#",
      col_types = cols(),
             col_names = c("chr", "start", "end", "state", "length_kb")) |>
  filter(state == 1 | state > 2,
         !chr %in% c("X", "Y")) |>
  mutate(state = if_else(state == 1, "loss", "gain")) |>
  mutate(chr = fct_inorder(chr)) |>
  count(chr, state) |>
  mutate(n = if_else(state == "loss", -n, n)) |>
  ggplot(aes(x = chr, fill = state, y = n)) +
  geom_col() +
  scale_y_continuous(labels = abs) + # absolute values for negative counts
  scale_x_discrete(expand = c(0, 0)) +
  scale_fill_manual(values = c("springgreen3", "steelblue2")) + 
  theme_classic(14) +
  theme(legend.position = c(1, 1),
        legend.justification = c(1, 1)) +
    labs(x = NULL,
         y = "count")

Genomic plot of top 5 chromosomes

Plot the genomic localisations of CNV on the 5 main chromosomes
forcats

The function fct_lump_n() from forcats ease lumping. Just pick n = 5 to get the top 5 chromosomes

cnv_auto |>
  mutate(top_chr = fct_lump_n(chr, n = 5)) |>
  ggplot(aes(x = top_chr)) +
  geom_bar()

# the 5 top chromosomes are then, 10, 11, 3, 12 & 5