Plotting
This practical aims at performing exploratory plots and how-to build layer by layer to be familiar with the grammar of graphics. In the last part, an optional exercise will focus on plotting genome-wide CNV.
Scatter plots of penguins
The penguins
dataset is provided by the palmerpenguins
R package. As for every function, most data-sets shipped with a package contain also a useful help page (?
).
If not done already, install the package palmerpenguins
and load it.
Plot the body mass on the y axis and the bill length on the x axis.
Plot again the body mass on the y axis and the bill length on the x axis, but with colour by species
The geom_smooth()
layer can be used to add a trend line. Try to overlay it to your scatter plot.
geom_smooth
is using a loess regression by default for < 1,000 points and adds standard error intervals.
- The
method
argument can be used to change the regression to a linear one:method = "lm"
- to disable the ribbon of standard errors, set
se = FALSE
Be careful where the aesthetics are located, so the trend linear lines are also colored per species.
Adjust the aesthetics of point in order to
- The
shape
map to the originatedisland
- A fixed size of
3
- A transparency of 40%
You should still have only 3 coloured linear trend lines. Otherwise check to which layer your are adding the aesthetic shape
. Remember that fixed parameters are to be defined outside aes()
Ajust the colour aesthetic to the ggplot()
call to propagate it to both point and regression line.
Try the scale colour viridis for discrete scale (scale_colour_viridis_d()
). Try to change the default theme to theme_bw()
Find a way to produce the following plot:
- All aesthetics defined in the
ggplot(aes())
command will be inherited by all following layers aes()
of individual geoms are specific (and overwrite the global definition if present).labs()
controls of plot annotationstheme()
allows to tweak the plot liketheme(plot.caption = element_text(face = "italic"))
to render in italic the caption
Categorical data
We are going to use a dataset from the TidyTuesday initiative. Several dataset about the theme deforestation on April 2021 were released, we will focus on the csv called brazil_loss.csv
. The dataset columns are described in the linked README and the csv is directly available at this url
Load the brazil_loss.csv
file, remove the 2 first columns (entity
and code
since it is all Brazil) and assign the name brazil_loss
Set the data type of year
to character()
.
Is this dataset tidy?
Pivot the deforestation reasons (columns commercial_crops
to small_scale_clearing
) to the long format. Values are areas in hectares (area_ha
is a good column name). Save as brazil_loss_long
Plot the deforestation areas per year as bars
year
needs to be a categorical datum. If you didn’t read the data as character for this column, you can convert it withfactor()
geom_col()
requires 2 aestheticsx
must be categorical / discrete (see first item)y
must be continuous
Same as the plot above but bar filled by the reasons for deforestation
Even if we have too many categories, we can appreciate the amount of natural_disturbances
versus the reasons induced by humans.
Lump the reasons for deforestations, keeping only the top 5 reasons, lumping as “Other” the rest
- The function
fct_lump_n()
is very useful for this operation. Be careful to weight the categories with the appropriate continuous variable. - The legend of filled colours could be renamed and suppress if the title is explicit
Optimize the previous plot by sorting the main deforestation reasons
Since v1.0.0, fct_infreq()
does have a weight argument.
you can play with the ordered
argument to get a viridis binned color scale
Optimize the previous plot by sorting the 5 main deforestation reasons
One solution would be extract the top 5 main reasons using dplyr
statements.
Then use this vector to recode the reasons
with the reason name when part of the top5 or other
if not. Then fct_reorder(reasons2, area_ha)
does the correct reordering. You might want to use fct_rev()
to have the sorting from top to bottom in the legend.
Supplementary exercises
These questions are optional and not required to be completed.
Genome-wide copy number variants (CNV) detection
Let’s have a look at a real output file for CNV detection. The used tool is called Reference Coverage Profiles: RCP.
It was developed by analyzing the depth of coverage in over 6000 high quality (>40×) genomes.
In the end, for every kb a state is assigned and similar states are merged eventually.
state means:
- 0, no coverage
- 1, deletion
- 2, expected diploidy
- 3, duplication
- 4, > 3 copies
Reading data
The file is accessible here.
CNV.seg
has 5 columns and the first 10 lines look like:
Load the file CNV.seg
in R, assign to the name cnv
- Lines starting by a hash (comment) should be discarded. You can specify in
vroom
comment = '#'
orskip = 2L
- The first and last column names are unclean.
#chrom
contains a hash andlength (kb)
. Would be neater to fix this upfront.
By skipping the 2 first lines, we need to redo the header manually, avoiding vroom
to use a data line as header.
- Specify a character vector of length 5 to
col_names =
to create the header
Exploratory plots
Plot the counts of the different states. We expect a majority of diploid states.
Plot the counts of the different states per chromosome. Might be worth freeing the count scale.
Using the previous plot, reorder the levels of chromosomes to let them appear in the karyotype order (1:22, X, Y)
We could provide the full levels lists in the desired order explicitly. In the tibble, the chromosomes appear in the right order - essentially we need to tell ggplot
to keep this order.
See the fct_inorder()
function in the forcats package to this end.
Sexual chromosomes are not informative, collapse them into a gonosomes level
See the fct_collapse()
function in the forcats
Plot the genomic segments length per state
- The distributions are completely skewed: transform to log-scale to get a decent plot.
- Add the summary
mean
andmedian
with colored hyphens using the ToothGrowth example
Of note, shape = 95
for dots is an hyphen shape.
Count gain / loss summarising events per chromosome
Filter the tibble only for autosomes and remove segments with no coverage and diploid (i.e states 0 and 2 respectively). Save as cnv_auto
.
We are left with state 1 and 3 and 4. Rename 1 as loss and the others as gain
Count the events per chromosome and per state
For loss counts, set them to negative so the barplot will be display up / down. Save as cnv_auto_chr
Plot cnv_auto_chr
using the count as the y
variable
This is the final plot, where the following changes were made:
- Labels of the
y
axis in absolute numbers - Set
expand = c(0, 0)
on thex
axis. see stackoverflow’s answer - Use
theme_classic()
- Set the legend on the top right corner. Use a mix of
legend.position
andlegend.justification
in atheme()
call. - Remove the label of the
x
axis, you could use chromosomes if you prefer - Change the color of the
fill
argument withc("springgreen3", "steelblue2")
It is now obvious that we have mainly huge deletions on chromosome 10 and amplifications on chromosome 11.
In order to plot the genomic localisations of these events, we want to focus on the main chromosomes that were affected by amplifications/deletions.
Lump the chromsomes by the number of CNV events (states 1, 3 or 4) keeping the 5 top ones and plot the counts
Genomic plot of top 5 chromosomes
Plot the genomic localisations of CNV on the 5 main chromosomes
forcats
The function fct_lump_n()
from forcats ease lumping. Just pick n = 5
to get the top 5 chromosomes