
This practical aims at performing exploratory plots and how-to build layer by layer to be familiar with the grammar of graphics. In the last part, a supplementary exercise will focus on plotting genome-wide CNV.
We are going to use the built-in dataset called ToothGrowth. This dataset contains the teeth length of 60 guinea pigs which received 3 different doses of vitamin C (in mg/day), delivered either by orange juice (OJ) or ascorbic acid (VC). As for every (or near to every) function, most datasets shipped with a library contain also a useful help page (?ToothGrowth).
dose type receiveddose needs to be a categorical data. You can convert it with factor()geom_boxplot() requires 2 aesthetics
x must be categorical / discrete (see first item)y must be continuous
supp column)When the dataset is tidy, it is easy to draw a plot telling us the story: vitamin C affects the teeth growth and the delivery method is only important for lower concentrations.
Boxplots are nice but misleading. The size of the dataset is not visible and the shapes of distributions could be better represented.
violins allow a better view of the distribution shapes.
dose type receivedsupp variable, we want only 3 violins, one for each dose.trim = FALSE to the geom_violin() to display distribution extremities
Now we are missing summary values like the medians which are shown on boxplots. We should add them.
geom_point(stat = "summary") to the previous plotSpecify the fun = "median" and appropriate size, colour to get a big red dot representing the median
aes(colour = "median") can be used, and specifying a manual red color with scale_color_manual(values = "red") will enforce the red colour to the median dots.
Of note, a ggplot extension named ggbeeswarm proposes a very neat dotplot that fits the distribution.

The penguins dataset is provided by the palmerpenguins R package. As for every function, most data-sets shipped with a package contain also a useful help page (?).
palmerpenguins and load it.speciesgeom_smooth() layer can be used to add a trend line. Try to overlay it to your scatter plot.by default geom_smooth is using a loess regression (< 1,000 points) and adds standard error intervals.
method argument can be used to change the regression to a linear one: method = "lm"se = FALSEshape map to the originated island3shape. Remember that fixed parameters are to be defined outside aes()
ggplot() call to propagate it to both point and regression line.Try the scale colour viridis for discrete scale (scale_colour_viridis_d()). Try to change the default theme to theme_bw()

Remember that:
ggplot(aes()) command will be inherited by all following layersaes() of individual geoms are specific (and overwrite the global definition if present).labs() controls of plot annotationstheme() allows to tweak the plot like theme(plot.caption = element_text(face = "italic")) to render in italic the caption
Let’s have a look at a real output file for CNV detection. The used tool is called Reference Coverage Profiles: RCP.
It was developed by analyzing the depth of coverage in over 6000 high quality (>40×) genomes.
In the end, for every kb a state is assigned and similar states are merged eventually.
state means:
The file is accessible here.
CNV.seg has 5 columns and the first 10 lines look like:
CNV
CNV.seg in R, assign to the name cnvseveral issues must be fixed:
vroom comment = '#' or skip = 2L#chrom contains a hash and length (kb). Would be neater to fix this upfront.By skipping the 2 first lines, we need to redo the header manually, avoiding vroom to use a data line as header.
col_names = to create the header
we could explicitly provide the full levels lists in the desired order.
However, in the tibble, the chromosomes appear in the wanted order.
See thefct_inorder() function in the forcats package to take advantage of this.
fct_collapse() function in the forcats
mean and median with colored hyphens using the ToothGrowth exampleshape = 95 for dots is an hyphen shape.
cnv_auto.cnv_auto_chrcnv_auto_chr using the count as the y variablethis is the final plot, where the following changes were made:
y axis in absolute numbersexpand = c(0, 0) on the x axis. see stackoverflow’s answertheme_classic()legend.position and legend.justification in a theme() call.x axis, you could use chromosomes if you preferc("springgreen3", "steelblue2")
It is now obvious that we have mainly huge deletions on chromosome 10 and amplifications on chromosome 11.
In order to plot the genomic localisations of these events, we want to focus on the main chromosomes that were affected by amplifications/deletions.
fct_lump from forcats ease lumping. Just pick n = 5 to get the top 5 chromosomes