11  Intro to ggplot2

The info below is adapted from the book “R for Data Science, second edition” (r4ds2e). The book is available both online and in print. It was written by Hadley Wickham, who in large part is the driving force behind the tidyverse packages.
The online version of r4ds2e is here: https://r4ds.hadley.nz/
The intro to ggplot2 chapter appears here: https://r4ds.hadley.nz/data-visualize

In class we covered sections 1.1 through 1.3 of chapter 1, Data visualization from r4ds2e. In that context we discussed different “geometries” of a graph (e.g. dot plot, histogram, bar plot, box plot), aesthetics of a particular geometry (e.g. x position, y position, color, shape, size). We also discussed the concept of how variables in your data can be “mapped” to particular aesthetics.

11.1 Code from sections 1.1 thorugh 1.3 of r4ds2e

Below is a summary of the code that we went through.
Please see the following webpage for more info: https://r4ds.hadley.nz/data-visualize

if(!require(palmerpenguins)){install.packages("palmerpenguins");require(palmerpenguins);}
if(!require(dplyr)){install.packages("dplyr");require(dplyr);} # needed for glimpse
if(!require(ggplot2)){install.packages("ggplot2");require(ggplot2);}
# Here are the first few rows data.
# When viewing a tibble, you may not see all the columns if your screen is too narrow.

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# You can use the dplyr::glimpse function to 
# view the names and datatypes of ALL the columns as well as 
# view the first few values of each column.
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# You can also use the following command to "View" the entire tibble in the 
# RStudio viewer window:
#
#View(penguins)  # It is a capital "V" in "View"

# Setting up the "aesthetics"
# This doesn't display any actual data.
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
)

# You will start seeing "data" on the plot once you set the "geometry".
# Here we set the "geometry" to be geom_point().
# Each "dot" on the plot represents a row of data from the tibble, i.e. one penguin.
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

# We can add some color and shapes to each dot on the plot based on the species.
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species, shape=species)
) +
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

# The function call, geom_smooth(method = "lm") 
# adds linear regressions lines, one for each species. 
#
# Since "color=species, shape=species" was mapped in the ggplot function, the data was
# divided into 3 different subsets, one for each species. 
# That is why there are 3 different linear regression lines, one for
# each species (compare this with the next plot).
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species, shape=species)
) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Removed 2 rows containing missing values (`geom_point()`).

# In the following plot, "color=species, shape=species", was moved
# from the ggplot function to the geom_point function.
# Since we did not set the color in the ggplot function we no longer
# consider the data as three different subsets and we get a single linear 
# regression line for the entire set of data.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Removed 2 rows containing missing values (`geom_point()`).

# Finally, the following plot adds a title and subtitle to the graph
# and labels for the x-axis, y-axis and legend.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", y = "Body mass (g)",
    color = "Penguin Species", shape = "Penguin Species"
  )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Removed 2 rows containing missing values (`geom_point()`).

11.2 Other stuff

The info above goes through the main ideas of how to use ggplot2. Using that knowledge you should be in good shape for learning on your own how to use other more advanced features of ggplot2.

The rest of the webpage, https://r4ds.hadley.nz/data-visualize, shows how to use several other features of ggplot2. The following topics are described on the rest of that webpage:

  • Other geometries (histograms, box plots, etc)

    How to use several other geometries (i.e. bar blots, histograms, density plots, box plots, stacked bar plots).

  • Facets

    How to break up a graph into several smaller graphs using the facet_wrap function.

  • How to save a plot to an image file

    You can use the ggsave function to save an image file with a copy of the last plot that you created. You can then import the image file to other files, e.g. a Word document, a powerpoint, etc.

    See ?ggsave for more info.