Tidyverse - Part 1
All the packages share a common design:
All packages can be loaded with library(tidyverse)
, but you can also load single packages one by one.
[Photo credits: Arturo de Frias Marques]
The Penguins Dataset stores real data about palmer penguins. This R data package was developed and is maintained by Allison Horst, Alison Hill and Kirsten Gorman for teaching purposes.
palmerpenguins
exports two datasets# A tibble: 344 × 17
studyName Sample Num…¹ Species Region Island Stage Indiv…² Clutc…³ `Date Egg`
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <date>
1 PAL0708 1 Adelie… Anvers Torge… Adul… N1A1 Yes 2007-11-11
2 PAL0708 2 Adelie… Anvers Torge… Adul… N1A2 Yes 2007-11-11
3 PAL0708 3 Adelie… Anvers Torge… Adul… N2A1 Yes 2007-11-16
4 PAL0708 4 Adelie… Anvers Torge… Adul… N2A2 Yes 2007-11-16
5 PAL0708 5 Adelie… Anvers Torge… Adul… N3A1 Yes 2007-11-16
6 PAL0708 6 Adelie… Anvers Torge… Adul… N3A2 Yes 2007-11-16
7 PAL0708 7 Adelie… Anvers Torge… Adul… N4A1 No 2007-11-15
8 PAL0708 8 Adelie… Anvers Torge… Adul… N4A2 No 2007-11-15
9 PAL0708 9 Adelie… Anvers Torge… Adul… N5A1 Yes 2007-11-09
10 PAL0708 10 Adelie… Anvers Torge… Adul… N5A2 Yes 2007-11-09
# … with 334 more rows, 8 more variables: `Culmen Length (mm)` <dbl>,
# `Culmen Depth (mm)` <dbl>, `Flipper Length (mm)` <dbl>,
# `Body Mass (g)` <dbl>, Sex <chr>, `Delta 15 N (o/oo)` <dbl>,
# `Delta 13 C (o/oo)` <dbl>, Comments <chr>, and abbreviated variable names
# ¹`Sample Number`, ²`Individual ID`, ³`Clutch Completion`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
palmerpenguins
exports two datasets# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
# ℹ Use `print(n = ...)` to see more rows
We will use the first one: penguins
, which has already been cleaned.
The print method for a tibble gives you a reasonable overview of the data stored in it.
Can you get more details with the package skimr?
Check its documentation, install it, try it out on the penguins
dataset. Comment on the output: is it useful? How?.
Dplyr provides a grammar for manipulating data, with many useful verbs:
mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their names.filter()
picks cases based on their values.summarise()
reduces multiple values down to a single summary.group_by()
performs operations by group.# A tibble: 344 × 9
species island bill_length_mm bill_d…¹ flipp…² body_…³ sex year bill_…⁴
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 0.0391
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 0.0395
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 0.0403
4 Adelie Torgersen NA NA NA NA <NA> 2007 NA
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 0.0367
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 0.0393
7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007 0.0389
8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 0.0392
9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007 0.0341
10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007 0.042
# … with 334 more rows, and abbreviated variable names ¹bill_depth_mm,
# ²flipper_length_mm, ³body_mass_g, ⁴bill_length_meters
# ℹ Use `print(n = ...)` to see more rows
# A tibble: 344 × 2
bill_length_mm bill_length_meters
<dbl> <dbl>
1 39.1 0.0391
2 39.5 0.0395
3 40.3 0.0403
4 NA NA
5 36.7 0.0367
6 39.3 0.0393
7 38.9 0.0389
8 39.2 0.0392
9 34.1 0.0341
10 42 0.042
# … with 334 more rows
# ℹ Use `print(n = ...)` to see more rows
penguins %>%
select(species, island, bill_length_mm) %>%
mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 344 × 4
species island bill_length_mm bill_length_meters
<fct> <fct> <dbl> <dbl>
1 Adelie Torgersen 39.1 0.0391
2 Adelie Torgersen 39.5 0.0395
3 Adelie Torgersen 40.3 0.0403
4 Adelie Torgersen NA NA
5 Adelie Torgersen 36.7 0.0367
6 Adelie Torgersen 39.3 0.0393
7 Adelie Torgersen 38.9 0.0389
8 Adelie Torgersen 39.2 0.0392
9 Adelie Torgersen 34.1 0.0341
10 Adelie Torgersen 42 0.042
# … with 334 more rows
# ℹ Use `print(n = ...)` to see more rows
penguins %>%
select(species, island, bill_length_mm) %>%
filter(island == 'Dream') %>%
mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 124 × 4
species island bill_length_mm bill_length_meters
<fct> <fct> <dbl> <dbl>
1 Adelie Dream 39.5 0.0395
2 Adelie Dream 37.2 0.0372
3 Adelie Dream 39.5 0.0395
4 Adelie Dream 40.9 0.0409
5 Adelie Dream 36.4 0.0364
6 Adelie Dream 39.2 0.0392
7 Adelie Dream 38.8 0.0388
8 Adelie Dream 42.2 0.0422
9 Adelie Dream 37.6 0.0376
10 Adelie Dream 39.8 0.0398
# … with 114 more rows
# ℹ Use `print(n = ...)` to see more rows
penguins %>%
select(species, island, bill_length_mm) %>%
filter(island == 'Dream') %>%
mutate(bill_length_meters = bill_length_mm/1000) %>%
group_by(species)
# A tibble: 124 × 4
# Groups: species [2]
species island bill_length_mm bill_length_meters
<fct> <fct> <dbl> <dbl>
1 Adelie Dream 39.5 0.0395
2 Adelie Dream 37.2 0.0372
3 Adelie Dream 39.5 0.0395
4 Adelie Dream 40.9 0.0409
5 Adelie Dream 36.4 0.0364
6 Adelie Dream 39.2 0.0392
7 Adelie Dream 38.8 0.0388
8 Adelie Dream 42.2 0.0422
9 Adelie Dream 37.6 0.0376
10 Adelie Dream 39.8 0.0398
# … with 114 more rows
# ℹ Use `print(n = ...)` to see more rows
penguins %>%
select(species, island, bill_length_mm) %>%
filter(island == 'Dream') %>%
mutate(bill_length_meters = bill_length_mm/1000) %>%
group_by(species) %>%
summarise(mean_bill_length_mm = mean(bill_length_mm),
sd_bill_length_mm = sd(bill_length_mm))
# A tibble: 2 × 3
species mean_bill_length_mm sd_bill_length_mm
<fct> <dbl> <dbl>
1 Adelie 38.5 2.47
2 Chinstrap 48.8 3.34
Let’s assign the output to a new variable dream_summary
.
In the previous code we have seen also two additional aspects that feature heavily in the tidyverse:
%>%
.%>%
The pipe is provided by the package magrittr, it’s a forwarding operator:
It takes the ouput of what comes before (LHS) and sends it to the first argument of the function that comes after (RHS).
LHS %>% RHS
%>%
For example, you could write:
%>%
…but if you use the pipe, your code is easier to read…
%>%
…especially if you have to perform many operations one after the other…
# A tibble: 152 × 2
species body_mass_g
<fct> <int>
1 Adelie 3750
2 Adelie 3800
3 Adelie 3250
4 Adelie NA
5 Adelie 3450
6 Adelie 3650
7 Adelie 3625
8 Adelie 4675
9 Adelie 3475
10 Adelie 4250
# … with 142 more rows
# ℹ Use `print(n = ...)` to see more rows
%>%
…that otherwise would force you to nest your code horribly.
# A tibble: 152 × 2
species body_mass_g
<fct> <int>
1 Adelie 3750
2 Adelie 3800
3 Adelie 3250
4 Adelie NA
5 Adelie 3450
6 Adelie 3650
7 Adelie 3625
8 Adelie 4675
9 Adelie 3475
10 Adelie 4250
# … with 142 more rows
# ℹ Use `print(n = ...)` to see more rows
This one is difficult…
Which argument does the function select()
take? Let’s check it in its help page with ?select
.
Under the Usage
section it says:
select(.data, ...)
And then in the section Arguments
the help page says:
.data: A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.
...: <tidy-select> One or more unquoted expressions separated by commas. Variable names can be used as if they were positions in the data frame, so expressions like x:y can be used to select a range of variables.
So, what are do we mean if we write:
The penguins
tibble is fills the.data
parameter through the pipe %>%
.
The unquoted names species, island
fill the argument ...
, they represent the names of the columns to be selected.
Through non-standard evaluation, in the function select, we can call element of a character vector like if they were variables (without quoting them).
The variables species
and island
don’t exist outside of the dplyr function select()
.
With non-standard evaluation we can write names without quoting them. This makes writing code for iterative data exploration faster.
If you come from a more strict programming language, it could be hard to get use to this behaviour.
Most function of the Tidyverse do non-standard evaluation.
With the penguin dataset:
Select all numeric variables (columns).
Convert all variables that are expressed in millimeters into meters, rename them accordingly.
Get help from:
How many penguins have bill_length_mm
above average?
Repeat the analysis for each species.