Manipulate Data

Tidyverse - Part 1

Otho Mantegazza

Intro to the Tidyverse

The Tidyverse is an ecosystem of packages for Data Science

All the packages share a common design:

  • One function does one thing, well.
  • Designed for pipes.
  • Extensive user-friendly documentation.
  • Non-standard evaluation, to write code quickly and easily.

All packages can be loaded with library(tidyverse), but you can also load single packages one by one.

Data: Palmer Penguins

[Photo credits: Arturo de Frias Marques]

We first explore the Palmer Penguins Dataset

The Penguins Dataset stores real data about palmer penguins. This R data package was developed and is maintained by Allison Horst, Alison Hill and Kirsten Gorman for teaching purposes.

Let’s install the package…

install.packages('palmerpenguins')

…and load it in R.

library(palmerpenguins)

palmerpenguins exports two datasets

penguins_raw
# A tibble: 344 × 17
   studyName Sample Num…¹ Species Region Island Stage Indiv…² Clutc…³ `Date Egg`
   <chr>            <dbl> <chr>   <chr>  <chr>  <chr> <chr>   <chr>   <date>    
 1 PAL0708              1 Adelie… Anvers Torge… Adul… N1A1    Yes     2007-11-11
 2 PAL0708              2 Adelie… Anvers Torge… Adul… N1A2    Yes     2007-11-11
 3 PAL0708              3 Adelie… Anvers Torge… Adul… N2A1    Yes     2007-11-16
 4 PAL0708              4 Adelie… Anvers Torge… Adul… N2A2    Yes     2007-11-16
 5 PAL0708              5 Adelie… Anvers Torge… Adul… N3A1    Yes     2007-11-16
 6 PAL0708              6 Adelie… Anvers Torge… Adul… N3A2    Yes     2007-11-16
 7 PAL0708              7 Adelie… Anvers Torge… Adul… N4A1    No      2007-11-15
 8 PAL0708              8 Adelie… Anvers Torge… Adul… N4A2    No      2007-11-15
 9 PAL0708              9 Adelie… Anvers Torge… Adul… N5A1    Yes     2007-11-09
10 PAL0708             10 Adelie… Anvers Torge… Adul… N5A2    Yes     2007-11-09
# … with 334 more rows, 8 more variables: `Culmen Length (mm)` <dbl>,
#   `Culmen Depth (mm)` <dbl>, `Flipper Length (mm)` <dbl>,
#   `Body Mass (g)` <dbl>, Sex <chr>, `Delta 15 N (o/oo)` <dbl>,
#   `Delta 13 C (o/oo)` <dbl>, Comments <chr>, and abbreviated variable names
#   ¹​`Sample Number`, ²​`Individual ID`, ³​`Clutch Completion`
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

palmerpenguins exports two datasets

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
 5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
# ℹ Use `print(n = ...)` to see more rows

We will use the first one: penguins, which has already been cleaned.

Exercise

The print method for a tibble gives you a reasonable overview of the data stored in it.

Can you get more details with the package skimr?

Check its documentation, install it, try it out on the penguins dataset. Comment on the output: is it useful? How?.

Tools: dplyr

A grammar for data manipulation

Dplyr provides a grammar for manipulating data, with many useful verbs:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • group_by() performs operations by group.

We can apply those verbs to manipulate data

penguins %>%
  mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 344 × 9
   species island    bill_length_mm bill_d…¹ flipp…² body_…³ sex    year bill_…⁴
   <fct>   <fct>              <dbl>    <dbl>   <int>   <int> <fct> <int>   <dbl>
 1 Adelie  Torgersen           39.1     18.7     181    3750 male   2007  0.0391
 2 Adelie  Torgersen           39.5     17.4     186    3800 fema…  2007  0.0395
 3 Adelie  Torgersen           40.3     18       195    3250 fema…  2007  0.0403
 4 Adelie  Torgersen           NA       NA        NA      NA <NA>   2007 NA     
 5 Adelie  Torgersen           36.7     19.3     193    3450 fema…  2007  0.0367
 6 Adelie  Torgersen           39.3     20.6     190    3650 male   2007  0.0393
 7 Adelie  Torgersen           38.9     17.8     181    3625 fema…  2007  0.0389
 8 Adelie  Torgersen           39.2     19.6     195    4675 male   2007  0.0392
 9 Adelie  Torgersen           34.1     18.1     193    3475 <NA>   2007  0.0341
10 Adelie  Torgersen           42       20.2     190    4250 <NA>   2007  0.042 
# … with 334 more rows, and abbreviated variable names ¹​bill_depth_mm,
#   ²​flipper_length_mm, ³​body_mass_g, ⁴​bill_length_meters
# ℹ Use `print(n = ...)` to see more rows

We can apply those verbs to manipulate data

penguins %>%
  select(bill_length_mm) %>% 
  mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 344 × 2
   bill_length_mm bill_length_meters
            <dbl>              <dbl>
 1           39.1             0.0391
 2           39.5             0.0395
 3           40.3             0.0403
 4           NA              NA     
 5           36.7             0.0367
 6           39.3             0.0393
 7           38.9             0.0389
 8           39.2             0.0392
 9           34.1             0.0341
10           42               0.042 
# … with 334 more rows
# ℹ Use `print(n = ...)` to see more rows

We can apply those verbs to manipulate data

penguins %>%
  select(species, island, bill_length_mm) %>% 
  mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 344 × 4
   species island    bill_length_mm bill_length_meters
   <fct>   <fct>              <dbl>              <dbl>
 1 Adelie  Torgersen           39.1             0.0391
 2 Adelie  Torgersen           39.5             0.0395
 3 Adelie  Torgersen           40.3             0.0403
 4 Adelie  Torgersen           NA              NA     
 5 Adelie  Torgersen           36.7             0.0367
 6 Adelie  Torgersen           39.3             0.0393
 7 Adelie  Torgersen           38.9             0.0389
 8 Adelie  Torgersen           39.2             0.0392
 9 Adelie  Torgersen           34.1             0.0341
10 Adelie  Torgersen           42               0.042 
# … with 334 more rows
# ℹ Use `print(n = ...)` to see more rows

We can apply those verbs to manipulate data

penguins %>%
 count(island)
# A tibble: 3 × 2
  island        n
  <fct>     <int>
1 Biscoe      168
2 Dream       124
3 Torgersen    52

We can apply those verbs to manipulate data

penguins %>%
  select(species, island, bill_length_mm) %>% 
  filter(island == 'Dream') %>% 
  mutate(bill_length_meters = bill_length_mm/1000)
# A tibble: 124 × 4
   species island bill_length_mm bill_length_meters
   <fct>   <fct>           <dbl>              <dbl>
 1 Adelie  Dream            39.5             0.0395
 2 Adelie  Dream            37.2             0.0372
 3 Adelie  Dream            39.5             0.0395
 4 Adelie  Dream            40.9             0.0409
 5 Adelie  Dream            36.4             0.0364
 6 Adelie  Dream            39.2             0.0392
 7 Adelie  Dream            38.8             0.0388
 8 Adelie  Dream            42.2             0.0422
 9 Adelie  Dream            37.6             0.0376
10 Adelie  Dream            39.8             0.0398
# … with 114 more rows
# ℹ Use `print(n = ...)` to see more rows

We can apply those verbs to manipulate data

penguins %>%
  select(species, island, bill_length_mm) %>% 
  filter(island == 'Dream') %>% 
  mutate(bill_length_meters = bill_length_mm/1000) %>% 
  group_by(species)
# A tibble: 124 × 4
# Groups:   species [2]
   species island bill_length_mm bill_length_meters
   <fct>   <fct>           <dbl>              <dbl>
 1 Adelie  Dream            39.5             0.0395
 2 Adelie  Dream            37.2             0.0372
 3 Adelie  Dream            39.5             0.0395
 4 Adelie  Dream            40.9             0.0409
 5 Adelie  Dream            36.4             0.0364
 6 Adelie  Dream            39.2             0.0392
 7 Adelie  Dream            38.8             0.0388
 8 Adelie  Dream            42.2             0.0422
 9 Adelie  Dream            37.6             0.0376
10 Adelie  Dream            39.8             0.0398
# … with 114 more rows
# ℹ Use `print(n = ...)` to see more rows

We can apply those verbs to manipulate data

penguins %>%
  select(species, island, bill_length_mm) %>% 
  filter(island == 'Dream') %>% 
  mutate(bill_length_meters = bill_length_mm/1000) %>% 
  group_by(species) %>% 
  summarise(mean_bill_length_mm = mean(bill_length_mm),
            sd_bill_length_mm = sd(bill_length_mm))
# A tibble: 2 × 3
  species   mean_bill_length_mm sd_bill_length_mm
  <fct>                   <dbl>             <dbl>
1 Adelie                   38.5              2.47
2 Chinstrap                48.8              3.34

We can apply those verbs to manipulate data

dream_summary <- 
  penguins %>%
  select(species, island, bill_length_mm) %>% 
  filter(island == 'Dream') %>% 
  mutate(bill_length_meters = bill_length_mm/1000) %>% 
  group_by(species) %>% 
  summarise(mean_bill_length_mm = mean(bill_length_mm),
            sd_bill_length_mm = sd(bill_length_mm))

Let’s assign the output to a new variable dream_summary.

In the previous code we have seen also two additional aspects that feature heavily in the tidyverse:

  • The Pipe %>%.
  • Non-Standard Evaluation.

The Pipe %>%

The pipe is provided by the package magrittr, it’s a forwarding operator:


It takes the ouput of what comes before (LHS) and sends it to the first argument of the function that comes after (RHS).


LHS %>% RHS

The Pipe %>%

For example, you could write:

select(penguins, species, body_mass_g)
# A tibble: 344 × 2
   species body_mass_g
   <fct>         <int>
 1 Adelie         3750
 2 Adelie         3800
 3 Adelie         3250
 4 Adelie           NA
 5 Adelie         3450
 6 Adelie         3650
 7 Adelie         3625
 8 Adelie         4675
 9 Adelie         3475
10 Adelie         4250
# … with 334 more rows
# ℹ Use `print(n = ...)` to see more rows

The Pipe %>%

…but if you use the pipe, your code is easier to read…

penguins %>% select(species, body_mass_g)
# A tibble: 344 × 2
   species body_mass_g
   <fct>         <int>
 1 Adelie         3750
 2 Adelie         3800
 3 Adelie         3250
 4 Adelie           NA
 5 Adelie         3450
 6 Adelie         3650
 7 Adelie         3625
 8 Adelie         4675
 9 Adelie         3475
10 Adelie         4250
# … with 334 more rows
# ℹ Use `print(n = ...)` to see more rows

The Pipe %>%

…especially if you have to perform many operations one after the other…

penguins %>%
  select(species, body_mass_g) %>% 
  filter(species == 'Adelie')
# A tibble: 152 × 2
   species body_mass_g
   <fct>         <int>
 1 Adelie         3750
 2 Adelie         3800
 3 Adelie         3250
 4 Adelie           NA
 5 Adelie         3450
 6 Adelie         3650
 7 Adelie         3625
 8 Adelie         4675
 9 Adelie         3475
10 Adelie         4250
# … with 142 more rows
# ℹ Use `print(n = ...)` to see more rows

The Pipe %>%

…that otherwise would force you to nest your code horribly.

filter(
  select(
    penguins,
    species, body_mass_g
  ),
  species == 'Adelie'
)
# A tibble: 152 × 2
   species body_mass_g
   <fct>         <int>
 1 Adelie         3750
 2 Adelie         3800
 3 Adelie         3250
 4 Adelie           NA
 5 Adelie         3450
 6 Adelie         3650
 7 Adelie         3625
 8 Adelie         4675
 9 Adelie         3475
10 Adelie         4250
# … with 142 more rows
# ℹ Use `print(n = ...)` to see more rows

Non-Standard Evaluation

This one is difficult…

Which argument does the function select() take? Let’s check it in its help page with ?select.

?select

Under the Usage section it says:

select(.data, ...)

And then in the section Arguments the help page says:

.data: A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.

...: <tidy-select> One or more unquoted expressions separated by commas. Variable names can be used as if they were positions in the data frame, so expressions like x:y can be used to select a range of variables.

Non-Standard Evaluation

So, what are do we mean if we write:

penguins %>% 
  select(species, island)

The penguins tibble is fills the.data parameter through the pipe %>%.

The unquoted names species, island fill the argument ..., they represent the names of the columns to be selected.

But the name of columns in a tibble is a character vector.

colnames(penguins)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

Non-Standard Evaluation

Through non-standard evaluation, in the function select, we can call element of a character vector like if they were variables (without quoting them).

penguins %>% 
  select(species, island)

The variables species and island don’t exist outside of the dplyr function select().

species
Error in eval(expr, envir, enclos): object 'species' not found

With non-standard evaluation we can write names without quoting them. This makes writing code for iterative data exploration faster.

If you come from a more strict programming language, it could be hard to get use to this behaviour.

Most function of the Tidyverse do non-standard evaluation.

Exercise

With the penguin dataset:

  • Select all numeric variables (columns).

  • Convert all variables that are expressed in millimeters into meters, rename them accordingly.

Get help from:

Exercise

How many penguins have bill_length_mm above average?

Repeat the analysis for each species.