Get Data into R

Tidyverse - Part 3

Otho Mantegazza

Datasets

Dataframes are rectangular datasets

More often than not when we speak about datasets, we speak about rectangular data, i.e., data in two-dimensional table, made of values organized in rows and columns.

  • Each cell stores a value.
  • Each value belongs to one column and one row.

Rectnagular data are the easiest to use, when we get data that are not rectangular, we try to reshape them in that form.

Tools: Read Data with Readr

Readr

Readr is a package that loads (reads) Rectangular Text data in R.

It’s fast, it guesses column types explicitly and it’s pipe friendly

You can use it to read both local data and online data from a URL.

For example we can use it to read data in CSV and TSV formats and many more.

Read the Palmer Pengunis dataset

We can use again on the Palmer Penguins Dataset

The source code of this package, is on github; we can find the tidy CSV data in the inst/exdata folder.

penguin_csv_url <- 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'
 
read_csv(penguin_csv_url)
Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <chr>   <chr>              <dbl>         <dbl>      <dbl>   <dbl> <chr> <dbl>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
 5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
# ℹ Use `print(n = ...)` to see more rows

The tibble that we have loaded and generate d from CSV is not identical to the one that comes already loaded with the palmerpenguins package:

penguins_from_csv <- 
  penguin_csv_url %>% 
  read_csv()

identical(
  penguins_from_csv,
  palmerpenguins::penguins
)
[1] FALSE

Let’s compare them side by side

palmerpenguins::penguins %>% 
  glimpse(width = 40)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adel…
$ island            <fct> Torgersen, T…
$ bill_length_mm    <dbl> 39.1, 39.5, …
$ bill_depth_mm     <dbl> 18.7, 17.4, …
$ flipper_length_mm <int> 181, 186, 19…
$ body_mass_g       <int> 3750, 3800, …
$ sex               <fct> male, female…
$ year              <int> 2007, 2007, …
penguins_from_csv %>% 
  glimpse(width = 40)
Rows: 344
Columns: 8
$ species           <chr> "Adelie", "A…
$ island            <chr> "Torgersen",…
$ bill_length_mm    <dbl> 39.1, 39.5, …
$ bill_depth_mm     <dbl> 18.7, 17.4, …
$ flipper_length_mm <dbl> 181, 186, 19…
$ body_mass_g       <dbl> 3750, 3800, …
$ sex               <chr> "male", "fem…
$ year              <dbl> 2007, 2007, …

Can you spot that column types are different?



Parsing

When we read data from text encoded “delimited” files, such as CSV, we use function that parse the file.

When we parse something, we formalize its structure applying a set of grammatical rules.

No parsing rule is perfect, thus we must often review the results and “fix” parsing “mistakes”.

# specify column types manually

penguins_from_csv <-
  penguin_csv_url %>% 
  read_csv(
    col_types = cols(
      species = col_factor(),
      island = col_factor(),
      flipper_length_mm = col_integer(),
      body_mass_g = col_integer(),
      sex = col_factor(),
      year = col_integer()
    )
  )

penguins_from_csv %>% glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Exercise

Find the source code of the readr package.

In the inst/extdata folder you can find 10 datasets that display different challenges that you might enconter when you have to load data from an external file.

Load in R at least 3 of those datasets using functions from readr.

Get help from readr’s documentation and the data import chapter of r4ds.

Which function did you use? Did you encounter any parsing failure? How did you fix it?

Read a dataset from PANGAEA

PANGAEA, a Data repository for the evironmental sciences.

For our exercise we will use this dataset from Wu et al:


“Effect of barite-bound Sr on detrital Sr isotope systematics in marine sediments with pertinent Mediterranean examples”.


https://doi.pangaea.de/10.1594/PANGAEA.946258

pangaea_filename <- 'Dataset_S2_HCl-leaching.tab'

pangaea_path <- here('data-int/Wu-etal_2022/datasets', pangaea_filename)

Let’s try to read the data file Dataset_S2_HCl-leaching.tab.

It’s a .tab file.

pangaea_data <- 
  pangaea_path %>% 
  read_delim()
Warning: One or more parsing issues, see `problems()` for details
Rows: 177 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: " "
chr (3): /*, DATA, DESCRIPTION:

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If we call problems() readr tells us what went wrong.

pangaea_data
# A tibble: 177 × 3
   `/*`                   DATA           `DESCRIPTION:`                         
   <chr>                  <chr>          <chr>                                  
 1 "Citation:\tWu,"       "Jiawang;"     "Liu, Zhifei; Michard, Annie; Tachikaw…
 2 "\tIn:"                "Wu,"          "J et al. (2022): Effect of barite-bou…
 3 "Keyword(s):\tBarite;" "Barium"       "(Ba); marine sediments; Mediterranean…
 4 "Related"              "to:\tWu,"     "Jiawang; Liu, Zhifei; Michard, Annie;…
 5 "Funding:\tChina"      "Postdoctoral" "Science Foundation (URI: http://res.c…
 6 "\tChina"              "Postdoctoral" "Science Foundation (URI: http://res.c…
 7 "\tEuropean"           "Commission"   "(EC), grant/award no. MAS2-CT93-0051:…
 8 "\tEuropean"           "Commission"   "(EC), grant/award no. MAS3 CT97-0137:…
 9 "\tEuropean"           "Commission"   "(EC), grant/award no. MAST1-90022C: E…
10 "\tNational"           "Natural"      "Science Foundation of China (NSFC) (U…
# … with 167 more rows
# ℹ Use `print(n = ...)` to see more rows

We can use the arguments:

  • delim = '\t' to tell read_delim() that we are reading a file delimited by tabulature (\t).
  • skip = 49 to tell it that the first 49 rows must be skipped.
pangaea_data <- 
  pangaea_path %>% 
  read_delim(delim = '\t',
             skip = 49)

Now the data that we’ve imported into R looks fine.

pangaea_data %>% glimpse()
Rows: 128
Columns: 19
$ Event                      <chr> "SL73BC", "SL73BC", "SL73BC", "SL73BC", "SL…
$ Latitude                   <dbl> 39.6617, 39.6617, 39.6617, 39.6617, 39.6617…
$ Longitude                  <dbl> 24.5117, 24.5117, 24.5117, 24.5117, 24.5117…
$ `Elevation [m]`            <dbl> -339, -339, -339, -339, -339, -339, -339, -…
$ `Sample label (barite-Sr)` <chr> "SL73-1", "SL73-2", "SL73-3", "SL73-4", "SL…
$ `Samp type`                <chr> "non-S1", "non-S1", "non-S1", "S1b", "S1b",…
$ `Depth [m]`                <dbl> 0.0045, 0.0465, 0.0665, 0.1215, 0.1765, 0.1…
$ `Age [ka BP]`              <dbl> 1.66, 3.13, 4.13, 5.75, 7.30, 7.78, 8.65, 9…
$ `CaCO3 [%]`                <dbl> 61.5, 55.1, 53.0, 43.4, 41.8, 42.3, 42.6, 3…
$ `Ba [µg/g] (Leachate)`     <dbl> 72.6, 64.3, 37.4, 63.5, 101.0, 141.0, 75.2,…
$ `Sr [µg/g] (Leachate)`     <dbl> 767, 681, 690, 552, 527, 528, 551, 482, 391…
$ `Ca [µg/g] (Leachate)`     <dbl> 188951.9, 163260.4, 162188.7, 125937.6, 124…
$ `Al [µg/g] (Leachate)`     <dbl> 10612.7, 11428.4, 5463.0, 3261.5, 2121.7, 1…
$ `Fe [µg/g] (Leachate)`     <dbl> 5935.0, 6814.3, 2465.7, 3936.8, 189.6, 7711…
$ `Ba [µg/g] (Residue)`      <dbl> 171.0, 198.0, 251.0, 290.0, 315.0, 259.0, 3…
$ `Sr [µg/g] (Residue)`      <dbl> 66.6, 71.3, 90.3, 99.8, 106.0, 90.7, 108.0,…
$ `Ca [µg/g] (Residue)`      <dbl> 2315.5, 2369.6, 3007.7, 3447.9, 3713.2, 331…
$ `Al [µg/g] (Residue)`      <dbl> 29262.5, 35561.4, 45862.4, 52485.6, 55083.0…
$ `Fe [µg/g] (Residue)`      <dbl> 14834.8, 18301.4, 24534.5, 30745.3, 28532.2…

Now the data that we’ve imported into R looks fine.

pangaea_data %>% skimr::skim()
Data summary
Name Piped data
Number of rows 128
Number of columns 19
_______________________
Column type frequency:
character 3
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Event 0 1 5 7 0 11 0
Sample label (barite-Sr) 0 1 5 7 0 128 0
Samp type 0 1 3 6 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Latitude 0 1 35.32 2.24 33.38 33.67 34.55 35.29 39.66 ▇▂▁▁▃
Longitude 0 1 22.79 8.17 4.51 17.86 24.51 28.61 32.67 ▂▁▆▆▇
Elevation [m] 0 1 -1807.65 829.55 -3390.00 -2373.00 -1587.00 -1375.00 -339.00 ▂▇▁▅▅
Depth [m] 0 1 0.30 0.25 0.00 0.17 0.26 0.34 1.40 ▇▅▁▁▁
Age [ka BP] 0 1 7.88 3.49 0.30 5.83 7.86 9.82 19.61 ▃▇▇▂▁
CaCO3 [%] 0 1 57.94 8.29 33.70 52.58 57.80 64.45 74.60 ▁▃▇▇▅
Ba [µg/g] (Leachate) 0 1 216.49 223.26 0.00 37.15 122.00 398.75 954.00 ▇▂▂▁▁
Sr [µg/g] (Leachate) 0 1 784.98 207.08 341.00 635.15 767.50 886.25 1419.00 ▃▇▇▂▁
Ca [µg/g] (Leachate) 0 1 179277.91 34034.64 82460.20 158421.22 182153.85 204214.33 240253.00 ▁▃▆▇▅
Al [µg/g] (Leachate) 0 1 9075.10 3739.72 122.20 6036.95 9034.50 11691.80 21011.90 ▂▇▇▅▁
Fe [µg/g] (Leachate) 0 1 7673.07 4235.50 189.60 4937.08 6918.25 9513.88 21428.70 ▃▇▃▁▁
Ba [µg/g] (Residue) 0 1 184.58 109.97 54.80 98.30 142.00 251.00 507.00 ▇▃▂▁▁
Sr [µg/g] (Residue) 0 1 46.15 17.69 20.50 33.53 41.10 51.85 120.00 ▇▇▂▁▁
Ca [µg/g] (Residue) 0 1 1316.60 930.42 475.50 825.50 1057.45 1446.40 8343.50 ▇▁▁▁▁
Al [µg/g] (Residue) 0 1 36668.75 8937.30 19227.20 29424.80 37297.75 43806.80 60033.10 ▆▇▇▆▁
Fe [µg/g] (Residue) 0 1 21446.97 6866.21 10823.30 16197.70 20393.90 24408.27 48057.60 ▇▇▃▁▁

Always check for missing values

skimr::skim() shows you how many values are missing in your dataset:

  • How many missing value are there?
  • Where do they occur?

A more formal way to check for missing values.

pangaea_data %>% 
  summarise(
    across(
      .fns = ~is.na(.) %>% sum()
    )
  ) %>% 
  glimpse()
Rows: 1
Columns: 19
$ Event                      <int> 0
$ Latitude                   <int> 0
$ Longitude                  <int> 0
$ `Elevation [m]`            <int> 0
$ `Sample label (barite-Sr)` <int> 0
$ `Samp type`                <int> 0
$ `Depth [m]`                <int> 0
$ `Age [ka BP]`              <int> 0
$ `CaCO3 [%]`                <int> 0
$ `Ba [µg/g] (Leachate)`     <int> 0
$ `Sr [µg/g] (Leachate)`     <int> 0
$ `Ca [µg/g] (Leachate)`     <int> 0
$ `Al [µg/g] (Leachate)`     <int> 0
$ `Fe [µg/g] (Leachate)`     <int> 0
$ `Ba [µg/g] (Residue)`      <int> 0
$ `Sr [µg/g] (Residue)`      <int> 0
$ `Ca [µg/g] (Residue)`      <int> 0
$ `Al [µg/g] (Residue)`      <int> 0
$ `Fe [µg/g] (Residue)`      <int> 0

Quick checklist when you read new data into R

  • Check for missing values.
  • Check the column types, are they what you expect?
  • Check the row number and the column names.
  • Optional, check the head() and the tail() of the file.

Now let’s tidy the data.

Exercise

Tidytuesday is a weekly data project aimed at learning, collaborating and networking the R ecosystem.

Find this week’s dataset and read it in R. Run the checklist from the previous slide on the data that you’ve read.

If you are donw early, proceed reading data from the previous week or find a colleague to help.

Check Tidytuesday submissions on Twitter with the hashtag #TidyTuesday