class: center, middle, inverse, title-slide # The Tidyverse, Part 3 ## “Without R, the crudeness of reality would make the world unbearable” ### Otho Mantegazza
2019-11-20 --- class: blueblue, middle .big[A couple of .orange[Data Cleaning] steps.] --- # NA encoded as question marks ```r squirrels_tidy %>% pull(age) %>% unique() ``` ``` ## [1] NA "Adult" "Juvenile" "?" ``` -- # Two columns in one ```r squirrels_tidy %>% pull(hectare) %>% unique() ``` ``` ## [1] "37F" "37E" "02E" "05D" "39B" "33H" "06G" "35C" "07B" "32E" "13E" "11H" ## [13] "36H" "33F" "21C" "11D" "20B" "36I" "05C" "07H" "16C" "14E" "32A" "17F" ## [25] "16I" "12I" "32F" "25A" "15E" "39G" "29I" "07E" "17C" "10A" "28A" "22F" ## [37] "12B" "18A" "29C" "38C" ## [ reached getOption("max.print") -- omitted 299 entries ] ``` --- # TIDYR .pull-left[ .middle[ TIDYR is a package dedicated to tidying data. It stores functions for - Pivoting, - Rectangling, - Nesting, - Separating and combining columns, - Deal with missing data. At this moment we need the last two. ] ] .pull-right[ .center[ </p> <a href="https://tidyr.tidyverse.org" class="imagelink"> <img src="img/SVG/tidyr.svg" alt="hex-tidyr" height="445" width="400"></a> <p> ] ] --- # How many NAs ```r squirrels_tidy %>% map(~is.na(.) %>% sum()) %>% unlist() ``` ``` ## long lat unique_squirrel_id ## 0 0 0 ## hectare shift date ## 0 0 0 ## hectare_squirrel_number age primary_fur_color ## 0 121 55 ## highlight_fur_color ## 1086 ## [ reached getOption("max.print") -- omitted 26 entries ] ``` --- # How many question marks ```r squirrels_tidy %>% map(~str_detect(., "\\?") %>% sum(na.rm = T)) %>% unlist() ``` ``` ## long lat unique_squirrel_id ## 0 0 0 ## hectare shift date ## 0 0 0 ## hectare_squirrel_number age primary_fur_color ## 0 4 0 ## highlight_fur_color ## 0 ## [ reached getOption("max.print") -- omitted 26 entries ] ``` --- # Transform question mark to NA ```r squirrels_tidy <- squirrels_tidy %>% mutate_if(is.character, ~na_if(., "?")) ``` No more NAs coded as question marks. ```r squirrels_tidy %>% map(~str_detect(., "\\?") %>% sum(na.rm = T)) %>% unlist() ``` ``` ## long lat unique_squirrel_id ## 0 0 0 ## hectare shift date ## 0 0 0 ## hectare_squirrel_number age primary_fur_color ## 0 0 0 ## highlight_fur_color ## 0 ## [ reached getOption("max.print") -- omitted 26 entries ] ``` --- # Separate columns ```r squirrels_tidy <- squirrels_tidy %>% separate(hectare, c("hectare_lat", "hectare_lon"), sep = 2) squirrels_tidy %>% select(hectare_lat, hectare_lon) ``` ``` ## # A tibble: 3,023 x 2 ## hectare_lat hectare_lon ## <chr> <chr> ## 1 37 F ## 2 37 E ## 3 02 E ## 4 05 D ## 5 39 B ## 6 33 H ## 7 06 G ## 8 35 C ## 9 07 B ## 10 32 E ## # … with 3,013 more rows ``` --- # Janitor .pull-left[ .middle[ Janitor encodes many useful routines to clean data that have been collected in spreadheets. According to the package readme, dirtiness includes: - Dreadful column names. - Rows and columns containing Excel formatting but no data. - Dates stored as numbers. - Values spread inconsistently over [...] columns. ] ] .pull-right[ .center[ </p> <a href="https://sfirke.github.io/janitor/" class="imagelink"> <img src="img/janitor.png" alt="hex-tidyr" height="230" width="200"></a> <p> ] ] --- .center[ .middle[ <img src="https://media.giphy.com/media/VzwrrgjLyRzJS/giphy.gif" width=75%> ] ]