The Tidyverse, Part 4

# The Tidyverse, Part 4
## “Science is what we understand well enough to explain to a computer; R is everything else.”
### Otho Mantegazza<br>2019-11-20

---

---

# Milano Open Data

The city of Milan releases carefully curated and well organized open data.

Check the website: https://dati.comune.milano.it/

Also Regione Lombardia has a great open data portal: https://www.dati.lombardia.it/

Did you know that, in **Europe**, you can ask for **Open Data** to the of public administration?

]
]

</p>
<a href="https://dati.comune.milano.it/" class="imagelink">
<img src="img/milano.svg" alt="stemma-milano" height="500" width="400"></a>
<p>

]
]

---

# Let's download some...

Air pollution in Milan: https://dati.comune.milano.it/dataset/ds417-rilevazione-qualita-aria-2019

```r
mil_url <- 
  paste0("http://dati.comune.milano.it/dataset/",
         "3e752fec-06fd-421b-ae9b-4d5d7a177640/",
         "resource/698a58e6-f276-44e1-92b1-3d2b81a4ad47/download/",
         "qaria_datoariagiornostazione_2019-11-12.csv")

mil_path <- here::here("data/qaria_datoariagiornostazione_2019-11-12.csv")
```

```r
download.file(url = mil_url,
              destfile = mil_path)
```

---

# Let's read the data...

```r
dat_mil <- read_csv(mil_path)
```

```
## Parsed with column specification:
## cols(
##   stazione_id = col_double(),
##   data = col_date(format = ""),
##   inquinante = col_character(),
##   valore = col_double()
## )
```

```r
dat_mil %>% print(n = 6)
```

---

Describe the data:

- How are they structured? 
- What's stored in the columns? What's their type?
- Are there any NA?

]

---

# Six Pollutants over Eight Stations

```r
dat_mil %>% 
  count(stazione_id)
```

```
## # A tibble: 8 x 2
##   stazione_id     n
##         <dbl> <int>
## 1           1   192
## 2           2  1152
## 3           3   384
## 4           4   768
## 5           5   384
## 6           6   960
## 7           7   576
## 8           8   576
```

]

```r
dat_mil %>%
  count(inquinante)
```

```
## # A tibble: 7 x 2
##   inquinante     n
##   <chr>      <int>
## 1 C6H6         768
## 2 CO_8h        768
## 3 NO2         1536
## 4 O3           576
## 5 PM10         768
## 6 PM25         384
## 7 SO2          192
```

]

---

Use a visualization to learn when were the records taken?

]

---

```r
dat_mil %>% 
  group_by(stazione_id, inquinante) %>% 
  count()
```

```
## # A tibble: 26 x 3
## # Groups:   stazione_id, inquinante [26]
##    stazione_id inquinante     n
##          <dbl> <chr>      <int>
##  1           1 NO2          192
##  2           2 C6H6         192
##  3           2 NO2          192
##  4           2 O3           192
##  5           2 PM10         192
##  6           2 PM25         192
##  7           2 SO2          192
##  8           3 CO_8h        192
##  9           3 NO2          192
## 10           4 C6H6         192
## # … with 16 more rows
```

---

Count the NAs per column of the dataset.

]

---

```r
*dat_mil
```
]
.pull-right-reveal[

```
## # A tibble: 4,992 x 4
##    stazione_id data       inquinante valore
##          <dbl> <date>     <chr>       <dbl>
##  1           1 2019-01-03 NO2          NA  
##  2           3 2019-01-03 NO2          51  
##  3           3 2019-01-03 CO_8h         1.2
##  4           4 2019-01-03 PM10         29  
##  5           4 2019-01-03 NO2         139  
##  6           4 2019-01-03 CO_8h         1.3
##  7           4 2019-01-03 C6H6          1.5
##  8           5 2019-01-03 NO2          NA  
##  9           5 2019-01-03 O3           NA  
## 10           2 2019-01-03 SO2           2.5
## # … with 4,982 more rows
```
]
---
count: false
.pull-left-reveal[

```r
dat_mil %>%  
* ggplot(aes(x = stazione_id,
*            y = inquinante,
*            colour = is.na(valore)))
```
]
.pull-right-reveal[
![](05-intro-to-the-tidyverse_files/figure-html/output_count_na_4-1.svg)
]
---
count: false
.pull-left-reveal[

```r
dat_mil %>%  
  ggplot(aes(x = stazione_id,  
             y = inquinante,  
             colour = is.na(valore))) +  
* geom_count(alpha = .8)
```
]
.pull-right-reveal[
![](05-intro-to-the-tidyverse_files/figure-html/output_count_na_5-1.svg)
]
---
count: false
.pull-left-reveal[

```r
dat_mil %>%  
  ggplot(aes(x = stazione_id,  
             y = inquinante,  
             colour = is.na(valore))) +  
  geom_count(alpha = .8)  +  
* theme_bw()
```
]
.pull-right-reveal[
![](05-intro-to-the-tidyverse_files/figure-html/output_count_na_6-1.svg)
]
---
count: false
.pull-left-reveal[

```r
dat_mil %>%  
  ggplot(aes(x = stazione_id,  
             y = inquinante,  
             colour = is.na(valore))) +  
  geom_count(alpha = .8)  +  
  theme_bw() +  
* scale_size_continuous(
*   limits = c(0, NA)
* )
```
]
.pull-right-reveal[
![](05-intro-to-the-tidyverse_files/figure-html/output_count_na_9-1.svg)
]

---

# Some missing data are implicit

https://tidyr.tidyverse.org/reference/complete.html

```r
# complete
dat_mil_all <- 
  dat_mil %>%
  complete(stazione_id, nesting(data, inquinante))
```

```r
dat_mil_all %>% print(n = 6)
```

---

```r
*dat_mil_all
```
]
.pull-right-reveal[

```
## # A tibble: 10,602 x 4
##    stazione_id data       inquinante valore
##          <dbl> <date>     <chr>       <dbl>
##  1           1 2019-01-03 C6H6           NA
##  2           1 2019-01-03 CO_8h          NA
##  3           1 2019-01-03 NO2            NA
##  4           1 2019-01-03 O3             NA
##  5           1 2019-01-03 PM10           NA
##  6           1 2019-01-03 PM25           NA
##  7           1 2019-01-03 SO2            NA
##  8           1 2019-01-04 C6H6           NA
##  9           1 2019-01-04 CO_8h          NA
## 10           1 2019-01-04 NO2            NA
## # … with 10,592 more rows
```
]
---
count: false
.pull-left-reveal[

```r
dat_mil_all %>%  
* ggplot(aes(x = stazione_id,
*            y = inquinante))
```
]
.pull-right-reveal[
![](05-intro-to-the-tidyverse_files/figure-html/output_count_na_exp_3-1.svg)
]
---
count: false
.pull-left-reveal[

```r
dat_mil_all %>%  
  ggplot(aes(x = stazione_id,  
             y = inquinante)) +  
* geom_count()
```
]
.pull-right-reveal[
![](05-intro-to-the-tidyverse_files/figure-html/output_count_na_exp_4-1.svg)
]
---
count: false
.pull-left-reveal[

```r
dat_mil_all %>%  
  ggplot(aes(x = stazione_id,  
             y = inquinante)) +  
  geom_count() +  
* theme_bw()
```
]
.pull-right-reveal[
![](05-intro-to-the-tidyverse_files/figure-html/output_count_na_exp_5-1.svg)
]
---
count: false
.pull-left-reveal[

```r
dat_mil_all %>%  
  ggplot(aes(x = stazione_id,  
             y = inquinante)) +  
  geom_count() +  
  theme_bw() +  
* scale_size_continuous(
*   limits = c(0, NA)
* )
```
]
.pull-right-reveal[
![](05-intro-to-the-tidyverse_files/figure-html/output_count_na_exp_8-1.svg)
]

---

Some rows might be duplicated. How would you remove them?

]

---

How would you visualize the amount of pollutants in the air by month?

Draw some sketches by hand.

What value would you map to what object of your sketch?

]

---

```r
dat_mil_all %>% 
  mutate(month = month(data, label = T)) %>% 
  ggplot(aes(x = month, 
             y = valore)) +
  geom_boxplot() +
  aes(fill = inquinante)
```

![](05-intro-to-the-tidyverse_files/figure-html/unnamed-chunk-14-1.svg)

---

```r
dat_mil_all %>% 
  ggplot(aes(x = data, 
             y = valore)) +
  geom_point(alpha = .2) +
  aes(colour = inquinante) +
  geom_smooth()
```

![](05-intro-to-the-tidyverse_files/figure-html/unnamed-chunk-15-1.png)

---

# Pivot

Pivoting: https://tidyr.tidyverse.org/articles/pivot.html

![](https://d33wubrfki0l68.cloudfront.net/3aea19108d39606bbe49981acda07696c0c7fcd8/2de65/images/tidy-9.png)

Image Credits: https://r4ds.had.co.nz/tidy-data.html

---

```r
*dat_mil_all
```
]
.pull-right-reveal[

```
## # A tibble: 10,472 x 4
##    stazione_id data       inquinante valore
##          <dbl> <date>     <chr>       <dbl>
##  1           1 2019-01-03 C6H6           NA
##  2           1 2019-01-03 CO_8h          NA
##  3           1 2019-01-03 NO2            NA
##  4           1 2019-01-03 O3             NA
##  5           1 2019-01-03 PM10           NA
##  6           1 2019-01-03 PM25           NA
##  7           1 2019-01-03 SO2            NA
##  8           1 2019-01-04 C6H6           NA
##  9           1 2019-01-04 CO_8h          NA
## 10           1 2019-01-04 NO2            NA
## 11           1 2019-01-04 O3             NA
## 12           1 2019-01-04 PM10           NA
## 13           1 2019-01-04 PM25           NA
## 14           1 2019-01-04 SO2            NA
## 15           1 2019-01-07 C6H6           NA
## 16           1 2019-01-07 CO_8h          NA
## 17           1 2019-01-07 NO2            NA
## 18           1 2019-01-07 O3             NA
## 19           1 2019-01-07 PM10           NA
## 20           1 2019-01-07 PM25           NA
## # … with 1.045e+04 more rows
```
]
---
count: false
.pull-left-reveal[

```r
dat_mil_all %>%  
* pivot_wider(
*   names_from = inquinante,
*   values_from = valore
* )
```
]
.pull-right-reveal[

```
## # A tibble: 1,496 x 9
##    stazione_id data        C6H6 CO_8h   NO2    O3  PM10
##          <dbl> <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
##  1           1 2019-01-03    NA    NA    NA    NA    NA
##  2           1 2019-01-04    NA    NA    NA    NA    NA
##  3           1 2019-01-07    NA    NA    NA    NA    NA
##  4           1 2019-01-08    NA    NA    NA    NA    NA
##  5           1 2019-01-09    NA    NA    NA    NA    NA
##  6           1 2019-01-10    NA    NA    NA    NA    NA
##  7           1 2019-01-11    NA    NA    NA    NA    NA
##  8           1 2019-01-14    NA    NA    NA    NA    NA
##  9           1 2019-01-15    NA    NA    NA    NA    NA
## 10           1 2019-01-17    NA    NA    NA    NA    NA
## 11           1 2019-01-18    NA    NA    NA    NA    NA
## 12           1 2019-01-21    NA    NA    NA    NA    NA
## 13           1 2019-01-22    NA    NA    NA    NA    NA
## 14           1 2019-01-23    NA    NA    NA    NA    NA
## 15           1 2019-01-25    NA    NA    NA    NA    NA
## 16           1 2019-01-28    NA    NA    NA    NA    NA
## 17           1 2019-01-29    NA    NA    NA    NA    NA
## 18           1 2019-01-30    NA    NA    NA    NA    NA
## 19           1 2019-02-01    NA    NA    NA    NA    NA
## 20           1 2019-02-04    NA    NA    NA    NA    NA
## # … with 1,476 more rows, and 2 more variables:
## #   PM25 <dbl>, SO2 <dbl>
```
]

---

```r
* # Pivot back
*dat_mil_wide
```
]
.pull-right-reveal[

```r
  # Pivot back
dat_mil_wide %>%  
* pivot_longer(
*   cols = C6H6:SO2,
*   names_to = "inquinante",
*   values_to = "valore"
* )
```
]
.pull-right-reveal[

---

# Extend GGPLOT2

https://www.ggplot2-exts.org/gallery/

```r
# install.packages("GGally")

library(GGally)

ggpairs(dat_mil_wide)
```

---

![](05-intro-to-the-tidyverse_files/figure-html/unnamed-chunk-19-1.png)

]