Missing Values

Otho Mantegazza

What is a missing value?

A missing value is a data point that’s missing, no one knows what the value was, so you can’t conduct any operation on it.

In R missing values are written NA.

Operation on missing values

Most operation on missing values return NAs.

Ask yourself what is one plus “a number that I don’t know” (NA). The anwser is “I dont know” (NA).

NA + 1
[1] NA
NA / 4
[1] NA
NA == 1
[1] NA
NA == NA
[1] NA

Testing if a value is missing

To test if an element is a missing value we must use the function is.na().

1 %>% is.na()
[1] FALSE
'hello' %>% is.na()
[1] FALSE
TRUE %>% is.na()
[1] FALSE
NA %>% is.na()
[1] TRUE

Missing values in a vector

We can assign missing values to a variable and I can place them in a vector.

missing_value <- NA
missing_value %>% is.na()
[1] TRUE

When we use is.na() on a vector, it returns a vector of booleans, with TRUE in the position where the values are missing.

vector_with_na <- c(1,5,NA, 10)
vector_with_na %>% is.na()
[1] FALSE FALSE  TRUE FALSE

We can count missing values in a vector with sum().

vector_with_na %>% is.na() %>% sum()
[1] 1

Operating on missing values

Most functions such as mean(), median(), sd() give you the chance to remove missing values with the argument na.rm = TRUE.

vector_with_na %>% mean()
[1] NA
vector_with_na %>% mean(na.rm = T)
[1] 5.333333

Also, ggplots functio remove NAs automatically for us, if they would hinder computations.

Data with missing values

Often data have NAs in them, for example the penguins dataset does.

To count missing values per each column, use this lines of code:

penguins %>% 
  summarise(
    across(
      everything(),
      ~is.na(.) %>% sum()
    )
  )
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…²   sex  year
    <int>  <int>          <int>         <int>          <int>   <int> <int> <int>
1       0      0              2             2              2       2    11     0
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Data with missing values

We can also count them stratified by the fixed variables, for example:

penguins %>% 
  group_by(island) %>% 
  summarise(
    across(
      everything(),
      ~is.na(.) %>% sum()
    )
  )
# A tibble: 3 × 8
  island    species bill_length_mm bill_depth_mm flipper_l…¹ body_…²   sex  year
  <fct>       <int>          <int>         <int>       <int>   <int> <int> <int>
1 Biscoe          0              1             1           1       1     5     0
2 Dream           0              0             0           0       0     1     0
3 Torgersen       0              1             1           1       1     5     0
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Strategy for missing values

Often data have missing values.

The most important thing, when you get new data is to figure out how many missing values it contains, and where they are.

Afterward you can decide if you want to remove them, or to impute them.

Learn more about missing values at R4DS.