[1] NA
A missing value is a data point that’s missing, no one knows what the value was, so you can’t conduct any operation on it.
In R missing values are written NA
.
Most operation on missing values return NAs.
Ask yourself what is one plus “a number that I don’t know” (NA
). The anwser is “I dont know” (NA
).
To test if an element is a missing value we must use the function is.na()
.
We can assign missing values to a variable and I can place them in a vector.
When we use is.na()
on a vector, it returns a vector of booleans, with TRUE
in the position where the values are missing.
We can count missing values in a vector with sum()
.
Most functions such as mean()
, median()
, sd()
give you the chance to remove missing values with the argument na.rm = TRUE
.
Also, ggplot
s functio remove NAs automatically for us, if they would hinder computations.
Often data have NAs in them, for example the penguins
dataset does.
To count missing values per each column, use this lines of code:
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex year
<int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 2 2 2 2 11 0
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
We can also count them stratified by the fixed variables, for example:
# A tibble: 3 × 8
island species bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
<fct> <int> <int> <int> <int> <int> <int> <int>
1 Biscoe 0 1 1 1 1 5 0
2 Dream 0 0 0 0 0 1 0
3 Torgersen 0 1 1 1 1 5 0
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
Often data have missing values.
The most important thing, when you get new data is to figure out how many missing values it contains, and where they are.
Afterward you can decide if you want to remove them, or to impute them.
Learn more about missing values at R4DS.