When we mentioned the boxplot, we talked about robust summary statistics. First of all, what is a summary statistic?
A Summary statistic describes a property of a distribution in one single number.
For example:
None of these statistics is robust toward outliers.
\(mean(x) = \frac{1}{n} \sum_{i=i}^{n} x_{i}\)
\(SD = \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2}\)
What if I’m distracted and forget to write a comma when I collect the data.
With this snipped we multiply the max value of bill_depth_mm
by 10.
With this chunk of code I’ve added an outlier at in the variable bill_depth_mm
at 215 mm. Let’s see what happens.
I’ve added an outlier to the Adelie
species, let’s see what happens graphically
In the next plots I will not represent the outlier graphically because it stretches the x axis too much.
Mean and SD are heavily influenced by the outlier.
The median represents a the central tendency of a distribution, it is the value that splits the bigger 50% (half) of the data from the smaller half.
The IQR range represents how a distribution is dispersed, it is the range between the 1st and the 3rd quartile.
Summary statistic without outlier.
# A tibble: 3 × 8
species m sd med q1 q3 iqr mad
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Adelie 18.3 1.22 18.4 17.5 19 1.5 1.19
2 Chinstrap 18.4 1.14 18.4 17.5 19.4 1.90 1.41
3 Gentoo 15.0 0.981 15 14.2 15.7 1.5 1.19
Summary statistic with an outlier at 215 mm.
mean()
sd()
median()
or quantile()['50%']
quantile()['25%']
quantile()['75%']
iqr()
Use na.rm = TRUE
with all those function to ignore missing values.
When we have many data, we often have to reduce them to simple indicators with summary statistics.
No statisitic is perfect, some, such as mean and standard deviation are more precise, but they “break” easilt if the data with “non-ideal” data.
Others are such as median and IQR are simpler and less sophisticated, but are better at capturing the essence of the data in “non-ideal” situation, i.e. robust.
The word robust is often use to describe statistical methods. It means that the method is unlikely to break when the data are not perfect and the condition are worse than ideal.
Read: Introduction to data science | Chapter 12 - Robust Summaries
The storm
dataset is loaded in R with the tidyverse. It stores attributes from hurricanes in the US over the years 1975-2020.
wind
speed?pressure
in the storm center?Investigate the wind
and pressure
variables with histograms and boxplots, how are the data distributed? are there any outlier?
What if you stratify those two variables by status
, do you see some interesting pattern?
Stratify the data by name
, extract the 5 hurricanes with the highest mean wind speed the 5 hurricanes with highest median wind speed. Are they the same?