
When we mentioned the boxplot, we talked about robust summary statistics. First of all, what is a summary statistic?
A Summary statistic describes a property of a distribution in one single number.
For example:
None of these statistics is robust toward outliers.
\(mean(x) = \frac{1}{n} \sum_{i=i}^{n} x_{i}\)

\(SD = \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2}\)

What if I’m distracted and forget to write a comma when I collect the data.
With this snipped we multiply the max value of bill_depth_mm by 10.
With this chunk of code I’ve added an outlier at in the variable bill_depth_mm at 215 mm. Let’s see what happens.
I’ve added an outlier to the Adelie species, let’s see what happens graphically

In the next plots I will not represent the outlier graphically because it stretches the x axis too much.
Mean and SD are heavily influenced by the outlier.


The median represents a the central tendency of a distribution, it is the value that splits the bigger 50% (half) of the data from the smaller half.
The IQR range represents how a distribution is dispersed, it is the range between the 1st and the 3rd quartile.




Summary statistic without outlier.
# A tibble: 3 × 8
  species       m    sd   med    q1    q3   iqr   mad
  <fct>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Adelie     18.3 1.22   18.4  17.5  19    1.5   1.19
2 Chinstrap  18.4 1.14   18.4  17.5  19.4  1.90  1.41
3 Gentoo     15.0 0.981  15    14.2  15.7  1.5   1.19Summary statistic with an outlier at 215 mm.
mean()sd()median() or quantile()['50%']quantile()['25%']quantile()['75%']iqr()Use na.rm = TRUE with all those function to ignore missing values.
When we have many data, we often have to reduce them to simple indicators with summary statistics.
No statisitic is perfect, some, such as mean and standard deviation are more precise, but they “break” easilt if the data with “non-ideal” data.
Others are such as median and IQR are simpler and less sophisticated, but are better at capturing the essence of the data in “non-ideal” situation, i.e. robust.
The word robust is often use to describe statistical methods. It means that the method is unlikely to break when the data are not perfect and the condition are worse than ideal.
Read: Introduction to data science | Chapter 12 - Robust Summaries
The storm dataset is loaded in R with the tidyverse. It stores attributes from hurricanes in the US over the years 1975-2020.
wind speed?pressure in the storm center?Investigate the wind and pressure variables with histograms and boxplots, how are the data distributed? are there any outlier?
What if you stratify those two variables by status, do you see some interesting pattern?
Stratify the data by name, extract the 5 hurricanes with the highest mean wind speed the 5 hurricanes with highest median wind speed. Are they the same?