next up previous contents
Next: Example: Summary statistics for Up: Descriptive statistics for univariate Previous: Resistant statistics   Contents

Empirical quantiles

One way of obtaining resistant statistics is to use the empirical quantiles (percentiles/fractiles). The quantile (this term was first used by Kendall, 1940) of a distribution is the number $ x_p$ such that a proportion $ p$ of the values are less than or equal to $ x_p$. For example, the 0.25 quantile $ x_{0.25}$ (also referred to as the 25th percentile or lower quartile) is the value such that 25% of all the values fall below that value.

Empirical quantiles can be most easily constructed by sorting (ranking) the data into ascending order to obtain a sequence of order statistics $ x_{(1)} \leq x_{(2)} \leq \ldots \leq x_{(n)}$ as shown in Figure 2.1b. The $ p$'th quantile $ x_p$ is then obtained by taking the rank $ r=(n+1)p$'th order statistic $ x_{((n+1)p)}$ (or an average of neigbouring values if $ (n+1)p$ is not integer):

$\displaystyle x_p$ $\displaystyle =$ $\displaystyle \left\{ \begin{array}{ll}
x_{((n+1)p)} & \mbox{ if $(n+1)p$\ is i...
...\\
0.5*(x_{([(n+1)p])}+x_{([(n+1)p]+1)}) & \mbox{otherwise}
\end{array}\right.$ (2.5)

where $ p$ is the probability $ \Pr(X\leq x_p)=r/(n+1)$ and $ [a]$ is the greatest integer not exceeding $ a$. Note that the empirical probability $ p=r/(n+1)$ is only defined at discrete values - quantiles for other values of $ p$ can be obtained either by interpolation ( $ 1\leq p(n+1)\leq n$) or by extrapolation ($ p<1/(n+1)$ or $ p>n/(n+1)$). The use of $ (n+1)$ rather than $ n$ in the denominator of $ p=r/(n+1)$ prevents issuing probabilities that are either zero or one (i.e. perfect certainty) based on only a finite sample of data. As an example, the quartiles of the height example are given by $ x_{0.25}=x_{(3)}=171$ (lower quartile), $ x_{0.5}=x_{(6)}=175$ (median), and $ x_{0.75}=x_{(9)}=180$ (upper quartile).

Figure: Diagram showing how the empirical distribution is obtained for the heights given in Table 2.1. All heights are relative to a reference height of 150cm in order to make the differences more apparent.

Unlike the arithmetic mean, the median $ x_{0.5}$ is not at all influenced by the exact value of the largest objects and so provides a resistant measure of the central location. Likewise, a resistant measure of the scale can be obtained using the Inter-Quartile Range (IQR) given by the difference between the upper and lower quartiles $ x_{0.75}-x_{0.25}$. In the asymptotic limit of large sample size ( $ n\rightarrow\infty$), for normally (Gaussian) distributed variables (see Chapter 4), the sample median tends to the sample mean and the sample IQR tends to 1.34 times the sample standard deviation. Resistant measures of skewness and kurtosis also exist such as the dimensionless Yule-Kendall skewness statistic

$\displaystyle \gamma_{YK}$ $\displaystyle =$ $\displaystyle \frac{x_{0.25}-2x_{0.5}+x_{0.75}}{x_{0.75}-x_{0.25}}$ (2.6)

and Moors kurtosis statistic

$\displaystyle \tau_M = \frac{(x_{0.875} - x_{0.625}) + (x_{0.375} -
x_{0.125})}{x_{0.75}-x_{0.25}}
$

There also exist other resistant measures based on all the quantiles such as L-moments, but these are beyond the scope of this course - refer to Wilks (1995) and von Storch and Zwiers (1999) for more discussion.


next up previous contents
Next: Example: Summary statistics for Up: Descriptive statistics for univariate Previous: Resistant statistics   Contents
David Stephenson 2005-09-30