next up previous contents
Next: Multivariate regression Up: Multiple and nonlinear regression Previous: Multiple and nonlinear regression   Contents

Multiple regression

It is often the case that a response variable may depend on more than one explanatory variable. For example, human weight could reasonably be expected to depend on both the height and the age of the person. Furthermore, possible explanatory variables often co-vary with one another (e.g. sea surface temperatures and sea-level pressures). Rather than subtract out the effects of the factors separately by performing successive iterative linear regressions for each individual factor, it is better in such cases to perform a single multiple regression defined by an extended linear model. For example, a mutliple regression model having two explanatory factors is given by


$\displaystyle y_i$ $\displaystyle =$ $\displaystyle \beta_0+\beta_1 x_{i1}+\beta_2 x_{i2} + \epsilon_i$ (8.1)

This model can be fit to the data using least squares in order to estimate the three $ \beta$ parameters. It can be viewed geometrically as fitting a $ q=2$ dimensional hyperplane to a cloud of points in $ (x_1,x_2,y)$ space.

The multiple regression equation can be rewritten more concisely in matrix notation as


$\displaystyle {\bf Y}$ $\displaystyle =$ $\displaystyle {\bf X}{\boldmath\beta}+{\bf E}$ (8.2)

where $ {\bf Y}$ is a $ (n\times 1)$ data matrix (vector) containing the response variable, $ {\bf X}$ is a $ (n\times q)$ data matrix containing the $ q$ factors, $ {\boldmath\beta}$ is a $ (q\times 1)$ data matrix containing the factor coefficients (model parameters), and $ {\bf E}$ is a $ (n\times 1)$ data matrix (vector) containing the noise terms.

The least squares solution is then given by the set of normal equations

$\displaystyle ({\bf X}'{\bf X}){\boldmath\beta}$ $\displaystyle =$ $\displaystyle {\bf X}'{\bf Y}$ (8.3)

where $ '$ denotes the transpose of the matrix. When $ {\bf X}'{\bf X}$ is non-singular, these linear equations can easily be solved to find the $ \beta$ parameters. The $ \beta$ parameters can be used to determine unambiguously which variables are significant in determining the response.

As with many multivariate methods, a good understanding can be obtained by considering the bivariate case with two factors ($ q=2$). To make matters even simpler, consider the unit scaled case in which $ x_1$ and $ x_2$ have been standardized (mean removed and divided by standard deviation) before performing the regression. By solving the two normal equations, the best estimates for the beta parameters can easily be shown to be given by

$\displaystyle \beta_1$ $\displaystyle =$ $\displaystyle \frac{r_{1y}-r_{12}r_{2y}}{1-r_{12}^2}$ (8.4)
$\displaystyle \beta_2$ $\displaystyle =$ $\displaystyle \frac{r_{2y}-r_{12}r_{1y}}{1-r_{12}^2}$ (8.5)

where $ r_{12}=cor(x_1,x_2)$ is the mutual correlation between the two $ x$ variables, $ r_{1y}=cor(x_1,y)$ is the correlation between $ x_1$ and $ y$, and $ r_{2y}=cor(x_2,y)$ is the correlation between $ x_2$ and $ y$. By rewriting the correlations in terms of the beta parameters
$\displaystyle r_{1y}$ $\displaystyle =$ $\displaystyle \beta_1+\beta_2 r_{12}$ (8.6)
$\displaystyle r_{2y}$ $\displaystyle =$ $\displaystyle \beta_2+\beta_1 r_{12}$ (8.7)

it can be seen that the correlations with the response consist of the sum of two parts: a direct effect (e.g. $ \beta_1$), and an indirect effect (e.g. $ \beta_2 r_{12}$) mediated by mutual correlation between the explanatory variables. Unlike descriptive correlation analysis, multiple regression is model-based and so allows one to determine the relative contribution from these two parts. Progress can then be made in discriminating important direct factors from factors that are only indirectly correlated with the response.

The MINITAB output below shows the results of multiple regression of height on weight and age for the sample of meteorologists at Reading:

The regression equation is
Weight = - 40.4 + 0.517 Age + 0.577 Height

Predictor       Coef    StDev       T       P
Constant      -40.36    49.20   -0.82   0.436
Age           0.5167   0.5552    0.93   0.379
Height        0.5769   0.2671    2.16   0.063

S = 6.655       R-Sq = 41.7%     R-Sq(adj) = 27.1%

Analysis of Variance

Source          DF      SS      MS      F       P
Regression       2      253.66  126.83  2.86    0.115
Residual Error   8      354.34   44.29  
Total           10      608.00

It can be seen from the p-values and coefficient of determination that the inclusion of age does not improve the fit compared to the previous regression that used only height to explain weight. Based on this small sample, it appears that at the 10% level age is not a significant factor in determining body weight (p-value $ 0.379>0.10$), whereas height is a significant factor (p-value $ 0.063<0.10$).


next up previous contents
Next: Multivariate regression Up: Multiple and nonlinear regression Previous: Multiple and nonlinear regression   Contents
David Stephenson 2005-09-30