15.2.7 Algorithm (Multiple Linear Regression)
The Multiple Linear Regression Model
Multiple Linear Regression Model
Multiple linear regression is an extension of the simple linear regression where multiple independent variables exist. It is used to analyze the effect of more than one independent variable \(x_1, x_2, \dots, x_k\) on the dependent variable y. For a given dataset \((y, x_1, x_2, \dots, x_k)\), the multiple linear regression fits the dataset to the model:
| \[y_i=\beta _0+\beta _1x_{1_i}+\beta _2x_{2_i}+\ldots +\beta _kx_{k_i}+\varepsilon_i\] |
(1) |
|---|
where \(\beta _0\,\!\) is the y-intercept and the parameters \(\beta _1\,\!\), \(\beta _2\,\!\), ... , \(\beta _k\,\!\) are called the partial coefficients. It can be written in matrix form:
| \[Y=XB+E\,\!\] |
(2) |
|---|
where
|
\(Y=\begin{bmatrix} y_1\\ y_2\\ \vdots \\ y_n \end{bmatrix} \) , \(X=\begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1k}\\ 1 & x_{21} & x_{22} & \cdots & x_{2k}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2} & \cdots & x_{nk} \end{bmatrix} \) \(B=\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix} \), \(E=\begin{bmatrix} \varepsilon _1\\ \varepsilon _2\\ \vdots \\ \varepsilon _n \end{bmatrix} \) |
Assuming that \(\varepsilon_i\) are independent and identically distributed as normal random variables with \(\bar{E}=0\) and \(Var[E]=\sigma^2\). In Order to minimize the \(\left \|E\right \|\) with respect to \(B\), we solve the function:
| \[\frac{\partial E'E}{\partial B}=0\] |
(3) |
|---|
The results \(\hat B\) is the least square estimate of the vector B, and it is the solution to the linear equations, which can be expressed as:
| \[\hat B=\begin{bmatrix} \hat \beta_0 \\ \hat \beta_1 \\ \vdots \\ \hat \beta_k \end{bmatrix}=(X'X)^{-1}X^{\prime }Y \] |
(4) |
|---|
where X' is the transpose of X. The predicted value of Y for a given X is:
| \[\hat{Y}=X\hat{B}\] |
(5) |
|---|
By substituting \(\hat{B}\) into (4), we can and defined matrix \(P\).
| \[\hat{Y}=[X(X'X)^{-1}X']Y=PY\] |
(6) |
|---|
The residuals is defined as:
| \[res_i=Y-\hat{Y}\] |
(7) |
|---|
and the residual sum of squares can be written by:
| \[RSS=\left \|E \right \|^2={Y}'Y-\hat{B}'X'X\hat{B}\] |
(8) |
|---|
Fit Control
Errors as Weight
We can give weight to each \(y_i\) in fitting process, the yEr± error column \(\sigma_i\) is treated as weight \(w_i\) for each \(y_i\), when yEr± is abscent, \(w_i\) should be 1 for all \(i\).
The solution \(\hat{B}\) for fitting with weight can be written as:
| \[\hat{B}=(X'WX)^{-1}X'WY\] |
(9) |
|---|
where
\[W=\begin{bmatrix} w_1& 0 & \dots &0 \\ 0 & w_2 & \dots &0 \\ \vdots& \vdots &\ \ddots &\vdots \\ 0& 0 &\dots & w_n \end{bmatrix}\]
No Weighting
The error bar will not be treated as weight in calculation.
Direct Weighting
| \[w_i=\sigma_i \] |
(10) |
|---|
Instrumental
| \[w_i=\frac 1{\sigma_i^2}\] |
(11) |
|---|
Fix Intercept (at)
Fix intercept will set the y-intercept \(\beta_0\) to a fixed value, meanwhile, the total degree of freedom will be n*=n-1 due to the intercept fixed.
Scale Error with sqrt(Reduced Chi-Sqr)
Scale Error with sqrt(Reduced Chi-Sqr) is available when fitting with weight. This option only affects the error on the parameters reported from the fitting process, and does not affect the fitting process or the data in any way. By default, it is checked, and \(\sigma^2\), which is the variance of \(E\) is taken into account for calculating error on the parameters, otherwise, variance of \(E\) will not be taken into account for error calculation. Take Covariance Matrix as example:
Scale Error with sqrt(Reduced Chi-Sqr)
| \[Cov(\beta _i,\beta _j)=\sigma^2 (X^{\prime }X)^{-1}\] |
(12) |
|---|---|
| \[\sigma^2=\frac{RSS}{n^{*}-k}\] |
Do not Scale Error with sqrt(Reduced Chi-Sqr)
| \[Cov(\beta _i,\beta _j)=(X'X)^{-1}\,\!\] |
(13) |
|---|
For weighted fitting, \((X'WX)^{-1}\,\!\) is used instead of \((X'X)^{-1}\,\!\).
Fitting Results
Fit Parameters
The Fitted Values
Formula (4)
The Parameter Standard Errors
For each parameter, the standard error can be obtained by:
| \[s_{\hat \beta _j}=s_\varepsilon \sqrt{C_{jj}}\] |
(14) |
|---|
where \(C_{jj}\) is the jth diagonal element of \((X'X)^{-1}\) (note that \((X'WX)^{-1}\) is used for weight fitting). The residual standard deviation \(s_\varepsilon\) (also called "std dev", "standard error of estimate", or "root MSE") is computed as:
| \[s_\varepsilon =\sqrt{\frac{RSS}{df_{Error}}}=\sqrt{\frac{RSS}{n^{*}-k}}\] |
(15) |
|---|
\(s_\varepsilon^2\) is an estimate of \(\sigma ^2\), which is variance of \(E\)
| Note: Please read the ANOVA Table for more details about the degree of freedom (df), dfError. |
t-Value and Confidence Level
If the regression assumptions hold, we can perform the t-tests for the regression coefficients with the null hypotheses and the alternative hypotheses:
\[H_0:\beta _j=0\,\!\]
\[H_\alpha :\beta _j\neq 0\]
The t-values can be computed as:
| \[t=\frac{\hat \beta _j-0}{s_{\hat \beta _j}}\] |
(16) |
|---|
With the t-value, we can decide whether or not to reject the corresponding null hypothesis. Usually, for a given Confidence Level for Parameters: \(\alpha\,\!\) , we can reject \(H_0\,\!\) when \(|t|>t_{\frac \alpha 2}\). Additionally, the p-value is less than \(\alpha\,\!\).
Prob>|t|
The probability that \(H_0\) in the t test is true.
| \[prob=2(1-tcdf(|t|,df_{Error}))\,\!\] |
(17) |
|---|
where \(tcdf(|t|,df_{Error})\) computes the cumulative distribution function of the Student's t distribution at the values |t|, with degree of freedom of error \((df_{Error})\).
LCL and UCL
From the t-value, we can calculate the \((1-\alpha )\times 100\%\) Confidence Interval for each parameter by:
| \[\hat \beta _j-t_{(\frac \alpha 2,n^{*}-k)}\varepsilon _{\hat \beta _j}\leq \hat \beta _j\leq \hat \beta _j+t_{(\frac \alpha 2,n^{*}-k)}\varepsilon _{\hat \beta _j}\] |
(18) |
|---|
where \(UCL\) and \(LCL\) is short for the Upper Confidence Interval and Lower Confidence Interval, respectively.
CI Half Width
The Confidence Interval Half Width is:
| \[CI=\frac{UCL-LCL}2\] |
(19) |
|---|
VIF
The Variance Inflation Factor is:
| \[VIF_i=\frac{1}{1-R^2_i}\] |
(20) |
|---|
Where: \(R^2_i\) is unadjusted coefficient of determination for regression the ith independent variable on the remaining ones.
Fit Statistics
Some fit statistics formulas are summary here:
Degree of Freedom
The degree of freedom for (Error) variation. Please refer to the ANOVA table for more details.
Reduced Chi-Sqr
| \[\sigma^2=\frac{RSS}{n^{*}-k}\] |
(21) |
|---|
Residual Sum of Squares
The residual sum of squares, see formula (8).
R-Square (COD)
The goodness of fit can be evaluated by Coefficient of Determination (COD), \(R^2\), which is given by:
| \[R^2=\frac{Explained\, variation}{Total \, variation}=1-\frac{RSS}{TSS}\] |
(22) |
|---|
Adj. R-Square
The adjusted \(R^2\) is used to adjust the \(R^2\) value for the degree of freedom. It can be computed as:
| \[\bar R^2=1-\frac{RSS/df_{Error}}{TSS/df_{Total}}\] |
(23) |
|---|
R Value
Then we can compute the R-value, which is simply the square root of \(R^2\):
| \[R=\sqrt{R^2}\] |
(24) |
|---|
Root-MSE (SD)
Root Mean Square of the Error, or residual standard deviation, which equals to:
| \[RootMSE=\sqrt{\frac{RSS}{n^*-k}}\] |
(25) |
|---|
Norm of Residuals
Equals to square root of RSS:
| \[Norm \,of \,Residuals=\sqrt{RSS}\] |
(26) |
|---|
ANOVA Table
The ANOVA table of linear fitting is:
| df | Sum of Squares | Mean Square | F Value | Prob > F | |
|---|---|---|---|---|---|
| Model | k | \[SS_{reg} = TSS-RSS\] | \[MS_{reg} = SS_{reg} / k \] | \[MS_{reg} / MSE \] | p-value |
| Error | n* - k | \[RSS\] | \[MSE = RSS / (n^* - k)\] | ||
| Total | n* | \[TSS\] |
| Note: If intercept is included in the model, n*=n-1. Otherwise, n*=n and the total sum of squares is uncorrected. |
Where the total sum of square, TSS, is:
| \(TSS =\sum_{i=1}^nw_i(y_i -\frac{\sum_{i=1}^n w_i y_i} {\sum_{i=1}^n w_i})^2\) (corrected) | (27) |
|---|---|
| \(TSS=\sum_{i=1}^n w_iy_i^2\) (uncorrected) |
The F value here is a test of whether the fitting model differs significantly from the model y=constant.
Additionally, the p-value, or significance level, is reported with an F-test. We can reject the null hypothesis if the p-value is less than \(\alpha\,\!\), which means that fitting model differs significantly from the model y=constant.
If fixing the intercept at a certain value, the p value for F-test is not meaningful, and it is different from that in multiple linear regression without the intercept constraint.
Lack of fit table
To run the lack of fit test, you need to have repeated observations, namely, "replicate data" , so that at least one of the X values is repeated within the dataset, or within multiple datasets when concatenate fit mode is selected.
Notations used for fit with replicates data:
| \(y_{ij}\) is the jth measurement made at the ith x-value in the data set |
|---|
| \(\bar{y}_{i}\) is the average of all of the y values at the ith x-value |
| \(\hat{y}_{ij}\) is the predicted response for the jth measurement made at the ith x-value |
The sum of square in table below is expressed by:
| \[RSS=\sum_{i}\sum_{j}(y_{ij}-\hat{y}_{ij})^2\] |
|---|
| \[LFSS=\sum_{i}\sum_{j}(\bar{y}_{i}-\hat{y}_{ij})^2\] |
| \[PESS=\sum_{i}\sum_{j}(y_{ij}-\bar{y}_{i})^2\] |
The Lack of fit table of linear fitting is:
| DF | Sum of Squares | Mean Square | F Value | Prob > F | |
|---|---|---|---|---|---|
| Lack of Fit | c-k-1 | LFSS | MSLF = LFSS / (c - k - 1) | MSLF / MSPE | p-value |
| Pure Error | n - c | PESS | MSPE = PESS / (n - c) | ||
| Error | n*-k | RSS |
| Note: If intercept is included in the model, n*=n-1. Otherwise, n*=n and the total sum of squares is uncorrected. If the slope is fixed, \[df_{Model}\] = 0. c denotes the number of distinct x values. If intercept is fixed, DF for Lack of Fit is c-k. |
Covariance and Correlation Matrix
The covariance matrix for the multiple linear regression can be calculated as
| \[Cov(\beta _i,\beta _j)=\sigma ^2(X^{\prime }X)^{-1}\] |
(28) |
|---|
The correlation between any two parameters is:
| \[\rho (\beta _i,\beta _j)=\frac{Cov(\beta _i,\beta _j)}{\sqrt{Cov(\beta _i,\beta _i)}\sqrt{Cov(\beta _j,\beta _j)}}\] |
(29) |
|---|
Residual Analysis
\(r_i\) stands for the Regular Residual \(res_i\).
Standardized
| \[r_i^{\prime }=\frac{r_i}s_\varepsilon\] |
(30) |
|---|
Studentized
Also known as internally studentized residual.
| \[r_i^{\prime }=\frac{r_i}{s_\varepsilon\sqrt{1-h_i}}\] |
(31) |
|---|
Studentized deleted
Also known as externally studentized residual.
| \[r_i^{\prime }=\frac{r_i}{s_{\varepsilon-i}\sqrt{1-h_i}}\] |
(32) |
|---|
In the equations for the Studentized and Studentized deleted residuals, \(h_i\) is the ith diagonal element of the matrix \(P\):
| \[P=X(X'X)^{-1}X^{\prime }\] |
(33) |
|---|
\(s_{\varepsilon-i}\) means the variance is calculated based on all points but exclude the ith.
Plots
Partial Leverage Plots
In multiple regression, partial leverage plots can be used to study the relationship between the independent variable and a given dependent variable. In the plot, the partial residual of Y is plotted against the partial residual of X, or the intercept. The partial residual of a certain variable is the regression residual with that variable omitted in the model.
Take the model \(y=\beta _0+\beta _1x_1+\beta _2x_2\,\!\) for example: the partial leverage plot for \(x_1\,\!\) is created by plotting the regression residual of \(y=\beta _0+\beta _2x_2\,\!\) against the residual of \(x_1=\beta _0+\beta _2x_2\,\!\).
Resudial Type
Select one residual type among Regular, Standardized, Studentized, Studentized Deleted for Plots.
Residual vs. Independent
Scatter plot of residual \(res\) vs. indenpendent variable \(x_1,x_2,\dots,x_k\), each plot is locate in a seperate graphs.
Residual vs. Predicted Value
Scatter plot of residual \(res\) vs. fitted results \(\hat{Y}\).
Residual vs. Order of the Data
\(res_i\) vs. sequence number \(i\)
Histogram of the Residual
The Histogram plot of the Residual \(res\)
Residual Lag Plot
Residuals \(res_i\) vs. lagged residual \(res_{(i–1)}\).
Normal Probability Plot of Residuals
A normal probability plot of the residuals can be used to check whether the variance is normally distributed as well. If the resulting plot is approximately linear, we proceed to assume that the error terms are normally distributed. The plot is based on the percentiles versus ordered residual, the percentiles is estimated by
\[\frac{(i-\frac{3}{8})}{(n+\frac{1}{4})}\]
where n is the total number of dataset and i is the i th data. Also refer to Probability Plot and Q-Q Plot

