For a given dataset
-- where x is the independent variable and y is the dependent variable,
and
are parameters, and
is a random error term with mean
and variance
-- linear regression fits the data to a model of the following form:
|
(1) |
|---|
The least squares estimation is used to minimize the sum of the n squared deviations
|
(2) |
|---|
the estimated parameters of linear model can be computed as:
|
(3) |
|---|
|
(4) |
|---|
where:
,
|
(5) |
|---|
and
(uncorrected)
|
(6) |
|---|
(corrected)
|
(7) |
|---|
| Note: When the intercept is excluded from the model, the coefficients are calculated using the uncorrected formula. |
Therefore, we estimate the regression function as follows:
|
(8) |
|---|
the residual
is defined as:
|
(9) |
|---|
formula in (2) is to be minimized equaling to residual sum of squares
|
(10) |
|---|
when the least squares estimators
and
are used for estimating
and
.
In above section, we assume that there is constant variance in the errors. However, when we fit the experimental data, we may need to take the instrument error (which reflect the accuracy and precision of a measuring instrument) into account in fitting process. Therefore, the assumption of constant variance in the errors is violated. Thus, we need to assume
to be normally distributed with nonconstant variance, and the errors act as
, which can be used as weight in fitting. The weight is defined as:
|
|---|
The fitting model is changed into:
|
(11) |
|---|
The weight factors
can be given by three formulas:
The error bar will not be treated as weight in calculation.
|
(12) |
|---|
As for Instrumental weight, the value is inversely proportional to the instrumental errors, so a trial with small errors will have a larger weight because it is rather precise than some other trials with larger errors.
|
(13) |
|---|
| Note: The errors as weight should be desiganited as "YError" column in worksheet. |
Fix intercept will set the y-intercept
to a fixed value, meanwhile, the total degree of freedom will be n*=n-1 due to the intercept fixed.
Scale Error with sqrt(Reduced Chi-Sqr) is available when fitting with weight. This option only affects the error on the parameters reported from the fitting process, and does not affect the fitting process or the data in any way.
By default, it is checked, and
is taken into account when calculate error on the parameters, otherwise,
will not be taken into account for error calculation.
Take Covariance Matrix as an example:
Scale Error with sqrt(Reduced Chi-Sqr):
| |
|---|---|
|
(14) |
Do not Scale Error with sqrt(Reduced Chi-Sqr):
|
(15) |
|---|
For weighted fitting,
is used instead of
.
When you perform a linear fit, you generate an analysis report sheet listing computed quantities. The Parameters table reports model slope and intercept (numbers in parentheses show how the quantities are derived):
See formula (3)&(4)
For each parameter, the standard error can be obtained by:
|
(16) |
|---|
|
(17) |
|---|
where the sample variance
(or error mean square,
) can be estimated as follows:
|
(18) |
|---|
And RSS means the residual sum of square (or error sum of square, SSE), which is actually the sum of the squares of the vertical deviations from each data point to the fitted line. It can be computed as:
|
(19) |
|---|
Note : Regarding , if intercept is included in the model, . Otherwise, .
|
If the regression assumptions hold, we have:
and
|
(20) |
|---|
The t-test can be used to examine whether the fitting parameters are significantly different from zero, which means that we can test whether
(if true, this means that the fitted line passes through the origin) or
. The hypotheses of the t-tests are:
The t-values can be computed by:
and
|
(21) |
|---|
With the computed t-value, we can decide whether or not to reject the corresponding null hypothesis. Usually, for a given confidence level
, we can reject
when
. Additionally, the p-value, or significance level, is reported with a t-test. We also reject the null hypothesis
if the p-value is less than
.
The probability that
in the t test above is true.
|
(22) |
|---|
where tcdf(t, df) computes the lower tail probability for the Student's t distribution with df degree of freedom.
From the t-value, we can calculate the
Confidence Interval for each parameter by:
|
(23) |
|---|
where
and
is short for the Upper Confidence Interval and Lower Confidence Interval, respectively.
The Confidence Interval Half Width is:
|
(24) |
|---|
where UCL and LCL is the Upper Confidence Interval and Lower Confidence Interval, respectively.
Key linear fit statistics are summarized in the Statistics table (numbers in parentheses show how quantities are computed):

The Error degrees of freedom. Please refer to the ANOVA table for more details.
The residual sum of squares, see formula (19).
See formula (14)
The quality of linear regression can be measured by the coefficient of determination (COD), or
, which can be computed as:
|
(25) |
|---|---|
|
where TSS is the total sum of square, and RSS is the residual sum of square. The
is a value between 0 and 1. Generally speaking, if it is close to 1, the relationship between X and Y will be regarded as very strong and we can have a high degree of confidence in our regression model.
We can further calculate the adjusted
as
|
(26) |
|---|
The R value is the square root of
:
|
(27) |
|---|
In simple linear regression, the correlation coefficient between x and y, denoted by r, equals to:
if is positive
|
(28) |
|---|---|
if is negative
|
Root mean square of the error, or residual standard deviation, which equals to:
|
(29) |
|---|
Equals to square root of RSS:
|
(30) |
|---|
The ANOVA table of linear fitting is:
| DF | Sum of Squares | Mean Square | F Value | Prob > F | |
|---|---|---|---|---|---|
| Model | 1 |
|
|
|
p-value |
| Error | n* - 1 | RSS | MSE = RSS / (n* - 1) | ||
| Total | n* | TSS |
Note: If intercept is included in the model, n*=n-1. Otherwise, n*=n and the total sum of squares is uncorrected. If the slope is fixed, = 0.
|
Where the total sum of square, TSS, is:
(corrected)
|
(31) |
|---|---|
(uncorrected)
|
The F value here is a test of whether the fitting model differs significantly from the model y=constant.
The p-value, or significance level, is reported with an F-test. If the p-value is less than
, the fitting model differs significantly from the model y=constant.
If fixing the intercept at a certain value, the p value for F-test is not meaningful, and it is different from that in linear regression without the intercept constraint.
To run the lack of fit test, you need to have repeated observations, namely, "replicate data" , so that at least one of the X values is repeated within the dataset, or within multiple datasets when concatenate fit mode is selected.
Notations used for fit with replicates data:
is the jth measurement made at the ith x-value in the data set |
is the average of all of the y values at the ith x-value |
is the predicted response for the jth measurement made at the ith x-value |
The sum of square in table below is expressed by:
|
|---|
|
|
The Lack of fit table of linear fitting is:
| DF | Sum of Squares | Mean Square | F Value | Prob > F | |
|---|---|---|---|---|---|
| Lack of Fit | c-2 | LFSS | MSLF = LFSS / (c - 2) | MSLF / MSPE | p-value |
| Pure Error | n - c | PESS | MSPE = PESS / (n - c) | ||
| Error | n*-1 | RSS |
| Note: If intercept is included in the model, n*=n-1. Otherwise, n*=n and the total sum of squares is uncorrected. If the slope is fixed, c denotes the number of distinct x values. If intercept is fixed, DF for Lack of Fit is c-1. |
The Covariance matrix of linear regression is calculated by:
|
(32) |
|---|
The correlation between any two parameters is:
|
(33) |
|---|
The Outliers are those points whose absolute values in Studentized Residual plot are larger than 2.
Studentized Residual is introduced in Detecting outliers by transforming residuals.
stands for the Regular Residual
.
|
(34) |
|---|
Also known as internally studentized residual.
|
(35) |
|---|
Also known as externally studentized residual.
|
(36) |
|---|
In the equations for the Studentized and Studentized deleted residuals,
is the ith diagonal element of the matrix
:
|
(37) |
|---|
means the variance is calculated based on all points but exclude the ith.
For a particular value
, the
confidence interval for the mean value of
at
is:
|
(38) |
|---|
And the
prediction interval for the mean value of
at
is:
|
(39) |
|---|
Assuming the pair of variables (X, Y) conforms to a bivariate normal distribution, we can examine the correlation between the two variables using a confidence ellipse. The confidence ellipse is centered at (
,
), and the major semiaxis a and minor semiaxis b can be expressed as follow:
| |
|---|---|
|
(40) |
For a given confidence level of
:
|
(41) |
|---|
|
(42) |
|---|
|
(43) |
|---|
Select one residual type among Regular, Standardized, Studentized, Studentized Deleted for Plots.
Scatter plot of residual
vs. indenpendent variable
, each plot is located in a seperate graphs.
Scatter plot of residual
vs. fitted results
.
vs. sequence number
The Histogram plot of the Residual
Residuals
vs. lagged residual
.
A normal probability plot of the residuals can be used to check whether the variance is normally distributed as well. If the resulting plot is approximately linear, we proceed to assume that the error terms are normally distributed. The plot is based on the percentiles versus ordered residual, and the percentiles is estimated by
where n is the total number of dataset and i is the i th data. Also refer to Probability Plot and Q-Q Plot