Simple Linear Regression and Correlation
From the statistics 1B curriculum · Updated May 29, 2026
Simple Linear Regression and Correlation
1. Introduction & Overview
- The Mental Model: Imagine fitting the trajectory of a ballistic missile's flight path with a precisely defined parabolic equation, where minute variations in initial velocity and launch angle dictate its exact landing coordinates, offering a predictive model of its impact based on observable, continuous input parameters.
- Significance:
- Financial Forecasting: Predicting stock prices, commodity futures, or economic indicators based on historical data and related variables (e.g., GDP, interest rates).
- Biomedical Research: Modeling drug dosage response curves (e.g., concentration of drug vs. physiological effect) or correlating genetic markers with disease susceptibility.
- Engineering Diagnostics: Predicting material fatigue life based on stress cycles, or estimating energy consumption from ambient temperature and operational load.
- Environmental Science: Relating pollutant concentrations to emission sources, or predicting agricultural yields based on rainfall and fertilizer application.
- Quality Control: Establishing relationships between manufacturing process parameters (e.g., temperature, pressure) and product quality metrics (e.g., tensile strength, purity).
mindmap
root((Simple Linear Regression & Correlation))
"Fundamentals"
"Deterministic vs. Stochastic"
"Population vs. Sample Regression Function"
"Assumptions (Gauss-Markov)"
"Regression Analysis"
"Model Specification"
"Y_i = beta_0 + beta_1 * X_i + epsilon_i"
"Parameter Estimation (OLS)"
"Normal Equations"
"Beta hats"
"Goodness-of-Fit"
"R-squared"
"Standard Error of Regression"
"Correlation Analysis"
"Pearson Product-Moment Coefficient (r)"
"Properties of r"
"Covariance"
"Inference"
"Hypothesis Testing (t-tests, F-tests)"
"Confidence Intervals"
"Prediction Intervals"
"Diagnostics"
"Residual Analysis"
"Homoscedasticity"
"Normality"
"Independence"
"Outliers & Influential Points"
2. In-Depth Theory, Equations & Mechanisms
Simple Linear Regression (SLR) models the relationship between two continuous quantitative variables: a dependent variable, $Y$, and an independent variable, $X$. This relationship is assumed to be linear in its parameters. Correlation quantifies the strength and direction of the linear association between these variables.
2.1 The Simple Linear Regression Model
The population regression function (PRF) describes the true, unknown relationship:
$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$
Where:
* $Y_i$: The $i$-th observation of the dependent variable.
* $X_i$: The $i$-th observation of the independent variable.
* $\beta_0$: The population Y-intercept, representing the expected value of $Y$ when $X=0$.
* $\beta_1$: The population slope coefficient, representing the expected change in $Y$ for a one-unit change in $X$.
* $\epsilon_i$: The $i$-th error term (or disturbance), representing all unobserved factors affecting $Y$ and the inherent randomness in the relationship. $\epsilon_i$ is a random variable.
2.2 Assumptions of the Classical Linear Regression Model (CLRM) for OLS Estimation
The validity and efficiency of Ordinary Least Squares (OLS) estimators depend critically on these assumptions (Gauss-Markov assumptions):
1. Linearity in Parameters: The model is linear in the coefficients $\beta_0$ and $\beta_1$.
* Equation: $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$
2. Random Sampling: The data $(X_i, Y_i)$ are a random sample from the population. This ensures the observations are independent.
3. No Perfect Collinearity of $X$: The independent variable $X$ must exhibit some variation in the sample (i.e., $X_i$ values are not all identical). If $Var(X) = 0$, $\beta_1$ is undefined.
* Condition: $\sum_{i=1}^{n} (X_i - \bar{X})^2 > 0$
4. Zero Conditional Mean of Error Term: The expected value of the error term, conditional on $X$, is zero. This implies that $X$ is exogenous; it is not correlated with the error term.
* Equation: $E(\epsilon_i | X_i) = 0$ for all $i$.
* Direct Implication: $E(Y_i | X_i) = \beta_0 + \beta_1 X_i$. This is the PRF.
5. Homoscedasticity (Constant Variance of Error Term): The variance of the error term, conditional on $X$, is constant for all observations.
* Equation: $Var(\epsilon_i | X_i) = \sigma^2$ (a constant) for all $i$.
* Violation is called heteroscedasticity.
6. No Autocorrelation (No Serial Correlation): The error terms for different observations are uncorrelated.
* Equation: $Cov(\epsilon_i, \epsilon_j | X_i, X_j) = 0$ for $i
eq j$.
7. Normality of Error Term (for Inference): The error terms are normally distributed. This assumption is crucial for hypothesis testing and constructing confidence intervals, particularly in small samples. For large samples, the Central Limit Theorem helps ensure estimators are approximately normally distributed even if errors are not.
* Equation: $\epsilon_i \sim N(0, \sigma^2)$
2.3 Ordinary Least Squares (OLS) Estimation
The objective of OLS is to find the sample regression function (SRF):
$\hat{Y}i = \hat{\beta}_0 + \hat{\beta}_1 X_i$
Where $\hat{\beta}_0$ and $\hat{\beta}_1$ are the OLS estimators of $\beta_0$ and $\beta_1$, respectively. The "hat" denotes an estimated value.
The OLS principle minimizes the sum of squared residuals (SSR):
$SSR = \sum{i=1}^{n} \hat{\epsilon}i^2 = \sum{i=1}^{n} (Y_i - \hat{Y}i)^2 = \sum{i=1}^{n} (Y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_i))^2$
To find $\hat{\beta}_0$ and $\hat{\beta}_1$, we take partial derivatives of $SSR$ with respect to $\hat{\beta}_0$ and $\hat{\beta}_1$, set them to zero, and solve the resulting system of "normal equations."
$\frac{\partial SSR}{\partial \hat{\beta}0} = -2 \sum{i=1}^{n} (Y_i - \hat{\beta}0 - \hat{\beta}_1 X_i) = 0$
$\frac{\partial SSR}{\partial \hat{\beta}_1} = -2 \sum{i=1}^{n} X_i (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i) = 0$
Solving these equations yields:
$\hat{\beta}1 = \frac{\sum{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2} = \frac{Cov(X, Y)}{Var(X)}$
$\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}$
Where:
* $\bar{X} = \frac{1}{n} \sum X_i$ is the sample mean of $X$.
* $\bar{Y} = \frac{1}{n} \sum Y_i$ is the sample mean of $Y$.
* $Cov(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})$ is the sample covariance.
* $Var(X) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$ is the sample variance.
2.4 Properties of OLS Estimators (Gauss-Markov Theorem)
Under assumptions 1-5, the OLS estimators $\hat{\beta}_0$ and $\hat{\beta}_1$ are the Best Linear Unbiased Estimators (BLUE).
* Linear: They are linear functions of the observed $Y_i$ values.
* Unbiased: Their expected values are equal to the true population parameters: $E(\hat{\beta}_0) = \beta_0$ and $E(\hat{\beta}_1) = \beta_1$.
* Best: They have the minimum variance among all linear unbiased estimators.
2.5 Goodness-of-Fit: R-squared ($R^2$) and Standard Error of the Regression ($s_e$)
- Total Sum of Squares (TSS): Measures the total variation in the dependent variable.
$TSS = \sum_{i=1}^{n} (Y_i - \bar{Y})^2$ - Explained Sum of Squares (ESS): Measures the variation in $Y$ explained by the regression model.
$ESS = \sum_{i=1}^{n} (\hat{Y}_i - \bar{Y})^2$ - Residual Sum of Squares (RSS): Measures the unexplained variation in $Y$ (sum of squared residuals).
$RSS = \sum_{i=1}^{n} (Y_i - \hat{Y}i)^2 = \sum{i=1}^{n} \hat{\epsilon}_i^2$
Crucially, $TSS = ESS + RSS$.
- Coefficient of Determination ($R^2$): Represents the proportion of the total variation in $Y$ that is explained by the independent variable $X$.
$R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}$- Properties: $0 \le R^2 \le 1$.
- An $R^2$ of 0 means the model explains none of the variation in $Y$. An $R^2$ of 1 means the model explains all the variation in $Y$.
- Standard Error of the Regression ($s_e$): An estimate of the standard deviation of the error term ($\sigma$). It measures the average distance that the observed values fall from the regression line.
$s_e = \sqrt{\frac{RSS}{n-k}} = \sqrt{\frac{\sum_{i=1}^{n} \hat{\epsilon}_i^2}{n-2}}$
(where $k=2$ for SLR, as there are 2 parameters: $\beta_0, \beta_1$)- Also known as the Root Mean Squared Error (RMSE).
2.6 Simple Linear Correlation: Pearson Product-Moment Correlation Coefficient ($r$)
The Pearson correlation coefficient quantifies the linear association between two variables $X$ and $Y$.
$r = \frac{Cov(X, Y)}{s_X s_Y} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n} (X_i - \bar{X})^2 \sum_{i=1}^{n} (Y_i - \bar{Y})^2}}$
- Properties of $r$:
- Range: $-1 \le r \le 1$.
- Sign: Indicates the direction of the linear relationship (positive for direct, negative for inverse).
- Magnitude: Indicates the strength of the linear relationship (closer to $\pm 1$ indicates stronger).
- Symmetry: $r_{XY} = r_{YX}$.
- Scale Invariance: $r$ is unaffected by changes in the origin or scale of measurement of either variable.
- $r^2 = R^2$ in Simple Linear Regression. This relationship is specific to SLR.
radar-beta
title OLS Estimator Properties Matrix
series
name "Unbiasedness"
data [100, 70, 85, 90, 60]
series
name "Efficiency (Minimum Variance)"
data [75, 100, 80, 70, 95]
series
name "Consistency"
data [90, 80, 100, 85, 75]
series
name "Distributional Normality (asymptotic)"
data [60, 65, 70, 100, 80]
labels
"CLRM Assumptions"
"Robust Standard Errors (Heteroskedasticity-consistent)"
"Large Sample Size (n -> infinity)"
"Normality of Errors"
"Absence of Outliers"
The radar chart above visualizes how key properties of OLS estimators, essential for valid inference, are influenced by various conditions. For instance, "Unbiasedness" is largely dependent on CLRM assumptions, especially $E(\epsilon_i|X_i)=0$. "Efficiency" (minimum variance) is maximally attained under all CLRM assumptions (BLUE property). "Consistency" (estimators converging to true parameters as $n \to \infty$) is a large-sample property, robust to some CLRM violations. "Distributional Normality" for hypothesis testing is directly enhanced by the normality of errors or by large sample sizes (Central Limit Theorem). "Absence of Outliers" impacts all properties, as outliers can bias estimators and inflate variance.
3. Technical Procedures & Applications
3.1 Procedure for Conducting a Simple Linear Regression Analysis
This procedure outlines the steps from data collection to interpretation and diagnostic checking, emphasizing rigorous statistical practice.
sequenceDiagram
participant Analyst as "Statistical Analyst"
participant Data as "Raw Data Set (X, Y)"
participant Model as "Regression Model (Y = β0 + β1X + ε)"
participant Software as "Statistical Software (R, Python, SAS)"
participant Report as "Analysis Report"
Analyst->Data: 1. Acquire raw data (n observations)
Analyst->Analyst: 2. Visualize data (Scatter plot of Y vs. X)
Note over Analyst: Identify potential linearity, outliers, heteroscedasticity.
Analyst->Software: 3. Specify OLS model formula
Software->Model: 4. Estimate parameters (β̂₀, β̂₁)
Note over Software: Applies Normal Equations using matrix algebra: <br/>β̂ = (X'X)⁻¹X'Y
Software->Model: 5. Calculate residuals (ε̂ᵢ = Yᵢ - Ŷᵢ)
Software->Model: 6. Calculate RSS, ESS, TSS, R²
Software->Model: 7. Compute standard errors for β̂₀, β̂₁ (SE(β̂₀), SE(β̂₁))
Software->Model: 8. Calculate t-statistics for β̂₀, β̂₁
Analyst->Software: 9. Request diagnostic plots
Software-->Analyst: 10. Generate Residuals vs. Fitted, Normal Q-Q, Scale-Location plots
Analyst->Analyst: 11. Interpret model coefficients: β̂₀, β̂₁
Note over Analyst: β̂₁ represents the estimated average change in Y for a one-unit increase in X.
Analyst->Analyst: 12. Evaluate goodness-of-fit: R², s_e
Analyst->Analyst: 13. Perform hypothesis tests for significance (t-tests for β̂₀, β̂₁)
Note over Analyst: H₀: β₁ = 0 vs. H₁: β₁ ≠ 0. Compare p-value to α.
Analyst->Analyst: 14. Construct confidence intervals for β₀, β₁
Note over Analyst: CI for β₁: β̂₁ ± t(α/2, n-2) * SE(β̂₁)
Analyst->Analyst: 15. Assess model assumptions using diagnostic plots
Analyst->Analyst: Consistency Check (Homoscedasticity, Normality, Independence)
Analyst->Report: 16. Compile results, interpretations, and diagnostics.
Analyst->Report: 17. Formulate predictions for new X values (point and interval forecasts).
3.2 Detailed Calculation of Key Statistics
Given a dataset $(X_i, Y_i)$ for $i=1, \dots, n$:
-
Sample Means:
$\bar{X} = \frac{1}{n} \sum X_i$
$\bar{Y} = \frac{1}{n} \sum Y_i$ -
Sample Variances and Covariance:
$S_{XX} = \sum_{i=1}^{n} (X_i - \bar{X})^2 = \sum X_i^2 - \frac{(\sum X_i)^2}{n}$
$S_{YY} = \sum_{i=1}^{n} (Y_i - \bar{Y})^2 = \sum Y_i^2 - \frac{(\sum Y_i)^2}{n}$
$S_{XY} = \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) = \sum X_i Y_i - \frac{(\sum X_i)(\sum Y_i)}{n}$ -
OLS Estimators:
$\hat{\beta}1 = \frac{S{XY}}{S_{XX}}$
$\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}$ -
Predicted Values and Residuals:
$\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i$
$\hat{\epsilon}_i = Y_i - \hat{Y}_i$ -
Sums of Squares:
$RSS = \sum_{i=1}^{n} \hat{\epsilon}i^2 = S{YY} - \hat{\beta}1 S{XY}$ (This is an important computational shortcut)
$TSS = S_{YY}$
$ESS = TSS - RSS = \hat{\beta}1 S{XY}$ -
Coefficient of Determination:
$R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}$ -
Standard Error of the Regression:
$s_e = \sqrt{\frac{RSS}{n-2}}$ -
Standard Errors of the Estimators:
$SE(\hat{\beta}1) = \frac{s_e}{\sqrt{S{XX}}}$
$SE(\hat{\beta}0) = s_e \sqrt{\frac{1}{n} + \frac{\bar{X}^2}{S{XX}}}$ -
Test Statistics for Hypothesis Testing:
For $H_0: \beta_1 = 0$: $t_{\hat{\beta}1} = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}$ (follows a $t$-distribution with $n-2$ degrees of freedom)
For $H_0: \beta_0 = 0$: $t{\hat{\beta}_0} = \frac{\hat{\beta}_0}{SE(\hat{\beta}_0)}$ (follows a $t$-distribution with $n-2$ degrees of freedom) -
Pearson Correlation Coefficient:
$r = \frac{S_{XY}}{\sqrt{S_{XX} S_{YY}}}$
Note: $r^2 = R^2$ for SLR.
3.3. Prediction Intervals versus Confidence Intervals
- Confidence Interval for the Mean Response $E(Y|X_0)$: Provides an interval estimate for the average value of $Y$ for a given $X_0$.
$\hat{Y}0 \pm t{(n-2, \alpha/2)} \cdot s_e \sqrt{\frac{1}{n} + \frac{(X_0 - \bar{X})^2}{S_{XX}}}$ - Prediction Interval for a New Observation $Y_0$: Provides an interval estimate for a single future observation $Y_0$ for a given $X_0$. It is wider than the confidence interval for the mean response because it accounts for both the uncertainty in estimating the mean and the inherent variability of individual observations.
$\hat{Y}0 \pm t{(n-2, \alpha/2)} \cdot s_e \sqrt{1 + \frac{1}{n} + \frac{(X_0 - \bar{X})^2}{S_{XX}}}$
3.4. Conditions During Application
- Data Type: Both X and Y must be quantitative and continuous or nearly continuous. For categorical data, specific transformations or different regression models are required.
- Sample Size: Sufficiently large samples ($n > 20-30$) are preferred to ensure the asymptotic properties of OLS estimators hold and to rely on the Central Limit Theorem for approximate normality. For small $n$, the normality of errors assumption becomes critical.
- Absence of Extreme Outliers: Outliers can disproportionately influence OLS estimates and inflate standard errors, leading to misleading conclusions. Robust regression methods may be necessary in such cases.
- Domain Expertise: The relationship explored should be logically plausible based on theoretical considerations or prior empirical evidence. Blindly fitting a line without domain knowledge can lead to spurious correlations.
4. Examiner's Breakdown
4.1 Comparative Analysis
| Feature | Simple Linear Regression (SLR) | Simple Linear Correlation (SLC) |
|---|---|---|
| Primary Objective | Prediction and estimation of cause-effect (causal if assumptions met) relationship; quantifying change in Y for unit change in X. Asymmetric in X and Y. | Quantification of linear association between two variables; strength and direction. Symmetric in X and Y. |
| Model Equation | $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ where $\epsilon_i$ is explicitly modeled with assumptions. | None explicitly. Focus on $r$ (Pearson product-moment correlation coefficient). |
| Assumptions | Strict: linearity in parameters, random sampling, $E(\epsilon_i | X_i)=0$, homoscedasticity, no autocorrelation, (optional for inference) $\epsilon_i \sim N(0, \sigma^2)$. |
| Output Metrics | $\hat{\beta}_0, \hat{\beta}_1$ (coefficients), $s_e$ (standard error of regression), $R^2$ (coefficient of determination), SEs of coefficients, t-statistics, p-values. | $r$ (correlation coefficient). Potentially p-value for testing $r=0$. |
| Interpretation | $\hat{\beta}_1$ is the expected change in $Y$ for a one-unit change in $X$, holding other factors (captured in $\epsilon$) constant. $R^2$ is % variance in Y explained by X. | $r$ indicates strength and direction of linear association. $r=0$ implies no linear association. $r=\pm 1$ implies perfect positive/negative linear association. Does NOT imply causation. |
| Causality Implication | Can imply causality IF all CLRM assumptions are met, especially exogeneity ($E(\epsilon_i | X_i)=0$) and the model is correctly specified, which is difficult to prove. |
| Predictive Power | High. Provides a functional form for prediction of $Y$ given $X$. Allows for prediction intervals and confidence intervals. | Limited. While 'r' indicates relationship strength, correlation itself does not provide a direct framework for predicting specific values of $Y$ based on $X$ values in the same way a regression equation does (though $r$ is a component of prediction). |
| Relationship to each other | For SLR, $R^2 = r^2$. Correlation measures the strength of the linear relationship that SLR then models. | Correlation is a prerequisite or a summary statistic that motivates or accompanies a regression analysis. |
4.2 High-Yield Marking Keywords
- Ordinary Least Squares (OLS) Estimators: $\hat{\beta}_0, \hat{\beta}_1$ derived by minimizing the Sum of Squared Residuals.
- Gauss-Markov Assumptions: Specifically, linearity, random sampling, zero conditional mean of error, homoscedasticity, no autocorrelation.
- BLUE (Best Linear Unbiased Estimator): Property of OLS estimators under the Gauss-Markov assumptions.
- Coefficient of Determination ($R^2$): Proportion of total variation in $Y$ explained by $X$.
- Pearson Product-Moment Correlation Coefficient ($r$): Quantifies strength and direction of linear association; ranges from -1 to 1.
- Homoscedasticity: Constant variance of the error term across all levels of $X$.
- Exogeneity: The independent variable $X$ is uncorrelated with the error term ($\text{Cov}(X, \epsilon) = 0$).
- Prediction Interval vs. Confidence Interval: Crucial distinction in purpose and width for single-point forecasting vs. mean response estimation.
4.3 Trapdoor Mistakes
- Inferring Causation from Correlation: Students frequently state or imply that a strong correlation ($|r|$ close to 1) means $X$ causes $Y$.
- Correct Answer: Emphasize that correlation only measures linear association and does NOT imply causation. Acknowledge possible confounding variables, reverse causality, or mere coincidence. State that establishing causation is complex and requires rigorous experimental design or satisfying very strict econometric conditions beyond mere statistical association.
- Misinterpreting $R^2$ as a measure of model adequacy or superiority: Students often assume that a high $R^2$ automatically implies a good model or a causally significant relationship.
- Correct Answer: A high $R^2$ simply means the model explains a large proportion of variance in $Y$. It does not guarantee that the model is correctly specified, unbiased, or free of assumption violations (e.g., heteroscedasticity, omitted variable bias). A low $R^2$ might still represent a statistically significant and important relationship.
- Confusing Confidence Interval for Mean Response with Prediction Interval for a New Observation: These are distinct and have different widths.
- Correct Answer: Clearly state that the confidence interval estimates the average value of $Y$ for a given $X_0$, while the prediction interval estimates a single, new observation of $Y$ for a given $X_0$. Explain that the prediction interval is always wider due to the additional uncertainty associated with individual variation. Write out both formulas to highlight the $\sqrt{1+\dots}$ term in the prediction interval.
- Ignoring or improperly analyzing diagnostic plots: Students often report regression results without checking underlying OLS assumptions.
- Correct Answer: Discuss the systematic examination of residual plots:
- Residuals vs. Fitted Values Plot: To check for homoscedasticity (should show a random scatter around zero, no discernible pattern or fanning/funneling) and linearity (no obvious curves).
- Normal Q-Q Plot of Residuals: To assess the normality of error terms (points should lie approximately along a straight diagonal line).
- Scale-Location Plot (Sqrt(|Residuals|) vs. Fitted Values): A variant for detecting heteroscedasticity more clearly, where points should be randomly scattered without any trend.
- Emphasize that violations of these assumptions (e.g., heteroscedasticity, non-normality) invalidate standard error estimates and p-values, making inference unreliable, requiring robust standard errors or transformations.
- Correct Answer: Discuss the systematic examination of residual plots:
Get the full statistics 1B curriculum
Clone the complete plan to your dashboard for unlimited AI-generated notes, practice quizzes, and a personalised revision schedule.
Create Free Account