CAPE PENINSULA UNIVERSITY OF TECHNOLOGY STAT151X intermediate

statistics 1B

Comprehensive AI-generated study curriculum with 6 detailed note modules.

0 students cloned 46 views 6 notes

Course Syllabus

  1. Probability Distributions
  2. Sampling Distributions and Estimation
  3. Hypothesis Testing - One Sample
  4. Hypothesis Testing - Two Samples
  5. Simple Linear Regression and Correlation
  6. Chi-Square Tests and Non-Parametric Methods

Study Notes

Probability Distributions

Probability Distributions

1. Introduction & Overview

  • The Mental Model: Probability distributions function as the genomic blueprints of random variables, encoding all possible outcomes and their respective likelihoods, thereby defining the statistical characterology of a stochastic process.
  • Significance:
    • Inferential Statistics: Foundation for hypothesis testing and confidence interval estimation in varied disciplines such as biostatistics, econometrics, and quality control.
    • Risk Management: Quantifying financial risk (e.g., Value at Risk via Extreme Value Distributions) and actuarial science.
    • Machine Learning: Bayesian inference, generative models (e.g., Gaussian Mixture Models), and regularization techniques.
    • Engineering & Physics: Modeling noise, system reliability, and quantum phenomena (e.g., Bose-Einstein or Fermi-Dirac distributions).
    • Operations Research: Stochastic optimization and queuing theory (e.g., Erlang distribution).
mindmap
    root((Probability Distributions))
        Discrete Distributions
            Bernoulli(p)
                "P(X=k) = p^k (1-p)^(1-k)"
            Binomial(n, p)
                "P(X=k) = C(n,k) p^k (1-p)^(n-k)"
            Poisson(λ)
                "P(X=k) = (e^(-λ) λ^k) / k!"
            Geometric(p)
                "P(X=k) = (1-p)^(k-1) p"
            Hypergeometric(N, K, n)
                "P(X=k) = (C(K,k) C(N-K, n-k)) / C(N,n)"
        Continuous Distributions
            Uniform(a, b)
                "f(x) = 1/(b-a)"
            Normal(μ, σ^2)
                "f(x) = (1/(σ√(2π))) e^(-(x-μ)^2 / (2σ^2))"
            Exponential(λ)
                "f(x) = λ e^(-λx)"
            Gamma(α, β)
                "f(x) = (β^α / Γ(α)) x^(α-1) e^(-βx)"
            Beta(α, β)
                "f(x) = (x^(α-1) (1-x)^(β-1)) / B(α,β)"
            Chi-squared(k)
                "f(x) = (1 / (2^(k/2) Γ(k/2))) x^((k/2)-1) e^(-x/2)"
            Student's T(ν)
                "f(t) = Γ((ν+1)/2) / (√(νπ) Γ(ν/2)) (1 + t^2/ν)^(- (ν+1)/2)"
        Key Concepts
            Probability Mass Function (PMF)
            Probability Density Function (PDF)
            Cumulative Distribution Function (CDF)
            Expected Value (Mean)
            Variance
            Moment Generating Function (MGF)
            Characteristic Function (CF)
            Quantile Function
            Parameters
            Support

2. In-Depth Theory, Equations

Read full note →

Sampling Distributions and Estimation

Sampling Distributions and Estimation

1. Introduction & Overview

  • The Mental Model: Imagine an infinite ocean of individual particles (the population), from which we draw finite buckets of water (samples). A sampling distribution is the theoretical probability distribution of a statistic (e.g., the average salinity of the water), calculated from all possible such finite buckets.
  • Significance:
    • Quantifies the uncertainty associated with sample statistics.
    • Provides the theoretical foundation for inferential statistics, including hypothesis testing and confidence intervals.
    • Enables estimation of unknown population parameters using sample data.
    • Crucial for experimental design, determining necessary sample sizes to achieve desired statistical power.
    • Underpins quality control processes in manufacturing by providing probabilistic guarantees.
mindmap
  root((Sampling Distributions & Estimation))
    Sampling Distribution
      "Definition (Population Parameter vs. Sample Statistic)"
      "Role of Random Sampling"
      "Central Limit Theorem (CLT)"
        "Conditions (n, independence)"
        "Implications (Normality)"
      "Types of Sampling Distributions"
        "Sample Mean (X̄)"
        "Sample Proportion (P̂)"
        "Sample Variance (S²)"
        "Difference of Means (X̄₁ - X̄₂)"
        "Difference of Proportions (P̂₁ - P̂₂)"
    Estimation
      "Point Estimation"
        "Estimator Properties"
          "Unbiasedness"
          "Efficiency (Minimum Variance)"
          "Consistency"
          "Sufficiency"
        "Maximum Likelihood Estimation (MLE)"
          "Likelihood Function L(θ|x)"
          "Log-Likelihood ln L(θ|x)"
          "Score Function"
          "Fisher Information"
          "Cramér-Rao Lower Bound (CRLB)"
      "Interval Estimation (Confidence Intervals)"
        "Confidence Level (1-α)"
        "Margin of Error"
        "Interpretation (Frequentist vs. Bayesian)"
        "Intervals for Specific Parameters"
          "Mean (σ known, unknown)"
          "Proportion"
          "Variance"
          "Difference of Means"
          "Difference of Proportions"
    "Applications"
      "Statistical Inference"
      "Hypothesis Testing (Foundation)"
      "Quality Control"
      "Survey Design"

2. In-Depth Theory, Equations & Mechanisms

2.1 Populations and Samples

The population ($\mathcal{P}$) refers to the entire collection of objects or individual

Read full note →

Hypothesis Testing - One Sample

Hypothesis Testing - One Sample

1. Introduction & Overview

  • The Mental Model: Hypothesis testing functions as the judicial system of statistical inference, where sample data acts as evidence presented to evaluate a pre-defined claim (null hypothesis) against an alternative, much like a prosecutor presents evidence to challenge the presumption of innocence.
  • Significance:
    • Quality Control: Determining if a manufacturing process consistently produces items within specified tolerance limits (e.g., mean weight of a product).
    • Environmental Monitoring: Assessing if pollutant levels in a water body exceed a regulatory maximum threshold.
    • Clinical Trials: Evaluating if a new drug significantly alters a physiological parameter compared to a known baseline or placebo.
    • Economic Analysis: Testing if average household income in a region deviates from a national average.
    • Scientific Research: Validating experimental observations against theoretical predictions or existing knowledge.
mindmap
    root((Hypothesis Testing - One Sample))
        Central Concepts
            Null Hypothesis (H₀)
            Alternative Hypothesis (H₁)
            Test Statistic
            "P-value Calculation"
            "Significance Level (α)"
            "Decision Rule (Reject/Fail to Reject)"
        Assumptions
            Random Sampling
            "Independence of Observations"
            "Population Distribution (Normal, or Sufficient N for CLT)"
            "Known/Unknown Population Variance"
        Test Types
            "Z-test (for means, known σ)"
            "t-test (for means, unknown σ)"
            "Z-test (for proportions)"
        Errors
            Type I Error (α)
            Type II Error (β)
            "Power (1-β)"
        Applications
            "Quality Control"
            "Clinical Research"
            "Policy Evaluation"
            "Scientific Validation"

2. In-Depth Theory, Equations & Mechanisms

Hypothesis testing, at its core, is a formal procedure for making an informed decision about a population parameter based on sample data. The framework begins with the formulation of two competing hypotheses: the null hypothesis ($H_0$) and the alternative hypothesis ($H_1$).

2.1. Hypotheses Formulation

  • Null Hypothesis ($H_0$): A statement of no effect, no difference, or no relationship. It represents the status quo, the unchallenged assumption, or th
Read full note →

Hypothesis Testing - Two Samples

Hypothesis Testing - Two Samples

1. Introduction & Overview

  • The Mental Model: Hypothesis testing for two samples is akin to a forensic comparison, meticulously evaluating whether observed differences between two distinct sets of evidence (data) are genuine and statistically significant, or merely artifacts of random variability, thereby determining if distinct underlying processes are at play.
  • Significance:
    • Medical Research: Comparing efficacy of two drugs (Drug A vs. Drug B), comparing incidence rates of disease in treated vs. control groups.
    • Engineering Quality Control: Assessing if two production lines yield products with significantly different defect rates or tensile strengths.
    • Social Sciences: Determining if two demographic groups (Gender A vs. Gender B, Age Group X vs. Age Group Y) exhibit statistically different mean scores on a psychological construct.
    • Business Analytics: Evaluating if a new marketing strategy (Strategy A) results in significantly higher sales conversion rates than an old one (Strategy B).
    • Environmental Science: Comparing pollutant levels in two different geographical regions or at two different time points.
mindmap
  root((Hypothesis Testing - Two Samples))
    Objectives
      Compare means ("Quantitative Data")
        "Independent Samples"
        "Paired Samples"
          "Known Variance"
          "Unknown Variance (Pooled)"
          "Unknown Variance (Welch's)"
      Compare proportions ("Categorical Data")
        "Independent Samples"
        "Known N, P"
      Compare variances
        "F-test"
    Assumptions
      "Independence"
      "Normality"
      "Homoscedasticity"
      "Random Sampling"
    Test Statistics
      "t-statistic"
      "z-statistic"
      "F-statistic"
    Decision Rule
      "p-value approach"
      "Critical value approach"
    "Type I Error (α)"
    "Type II Error (β)"
    "Power (1-β)"

2. In-Depth Theory, Equations & Mechanisms

Hypothesis testing for two samples primarily involves comparing parameters (means, proportions, variances) from two distinct populations based on sample data. The fundamental principle remains the construction of a null hypothesis ($H_0$), representing no difference, and an alternative hypothesis ($H_1$ or $H_a$), representing a significant difference.

2.1 Comparison of Two Population Means ($\mu_1 - \mu_2$)

2.1.1 Independent Samples, Po

Read full note →

Simple Linear Regression and Correlation

Simple Linear Regression and Correlation

1. Introduction & Overview

  • The Mental Model: Imagine fitting the trajectory of a ballistic missile's flight path with a precisely defined parabolic equation, where minute variations in initial velocity and launch angle dictate its exact landing coordinates, offering a predictive model of its impact based on observable, continuous input parameters.
  • Significance:
    • Financial Forecasting: Predicting stock prices, commodity futures, or economic indicators based on historical data and related variables (e.g., GDP, interest rates).
    • Biomedical Research: Modeling drug dosage response curves (e.g., concentration of drug vs. physiological effect) or correlating genetic markers with disease susceptibility.
    • Engineering Diagnostics: Predicting material fatigue life based on stress cycles, or estimating energy consumption from ambient temperature and operational load.
    • Environmental Science: Relating pollutant concentrations to emission sources, or predicting agricultural yields based on rainfall and fertilizer application.
    • Quality Control: Establishing relationships between manufacturing process parameters (e.g., temperature, pressure) and product quality metrics (e.g., tensile strength, purity).
mindmap
  root((Simple Linear Regression & Correlation))
    "Fundamentals"
      "Deterministic vs. Stochastic"
      "Population vs. Sample Regression Function"
      "Assumptions (Gauss-Markov)"
    "Regression Analysis"
      "Model Specification"
        "Y_i = beta_0 + beta_1 * X_i + epsilon_i"
      "Parameter Estimation (OLS)"
        "Normal Equations"
        "Beta hats"
      "Goodness-of-Fit"
        "R-squared"
        "Standard Error of Regression"
    "Correlation Analysis"
      "Pearson Product-Moment Coefficient (r)"
      "Properties of r"
      "Covariance"
    "Inference"
      "Hypothesis Testing (t-tests, F-tests)"
      "Confidence Intervals"
      "Prediction Intervals"
    "Diagnostics"
      "Residual Analysis"
        "Homoscedasticity"
        "Normality"
        "Independence"
      "Outliers & Influential Points"

2. In-Depth Theory, Equations & Mechanisms

Simple Linear Regression (SLR) models the relationship between two continuous quantitative variables: a dependent variable, $Y$, and an independent variable, $X$. This relationship is assumed to be linear in its parameters. Correlation quantifies the streng

Read full note →

Chi-Square Tests and Non-Parametric Methods

Chi-Square Tests and Non-Parametric Methods

1. Introduction & Overview

  • The Mental Model: Chi-square tests quantify the discrepancy between observed frequencies and those expected under a null hypothesis, while non-parametric methods provide robust inferential conclusions without stringent distributional assumptions, particularly concerning population parameters like means or variances.
  • Significance:
    • Categorical Data Analysis: Essential for analyzing qualitative data, market research, epidemiological studies, and social sciences where observations fall into discrete categories.
    • Distributional Assumptions: Crucial when data deviate significantly from normality, when sample sizes are small, or when working with ordinal scales, preventing Type I or Type II errors inherent in parametric violations.
    • Robustness: Offers statistical inference when population distributions are unknown or highly skewed, enhancing external validity in fields like psychometrics or medical trials.
    • Hypothesis Testing Versatility: Applicable across a wide range of hypothesis testing scenarios, from association between variables to comparing medians or distributions.

```mermaid
mindmap
root((Chi-Square Tests & Non-Parametric Methods))
Chi-Square Tests
"Goodness-of-Fit Test (GOF)"
"Univariate Categorical Data"
"Compares Observed vs. Expected Frequencies"
"Hypothesis: H0: Data fits specified distribution"
"Formula: χ² = Σ [(Oi - Ei)² / Ei]"
"Test of Independence"
"Bivariate Categorical Data"
"Compares Observed vs. Expected Frequencies in Contingency Tables"
"Hypothesis: H0: Variables are independent"
"Formula: χ² = Σ Σ [(Oij - Eij)² / Eij]"
"Test of Homogeneity"
"Compares Distribution across Groups"
"Similar to Independence, but one variable is fixed"
"Hypothesis: H0: Distributions are homogeneous"
"Assumptions (Chi-Square)"
"Expected Frequencies > 5 (for >80% cells)"
"Independence of Observations"
"Random Sampling"
"Nominal or Ordinal Data"
"Non-Parametric Methods"
"Alternatives to Parametric Tests"
"Robustness to Outliers"
"No Distributional Assumptions (e.g., Normality)"

Read full note →