Regression and Big Data Basics

Simple linear regression output, assumption checks, and introductory big-data concepts for Level I.

Level I regression is not a machine-learning competition. It is a structured way to describe how one variable changes with another and how much confidence you should place in that relationship. The exam usually tests interpretation: what the slope says, what the intercept means, whether the fit is useful, and what kind of assumption problem is present.

The Core Model

In simple linear regression, the basic model is:

$$ Y = b_0 + b_1 X + \varepsilon $$

You do not need to treat this as abstract notation. It simply says:

  • (b_0) is the intercept
  • (b_1) is the estimated change in (Y) for a one-unit change in (X)
  • (\varepsilon) is the unexplained part left in the residual

What The Output Is Really Saying

Output itemPractical interpretationCommon trap
InterceptEstimated value of (Y) when (X = 0)Assuming it is always economically meaningful
Slope coefficientDirection and magnitude of the estimated relationshipCalling it causal automatically
(R^2)Fraction of variation in (Y) explained by the modelReading a high (R^2) as proof of correctness
Standard error of estimateTypical size of residual errorIgnoring it when the fit looks visually neat

Level I is especially likely to test the difference between association and causation. A regression can describe a relationship without proving why it exists.

Assumptions Matter Because Residual Problems Change The Conclusion

When assumptions break, the result may become less reliable even if the coefficient sign looks attractive. At Level I, the important habit is to recognize the type of problem:

  • heteroskedasticity means error variance is not constant
  • serial correlation means residuals are related across time
  • omitted-variable bias means the model left out a relevant driver

You are usually not being asked to fix the full model. You are being asked to identify why the reported inference may be weaker than it first appears.

Where Big Data Enters At Level I

The curriculum’s big-data introduction is intentionally light. The exam usually wants vocabulary and judgment, not production data engineering. You should be ready to distinguish:

  • structured from unstructured data
  • supervised from unsupervised learning at a high level
  • more data from better data
  • prediction accuracy from economic usefulness

The strongest answers stay skeptical. Large data sets can reveal patterns, but they can also amplify bad assumptions, poor labels, and overfitting.

How CFA-Style Questions Usually Test This

  • by asking what the slope coefficient means in plain language
  • by checking whether you confuse explanatory power with causation
  • by presenting a residual problem and asking what reliability concern it creates
  • by contrasting big-data promise with model-risk reality

Mini-Case

An analyst regresses stock returns on market returns and reports a positive slope with a high (R^2). A weak candidate concludes the market return fully causes the stock return. A stronger candidate says the market return explains a meaningful share of observed variation in the sample, but that statement alone does not prove a complete causal mechanism.

Common Traps

  • turning correlation into causation without justification
  • treating a high (R^2) as proof that the model is correct
  • ignoring residual-pattern warnings because the coefficient sign matches intuition
  • assuming that more data automatically means less model risk

Sample CFA-Style Question

A candidate says a regression with a statistically significant positive slope proves that increases in (X) cause increases in (Y). What is the strongest reply?

Best answer: Statistical significance shows evidence of an association in the sample, not automatic proof of causation.

Why: Level I often checks whether you can interpret regression output without overstating what the model demonstrates.

Continue In This Chapter