Simple linear regression output, assumption checks, and introductory big-data concepts for Level I.
Level I regression is not a machine-learning competition. It is a structured way to describe how one variable changes with another and how much confidence you should place in that relationship. The exam usually tests interpretation: what the slope says, what the intercept means, whether the fit is useful, and what kind of assumption problem is present.
In simple linear regression, the basic model is:
$$ Y = b_0 + b_1 X + \varepsilon $$
You do not need to treat this as abstract notation. It simply says:
| Output item | Practical interpretation | Common trap |
|---|---|---|
| Intercept | Estimated value of (Y) when (X = 0) | Assuming it is always economically meaningful |
| Slope coefficient | Direction and magnitude of the estimated relationship | Calling it causal automatically |
| (R^2) | Fraction of variation in (Y) explained by the model | Reading a high (R^2) as proof of correctness |
| Standard error of estimate | Typical size of residual error | Ignoring it when the fit looks visually neat |
Level I is especially likely to test the difference between association and causation. A regression can describe a relationship without proving why it exists.
When assumptions break, the result may become less reliable even if the coefficient sign looks attractive. At Level I, the important habit is to recognize the type of problem:
You are usually not being asked to fix the full model. You are being asked to identify why the reported inference may be weaker than it first appears.
The curriculum’s big-data introduction is intentionally light. The exam usually wants vocabulary and judgment, not production data engineering. You should be ready to distinguish:
The strongest answers stay skeptical. Large data sets can reveal patterns, but they can also amplify bad assumptions, poor labels, and overfitting.
An analyst regresses stock returns on market returns and reports a positive slope with a high (R^2). A weak candidate concludes the market return fully causes the stock return. A stronger candidate says the market return explains a meaningful share of observed variation in the sample, but that statement alone does not prove a complete causal mechanism.
A candidate says a regression with a statistically significant positive slope proves that increases in (X) cause increases in (Y). What is the strongest reply?
Best answer: Statistical significance shows evidence of an association in the sample, not automatic proof of causation.
Why: Level I often checks whether you can interpret regression output without overstating what the model demonstrates.