Machine Learning, Overfitting, and Big Data Workflow

April 9, 2026

How Level II tests supervised and unsupervised machine learning, overfitting control, algorithm choice, and big-data project workflow.

On this page

The machine-learning portion of Level II is not there to turn candidates into data scientists. It is there to make sure they can identify which learning approach fits which problem, recognize overfitting risk, and evaluate whether a data project actually improves the investment decision.

Why This Lesson Matters

Machine-learning language can make weak analysis sound sophisticated. Level II pushes back against that.

A more complex algorithm is not automatically a better model.
Training accuracy is not the same thing as decision quality.
Feature engineering can add signal or noise.
A data project fails if the workflow is weak, even when the model label sounds impressive.

Separate The Main Learning Families

Learning family	Best thought of as	Typical use in the curriculum
Supervised learning	Learn from labeled outcomes	Prediction, classification, and forecasting tasks
Unsupervised learning	Find structure without labeled targets	Clustering, dimensionality reduction, and pattern discovery
Deep learning	Flexible layered modeling for complex patterns	Broader conceptual awareness rather than detailed implementation

The exam usually asks what problem a method is best suited for, not how to code it.

Algorithm Choice Should Follow The Problem

Method	Best suited for
Penalized regression	Prediction with many variables and risk of overfitting
Support vector machine	Classification or boundary-finding problems
k-nearest neighbor	Local pattern-based classification or prediction
Classification and regression tree	Nonlinear decision rules and interpretable splitting logic
Ensemble learning / random forest	Strong predictive performance through model combination
Principal components analysis	Dimensionality reduction
k-means or hierarchical clustering	Group discovery without labeled targets

Level II often tests whether the candidate can match the algorithm family to the problem rather than recite algorithm names.

Overfitting Is A Core Risk

Overfitting signal	Why it matters
Excellent training performance but weak test performance	The model learned noise rather than durable pattern
Too many features relative to data depth	Noise can overwhelm signal
Excess model complexity	Interpretability and stability can deteriorate

The curriculum expects you to know that cross-validation, penalization, train-test discipline, and simpler model choices are all ways to fight overfitting.

Big Data Work Is A Workflow, Not Just A Model Choice

    flowchart LR
	    A["Define the decision problem"] --> B["Acquire and prepare data"]
	    B --> C["Wrangle, clean, and explore"]
	    C --> D["Engineer and select features"]
	    D --> E["Train candidate models"]
	    E --> F["Evaluate fit and out-of-sample usefulness"]
	    F --> G["Interpret results for the investment problem"]

This is why the curriculum includes wrangling, exploration, text handling, and evaluation alongside algorithms.

Text Data Needs Preparation Before It Becomes Signal

Text-data task	Why it matters
Cleaning and wrangling	Raw text is messy and inconsistent
Feature extraction	The model needs numeric or structured representation
Feature selection or engineering	Too many weak textual features can worsen noise
Forecast interpretation	Text-based signals must still be tied back to the financial question

The exam is usually testing process quality here, not programming detail.

How CFA-Style Questions Usually Test This

by asking which learning family matches a labeled versus unlabeled problem
by asking which algorithm best suits a particular classification or forecasting task
by describing a model with high training accuracy and weak test performance
by testing whether the candidate understands that data preparation and evaluation are part of the project, not afterthoughts

Mini-Case

A research team builds a very accurate training-set classifier for credit downgrade risk using hundreds of raw text features from earnings calls. Test-set performance drops sharply.

A weak answer praises the strong training accuracy.

A stronger answer identifies likely overfitting and asks whether the feature set, training workflow, and validation process are disciplined enough for out-of-sample use.

Common Traps

choosing a sophisticated model because it sounds advanced
treating unsupervised learning as if it required labeled outcomes
evaluating only in-sample fit
ignoring data wrangling and feature engineering when judging model credibility

Sample CFA-Style Question

Which observation most strongly suggests that a supervised learning model may be overfit?

Best answer: It performs very well on the training data but materially worse on the test data.

Why: Level II often tests whether the candidate can distinguish genuine predictive signal from historical noise fitting.

Continue In This Chapter

Related topic: Model Fit, Misspecification, and Regression Extensions
Related topic: Portfolio Management and Wealth Planning
Back to chapter root: Quantitative Methods

Revised at Thursday, April 9, 2026

2.3 Time-Series & Forecast Choice

Browse CFA Level II Study Guide