Machine Learning, Overfitting, and Big Data Workflow

How Level II tests supervised and unsupervised machine learning, overfitting control, algorithm choice, and big-data project workflow.

The machine-learning portion of Level II is not there to turn candidates into data scientists. It is there to make sure they can identify which learning approach fits which problem, recognize overfitting risk, and evaluate whether a data project actually improves the investment decision.

Why This Lesson Matters

Machine-learning language can make weak analysis sound sophisticated. Level II pushes back against that.

  • A more complex algorithm is not automatically a better model.
  • Training accuracy is not the same thing as decision quality.
  • Feature engineering can add signal or noise.
  • A data project fails if the workflow is weak, even when the model label sounds impressive.

Separate The Main Learning Families

Learning familyBest thought of asTypical use in the curriculum
Supervised learningLearn from labeled outcomesPrediction, classification, and forecasting tasks
Unsupervised learningFind structure without labeled targetsClustering, dimensionality reduction, and pattern discovery
Deep learningFlexible layered modeling for complex patternsBroader conceptual awareness rather than detailed implementation

The exam usually asks what problem a method is best suited for, not how to code it.

Algorithm Choice Should Follow The Problem

MethodBest suited for
Penalized regressionPrediction with many variables and risk of overfitting
Support vector machineClassification or boundary-finding problems
k-nearest neighborLocal pattern-based classification or prediction
Classification and regression treeNonlinear decision rules and interpretable splitting logic
Ensemble learning / random forestStrong predictive performance through model combination
Principal components analysisDimensionality reduction
k-means or hierarchical clusteringGroup discovery without labeled targets

Level II often tests whether the candidate can match the algorithm family to the problem rather than recite algorithm names.

Overfitting Is A Core Risk

Overfitting signalWhy it matters
Excellent training performance but weak test performanceThe model learned noise rather than durable pattern
Too many features relative to data depthNoise can overwhelm signal
Excess model complexityInterpretability and stability can deteriorate

The curriculum expects you to know that cross-validation, penalization, train-test discipline, and simpler model choices are all ways to fight overfitting.

Big Data Work Is A Workflow, Not Just A Model Choice

    flowchart LR
	    A["Define the decision problem"] --> B["Acquire and prepare data"]
	    B --> C["Wrangle, clean, and explore"]
	    C --> D["Engineer and select features"]
	    D --> E["Train candidate models"]
	    E --> F["Evaluate fit and out-of-sample usefulness"]
	    F --> G["Interpret results for the investment problem"]

This is why the curriculum includes wrangling, exploration, text handling, and evaluation alongside algorithms.

Text Data Needs Preparation Before It Becomes Signal

Text-data taskWhy it matters
Cleaning and wranglingRaw text is messy and inconsistent
Feature extractionThe model needs numeric or structured representation
Feature selection or engineeringToo many weak textual features can worsen noise
Forecast interpretationText-based signals must still be tied back to the financial question

The exam is usually testing process quality here, not programming detail.

How CFA-Style Questions Usually Test This

  • by asking which learning family matches a labeled versus unlabeled problem
  • by asking which algorithm best suits a particular classification or forecasting task
  • by describing a model with high training accuracy and weak test performance
  • by testing whether the candidate understands that data preparation and evaluation are part of the project, not afterthoughts

Mini-Case

A research team builds a very accurate training-set classifier for credit downgrade risk using hundreds of raw text features from earnings calls. Test-set performance drops sharply.

A weak answer praises the strong training accuracy.

A stronger answer identifies likely overfitting and asks whether the feature set, training workflow, and validation process are disciplined enough for out-of-sample use.

Common Traps

  • choosing a sophisticated model because it sounds advanced
  • treating unsupervised learning as if it required labeled outcomes
  • evaluating only in-sample fit
  • ignoring data wrangling and feature engineering when judging model credibility

Sample CFA-Style Question

Which observation most strongly suggests that a supervised learning model may be overfit?

Best answer: It performs very well on the training data but materially worse on the test data.

Why: Level II often tests whether the candidate can distinguish genuine predictive signal from historical noise fitting.

Continue In This Chapter

Revised at Thursday, April 9, 2026