Module 09

Validation & Selection

Just because your model fits the past doesn't mean it predicts the future. How to select the winner.

You have run your Bayesian sampling (Module 8) or Ridge Regression (Module 7). You don't just get one model; you often run thousands of iterations with different Adstock and Saturation parameters.

Which one is "True"? To decide, we need to act like a Referee. We judge models on three criteria: Accuracy, Stability, and Calibration.

1. The Vanity Metric: R-Squared (R²)

R² tells you how well your model's line fits the historical data dots.
An R² of 0.95 means you explained 95% of the variance. Amazing, right?

Wrong. A high R² is often a sign of Overfitting. If you have enough variables (Holiday flags, 10 media channels), you can memorize the past perfectly but fail completely at predicting next week's sales. R² is useful, but never trust it alone.

2. The Business Metric: MAPE

Mean Absolute Percentage Error (MAPE) is what you tell your CFO. It answers: "On average, by what percentage is our prediction wrong?"

< 5% Excellent Model

5% - 10% Acceptable

> 15% Needs Rework

from sklearn.metrics import mean_absolute_percentage_error

y_true = [100, 120, 130]
y_pred = [105, 115, 140]

# Result: 0.05 (5% Error)
mape = mean_absolute_percentage_error(y_true, y_pred)

3. The Stress Test: Time-Series Cross-Validation

In standard Machine Learning, you split data randomly into Train/Test. You cannot do this with Time Series because the order matters. You cannot use next week's data to predict last week's sales.

We use a Rolling Window approach:

Train on Weeks 1-50, Test on Week 51-55.
Train on Weeks 1-55, Test on Week 56-60.
Average the error across all windows.

4. The Gold Standard: Lift Calibration

This is what separates "Junior" Data Scientists from "Senior" Econometricians.

Imagine your model says Facebook has an ROI of 4.0.
But last month, you ran a Geo-Lift Test (turning off Facebook ads in Ohio) and the result was an incremental ROI of 1.5.

Your model is wrong. It doesn't matter what the R² is.

Calibration involves filtering out any model iteration that deviates too far from ground-truth experiments. We calculate a "Calibration Error" metric:

def calc_calibration_error(model_roi, experimental_roi):
    # We penalize the model distance from the experiment
    return abs(model_roi - experimental_roi) / experimental_roi

# Logic inside your selection loop:
if calibration_error < 0.20:
    keep_model()
else:
    discard_model()

The Selection Hierarchy
When selecting the final model to show the client, prioritize in this order:
1. Calibration: Does it match known experiments?
2. Stability: Do coefficients change wildly if we add 1 week of data?
3. Accuracy (MAPE): Does it predict sales well?

Once we have selected the single best model (The "Champion" Model), we are ready to interpret the results and decompose the sales.

Previous Module ← The Bayesian Revolution Next Module Decomposing Contribution: The Waterfall Chart →