Module 08

The Bayesian Revolution

Stop letting the model tell you that "TV has zero impact." Injecting business logic using Priors.

In Module 7, we saw that Ridge Regression can reduce noise, but it can still produce results that defy logic. If your TV spend was flat for two years while sales grew, Ridge Regression might say: "TV contributed $0."

But you know TV works. You have seen the brand lift studies. You have 30 years of marketing theory that says it works.

This is where Bayesian Statistics wins. Instead of asking the data to speak for itself (Frequentist), we combine the data with our "Prior Beliefs."

1. The Formula for Common Sense

Bayes' Theorem can be summarized for MMM as:

Posterior \propto Likelihood \times Prior

1. Prior "I believe TV ROI is likely between 0.5 and 1.5 based on last year."

2. Likelihood "The current data suggests TV ROI is 0.1 because spend was flat."

3. Posterior "The Updated Result: TV ROI is probably 0.6. The data pulled it down, but the Prior kept it realistic."

2. Defining Priors

A "Prior" is a distribution. It is how you tell the model what is possible before it sees a single row of data.

Uninformative Prior: "I have no idea. The coefficient could be -100 or +100." (This is basically standard regression).
Informative Prior: "Based on our Geo-Lift experiments, Facebook ROI is usually around 2.0. I am 90% sure it is between 1.5 and 2.5."

In Bayesian MMM, we almost always force Positive Priors for media (using a Half-Normal or Gamma distribution), because spending money on ads should not destroy sales.

3. Python Implementation (PyMC)

We use probabilistic programming libraries like PyMC (used by HelloFresh, Google) to build these models. It looks different from scikit-learn.

import pymc as pm

# Define the "Probabilistic Context"
with pm.Model() as mmm_model:
    
    # 1. Define Priors (The Beliefs)
    # We use HalfNormal to force positive coefficients for media
    beta_fb = pm.HalfNormal('beta_fb', sigma=1) 
    beta_tv = pm.HalfNormal('beta_tv', sigma=2) # We think TV might have higher variance
    intercept = pm.Normal('alpha', mu=0, sigma=10)

    # 2. Define the Linear Relationship
    # Sales = Intercept + Beta * Spend
    mu = intercept + (beta_fb * fb_spend_data) + (beta_tv * tv_spend_data)

    # 3. Define the Likelihood (The Data Fit)
    # Assuming sales are normally distributed around the prediction
    sigma = pm.HalfNormal('sigma', sigma=1)
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=sales_data)

    # 4. Hit the "Magic Button" (MCMC Sampling)
    # The computer runs thousands of simulations to find the Posterior
    trace = pm.sample(draws=2000, chains=4)

4. The Result: Distributions, Not Numbers

When you run model.predict() in scikit-learn, you get a single number.
When you run a Bayesian model, you get a Distribution.

It won't say: "ROI is 1.5."
It will say: "There is a 95% probability that ROI is between 1.3 and 1.7."

This is incredibly powerful for executive presentations. It allows you to talk about Confidence Intervals and risk. "We are confident that spending more on Facebook will yield at least 1.3 ROI."

Previous Module ← Linear vs. The World Next Module Validation & Selection: Finding the "Best" Model →