Module 03

Visualizing the Pulse

Exploratory Data Analysis (EDA): Finding the signal before we ever train the model.

You have your data in an ABT format (Module 2). The temptation is to immediately throw it into a regression model to get "Results." Do not do this.

MMM models are incredibly sensitive to noise. If you don't know what your data looks like—where the spikes are, which channels move together, and what the seasonality looks like—you will build a model that lies to you. This module covers the three essential Python visualizations you must run first.

1. Decomposing Time (Trend vs. Seasonality)

Sales data is a combination of three invisible forces:
1. Trend: Is the business generally growing or shrinking?
2. Seasonality: Does revenue always peak in December or on Pay-Day?
3. Residuals: The random noise (or the impact of marketing).

We use the statsmodels library to mathematically separate these layers.

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# Load data and set Date as index
df = pd.read_csv('mmm_abt.csv', parse_dates=['date'], index_col='date')

# Decompose the Sales column
result = seasonal_decompose(df['sales'], model='additive', period=52)

# Plot the decomposition
result.plot()
plt.show()

What to look for: If your "Residuals" graph still has a clear wavy pattern, it means your model missed a seasonality variable (e.g., you forgot to flag Easter). The residuals should look like random static.

2. The Multicollinearity Trap

This is the #1 killer of MMM models. Multicollinearity happens when two variables move perfectly in sync.

Example: You always increase TV spend and Search spend at the exact same time (e.g., for a Black Friday campaign). The model cannot mathematically tell which one caused the sales spike. It might give TV huge credit and Search zero credit, or vice versa.

The Correlation Matrix

We visualize this using a Seaborn heatmap.

import seaborn as sns

# Select only numeric media columns
media_cols = ['tv_spend', 'fb_spend', 'search_spend', 'tiktok_spend']
corr_matrix = df[media_cols].corr()

# Plot Heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

The Danger Zone If any two channels have a correlation > 0.7, you have a problem. You may need to:
1. Drop one of the channels.
2. Combine them (e.g., "FB + Insta" = "Social").
3. Use Ridge Regression (covered in Module 7) to handle the correlation.

3. Visualizing Saturation (Line Plots)

Finally, we need to check if there is a linear relationship between Spend and Sales. We do this by plotting simple scatter plots for each channel.

for channel in media_cols:
    plt.scatter(df[channel], df['sales'])
    plt.title(f"Sales vs {channel}")
    plt.show()

Interpretation:
- If you see a straight line going up ↗️, you haven't hit saturation yet. Spend more!
- If you see a curve flattening out (like a logarithmic curve), you are hitting diminishing returns. This confirms we need to use Hill Functions in our feature engineering.

Now that we understand the shape of our data, we are ready to transform it mathematically to better represent reality.

Previous Module ← The Data Landscape Next Module Feature Engineering: Adstock & Memory →