You have built a perfect dataset. You have engineered features for Adstock and Saturation. You are ready to run model.fit().
If you use standard Ordinary Least Squares (OLS) regression, you will likely get a result that says: "Facebook has a coefficient of -0.5."
-0.5? That implies that for every dollar you spend on Facebook, you lose 50 cents in revenue. Unless your ads are offensive, this is impossible. This happens because OLS is "unbiased"—it tries to fit the training data perfectly, even if the result makes no business sense.
Marketing data is messy. TV spend and Search spend often move together. OLS struggles to separate them. It might assign a huge positive number to TV (+5.0) and a negative number to Search (-2.0) just to balance the equation mathematically.
This is called Overfitting. The model has low "Bias" (it fits the training data well) but high "Variance" (it crashes on new data and fails logic tests).
To fix this, we need to introduce a "Penalty." We tell the model: "I want you to fit the data, BUT I will punish you if you use coefficients that are too large."
This is Ridge Regression (L2 Regularization). It changes the goal of the model from:
To:
Lambda (λ) is the penalty strength.
- If λ is 0, it behaves like OLS.
- If λ is high, it shrinks all coefficients towards zero, reducing the wild swings caused by multicollinearity.
We use scikit-learn to implement Ridge Regression. We also need to enforce Positive Coefficients (because marketing impact should almost always be positive).
from sklearn.linear_model import Ridge from sklearn.model_selection import GridSearchCV # Define the model with a positivity constraint # (Note: standard Ridge in sklearn doesn't force positivity easily, # so we often use libraries like CVXPY or specialized args in newer versions) model = Ridge(alpha=1.0) # alpha is the Lambda parameter # We usually run a Grid Search to find the best Alpha/Lambda param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]} grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error') grid_search.fit(X_train, y_train) print(f"Best Penalty Strength: {grid_search.best_params_}")
By using Ridge Regression, we are accepting a small amount of Bias (our model is slightly "damped") in exchange for a massive reduction in Variance.
The result is a model that is more stable, more predictive on future data, and less likely to give you crazy results like "-$50 ROI on Search."
But Ridge is still purely mathematical. It doesn't know that "Brand Search" should have a higher ROI than "Display Ads." For that, we need to inject human knowledge. That brings us to Bayesian methods.