In a standard A/B test, you send 50% of your traffic to the loser for the entire duration of the test. That is a lot of lost revenue (or "Regret").
What if you could dynamically shift traffic? What if, as soon as Variant B started looking good, the system automatically sent 60%, then 70%, then 90% of traffic to it?
This is the Multi-Armed Bandit (MAB) problem. It is named after slot machines ("one-armed bandits"). If you have 10 slot machines, and one pays out more often, how fast can you figure it out and switch all your coins to that one machine?
Every algorithm faces this fundamental dilemma:
Pulling a lever I haven't tried much, just to gather information. I might lose money, but I gain knowledge.
Pulling the lever I currently think is the best. I maximize immediate revenue, but I learn nothing new.
A/B Testing is 100% Explore for 2 weeks, then 100% Exploit forever.
Bandit Testing blends them together.
The simplest approach. You flip a coin.
- 90% of the time, choose the current winner (Exploit).
- 10% of the time, choose a random option (Explore).
This uses the probability distributions we learned in Module 10. The algorithm samples a random value from each variation's posterior distribution and picks the highest one.
Bandits sound perfect. Why do we still use A/B tests? Because Bandits optimize, but they don't teach.
Standard Bandits treat everyone the same. Contextual Bandits use machine learning to personalize.
"For User X (iOS, Evening), Button A is best. For User Y (Android, Morning), Button B is best." This is the technology behind Netflix recommendations and TikTok feeds.