The A/B Testing Masterclass

Kamal Kumar

Live Case Study

The Case of the "Sticky" Button: An Experimentation Story

The Protagonist: Alex, a Product Analyst at an e-commerce fashion app called StyleStream.

The Problem: Mobile users were browsing products endlessly but dropping off at the last second. Mobile conversion was stuck at 2.5%.

Stage 01: The Hypothesis (The Spark)

Alex noticed a behavioral pattern in the session recordings. Users on mobile would scroll deep down a long product page to read reviews. When they finally decided to buy, they had to scroll all the way back up to find the "Add to Cart" button. Friction.

Alex wrote down a formal hypothesis:

"IF we make the 'Add to Cart' button 'sticky' (fixed at the bottom of the screen), THEN mobile conversion will increase, BECAUSE users can purchase the moment they make a decision without friction."

Alex chose the OEC (Overall Evaluation Criterion): Revenue Per Session. Just getting more clicks wasn't enough; they needed to make sure the friction reduction actually led to paid orders.

Stage 02: Risk Assessment (The Math)

Before writing a single line of code, Alex had to answer: "How long do we run this?" This wasn't a guess—it was a negotiation of risk. Alex opened the Power Analysis calculator to set the "Error Budget."

Current Baseline: 2.5% Conversion.
Minimum Detectable Effect (MDE): Alex decided that unless the new button improved things by at least 5% (bringing conversion to 2.625%), it wasn't worth the engineering maintenance.
Risk Tolerance (Alpha & Beta):
Alpha (\(\alpha\)) = 5%: Alex accepted a 5% risk of a False Positive. Basically, if they ran 100 useless experiments, 5 might look like winners by pure luck. This was the "False Alarm" limit.
Beta (\(\beta\)) = 20%: Alex accepted a 20% risk of a False Negative. This meant if the button really was better, there was still a 20% chance the test would miss it. (This gave the test 80% Power).

The Result: To detect that specific 5% lift while keeping errors inside those Alpha/Beta guardrails, the math demanded 180,000 visitors per variation. Based on their traffic, this meant the test had to run for exactly 14 days.

Stage 03: Implementation (The Sanity Check)

The engineers built the "Sticky Button." On Launch Day, traffic was split 50/50 between the Control (Old Button) and the Variant (Sticky Button).

Alex didn't look at sales yet. Alex looked at the SRM (Sample Ratio Mismatch).

Variant A: 10,045 users.
Variant B: 10,012 users.

The ratio was nearly perfect (50.08% vs 49.92%). The randomization engine was working. The data was clean.

Stage 04: Execution (The Temptation)

On Day 4, the VP of Marketing rushed to Alex’s desk.

"Alex! The dashboard shows the Sticky Button is up 15%! Let’s stop the test and launch it to everyone now!"

Alex had to be the buzzkill. "We can't," Alex explained. "If we stop now, our False Positive rate isn't 5% anymore—it jumps to over 30% because we are 'peeking'. Plus, it's Friday. We haven't seen how weekend users behave yet."

They held the line. As the weekend passed, that massive 15% spike started to drop, settling down as the data regressed to the mean.

Stage 05: Final Inference (The Verdict)

Day 14 arrived. The test ended. Alex ran the final numbers against the "Null Hypothesis" (the assumption that the button did nothing).

Lift: +6.5% (The 15% was indeed a mirage, but 6.5% is still good).
P-Value: 0.03.

What did P=0.03 mean?
It meant that if the sticky button actually did nothing, random luck would produce a result this high only 3% of the time.

The Decision:
Since the observed risk (3%) was lower than the safety limit Alex set in Stage 2 (5%), the result was declared Statistically Significant. The "ball" had landed in the "Zone of Rarity."

Stage 06: Segmentation (The Detective Work)

Alex remembered the Golden Rule: Averages lie. Alex broke the "Win" down by device type.

iOS Users: Conversion up +12%. (Incredible!)
Android Users: Conversion down -1%. (Flat/Negative).

Why? Alex grabbed an Android phone to test. On the smaller Android screen, the new "Sticky Button" was covering up the "Customer Chat" icon. Android users couldn't ask questions, so they weren't buying.

Simpson’s Paradox had struck. The massive win on iPhones was hiding a broken experience on Android.

Stage 07: Iteration (The Smart Launch)

If Alex had just "launched" based on the P-value in Stage 05, they would have hurt the Android user base permanently.

The Decision:

Rollout: Launch the Sticky Button to 100% of iOS users immediately. Bank that revenue.
Fix: Send a ticket to Engineering to move the Chat Widget up by 50 pixels on Android.
Next Test: Re-run the test specifically for Android next month.

The Loop Closed. The data didn't just say "Win" or "Loss." It told a story about user interface spacing, leading to a smarter product.

Phase 1

Design & Foundation

Module 01

The Hypothesis Framework

Why "Let's test it" is a bad strategy. Structuring a falsifiable hypothesis using the "If, Then, Because" model.

Read Module

Module 02

Metric Selection

The OEC (Overall Evaluation Criterion). Balancing Primary metrics (Conversion) with Guardrail metrics (Latency, Cancellation).

Read Module

Module 03

Minimum Detectable Effect

Defining MDE. How small of a change matters? Why seeking a 0.1% lift requires impossible sample sizes.

Read Module

Phase 2

Statistical Rigor

Module 04

Power Analysis

Calculating Sample Size. Understanding Alpha (False Positives) and Beta (False Negatives) risks before you start.

Read Module

Module 05

The P-Value Trap

What P-Values actually mean (and what they don't). Why P < 0.05 is not a guarantee of truth.

Read Module

Module 06

A/A Testing

Calibrating the engine. Running a test where the Control and Treatment are identical to check for bias in your randomization.

Read Module

Phase 3

Execution & Pitfalls

Module 07

Sample Ratio Mismatch (SRM)

The "Check Engine" light of A/B testing. Why a 50/50 split that ends up 48/52 invalidates your entire experiment.

Read Module

Module 08

The Peeking Problem

Why checking your results every day increases your error rate. Using Sequential Testing (SPRT) to peek safely.

Read Module

Module 09

Novelty & Primacy Effects

Time-based bias. Why users click new things just because they are new, and how to filter out this noise.

Read Module

Phase 4

Advanced Strategy

Module 10

Bayesian A/B Testing

Moving from "Is this significant?" to "What is the probability this version is better?" A more intuitive approach for business.

Read Module

Module 11

Multi-Armed Bandits

Explore vs. Exploit. Automating dynamic traffic allocation to maximize conversions while the test is still running.

Read Module

Module 12

Building the Platform

Scaling experimentation. How to design an internal Experimentation Platform (XP) using Python, SQL, and Airflow.

Read Module

Practical Lab & Interactive Tools

Apply the theory with these hands-on resources.

The Framework

Theoretical Summary

A concise overview of the statistical framework and execution steps.

Case Study

Real-Life Example

Walkthrough of a Checkout Redesign experiment with real data.

Tool

Interactive Calculator

Input your conversion rates to calculate Power & Significance live.

Long Read

The Complete Guide

All content combined into a single long-form masterclass.