The Hypothesis Framework
Why "Let's test it" is a bad strategy. Structuring a falsifiable hypothesis using the "If, Then, Because" model.
Read ModuleFrom Hypothesis to Significance: The rigorous science of making decisions with data.
The Protagonist: Alex, a Product Analyst at an e-commerce fashion app called StyleStream.
The Problem: Mobile users were browsing products endlessly but dropping off at the last second. Mobile conversion was stuck at 2.5%.
Alex noticed a behavioral pattern in the session recordings. Users on mobile would scroll deep down a long product page to read reviews. When they finally decided to buy, they had to scroll all the way back up to find the "Add to Cart" button. Friction.
Alex wrote down a formal hypothesis:
Alex chose the OEC (Overall Evaluation Criterion): Revenue Per Session. Just getting more clicks wasn't enough; they needed to make sure the friction reduction actually led to paid orders.
Before writing a single line of code, Alex had to answer: "How long do we run this?" This wasn't a guess—it was a negotiation of risk. Alex opened the Power Analysis calculator to set the "Error Budget."
The Result: To detect that specific 5% lift while keeping errors inside those Alpha/Beta guardrails, the math demanded 180,000 visitors per variation. Based on their traffic, this meant the test had to run for exactly 14 days.
The engineers built the "Sticky Button." On Launch Day, traffic was split 50/50 between the Control (Old Button) and the Variant (Sticky Button).
Alex didn't look at sales yet. Alex looked at the SRM (Sample Ratio Mismatch).
The ratio was nearly perfect (50.08% vs 49.92%). The randomization engine was working. The data was clean.
On Day 4, the VP of Marketing rushed to Alex’s desk.
"Alex! The dashboard shows the Sticky Button is up 15%! Let’s stop the test and launch it to everyone now!"
Alex had to be the buzzkill. "We can't," Alex explained. "If we stop now, our False Positive rate isn't 5% anymore—it jumps to over 30% because we are 'peeking'. Plus, it's Friday. We haven't seen how weekend users behave yet."
They held the line. As the weekend passed, that massive 15% spike started to drop, settling down as the data regressed to the mean.
Day 14 arrived. The test ended. Alex ran the final numbers against the "Null Hypothesis" (the assumption that the button did nothing).
What did P=0.03 mean?
It meant that if the sticky button actually did nothing, random luck would produce a result this high only 3% of the time.
The Decision:
Since the observed risk (3%) was lower than the safety limit Alex set in Stage 2 (5%), the result was declared Statistically Significant. The "ball" had landed in the "Zone of Rarity."
Alex remembered the Golden Rule: Averages lie. Alex broke the "Win" down by device type.
Why? Alex grabbed an Android phone to test. On the smaller Android screen, the new "Sticky Button" was covering up the "Customer Chat" icon. Android users couldn't ask questions, so they weren't buying.
Simpson’s Paradox had struck. The massive win on iPhones was hiding a broken experience on Android.
If Alex had just "launched" based on the P-value in Stage 05, they would have hurt the Android user base permanently.
The Decision:
The Loop Closed. The data didn't just say "Win" or "Loss." It told a story about user interface spacing, leading to a smarter product.
Why "Let's test it" is a bad strategy. Structuring a falsifiable hypothesis using the "If, Then, Because" model.
Read ModuleThe OEC (Overall Evaluation Criterion). Balancing Primary metrics (Conversion) with Guardrail metrics (Latency, Cancellation).
Read ModuleDefining MDE. How small of a change matters? Why seeking a 0.1% lift requires impossible sample sizes.
Read ModuleCalculating Sample Size. Understanding Alpha (False Positives) and Beta (False Negatives) risks before you start.
Read ModuleWhat P-Values actually mean (and what they don't). Why P < 0.05 is not a guarantee of truth.
Read ModuleCalibrating the engine. Running a test where the Control and Treatment are identical to check for bias in your randomization.
Read ModuleThe "Check Engine" light of A/B testing. Why a 50/50 split that ends up 48/52 invalidates your entire experiment.
Read ModuleWhy checking your results every day increases your error rate. Using Sequential Testing (SPRT) to peek safely.
Read ModuleTime-based bias. Why users click new things just because they are new, and how to filter out this noise.
Read ModuleMoving from "Is this significant?" to "What is the probability this version is better?" A more intuitive approach for business.
Read ModuleExplore vs. Exploit. Automating dynamic traffic allocation to maximize conversions while the test is still running.
Read ModuleScaling experimentation. How to design an internal Experimentation Platform (XP) using Python, SQL, and Airflow.
Read ModuleApply the theory with these hands-on resources.
A concise overview of the statistical framework and execution steps.
Walkthrough of a Checkout Redesign experiment with real data.
Input your conversion rates to calculate Power & Significance live.
All content combined into a single long-form masterclass.