Module 08

The Peeking Problem

Why checking your results every day destroys your statistical validity, and how to fix it.

You launch a test on Monday.
On Tuesday, you check the results. "Variant B is up 15%! It's significant!"
On Wednesday, it drops to 2%. "Not significant anymore."
On Thursday, it's up 8%. "Significant again! Let's stop the test and declare a winner."

This is called Peeking (or "Continuous Monitoring"), and it is statistical malpractice. If you stop a test the moment it becomes significant, you are cherry-picking the data.

1. The Math of Repeated Peeking

Every time you calculate a P-Value, you are rolling the dice. There is a 5% chance of getting a False Positive (Alpha = 0.05).

If you check the test once at the end, your error rate is 5%.
If you check the test every day, you are rolling the dice over and over again. Your cumulative probability of finding a "fake winner" explodes.

Error Rate = 1 - (1 - α) k

(Where k is the number of times you peek)

Number of Peeks	False Positive Probability
1 (Fixed Horizon)	5.0%
2	9.8%
5	22.6%
10	40.1%

If you check your dashboard 10 times during a test, there is a 40% chance you will see a significant result even if the test is actually flat. If you stop the test then, you have deployed a loser.

2. Solution A: Fixed Horizon Testing

The simplest solution requires discipline.

The Rules

Calculate Sample Size (e.g., 50,000 users).
Launch the test.
Do not look at the results until you hit 50,000 users.
Make a decision once.

The Pros & Cons

Pros: Statistically valid. Easy to explain.

Cons: Painful. If a test is a huge winner (lift +50%), you still have to wait 2 weeks to "prove" it.

3. Solution B: Sequential Testing (SPRT)

What if you need to peek? (e.g., stopping bad tests early). You can use Sequential Probability Ratio Testing (SPRT).

This is what modern tools like Optimizely and Eppo use. Instead of keeping the significance bar flat at 95% (Z=1.96), they raise the bar significantly at the start of the test and lower it over time.

Day 1: You need a Z-Score of 3.5 (99.9% confidence) to call a winner.
Day 7: You need a Z-Score of 2.5.
Day 14: You need a Z-Score of 1.96.

This "Moving Goalpost" (or Alpha Spending Function) allows you to check the results every day without inflating your error rate. It essentially "spends" a little bit of your error budget each time you peek.

4. Next Steps

You ran the test without peeking. You found a winner! But... is it a real winner, or is it just the Novelty Effect?

In the next module, we discuss why users click on bright shiny new things, and why those clicks often disappear after 2 weeks.

Previous Module ← Sample Ratio Mismatch Next Module Novelty & Primacy Effects →