In Module 08, we learned not to peek. But even if you wait for the sample size to hit, the data itself might be skewed by user psychology.
Users are creatures of habit. When you change an interface, they don't react neutrally. They react emotionally.
If you change a grey button to a bright orange button, clicks will almost certainly go up in the first week. Is it because orange is better? No.
It's because regular users noticed something changed. Their curiosity drove the click. Once the novelty wears off (usually after 1-2 weeks), their behavior regresses to the mean.
The Danger: If you run a 1-week test, you will declare the Orange Button a winner. You will roll it out, and next month revenue will drop back to baseline.
This is the opposite problem. If you redesign a complex dashboard (like Salesforce or Gmail), productivity will drop immediately.
Users have "muscle memory." They click where the button used to be. When you move it, they get frustrated. The new design might be objectively better, but the initial data will show a massive loss.
What it looks like: Huge positive lift early on, then a slow decline toward zero.
Common in: Retail, Media, Simple UI changes.
What it looks like: Huge negative drop early on, followed by a slow recovery and climb.
Common in: SaaS, B2B, Workflow tools.
To detect these effects, you should plot the Cumulative Lift over Time.
If the line is zig-zagging or sloping heavily downward after 7 days, your test has not stabilized. You cannot call a winner yet.
New Users do not have muscle memory. They have never seen the old site. They are immune to Primacy and Novelty effects.
Strategy: Segment your test results.
- If New Users love the design (positive lift) but Returning Users hate it (negative lift), it is likely a Primacy Effect. The Returning Users will eventually learn.
Always run tests for at least two full business cycles (usually 2 weeks). This allows the novelty to wear off.
Some advanced teams ignore the first 3-5 days of data entirely. They treat it as a "warm-up" period and only calculate significance on data collected from Day 6 onwards.
We have now covered the core Frequentist approach to A/B testing. But there is another way—a way that allows us to speak in probabilities rather than P-Values. This brings us to Bayesian A/B Testing.