Statistical Significance for Skeptics: When Flat Wins

Todos los artículos

Statistical Significance for Skeptics: When Flat Wins

Blog

17 jul 2026

Minimalist image of a robotic hand reaching out on a white background.

The most common outcome of a well-run A/B test is nothing. No significant difference, intervals overlapping, the dashboard politely refusing to declare a winner. Teams experience this as failure, report it apologetically, and quietly stop testing. All three reactions misread what the data said, and the misreading costs more than the flat test did.

This post is the skeptic's guide to flat results: what the base rates say you should expect, what "no significant difference" actually asserts, and the specific situations where a flat readout is the success signal you were hoping for.

First, calibrate: flat is the norm, not the anomaly

The strongest public numbers come from Ronny Kohavi, who built experimentation platforms at Microsoft, Bing, and Airbnb. His findings from Microsoft's program: roughly one third of ideas produced a statistically significant positive result, one third were flat, and one third were significantly negative, with Bing's success rate lower still. More recently he's put the broader figure at around 10% of experiments moving their target metric for the median organization.

Hold onto that second number, because it has a nasty corollary. At a 10% base success rate with standard thresholds, the false positive risk works out to roughly 22%, meaning about one in five "significant wins" is actually a false positive. The skeptic's first lesson cuts both ways: be less devastated by flat results and less euphoric about wins.

And the upside of all that failure is real. The famous Bing ad-headline experiment, a change so unpromising it sat in the backlog for months, produced a 12% revenue increase worth over $100 million a year. You only find those by running enough tests to eat a lot of flat ones on the way.

What a flat result actually says (and doesn't)

"Not statistically significant" means: if there were truly no difference, data like yours wouldn't be surprising. It does not mean the variants are equal. It means any difference is smaller than what your sample size could reliably detect. That detection threshold (your minimum detectable effect) was set, explicitly or accidentally, when you decided how long to run the test. A flat result from an underpowered test mostly reports on your traffic, a dynamic I walked through with real numbers in how an A/B test really looks when you don't have enough data.

Two reading rules follow. One: never interpret flatness without stating the MDE next to it. "No difference detected, and we were powered to detect anything above 8%" is information; "no difference" alone is noise wearing a suit. Two: respect the stopping rule you set in advance. Checking daily and stopping the moment significance appears inflates your false positive rate dramatically, the trap Evan Miller documented years ago in "How Not To Run an A/B Test", and it manufactures fake winners out of genuinely flat tests.

The cases where flat is the win

Now the productive part. There's a class of experiments where the hypothesis is non-inferiority: you want the new version to be no worse, because it's better on some axis the primary metric doesn't capture. There, flat-on-primary is exactly what success looks like:

The cheaper variant. You removed a third-party widget, a heavy script, or a step that costs money to operate. Conversion holds flat: ship it, you just bought the same revenue for less cost and less complexity.
The simplification. The redesigned form has four fields instead of nine. Flat conversion with lower maintenance burden and a cleaner codebase is a win that compounds quietly.
The risk check. Legal copy, a pricing display change, a rebrand element. The test was insurance: you wanted evidence the change doesn't hurt before rolling it out. Flat is the all-clear.
The killed myth. The HiPPO was certain the carousel/video/badge was carrying conversions. Flat says it wasn't. You can now delete it, and you've made the next argument shorter.

One discipline keeps this honest: declare the non-inferiority framing before the test, not after. Kohavi's caution against shipping on flat results exists precisely because "it didn't hurt anything" becomes an excuse to ship pet features without evidence. The difference between the two is whether "flat = success" was the stated hypothesis or the post-hoc consolation. Guardrails matter here too; a flat primary with a degraded guardrail metric is a loss, which is the territory of the metrics you're not watching.

Reporting flat without apologizing

A flat result reported well includes the MDE, the non-inferiority framing if there was one, the guardrail readout, and the decision (ship the cheaper variant, keep control, redesign with a bigger swing). Reported that way, it reads as what it is: the experiment did its job, the job was answering a question, and the answer arrived. I wrote a full template for this in the professional maturity of reporting an inconclusive test, and the low-traffic strategy that prevents most doomed-to-be-flat tests in CRO when you don't have the traffic.

The implication

If roughly a third of honest tests come back flat, then an experimentation program's real output metric can't be its win rate; chasing win rate just teaches teams to test only safe, tiny, unfalsifiable things. The better KPI is validated decisions per quarter, where a flat result that kills a bad idea, de-risks a launch, or licenses a simplification counts at full value. Significance is a filter on evidence, and a filter is not a verdict on whether the work was worth doing.