- Decide how to design tests that reach statistical significance faster
- Determine the cause of long-running or inconclusive tests
If you've been running A/B tests, you've probably wondered: why isn't my test reaching statistical significance?
Statistical significance is the likelihood that the difference in conversion rates between a given variation and the baseline is not due to random chance. In other words, it's a good indicator of how well the results of the sample you tested will reflect reality. On the journey of experience optimization, your speed of travel is tied to your success in getting statistically significant results.
Fortunately, savvy experiment design and an understanding of how statistical significance works under the hood will help you reach conclusive results.
This article provides a few tips on reaching statistical significance. We also touch on related concepts in other articles: How long to run a test and Use minimum detectable effect to prioritize a test. Even though we repeat some of the same principles here, we recommend that you read those as well.
Read on to learn why your test isn't reaching statistical significance.
Changes are too small
Sometimes, a small change can make a huge difference. A new call-to-action (CTA) can help a charity raise $1.5m more, for example. Other times, modest adjustments don't make big enough waves to push your test to statistical significance.
If your revision is minor, its impact on your baseline conversion rate is likely to be small too. Stats Engine picks up this small difference, but takes longer to decide whether it's a chance fluctuation or a lasting change in visitor behavior.
Check out the chart below to see how smaller improvements over the baseline require larger sample sizes (and time) to declare a statistically significance result.
When you design a test, consider making changes that will make a significant impact to your visitor's experience -- whether the change itself is big or small.
A text change to a CTA can drive more clicks if the initial text doesn't reflect the purpose CTA properly. Adjusting the copy to match the visitor's intent can be a significant change. If the purpose of the CTA is generally clear (like a "buy" button on a product page), changes to the text are less likely to drive noticeable improvements.
The most important metrics to a business sometimes have relatively low baseline conversion rates. In e-commerce, for example, the "purchase" conversion rate is a relatively low-frequency event: often below 3%.
Low baseline conversion rates affect the time it takes to reach statistical significance. In the chart above, note the difference in traffic required to reach significance for a 1% versus a 5% baseline.
While it's important to track how tests affect key metrics, it's not always possible to directly capture the impact to that infrequent event in a timely manner. When this is the case, use a metric with a higher baseline to stand in for the other and measure success.
Imagine that you're optimizing the homepage of an e-commerce site with a banner that prompts visitors to visit the electronics category. You expect more visitors to click the banner, view electronics, and purchase. But the baseline conversion rate for purchases is relatively low. You have a limited amount of time to run this test; it will take too long to reach statistical significance.
Instead of measuring success in purchases, you set your primary metric to track clicks to the banner. That way, you don't have to wait for significance to travel all the way down the funnel to decide whether the variation wins or loses. You measure the impact of your test directly in clicks, where you made the change. And, you can extrapolate that win to estimate your test's impact on revenue.
Too many goals
Be strategic when deciding what metrics to track in a test. Add all the goals that are critical to measure, even if it's 10 or more. But don't track goals that aren't crucial in deciding whether a test is a success or failure for your business needs.
Imagine that you're optimizing the search bar on your homepage; you're tempted to track how your changes impact clicks on your customer support widget. While customer support is a valid consideration, it may not be crucial to measure for this particular test.
Return to your hypothesis. Does the impact to support tell you whether the hypothesis is valid or not? If not, this goal may just get in the way of reaching statistical significance. Avoid adding it to this particular experiment.