Skip to main content


Optimizely Knowledge Base

Statistical significance in Optimizely

  • Use statistical significance to analyze results 

Optimizely won't declare a variation a winner or loser until your experiment meets specific criteria for visitors and conversions. These criteria are different for experiments using numeric metrics and those using binary metrics.

Numeric metrics (such as revenue) do not require a specific number of conversions, but they do require 100 visitors/sessions in the variation. Binary metrics, on the other hand, require at least 100 visitors/sessions and 25 conversions in both the variation and the baseline before a winner can be declared.

More often, you'll see results once Optimizely has determined they are statistically significant.

Statistical significance represents that likelihood that the difference in conversion rates between a given variation and the baseline is not due to chance. Your statistical significance level reflects your risk tolerance and confidence level.

For example, if your results are significant at a 90% significance level, you can be 90% confident that the results you see are due to an actual underlying change in behavior, not just random chance.


This is necessary because in statistics, you observe a sample of the population and use it to make inferences about the total population. Optimizely uses statistical significance to infer whether your variation caused movement in the Improvement metric.

There's always a chance that the lift you observed was a result of typical fluctuation in conversion rates instead of actual change in underlying behavior. For example, if you set a 80% significance level and you see a winning variation, there’s a 20% chance that what you’re seeing is not actually a winning variation. At a 90% significance level, the chance of error decreases to 10%.

The higher your significance, the more visitors your experiment will require. The highest significance that Optimizely will display is >99%: it is technically impossible for results to be 100% significant.

Statistical significance helps Optimizely control the rate of errors in experiments. In any controlled experiment, you should anticipate three possible outcomes:

  • Accurate results. When there is an underlying, positive (negative) difference between your original and your variation, the data shows a winner (loser), and when there isn’t a difference, the data shows an inconclusive result.

  • False positive. Your test data shows a significant difference between your original and your variation, but it’s actually random noise in the data—there is no underlying difference between your original and your variation.

  • False negative. Your test shows an inconclusive result, but your variation is actually different from your baseline.

Statistical significance is a measure of how likely it is that your improvement comes from an actual change in underlying behavior, instead of a false positive.

By default, we set significance at 90%, which means there’s a 90% chance that the observed effect is real and not due to chance. In other words, you will declare 9 out of 10 winning or losing variations correctly. If you want to use a different significance threshold, you can set a significance level at which you would like Optimizely to declare winners and losers for your project.

Lower significance levels may increase the likelihood of error but can also help you test more hypotheses and iterate faster. Higher significance levels decrease the error probability, but require a larger sample.

Choosing the right significance level should balance the types of tests you are running, the confidence you want to have in the tests, and the amount of traffic you actually receive.

One-tailed and two-tailed tests in Optimizely

When you run a test, you can run a one-tailed or two-tailed test. Two-tailed tests are designed to detect differences between your original and your variation in both directions: they tell you if your variation is a winner and if your variation is a loser. A one-tailed test will tell you whether your variation is a winner or a loser, but not both. One-tailed tests are designed to detect differences between your original and your variation in only one direction.

With the introduction of the Stats Engine, Optimizely uses two-tailed tests because they are required for the false discovery rate control that we have implemented in our Stats Engine.

In reality, false discovery rate control is more important to your ability to make business decisions than whether you use a one-tailed or two-tailed test because when it comes to making business decisions, your main goal is to avoid implementing a false positive or negative. 

Switching from a two-tailed to a one-tailed test will typically change error rates by a factor of two, but requires the additional overhead of specifying whether you are looking for winners or losers in advance. If you know you're looking for a winner, you can increase your statistical significance setting from 90% to 95%. On the other hand, as the example above shows, not using false discovery rates can inflate error rates by a factor of five or more. 

It’s more helpful to know the actual chance of implementing false results and to make sure that your results aren’t compromised by adding multiple goals.

Segmentation and statistical significance

Optimizely lets you segment your results so you can see if certain groups of visitors behave differently from your visitors overall. However, Optimizely doesn't control the false discovery rate for segments. This means it's much more likely that significant results in segments are false positives, and the false discovery rate will be higher.

You can limit the risk of false positives if you only test the segments that are the most meaningful. The higher false discovery rate arises when you're searching for significant results among many segments.

Novelty effect and statistical significance

Currently, the statistical significance from a novelty effect stays for a long time. In future, statistical significance calculations will self-correct and take into account how long the test is running for, not just sample size.