- Use Optimizely's Sample Size Calculator to calculate the sample size for your test
- Calculate the length of time an experiment will likely run based on sample size
- Choose a Baseline Conversion Rate and Minimum Detectable Effect for your estimates
You have a theory about how to improve your conversion rate, you've built your test, and you’re ready to turn it on. Congratulations! So, how long do you have to wait to you know if your theory was right or not? Traditionally, you’ve needed to figure out the total sample size you need and divide it by your daily traffic, then stop the test at the exact sample size that you calculated. Doesn’t sound very simple, does it?
Good news: Optimizely’s Stats Engine removes the need to calculate a sample size in advance by using a methodology called sequential testing, which collects evidence as your test runs to declare significant results and show you winners and losers as quickly and accurately as possible.
Need to calculate the sample size or duration of your experiment? Check out our Sample Size Calculator!
The importance of sample size
Even though you no longer need to calculate sample size as an experiment runs, you should understand why it's important to have a healthy sample size when making decisions.
A healthy sample size is at the heart of making accurate statistical conclusions and a strong motivation behind why we created Stats Engine. When your test has a low conversion rate for a given sample size, it means that there is not yet enough evidence to conclude that the effect you're seeing is due to a real difference between the baseline and variation instead of due to chance: in statistical terms, your test is underpowered.
The table below provides an estimate of the sample size you would need to accurately detect different levels of Improvement (relative difference in conversion rates) across a few different baseline conversion rates, based on Optimizely’s Sample Size Calculator/Stats Engine. It takes fewer visitors to detect large differences in conversion rates; just look across any row to see how this works.
The same is true for higher baseline conversion rates: as your baseline conversion rate gets higher, you need a smaller sample size to measure Improvement. Read each column from top to bottom to see how this works.
Stats Engine allows you evaluate results as they come in and avoid making decisions on tests with low, underpowered sample sizes (a “weak conclusion”) without having to commit to predetermined sample sizes before running a test. You want to avoid making business decisions based on underpowered tests because any improvement that you see is unlikely to hold up when you implement your variation, potentially causing you to spend valuable resources and realize no benefit.
As you're running experiments, Optimizely shows you an estimate of how many visitors you'll need to reach significant results.
When your variation reaches a statistical significance greater than your desired significance level (by default, 90%), Optimizely will declare it a winner or loser. You can stop your test once your variations reach significance.
If some of your variations haven't reached significance, decide whether you can afford to wait for the number of visitors needed to reach significance or use the sample size calculator described below to calculate how many visitors you would need if the Improvement percentage were to change.
Why would you see a high Improvement percentage but a Statistical Significance of 0%? It's because your experiment is underpowered and hasn't had enough visitors. As more visitors encounter your variations, and convert, you'll start to see Statistical Significance increase because Optimizely is collecting evidence to declare winners and losers. Learn more in our article on the Stats Engine.
Even with Stats Engine in place, you probably still want to know how long you can expect your experiments to take, so that you can plan and roadmap them accurately. This article will walk you through the process for doing exactly that.
Have a question about test duration or your results? Head over to the Optimizely Community to post a discussion and see what others are talking about.
Use the online calculator to calculate sample sizes
Use Optimizely’s Sample Size Calculator to determine the traffic you will need for your conversion rate tests.
The calculator takes two inputs and then tells you what the sample size for both your original and your variation should be to meet your statistical goals. There are also options to modify the recommended statistical significance level, which should reflect the statistical significance level you choose for your Optimizely project. The values you choose for the calculator will be unique to each experiment and goal.
Here’s what the calculator looks like. To help you use the calculator for your conversion rate test, we’ll walk you through the inputs in more detail below.
Great, I’m done calculating sample size! Now, how long will it take to run the test?
This part is easy! The last step is to translate sample size into estimated time. Take the sample size and multiply it by the number of variations you have in your experiment. This gives you the total number of visitors you need. Divide that by your average number of visitors per day, and you will have the estimated number of days you need to run your test.
With the introduction of Optimizely’s Stats Engine, you no longer need to use this sample size calculator to determine an experiment’s “stopping point.” Now, you can use the calculator mainly to estimate test duration in advance. Also note that other calculators, which account for traditional fixed-horizon testing, will no longer give you a correct estimate of Optimizely’s test duration.
Are you trying to calculate experiment length, but your site has low traffic? Check out some strategies in our article on testing low-traffic sites!
Baseline Conversion Rate
This is the current conversion rate (number of successful actions divided by the number of visitors who saw the page) for the page you’re testing. Baseline conversion rates can usually be calculated using data found in analytics platforms like Google Analytics or from a previous Optimizely experiment. If you don't have a previous Optimizely experiment, you can always run a "monitoring campaign" in Optimizely, which is an experiment that has only an Original, no Variations, just to measure baseline conversions.
Minimum detectable effect (MDE)
This is a simple idea but a long explanation—play with the calculator and it will become clear pretty quickly, and then you can skip this long explanation.
Once you’ve entered your baseline conversion rate, the next step is to decide how much change from the baseline (how big or small a lift) you want to be able to detect. You’ll need less traffic to detect big changes and more traffic to detect small changes. The Optimizely Results page and sample size calculator are set to measure change relative to the baseline conversion rate.
Let’s use an example to help with this, with a 20% baseline conversion rate and a 5% MDE.
With these inputs, your test will be able to detect 80% of the time when a variation's underlying conversion rate is actually 19% or 21%, i.e. 20% +/- (5% x 20%). If you try to detect differences smaller than 5%, your test is said to be underpowered. Power is a measure of how well you can distinguish the difference you are detecting from no difference at all. So running an underpowered test is the equivalent of not being able to make strong declarations of whether your variations are actually winning or losing.
In Optimizely, Effect, or lift, is labeled Improvement on the Optimizely Results page. It is always presented as relative, not absolute.
Once you input these two numbers, the calculator will tell you what sample size you need for your original and also for your variation. The calculator is by default set to the recommended level for statistical significance for your tests. You have the option to change these inputs depending on the level of risk you are comfortable with for your experiments. To learn more about each of these options, we’ve provided some more detail below.
You can also use MDE to benchmark how long to run a test and the impact you're likely to see. This approach can help provide a guideline through the uncertainty of testing, so you can prioritize experiments according to expected ROI. To learn more, read about using MDE to gauge the effort and potential impact of a test.
Statistical Significance level
Statistical significance answers the question, “How likely is it that my test results will say I have a winner when I actually don’t?” Generally we talk about this as 90% statistical significance. A different way to say the same thing is that we will accept a 10% false positive rate, where the result is not real (100% - 10% = 90%). This calculator defaults to 90% statistical significance, and is generally how tests are run. If you would like to increase or decrease the level of statistical significance for your test, you can edit this input. Note that you can also change the significance level at which Optimizely uses to declare winners and losers from the Settings tab of the Home page.
Does Optimizely use 1-tailed or 2-tailed tests?
In the context of A/B testing, a 1-tailed test tells you whether a variation can identify a “winner” whereas a 2-tailed test will check for statistical significance in both directions. Previously, Optimizely used 1-tailed tests because we believe in giving you actionable business results, but we now solve this for you even more accurately through false discovery rate control. Find out more in our Stats Engine article.
The right level of risk for you
When running an experiment, you may need to consider the trade-off between running tests quickly and reducing the chance of inaccuracy in your results (false positives and false negatives). Generally, tests are run at 90% statistical significance. You can adjust this threshold based on how much risk you are willing to take.
For example, imagine a scenario in which your experiment requires large sample size to reach statistical significance, but you need to make a business decision within the next two weeks. Based on your traffic levels, your test may not reach statistical significance within that timeframe. What do you do? If your organization feels that the impact of a false positive—a winner that is incorrectly called—is low, you may decide to decrease the statistical significance to see results declared more quickly.
At the end of the day, you should be aware of the tradeoff between accurate data and available data when making time-sensitive business decisions based on your experiments.
Why isn’t my test reaching significance?
In general, smaller differences take longer to detect, because you need more data to confirm that Optimizely observed an actual, statistically significant difference, not random changes in conversion patterns.
If you find that your test has been running for a considerable amount of time and you still need more unique visitors to reach significance, this could be because Optimizely is observing scattered data - or conversions that are erratic and inconsistent over time. If your data has high variability, Stats Engine will require more data before showing significance.
When you are measuring impulse-driven goals (such as video plays or e-mail sign-ups), data tends to be more scattered, as visitor behavior is tends to be erratic and easily affected by many small impulses. However, when you are measuring goals that involve carefully weighed decisions (such as a high-value purchase), you will see more stable, less variable data. Optimizely’s Stats Engine automatically calculates variability and adjusts accordingly.
See below for an example of data variability:
Low Variability Data: The blue line shows a data set for which the baseline conversion rate varies from 3.2% to 4.8%. If a variation raises this metric to 5%, we can tell that it is significant.
High Variability Data: The green line shows data set whose baseline conversion rate varies between 2% and 6%. If a variation raises this metric to 5%, we would need additional data to call results significant, because 5% falls within the baseline conversion range.