How long to run a test

Last updated:

This article will help you:

  • Use Optimizely's Sample Size Calculator to calculate the sample size for your test
  • Calculate the length an experiment will likely run based on sample size
  • Choose a Baseline Conversion Rate and Minimum Detectable Effect for your estimates
 

Tip:

Need to calculate the sample size or duration of your experiment? Check out our Sample Size Calculator!

You have a theory about how to improve your conversion rate, you've built your test, and you’re ready to turn it on. Congratulations! So, how long do you have to wait until you know if your theory was right or not? Traditionally, you’ve needed to figure out the total sample size you need and divide it by your daily traffic, then stop the test at the exact sample size that you calculated. Doesn’t sound very simple, does it?

Here’s the good news: Optimizely’s Stats Engine removes the need to calculate a sample size in advance by using a methodology called sequential testing, which collects evidence as your test runs to declare significant results and show you winners and losers as quickly and accurately as possible.

Optimizely also shows you an estimate of how many visitors you'll need to reach significant results.

That said, you probably still want to know how long you can expect your experiments to take, so that you can plan and roadmap them accurately. This article will walk you through the process for doing exactly that.

 

Tip:

Looking for the definitive guide to getting up and running with Optimizely? Learn and practice your Optimizely skills and web optimization strategy, from beginner to advanced, in Optimizely Academy.

Have a question about test duration or your results? Head over to the Optimizely Community to post a discussion and see what others are talking about.

Using an online calculator to calculate sample sizes

Use Optimizely’s Sample Size Calculator to determine the traffic you will need for your conversion rate tests.

The calculator takes two inputs and then tells you what the sample size for both your original and your variation should be to meet your statistical goals. There are also options to modify the recommended statistical significance level, which should reflect the statistical significance level you choose for your Optimizely project. The values you choose for the calculator will be unique to each experiment and goal.

 

Note:

With the introduction of Optimizely’s Stats Engine, you no longer need to use this sample size calculator to determine an experiment’s “stopping point.” Now, you can use the calculator mainly to estimate test duration in advance. Also note that other calculators, which account for traditional fixed-horizon testing, will no longer give you a correct estimate of Optimizely’s test duration.

Here’s what the calculator looks like. To help you use the calculator for your conversion rate test, we’ll walk you through the inputs in more detail below.

Baseline Conversion Rate

This is the current conversion rate (number of successful actions divided by the number of visitors who saw the page) for the page you’re testing. Baseline conversion rates can usually be calculated using data found in analytics platforms like Google Analytics or from a previous Optimizely experiment. If you don't have a previous Optimizely experiment, you can always run a "monitoring campaign" in Optimizely, which is an experiment that has only an Original, no Variations, just to measure baseline conversions.

Minimum detectable effect (MDE)

This is a simple idea but a long explanation—play with the calculator and it will become clear pretty quickly, and then you can skip this long explanation.

Once you’ve entered your baseline conversion rate, the next step is to decide how much change from the baseline (how big or small a lift) you want to be able to detect. You’ll need less traffic to detect big changes and more traffic to detect small changes. The Optimizely results page and sample size calculator are set to measure change relative to the baseline conversion rate.

Let’s use an example to help with this, with a 20% baseline conversion rate and a 5% MDE.

With these inputs, your test will be able to detect 80% of the time when a variation's underlying conversion rate is actually 19% or 21%, i.e. 20% +/- (5% x 20%). If you try to detect differences smaller than 5%, your test is said to be underpowered. Power is a measure of how well you can distinguish the difference you are detecting from no difference at all. So running an underpowered test is the equivalent of not being able to make strong declarations of whether your variations are actually winning or losing.

 

Note:

In Optimizely, Effect, or lift, is labeled Improvement on the Optimizely results page. It is always presented as relative, not absolute.

Once you input these two numbers, the calculator will tell you what sample size you need for your original and also for your variation. The calculator is by default set to the recommended level for statistical significance for your tests. You have the option to change these inputs depending on the level of risk you are comfortable with for your experiments. To learn more about each of these options, we’ve provided some more detail below.

Statistical Significance level

Statistical significance answers the question, “How likely is it that my test results will say I have a winner when I actually don’t?” Generally we talk about this as 95% statistical significance. A different way to say the same thing is that we will accept a 5% false positive rate, where the result is not real (100% - 5% = 95%). This calculator defaults to 95% statistical significance, and is generally how tests are run. If you would like to increase or decrease the level of statistical significance for your test, you can edit this input. Note that you can also change the significance level that Optimizely uses to declare winners and losers from the Settings tab of the Home page.

The right level of risk for you

When running your test, you need to make a trade-off between running tests quickly and reducing the chances of inaccurate results (false positives and false negatives). Generally tests are run at 95% statistical significance. You can adjust this based on how much risk you are willing to take. For example, if your organization feels that the impact of a false positive—falsely picking a winner—is low, you might want to decrease your statistical significance, and you’ll generally see results declared more quickly. At the end of the day, you should just be aware of the tradeoffs that you need to make business decisions based on accurate and available data.

 

Note:

Does Optimizely use 1-tailed or 2-tailed tests?

In the context of A/B testing, a 1-tailed test tells you whether a variation can identify a “winner” whereas a 2-tailed test will check for statistical significance in both directions. Previously, Optimizely used 1-tailed tests because we believe in giving you actionable business results, but we now solve this for you even more accurately through false discovery rate control. Find out more in our Stats Engine article.

Great, I’m done calculating your sample size! Now, how long will it take to run the test?

This part is easy! The last step is to translate sample size into estimated time. Take the sample size and multiply it by the number of variations you have in your experiment. This gives you the total number of visitors you need. Divide that by your average number of visitors per day, and you will have the estimated number of days you need to run your test.

Comments

  • Avatar
    j.van.rooy

    @Optimizely
    Comment on last paragraph, about 'assumptions':

    Is there already some insights in how to deal with Power in settings like 'revenue tests, tests with multiple variations (ABCE and MVT tests)'?

Please sign in to leave a comment.