Skip to main content

We are currently experiencing an issue that is preventing customers from submitting support tickets. Please contact us at (866) 819-4486 if you need immediate assistance.

x
Optimizely Knowledge Base

Why experiment results sometimes don't match live metrics

THIS ARTICLE WILL HELP YOU:
  • Understand a possible discrepancy between test results and analytics data after rolling out a change
  • Design experiments that reliably predict true lift when published to your live site

In A/B testing, it's important to be able to accurately predict how wins translate into real-life value. Getting a winning result is great, but true value comes from implementing winning changes and watching your metrics go up.

Sometimes, there's a discrepancy between lift in a test and what you see in analytics after rolling out the change. On occasion, lift from live changes can even wear off over time.

Understanding where gaps between experiment results and live metrics come from - and designing experiments to mitigate those gaps - is key to a successful testing strategy. Read on for our take on why live metrics sometimes diverge from experiment data and what to do about it.

Pay attention to difference intervals

A/B testing is based on the sample that is your audience. As such, there's always the possibility of a discrepancy as a test sample can never represent the full population: all visitors across time. If you run the exact same test a few times, you'd see slightly different results even if there are no false positives.

Implementing the change on your site is, in its essence, like running another experiment. You can expect to see a difference between test result and the metrics after pushing the change live on your site. But in experience optimization, it's important to be able to predict the value you see in a given test; this type of uncertainty won't do.

The best solution is to report results using difference intervals rather than just observed improvement. The difference interval adds additional visibility into the lift you can expect, since it shows the range where your true improvement lies (with a statistical likelihood of 90%, or whatever you set your statistical significance threshold to).

Curious how the difference interval is calculated? Check out the example below.

As opposed to the 10.2% improvement shown, which is calculated relative to the baseline conversion rate, the numbers in the difference interval are absolute. This means that expected improvement will sit between (9.51% + 0.4% =) 9.91% and (9.51% + 1.59% =) 11.1%.

Report this range to your stakeholders or report just the lower bound. That way you'll be able to estimate the increase in value from this test in a more reliable way.

Account for seasonality

Seasonality is a big factor in the volatility of results. Most businesses have highly visible weekly and daily cycles in their business.

For example, check out the daily session count for our own Knowledge Base. The clear dips in the chart are Saturdays and Sundays, when people (rightfully) would rather do things other than work.

Short cycles are very common and easy to spot. The best way to mitigate against uncertainty from cycles is to run tests for whole weeks, or whatever time frame makes sense to your business.

Other types of seasonality may be more difficult to recognize. Holidays, political events, financial events, and seasons may interfere with how your test results perform. It's often better to run tests to account for seasonality.

For example, assume that you run a website for a flower shop and you'd like to test a new product page layout. At Optimizely, we'd argue that Valentine's Day would not be the best time to test type of change, since the visitor behavior is unusual around that time. Unless you're testing a concept that is directly tied to Valentine's Day, like a special promotion or a festive homepage background, it's probably best to run your experiment before or after. Otherwise, results you will see in the experiment aren't likely to materialize in the same way when the change is implemented.

Changes in your audience

In A/B testing, you test a hypothesis on a sample of visitors and extrapolate the results to the rest of your user base. But if you make changes to your customer acquisition channels, you may see a difference in visitor behavior.

If you acquire visitors with a new channel or target different customer segments, the changes you tested may not apply to them in the same way. You'll see this difference in your analytics.

Please note that discrepancies due to changes in your audience more likely to affect smaller businesses.

Novelty effect

When you add new functionality or make a visual change, your visitors may react with some initial interest. Over time, this interest may wane; this is a novelty effect. When you observe lift that decreases over time, you may be seeing a novelty effect.

For example, imagine you test a new call-to-action (CTA) to encourage more visitors to click through to a "Special Offers" page. If you make the CTA stand out with a new bright color, visitors who are used to the original design may notice it; the variation may look like a win. But over time, the novelty will wear off on repeat visitors and that initial improvement may regress.

For this reason, it's important to design tests to generate true insights about customers to create experiences that help them achieve their goals. Tests that simply prompt a new, short-term behavior don't always help your long-term business goals.

Different types of experiments have different likelihood of being affected by regression to the mean. Removing an unnecessary step in your cart-check-out funnel will yield a more constant result. But the lift from simply making an element more noticeable is more likely to wear off over time.

Novelty will only affect returning visitors, since new visitors aren't aware of the novelty of the change.