- Optimizely X Web Personalization
- Optimizely X Web Recommendations
THIS ARTICLE WILL HELP YOU:
- Understand the difference between the concepts of false positive rate and false discovery rate, and why Optimizely's Stats Engine uses one over the other.
Optimizely’s Stats Engine focuses on false discovery rate instead of false positive rate, which helps you make business decisions based on reliable results.
You may already be familiar with the idea of a false positive rate. It can be calculated as the ratio between:
the number of negative events incorrectly categorized as positive, and
the overall number of actual negative events.
On every test, there is a risk of getting a false positive result. This happens when a test reports a conclusive winner but there is in fact no real difference in visitor behavior between your variations.
With traditional statistics, the risk of generating at least one false positive result increases as you add more metrics and variations to your experiment. This is true even though the false positive rate stays the same for each individual metric or variation.
This may sound like a theoretical problem, and it is (it’s known as the multiple testing problem). But it can also have significant real-world impact. The reason is that even if the false positive rate for each individual metric or variation stays the same, the chances that you will make a critical business decision based on a false positive result grow very quickly.
False discovery rate
Optimizely helps you avoid this by taking a more stringent approach to controlling errors. Instead of focusing on the false positive rate, Optimizely uses procedures that manage the false discovery rate. These procedures are designed to control the expected proportion of conclusive results that are incorrect.
In statistical language, this would be described as the number of incorrect rejections of the null hypothesis (that null hypothesis being the claim that there was no change to visitor behavior as a result of a particular change to your website).
If you’re not familiar with the concept of false discovery rate control, it’s a modern statistical procedure that has been demonstrated to be more accurate for testing multiple hypotheses at once, which is exactly what you’re doing when you run a test with more than one variation or more than one metric. You can learn more about how false discovery rate control works by reading this article or scrolling through this slide deck (they’re both pretty technical reads).
Here is an example of how false discovery rate control delivers better results in an experiment using multiple variations and metrics. Imagine a hypothetical experiment with five variations and two distinct metrics:
In this experiment, there are ten different opportunities for a conclusive result. There are two winners reported; however, one of them (the one labeled “false winner”) is actually inconclusive.
If we were to (incorrectly) use the false positive rate as our metric, we would think the likelihood of choosing the false winner is ten percent, because only one of the ten potential results is incorrect. We would likely consider this to be an acceptable rate of risk.
But looking at the false discovery rate, we see that our chances of selecting a false winner are actually fifty percent. That’s because the false discovery rate only looks at actual conclusive results, instead of merely all opportunities for results.
If you were running this experiment, the first thing you probably would do is discard all the inconclusive variation / metric combinations. You would then have to decide which of the two winning variations to implement. In doing so, you would have no better than a 50-50 chance of selecting the variation that would actually help drive the visitor behavior you wanted to encourage.
A false discovery rate of fifty percent would definitely be alarming. But because Optimizely uses techniques that work to keep the false discovery rate low—approximately ten percent—your chances of selecting a true winning variation to implement are much higher than if you were using a tool that relied on more traditional statistical methods.