- Understand the different approaches used by Optimizely's Stats Engine and classical statistics
Sometimes, Optimizely will declare a winning variation in a situation where a traditional t-test would fail to find any statistically significant difference between it and the other variations. This is because Optimizely’s Stats Engine uses an approach that differs from those used in classical statistics-based models, one which is simultaneously more conservative in declaring a winner and less likely to reverse that declaration as more data accumulates.
In classical statistics, a t-test is often used to measure differences in responses between two events, often in a “before-and-after” context. For example, a t-test might be used to measure the efficiency of a particular medical treatment by comparing the health outcomes of a sample group before and after receiving the treatment.
Because it measures changes in response to an event, the t-test is also widely used in A/B testing. However, there are some weaknesses inherent in this approach, which is why Optimizely uses its proprietary stats engine instead.
Stats engine vs classical statistics
Rather than using those classical statistics tools to determine significance and declare a winning variation, Optimizely calculates a series of 100 successive confidence intervals through the course of an experiment. Each of those intervals receives a distinct p-value and a confidence interval.
The p-value that appears in the Results page reflects the smallest p-value that Optimizely saw over the course of all of these sequential intervals. It is not an average p-value for the entire experiment.
Similarly, the confidence interval that you see in the Results page is the intersection of all of the confidence intervals that Optimizely created across those sequential intervals.
Because of that, the p-value and confidence interval that you see may not exactly match the currently observed means in the experiment.
The z-test, on the other hand, only uses the currently observed mean and difference to compute a p-value and confidence interval.
If Optimizely detected strong evidence of a difference between two variations at first, but then the evidence weakened over time, the p-value and confidence interval shown in the Results page could be still reflect the strong evidence that was detected at the start of the experiment.
The approach described above essentially results in Optimizely taking a more conservative approach to both declaring a winner and to “un-declaring” a winning variation. Using a z-test, on the other hand, is much more likely to result in an experiment flipping into—and then back out of—significance.
The stats engine approach me it is far less likely that users will ever see results where Optimizely has spuriously declared a winner (for a brief period of time) with stats engine than it is with the t-test.
If you are unsure whether or not Optimizely is likely to “un-declare” a winner, look at the currently observed mean (i.e., the ‘tick mark’). If it is at the edge of the confidence interval, then it’s possible Optimizely is accumulating evidence against the conclusion it has already drawn. In those cases, it may be worth it to wait for a while. But if the observed mean is closer to the center of the confidence interval, you may feel more secure in declaring it a winner.