Skip to main content
menu-icon.png

 

x
Optimizely Knowledge Base

Resolving discrepancies between data pipelines

This article is about Optimizely X. If you're using Optimizely Classic, check this article out instead.
 
relevant products:
  • Optimizely X Web Experimentation
  • Optimizely X Web Personalization
  • Optimizely X Full Stack

THIS ARTICLE WILL HELP YOU:
  • Understand why Optimizely may appear to be delivering inconsistent data via the results and data export pipelines
  • Resolve any discrepancies you may encounter 

Optimizely’s data export service gives you access to all your Optimizely events. In some cases, you may observe differences between the output of your data export query and the results you get in Optimizely. This article will help you understand these differences and provide some practical steps to mitigate them. 

Overview of results and data export pipelines

Optimizely’s results page is powered by a real-time data pipeline. This pipeline (henceforth called the results pipeline) processes events as soon as it receives them by running several data enrichments, such as adding session-level information and applying Optimizely’s event attribution rules. Afterwards, the events are available for querying via the Results API and the results page. This usually happens within a few minutes.

On the other hand, data export is a static data pipeline that processes all events received over the past 24 hours and stores them in raw event format in S3. These raw events are organized in folders formatted in this way:

/optimizely-export-ng/{account_id}/{project_id}/2.0/yyyy/mm/dd/{experiment_id}/{file_name

The date in the folder path refers to the date the event was received by the Optimizely server (i.e. the server timestamp in PST), and not to the timestamp of the event itself. For example, if an event has a timestamp date of "2017-12-01" but is not received by the server until the following day, it will be saved in the "2017-12-02" folder partition. This has important implications for raw data querying requirements.

data_discrepancies.png

Key differences between pipelines

This table summarizes the most important differences between the two pipelines.

Optimizely considers discrepancies within 5% to be acceptable for most customer scenarios.

 

Results

Data export

Data availability

Near real-time (<10min).

Next day (usually ~ 8am PST).

Event attribution

Event attribution rules are applied automatically.

Event attribution must be manually executed.

Time zone

Results page uses the browser’s time zone setting by default for the results query. For example, a results query for ‘2017-12-01’ to ‘2017-12-07’ from San Francisco will first convert the date range to PST and then consider all events with a timestamp within the converted date range.

Event timestamps use UTC epoch time.

Delayed events

Included in the results a few seconds after they are received by our servers.

Included in the raw data. Note that these events will end up in a daily partition that is based on the event’s arrival date.

Out of bounds events

If an event has a timestamp outside of the experiment’s valid running range, it is automatically excluded from the results.

Included in the daily raw data partition. Must be manually excluded from the query using proper begin/end timestamp conditions.

Duplicate events

Duplicated events are automatically de-duplicated using the event UUID.

Events must be manually de-duplicated using the event_uuid field.

Holdback events

Holdback traffic does not appear in the results.

Must be manually excluded using the isHoldbackTrafficfilter.

Sessionization

For personalization, sessionization is performed automatically by the results pipeline to calculate total sessions and other session metrics.


Events must be manually grouped into sessions.

IP filtering

The IP filters (in web project settings) are applied in real time.

Events from filtered IPs must be manually excluded.

Results resetting

Removes all data from begin time of experiment till time of reset.

Not applicable. Resetting does not erase raw data.

Traffic reallocation

If visitors are allocated (or re-allocated) to more than one variation, Optimizely's event attribution model does not behave as expected, and results are invalid.

Raw events are attributed to the appropriate variation. Because event attribution model does not behave predictably, any comparison with the results is very difficult. This is not a supported scenario.

Event attribution

Optimizely uses specific counting rules to count visitors, sessions, and conversions in the Results page. These rules make up Optimizely’s event attribution model. These rules must be applied manually when querying raw data and matching that data with results. The table below summarizes the rules used by each Optimizely product.

 

Unique Users/Sessions

Unique  Conversions

Total    Conversions

Web Experimentation

Count distinct visitorIDs that sent >=1 decision or conversion.

Count distinct visitorIDs that sent >=1 conversion.

Count total conversions sent.

Full Stack

Count distinct visitorIDs that sent >=1 decision.

Count distinct visitorIDs that sent >=1 conversion after the first decision.

Count total conversions after the first decision.

Web Personalization

Count total sessions that had >=1 decision. These are called qualified sessions.

Count qualified sessions with >=1 conversion after the decision.

Count total conversions sent from qualified sessions after the decision.

Time zone

By default, the Results page queries the results pipeline using the time zone of the user’s browser. For example, if a user visits the results page from San Francisco and queries results for 2017-12-01 to 2017-12-07, the results pipeline will return results from 2017-12-01 12:00am Pacific time to 2017-12-07 11:59pm Pacific time. If a collaborator now visits the same results page from New York and queries the same date range, then the results pipeline will return results from  2017-12-01 12:00am Eastern time to ‘2017-12-07 11:59pm Eastern time. So the results will differ for these two users, even when all else is equal.

The raw data does not use time zones. By default, all event timestamps use epoch time in UTC/GMT. You must manually ensure that the beginning and ending times for any raw data query match those used in the Results page. To do this, follow these steps:

  1. Navigate to the Results page and select the desired date range.

  2. Look at the Results page’s URL. Copy the timestamp values for the &beginDate and &endDate URL parameters. Each should be a 13-digit integer.

  3. In the raw data query, use these values for your timestamp filter:

SELECT *

FROM events

WHERE timestamp >= {beginDate} AND timestamp <= {endDate}


If the &beginDate and &endDate do not appear in the results URL, manually select the time range from the date picker. The results page will load and the URL will now contain the parameters.

Delayed events

In some situations, events may arrive after a considerable delay. These include mobile scenarios where visitors might suspend the browser app before an event is fired and then resume it hours or days later, as well as Full Stack scenarios where the developer might intentionally queue events and batch-upload them in the future.

Optimizely’s results pipeline relies on the event’s timestamp—instead of its arrival time—to attribute the event to the correct time range, while data export simply stores all events to daily partitions according to the time they were received. For example, an event with a timestamp of 2017-12-01 that is received on 2017-12-07 will be stored in the 2017-12-07 partition.

This difference can create an artificial discrepancy between results and raw data. To reconcile these discrepancies, follow these steps:

  1. In your raw data query, select all daily partitions from the time the experiment started until a few days after the experiment ended, even if you are only interested in a sub-range of the experiment’s running time.

    For example, if the experiment ran from 2017-12-01 to 2017-12-07 and you are interested in a comparison between 2017-12-01 and 2017-12-02, you would select the daily partitions for 2017-12-01 to  2017-12-15 (a week after the experiment’s completion). This ensures that the majority of delayed events will be captured by your raw event query.

  2. Next, select events for a particular time window by following the steps described earlier (in the time zone section).  Here’s how it would look for a sub-range of 2017-12-01 to 2017-12-02:

SELECT *

FROM events -- includes data from ‘12-01’ to ‘12-15’ partition

WHERE

timestamp >= 1512086400000      --epoch for 12/01 12:00am

AND timestamp <= 1512259199000  --epoch for 12/02 11:59pm

Out of bounds events

Out of bounds events have a timestamp outside the experiment’s valid time range. For example, an event that has a timestamp of 2010-10-01 would be considered out of bounds from an experiment that is running between 2012-12-01 and 2012-12-07. Out of bound events can occur if the clock settings on the browser sending the events were incorrect. The developer might also have used incorrect timestamps during event tracking.

The results pipeline will automatically exclude out of bounds events. However, in the raw data, these events will still be present in the daily partition, organized by the date they were received by Optimizely’s servers. You will have to manually exclude them using appropriate time filters, as described in the previous two sections (time zone and delayed events).

Duplicate events

Duplicate events occur when the same event is sent multiple times by the client. Optimizely's results pipeline relies on a unique event identifier (UUID) that should be present in the event payload to de-duplicate events. The web client and all Full Stack SDKs automatically generate event UUIDs during event tracking. Event API developers are responsible for generating event UUIDs on their own.

While de-duplication in the results pipeline is automatic, it must be performed manually when querying raw data:

SELECT count(distinct eventId)

FROM events

WHERE

event_name = “eventName”

AND timestamp >= 1512086400000     

AND timestamp <= 1512259199000

Holdback Traffic

The holdback visitor group is traffic that is not bucketed into a variation; instead, it is bucketed into an empty or generic experience that is similar to a control. One important difference is that the holdback traffic does not appear in the results for an A/B experiment, while control traffic does. When querying raw events, you must manually exclude holdback traffic using the available IsHoldback filter:

SELECT count(distinct UUID)

FROM events

WHERE

event_name = “eventName”

AND LAYER_HOLDBACK == FALSE

AND timestamp >= 1512086400000     

AND timestamp <= 1512259199000

IP filtering

If the project is using IP filtering (available for X Web only), the Results page will automatically exclude any visitors on the filtered IP list. However, this is not true for data export. VisitorIDs and their conversions must be manually excluded using the same IP filter conditions in the SQL query.

Results resetting

When results are reset, the underlying raw event data is not subsequently deleted. You will have to ensure that you are not counting events for a time range when the results were reset. You can achieve this using the time filters described in previous sections.