Handling Time Splay
A common problem in event analytics is what I'll call "time splay". Your log pipeline collects a bunch of events during the 24 hour period of Feb 17, 2017 PST, and you initially expect that your events will all therefore have timestamps from Feb 17, 2017 PST.
But when you take a closer look, you realize that the events you collected during Feb 17, 2017 PST actually have some timestamps that are far older, stretching back days or even weeks into the past. And you also have some timestamps that are newer (even though it seems impossible for events to have timestamps newer than the actual current time). This is actually pretty common, either due to collection delays in your log pipeline, or because your event timestamps are being set by the client and are simply not accurate.
The Impact of Time Splay
There are a couple cases where this might impact you:
- You run a query over the last month and get a certain result. Then you wait a day and run the same query again, and suddenly you see a spike in activity from 3 weeks ago that you could swear wasn't there when you ran the query yesterday. If you have a lot of time splay, then it might be that events were ingested in the last day which were timestamped from 3 weeks ago.
- You decide to delete and reingest data from a certain time period because you realize you want to apply a different transformation to get more value out of the data. If you have a lot of time splay, then it's hard to match up a time range of data you want to delete against the data files you need to reingest to repopulate that time range.
Approaches For Handling Time Splay
So what can you do about time splay? You have a few options:
- Investigate it to determine how much time splay you have. The Transformer Library provides an add_file_date transform to insert the timestamp of the origin data file into each event. This allows you to run a query in Interana to see, for events that in theory came from files timestamped with Feb 17, 2017 PST, did the events all actually have timestamps within Feb 17, 2017 PST?
- Truncate it. The Transformer Library provides a time_convert transform to process the timestamp of your individual events, and discard events that are too far outside a window you configure (presumably based on the timestamp of your data file). Throwing events away isn't ideal, but it can guarantee you that the only thing that will change between yesterday and today are events in the last day (not events from days or weeks ago).
- Do time correction upstream in your logging pipeline. If you are getting events sent from clients and you cannot guarantee the correctness of their timestamps, you can apply an algorithm like "when I receive this event, I set / adjust its timestamp based on the (reliable) server time that I received it, and ignore the client timestamp".