Skip to main content

How does Interana perform data sampling?


Like all modern analytics platforms, Interana is architected using a scale-out clustered approach. There are multiple nodes (machines) in the cluster, and the number scales to match the demands of the data and concurrent users. Each machine has one or more jobs within the overall solution. The import nodes are tasked with ingesting event data from event logs, traces, etc. The data nodes manage efficiently storing and scanning the event data. The string nodes compress and deduplicate all the strings in the event data, handling storage and lookup for query results so that all data operations can take place efficiently using compact integers. For smaller clusters, nodes can take on several of these roles as needed. These machines can be bare metal, virtualized, or securely running inside of a public cloud service.

Figure 1 shows an example of this.

Figure 1: Example architectural diagram of an Interana cluster

The end result is that no matter the scale, Interana can match the requirements. From smaller deployments that can be satisfied with a single efficiently utilized machine, to huge clusters with hundreds of nodes, the Interana application can scale to meet demands. Our customers have some of the busiest services in the world, with clusters hosting over a trillion events.

Consistent organization and representative populations

One mechanism to support accurate behavioral sampling is to make sure all the data for an actor is stored together on the same node. And that the storage and placement of individual actors is fair and even for all actors in the population. We do that mathematically by using a hash with appropriate properties, followed by selecting one of many shards (containers) to hold that particular actor. All events associated with that actor will be held within the same shard. That shard will be managed by one node at a time, so all the events for a single actor are managed efficiently on a single machine. Because the actors are evenly distributed among shards, every shard contains a representative slice of the overall population.

Sampling actors to get entire time stream

We can take advantage of the fact that a single shard holds a representative cross-section of the population when it comes to sampling. When we sample, each data node processes a single shard and scans the requested time range for all the actors in that shard. Because even a single shard may hold more actors than necessary for a statistically valid sample, we can sample with progressively larger subsets of the shard until we have sufficient confidence in the sampled result.

Scanning the sample shard is usually very quick, because the common time ranges tend to be cached in memory. Even when the data is scanned directly from disk, the process usually takes just a few seconds. We only scan the columns required for that particular query, and those columns are stored very efficiently and processed with a highly optimized C++ backend. The end result is quick, accurate sampling for behavioral analytics and the ability to interactively explore vast amounts of data.

Parallelized queries, sane results

You might be wondering how we know that the sample from a single shard is actually representative of the population. That’s where the scale-out nature of the Interana solution comes in. The query isn’t just run on a single shard. It’s run on a shard for each data node of the cluster in parallel. The algorithm balances actors evenly across all nodes and shards, so each node should complete the query on its shard at roughly the same time. 

The results from each data node are merged together and compared. If the statistical profile of the actors in each shard is roughly the same then the sampling is held as valid. The results are scaled up based on the size of the sample relative to the overall population, and the results returned to the user. Any strings in the results are translated on the string servers at this time, generally eliminating string operations during the query processing. 

The product is a fast, accurately sampled result that reflects behavioral information across billions of events in mere seconds.

Knowing when not to sample

Of course, sampling isn’t always appropriate. Certain data isn’t going to be evenly distributed among the shards. Some events are very rare and unlikely to show up in a sampled result. Sometimes you’re looking for a tiny specific set of events but aren’t sure when they occurred. Sometimes the selection filters leave too few events to sample accurately. Interana detects situations where sampling isn’t appropriate and returns a clear warning with the results (for example, Figure 2).

The user can then either reformulate their query, or choose to disable sampling and compute the query across the full set of data. It’s up to the user which they prefer.

In general, unsampled queries are only needed when exploring a very small or highly irregular data set. For example, when the scope of the query is so narrow that it covers fewer than a few thousand users or events. In all of these cases, Interana will recommend that the query be rerun unsampled. But even when running a query on the full set of events, Interana returns results in seconds. The entire system was designed around the proposition that scanning billions of events is the common case and needs to complete quickly. And we deliver.

Conclusion and more information

So let’s wrap up. Interana is a purpose-built solution for behavioral analytics of event data at scale. The solution consists of a highly scalable cluster which is combined with an intuitive visual interface to interactively explore trillions of events in seconds. Part of how that’s possible is the architecture of the solution, and part is the integral role of sampling.

We’ve explained how sampling for event data needs to be handled correctly. And that sampling got a bad rap because in the past big data solutions didn’t always do the right thing. Interana handles sampling for behavioral analytics correctly, and uses this powerful technique to give our users the freedom to rapidly explore their data without fear of asking the wrong questions. Enabling them to zero in on the right questions to ask and significantly reducing time to discovery.

  • Was this article helpful?