Skip to main content
Interania

Adaptive sampling in Interana

0votes
25updates
84views

When you run a query that contains a count unique operation, Interana uses an adaptive sampling algorithm. The count unique can be a top-level aggregator in the query, or included in the definition of a named expression used in the query. This allows Interana to return statistically significant results quickly even when processing queries referencing shard keys with a high number of unique values. 

Using the adaptive sampling approach, each shard sends a sample of values to the merge server. The merge server aggregates the truncated set of values, which limits the network, memory, and CPU resources required for the computation.

Although there is the risk that this can introduce a small amount of inaccuracy when computing a count unique on high cardinality columns (columns with a large number of unique values), in practice our users rarely reach the default sampling limit (8192 unique values).

Adaptive sampling and population sampling

Interana uses the adaptive sampling algorithm even when running an unsampled query. The Sampled Query setting determines whether Interana performs population sampling, and is independent of adaptive sampling. See How does Interana perform data sampling? for detailed information about how Interana performs population sampling. 

If Interana did not use adaptive sampling, each shard in your cluster would have to return the entire list of unique values in the shard. Interana would then compute the union of all unique values and perform the count operation. This is a resource-intensive operation: Interana would need to perform operations on each shard, send the (large) results over the network, and then perform the count operation, requiring a large amount of CPU and memory resources on the merge server.

Why are we using this limit?

The sampling strategy that Interana uses is taken from the paper "On Adaptive Sampling", P. Flajolet, Le Chesnay, Apr 11, 1989. We determined that this value will provide accurate values, up to a 1% error rate (at most), and that error rate will only occur with data sets that include a greater than that number of unique values in the columns being analyzed.

We added the ability to configure this limit in Interana version 2.24.2. You can now configure the adaptive sampling limits for shard key and non-shard key columns.  

Configuring the adaptive sampling limits

If you are a Growth Edition customer, you can change the adaptive sampling limits on your Interana Cluster through the Interana command line interface.

Use the following CLI command to set the adaptive sampling values: 

ia settings update query_api adaptive_sampling_limits '{"<table_copy_id>": [<shard limit>, <non-shard limit>]}'

The configuration string is a JSON object, with string keys representing the table_copy_id, and each key value is a list of two numbers:

  • The first number is the limit to use when this count unique is running on a shard key on its own table copy.
  • The second number is the limit to use in all other cases.

The default sampling limit in both cases is 8192. If you want to change only one of the limits, you must set the other value to 8192 to preserve the default value. For example, if you want to lower your shard key limit to 4096, but leave the non-shard key limit at the default value, run the command with the values [4096, 8192].

Interana automatically rounds any values up to the nearest power of 2. For example, if you specify 10000 as a limit, Interana automatically rounds it up to 16384.

See the Interana CLI reference for more information about Interana command line parameters.

System performance considerations

Increasing the adaptive sampling limits can significantly affect system performance. You may need to increase the size of your Interana cluster to preserve query performance. Before increasing the limits, see the Admin Guide to review information about resizing your Interana cluster.

  • Was this article helpful?