Skip to main content
Interania

Adaptive sampling in Interana

0votes
39updates
338views
Neal

When you run a query that contains a count unique operation, Interana might use an adaptive sampling algorithm, when the cardinality of the column is high. The count unique can be a top-level aggregator in the query, or included in the definition of a named expression used in the query. This allows Interana to return statistically significant results quickly even when processing queries referencing shard keys with a high number of unique values. 

Using the adaptive sampling approach, each shard sends a sample of values to the merge server. The merge server aggregates the truncated set of values, which limits the network, memory, and CPU resources required for the computation.

Although there is the risk that this can introduce a small amount of inaccuracy when computing a count unique on high cardinality columns (columns with a large number of unique values), in practice our users rarely reach the default sampling limit (8192 unique values).

Adaptive sampling and population sampling

The Sampled Query setting determines whether Interana performs population sampling, and is independent of adaptive sampling. Interana might use the adaptive sampling algorithm for count unique queries even when running an unsampled query. See How does Interana perform data sampling? for detailed information about how Interana performs population sampling.

If Interana did not use adaptive sampling, each shard in your cluster would have to return the entire list of unique values in the shard. Interana would then compute the union of all unique values and perform the count operation. This is a resource-intensive operation: Interana would need to perform operations on each shard, send the (large) results over the network, and then perform the count operation, requiring a large amount of CPU and memory resources on the merge server.

Why are we using this limit?

The sampling strategy that Interana uses is taken from the paper "On Adaptive Sampling" (Flajolet 1990). We determined that this value provides accurate values, up to a 1% error rate (at most), and that error rate occurs only with data sets that include a greater than that number of unique values in the columns being analyzed.

We added the ability to configure this limit in Interana version 2.24.2. You can now configure the adaptive sampling limits for shard key and non-shard key columns.  

When does adaptive sampling kick in?

As of Interana 3.13, adaptive sampling activates when you run a query that either:

  • uses a count unique aggregation on a non-actor column; or
  • uses a count unique aggregation on an actor and also uses a time offset or split by.

For example, if a shard key (actor) is "user," running count unique user unsampled does not activate adaptive sampling. However, count unique user group by platform unsampled does activate adaptive sampling.

On Interana versions before 3.13 (including 2.x), adaptive sampling activates for any count unique on a column with a cardinality higher than the limits set on the system. By default, this limit is 8192 for both shard keys and non-shard keys.

Contact your Technical Account Manager or Interana Support to learn more.

Configuring the adaptive sampling limits

If you have access to the Interana command line interface and the appropriate permissions level, you can configure the adaptive sampling limits. If you do have the required access, contact your technical account manager or support to change these settings.

Use the following CLI command to set the adaptive sampling values: 

ia settings update query_api adaptive_sampling_limits '{"<table_copy_id>": [<shard limit>, <non-shard limit>]}'

The configuration string is a JSON object, with string keys representing the table_copy_id, and each key value is a list of two numbers:

  • The first number is the limit to use when this count unique is running on a shard key on its own table copy.
  • The second number is the limit to use in all other cases.

The default sampling limit in both cases is 8192. If you want to change only one of the limits, you must set the other value to 8192 to preserve the default value. For example, if you want to lower your shard key limit to 4096, but leave the non-shard key limit at the default value, run the command with the values [4096, 8192].

Interana automatically rounds any values up to the nearest power of 2. For example, if you specify 10000 as a limit, Interana automatically rounds it up to 16384.

See the Interana CLI reference for more information about Interana command line parameters.

System performance considerations

Increasing the adaptive sampling limits can significantly affect system performance. You might need to increase the size of your Interana cluster to preserve query performance. Before increasing the limits, see the Admin Guide to review information about resizing your Interana cluster.

  • Was this article helpful?