Skip to main content
Interania

Balancing data for efficient sampling

1votes
82updates
146views

This document explains data whales, the problems unbalanced data can cause in a data tier and solutions to remedy them. The following topics are covered:

Data whales and the solution

In a perfect data tier events would be allocated equally across all shards, but this isn't always the case. There may be times when an actor creates a disproportionate volume of events in comparison to other actors. An extreme volume of events for any one actor can result in the following:

  • Unbalanced sampled query results,
  • Poor-performing shards that are bigger than average 
  • Disk space issues for the data node

Whales create unbalanced data

A person who is an exceptionally big spender in a gambling casino is known as a whale. Likewise, in behavioral data analytics, an actor that creates a disproportionate volume of events compared to the other actors in a dataset is a whale. The following illustration is a simplified example of a data node with six shards. The purple shard is a whale.

WhaleShard_simple_illo.png

Whale actors can be any of the following:

  • Real actor—users who are an order of magnitude more active than most
  • System actor—services, notifications, and other automated events
  • Bot—a utility that automatically generates events

The imbalance a whale creates can cause sampling errors, especially if the whale happens to be picked as a sampling shard. For unsampled queries, the whale shard impedes performance. If the whale continues to expand unchecked, the node runs out of disk space much faster than the other nodes in the cluster. How do you resolve the imbalance a whale creates? You splash the whale, of course.

Splashing the whale

Interana dissipates the data of an actor detected as a whale. In effect, the whale actor data is splashed across the other shards in the dataset. The act of splashing balances the distribution of data to ensure efficient sampling. Splashing is accomplished with shard function exceptions. Later when behavioral queries are run on the dataset, whale actors are filtered out.

Splashing results in efficient sampling, but also has the following effects:

  • The actor associated with the whale data goes away. You cannot run a behavioral query on that actor.
  • The actor is filtered out of certain types of queries, since its data was effectively spread across all shards in the dataset.

Whale events are filtered out under specific circumstances only, as described in Filtering out whales from queries

Detecting whales

There is a Whale Detector script that you run on the config node of the Interana cluster to analyze the volume of actor events across all shards. Fine tuning the script is an iterative process that includes query, analysis, reporting, and generation.

Query—In this phase, you run an unsampled count event group query (by shard key, with a maximum of 1000 queries) on a specified table copy, and the results are stored. You configure the number of time intervals to scan, how far back in time to go, and how long each time period should be. You can also specify an initial delay, to allow a pipeline to complete importing an event stream for the time range. The default settings for the script arguments are sufficient for most cases. If you think your data requires variables beyond the defaults, it's important to include those options in the unsampled query so you can quickly iterate on the settings to fine tune the results.

Analysis—In this phase, scan results for the specified time period(s) are loaded, and the results are examined. If there are fewer than 500 unique actors in a time period, or the mean count of events per actor is less than min-sample-mean, then the time period is excluded from analysis. For the remaining time intervals, each is analyzed for whale actors. If the event count for an actor is more than the outlier-threshold-stddev standard deviation above the mean, it is marked as a candidate whale for the time period. You can modify the argument configuration values and rerun the query with the --analysis-only argument to quickly view the new results. Using the --analyze-only flag to view the results of different thresholds saves time, because you don't have to reissue the query.

Reporting—In this phase, a table is printed listing the candidate whales, the total event count for each, and the number of time periods in which each appeared. You can choose to only include candidate whales that appeared in at least N periods, then re-run the analysis and view a report with different thresholds. Through this process, you determine which actors in the dataset are valid whales.

Generation—In this phase, you include the --auto-add flag to candidate whales, so they are automatically added to the shard function exceptions list of the table being analyzed. This causes the whale actor events to be distributed across all shards in the dataset and excluded from behavioral and sharded count unique queries. After these parameters are set, you can use the whale_detector.pyc script with the --auto-add argument in a cron job.

Using the Whale Detector script

You run the whale_detector.pyc script on the command line, specifying arguments as needed for your data. The following sections explain the whale_detector.pyc arguments, then outlines the recommended workflow for effectively using the script.

Whale Detector script syntax

You run the whale_detector.pyc script on the config node of the Interana cluster. The output is automatically stored in a separate table in the Interana configDB. Shard exceptions in configDB are updated based on threshold settings. When a data rolloff is performed, whales in the configDB are included in the rolloff.

The Whale Detector script ignores events belonging to the *null* actor, as they are automatically treated as whales by the purifier and query API.

whale_detector.pyc Generates a list of candidate whales based on query groups.
Usage whale_detector.pyc [-h] [--customer-id CUSTOMER_ID] [-t TABLE]
                                [-s SHARD_KEY] [-m MIN_SAMPLE_MEAN]
                                [--period-length PERIOD_LENGTH]
                                [--number-of-periods NUMBER_OF_PERIODS]
                                [--delay DELAY]
                                [--outlier-threshold-stddev OUTLIER_THRESHOLD_STDDEV]
                                [--outlier-threshold-periods OUTLIER_THRESHOLD_PERIODS]
                                [--auto-add] [--analyze-only | --exclude-saved-scans]
                                [--delete-saved-scans]
Required Arguments  
-t TABLE, --table TABLE The name of the Interana table to scan. (default: None)
-s SHARD_KEY,
--shard-key SHARD_KEY
The shard key in which to look for whales. (default: None)
-m MIN_SAMPLE_MEAN,
--min-sample-mean MIN_SAMPLE_MEAN

Minimum mean event count across all actors. Scan periods with a mean event count less than the minimum value specified by this argument, are excluded from analysis. (default: None)

Each actor has an event count in the scan results. The mean event count is the average of the event counts of all the actors. This control ensures that you have enough data in the time period being analyzed, to determine that an actor is a whale. 

Tip: The point of this flag is to exclude periods which have an abnormally low event count for whatever reason—perhaps an influx of actors with very low event counts, or a logging change, or downtime in your logging pipeline. This value should be set such that periods have only a small margin above the specified threshold.

Optional Arguments  
h, --help Shows the help for the script.
--customer-id CUSTOMER_ID This is the Interana customer ID. (default: 1)
--period-length PERIOD_LENGTH Time range in seconds of each scan. Use a smaller value if scans cannot complete (due to running out of memory, other resources, or time outs). Use a larger value if periods contain too few actors or events. (default in seconds: 21600)
--number-of-periods NUMBER_OF_PERIODS Number of periods to scan. This is multiplied by the period_length to determine how far back to scan overall. (default: 32)
--delay DELAY

The most recent (last) scan period will end at NOW minus DELAY. This ensures that all, or most, of the events for the scan periods have been imported.

Use a larger delay if your pipeline has time splay or other sources of latency. Use a smaller delay if your pipeline is low latency or near real-time. A smaller delay is beneficial, because whales are detected sooner. (default in seconds: 7200)

--outlier-threshold-stddev OUTLIER_THRESHOLD_STDDEV

Minimum standard deviations above the mean for an actor to be marked as a candidate whale. (default: 6.0)

This is a standard of measure to determine if an actor is a whale, during the specified time period. In effect, this argument normalizes the variants in an event count, to exclude periods where event count is too low.

--outlier-threshold-periods OUTLIER_THRESHOLD_PERIODS Minimum number of periods in which an actor must appear as an outlier to be marked as a candidate whale. (default: 1)
--auto-add

Automatically add all candidate whales to the shard function exceptions list for the specified table.  (default: False)

IMPORTANT! Once an whale actor is added as a shard function exception, the action cannot be reversed.

--analyze-only Do not run scans; list candidate whales from previous scans of this table with the same period length. May not be used with --exclude-saved-scans. (default: False)
--exclude-saved-scans Do not use scan results from previous invocations of this script. May not be used with --analyze-only. (default: False)
--delete-saved-scans  Delete saved scan results from previous invocations of this script for the specified shard key, then exit. (default: False)

Whale Detector workflow

Fine tuning the whale_detector.pyc script for your data is an iterative process. This section provides examples for each phase of the workflow: query, analysis, reporting and generation.

1. Query—Run an unsampled count event group query

You first use the whale_detector.pyc script to run an unsampled query to generate results for the actor and specified table. The script defaults are acceptable in most cases. However, review each argument and specify values that better suit your data, where necessary. Depending on the size of your data, running an unsampled query may take several minutes.

Specify all necessary arguments in the unsampled query, in order to leverage the results with the
--anlayze-only argument. Every time you add an argument to the whale_detector.pyc script command, you must rerun the unsampled query.

The following example runs a whale_detector.pyc unsampled query on the AppServerEvents table for the edge.clientid shard key with a Minimum Sample Mean of 2000, across two time periods of 24 hours each (28800 seconds).

The default delay of 2 hours (7200 seconds) is sufficient for the pipeline in the example. If your pipeline requires a longer or allows for a shorter lead time, specify the required time in seconds with the --delay argument.

sudo -u interana /usr/share/python/interana-python/bin/python /opt/interana/backend/precacher/whale_detector.pyc -t AppServerEvents -s edge.clientid -m 2000 --period-length 28800 --number-of-periods 2

Depending on the size of your data and hardware configuration, you may need to decrease the length of the specified time period. The query will fail if you run out of memory.

2. Analysis—Interate and analyze results

The results from the whale_detector.pyc unsampled query are stored in a table in the configDB. To view the results, enter the command again, this time adding the --analyze-only argument. The --analyze-only argument displays the output of the query results in a table. 

sudo -u interana /usr/share/python/interana-python/bin/python /opt/interana/backend/precacher/whale_detector.pyc -t AppServerEvents -s edge.clientid -m 2000 --period-length 28800 --number-of-periods 2 --analyze-only

After reviewing the results, you can modify the values of the arguments and iterate the query using the
--analyze-only argument to view the new results. 

To delete previous scans so they won't be included in analysis results, use the -delete-saved-scans argument. To ignore previous scans without deleting them, use the --exclude-saved-scans argument.​​​

3. Reporting—Review final results and select whales to be excluded

When you are happy with the whale_detector.pyc query results, print a table for final review using the --analyze-only argument. The table lists the candidate whales, the total event count for each, and the number of time periods in which they appeared, if more than one time period is scanned. 

sudo -u interana /usr/share/python/interana-python/bin/python /opt/interana/backend/precacher/whale_detector.pyc -t AppServerEvents -s edge.clientid -m 2000 --period-length 28800 --number-of-periods 2 --analyze-only

If you included the --number-of-periods argument (in the whale_detector.pyc unsampled query), you can choose to only include candidate whales that appeared in at least N periods, then re-run the analysis for a report with different thresholds.

If you didn't scan multiple time periods in the original unsampled query, you can run another unsampled command using the
--number-of-periods argument, as described in 1. Query.

4. Generation—Exclude whales automatically

When you have determined the whales in your data, you can include the auto-add flag with the whale_detector.pyc query, to automatically add them to the shard function exceptions list of the table. This causes the whale actor events to be distributed across all shards in the table and excluded from behavioral and sharded count unique queries. 

Once a whale actor is added as a shard function exception, the action cannot be reversed.

sudo -u interana /usr/share/python/interana-python/bin/python /opt/interana/backend/precacher/whale_detector.pyc -t AppServerEvents -s edge.clientid -m 2000 --period-length 28800 --number-of-periods 2 --auto-add

You can now use this command in a cron job, so that it runs automatically at the time interval that best suits your data.

Filtering out whales from queries

Whale actor events are distributed across all shards, and therefore excluded from behavioral and sharded count unique queries. The following diagram illustrates the process for filtering out whales in per-actor metric, per-flow metric, and shard lookup table queries.

WhaleDetector (2).png

  • Was this article helpful?