Skip to main content
Interania

Troubleshooting overloaded string tiers

0votes
12updates
216views

Data on the string tier is stored both on disk and memory. All string columns are stored on disk, but not every string column is stored in memory. Which columns are stored in memory depends on the "data pattern" of the environment, generally the most commonly used columns are put into memory and the less (or never in some cases) used columns stay on disk.

What causes overloaded string tiers?

Memory > disk in terms of speed and this clever management of string data storage is part of why most queries run quickly on an Interana environment. A query using a less-common column will take more time to run since the data has to be loaded into memory from the disk. However, if users include the column in more queries, its popularity rises and the column becomes persistent in memory. The inverse is true as well, string columns that have decreased usage are removed from memory and exist only on disk.

It's not always the case that some columns get added/dropped from memory (you may have a dataset with a small amount of columns that are all popular so all the columns are persistent), so string data can begin to take up a large amount of memory on the node. When the amount of used disk space gets close to the total amount of memory on the node, problems will arise when adding new data to the string columns in memory. Messages such as the following appear like in the import-pipeline:

[Mon Aug 15 06:49:50 2016] P17 J44 S0/2 B251263885801617: E0815 06:49:50.246578 56066 Client-inl.h:320] StringServerAggregator::addStrings request to localhost:8600 failed indirectly due to: socket open()
error: Connection refused at 10.0.0.6:2000
[Mon Aug 15 06:49:50 2016] P17 J44 S0/2 B251263885801617: E0815 06:49:50.246913 56066 Translation.cpp:126] String tier is overloaded, back off.
[Mon Aug 15 06:49:50 2016] P17 J44 S0/2 B251263885801617: ERROR processing batch: ERROR purifier failed on batch 251263885801617: 1
[Mon Aug 15 06:49:50 2016] P17 J44 S0/2 B251263885801617: Traceback (most recent call last):
[Mon Aug 15 06:49:50 2016]   File "/opt/interana/backend/import_server/data_source_import_sharded.py", line 917, in process_batch
[Mon Aug 15 06:49:50 2016]     file_pipeline.pop(0)(file_batch)
[Mon Aug 15 06:49:50 2016]   File "/opt/interana/backend/import_server/data_source_import_sharded.py", line 654, in run_purifier
[Mon Aug 15 06:49:50 2016]     status=purifier_process.returncode)

Checking string node memory and disk space

Once disk usage on the string tier exceeds the total amount of available memory, imports across the cluster cease altogether. This is caused by the inability to add new events to the string tier.

To check the memory and disk usage on a string node, do the following:
  1. Log in to the string node.
  2. Enter the following command.
free -m
  total used free shared buffers cached
Mem: 1112807 112428 379 5 40 88844
-/+ buffers/chache:   23543 89264      
Swap: 39999 39700 299      

In this example the node has 112807 MB of memory and the string data is currently using 101948 MB of disk. We have used up most of the swap space and are fast approaching the memory threshold. We have several options to prevent Import from becoming impaired:

Deleting high-cardinality columns

This is usually the quickest and easiest solution to staying below the memory threshold, and it's usually the reason why the string tier is overloaded to begin with. A high-cardinality string column is a column with a lot of unique values, meaning more disk/memory is required to store the data it contains. The Resource Usage page (<cluster-url>/?resourceusage) shows a list of all the string/data columns on the environment along with the amount of disk space each column is using in kilobytes and more metadata about the column, such as which table it belongs to, disk space used as a percentage, and examples of data in the column.

Above is an example of the column table on the Resource Usage page sorted by "Size (KB)". The biggest string column on this environment, "data_baseData.properties_GcmNotificationSubscriptionId",  belongs to the "Android" table and is occupying over 69GB across the entire string tier, or nearly 14GB on every node (there are five string nodes on this cluster). You can use the "Show servers" checkbox to show disk usage per column per node. This column would be a good candidate for deletion should the environment develop import problems as a result of an overloaded string tier.

You can delete the data for a specified column with the CLI command ia column delete. The default is dry-run mode, and you must use the -r option to actually delete the column. When running the delete, metadata is preserved by default, so the column will still exist in the UI and data for the column may be ingested in the future. Use the --delete-metadata option to remove the column from the UI and prevent further data for the column from being ingested. For more information, see How to delete data from an overloaded string tier.

ia column delete [-h] [-v] [--unsafe] [--instance-name handle [--version]
[--output {json,text,table}]  [--match-pattern MATCH_PATTERN] [--delete-metadata] [--run] table_name [column_name]

Resizing and rebalancing a cluster

If your company's data consumption is expanding, the solution may be to increase the size of your string tier. You can resize a cluster by increasing the number of string nodes. Rebalancing then redistributes shards evenly across all eligible (non-excluded) nodes. You should expand the cluster capacity to match the growth of your data. 

For more information, see Planning your Interana deployment  and  Resize a Cluster.

Deleting an entire table

This is normally not an option, but you may have unused event tables on the environment with "dead" import pipelines that are still consuming a lot of resources.  

You can use the CLI ia table delete to remove any table from your cluster, including lookup tables. Auto completion displays both types of tables (event and lookup) when applicable. 

ia table delete [-h] [-v] [--unsafe] [--instance-name handle] [--version]
[--output {json,text,table}] [-y] [--delete-metadata] [--run] table_name

The ia table delete default is dry-run mode, so you can verify that the table can be deleted. Use the -r option to execute the command. The --delete-metadata parameter removes the table definition, in addition to deleting all of the existing data (event and strings) and import records. For more information, see How to delete data from an overloaded string tier.