Skip to main content
Interania

Troubleshooting overloaded string tiers

0votes
5updates
157views

Data on the string tier is stored both on disk and memory. All string columns are stored on disk, but not every string column is stored in memory (caveat to this is that according to one of our engineers all string columns are loaded into memory during certain times of the Import process). Which columns are stored in memory depends on the "data pattern" of the environment, generally the most commonly used columns are put into memory and the less (or never in some cases) used columns stay on disk. Memory > disk in terms of speed and this clever management of string data storage is part of why most queries run quickly on an Interana environment. A query using a less-common column will take more time to run since the data has to be loaded into memory from the disk, but if users start including the column in more queries its popularity will rise and the column will become persistent in memory. The inverse is true as well, string columns who see a decrease in usage will be removed from memory and exist only on disk.

Since it's not always the case that some columns get added/dropped from memory (you may have a dataset with a small amount of columns that are all popular so all the columns are persistent), string data can start to take up a large amount of memory on the node. Once the amount of disk space starts to get close to the total amount of memory on the node the environment will start to have problems adding new data to the string columns in memory, as there is either not enough or no space to add the new data. You will start to see messages like this in import-pipeline:

[Mon Aug 15 06:49:50 2016] P17 J44 S0/2 B251263885801617: E0815 06:49:50.246578 56066 Client-inl.h:320] StringServerAggregator::addStrings request to localhost:8600 failed indirectly due to: socket open()
error: Connection refused at 10.0.0.6:2000
[Mon Aug 15 06:49:50 2016] P17 J44 S0/2 B251263885801617: E0815 06:49:50.246913 56066 Translation.cpp:126] String tier is overloaded, back off.
[Mon Aug 15 06:49:50 2016] P17 J44 S0/2 B251263885801617: ERROR processing batch: ERROR purifier failed on batch 251263885801617: 1
[Mon Aug 15 06:49:50 2016] P17 J44 S0/2 B251263885801617: Traceback (most recent call last):
[Mon Aug 15 06:49:50 2016]   File "/opt/interana/backend/import_server/data_source_import_sharded.py", line 917, in process_batch
[Mon Aug 15 06:49:50 2016]     file_pipeline.pop(0)(file_batch)
[Mon Aug 15 06:49:50 2016]   File "/opt/interana/backend/import_server/data_source_import_sharded.py", line 654, in run_purifier
[Mon Aug 15 06:49:50 2016]     status=purifier_process.returncode)

Once disk usage on the string tier exceeds the total amount of available memory, then import across the entire environment will cease altogether as it cannot add new events to the string tier. Here are the commands to check memory and disk usage on a string node:

interana@string000:~$ free -m

             total       used       free     shared    buffers     cached

Mem:        112807     112428        379          5         40      88844

-/+ buffers/cache:      23543      89264

Swap:            0          0          0

interana@string000:~$ du -sm /mnt/iafs/data/ss_data_1471035433.37/

101948 /mnt/iafs/data/ss_data_1471035433.37/

In this example the node has ~112807MB of memory and the string data is currently using ~101948MB of disk. We are below the memory threshold but we are approaching it. We have several options to prevent Import from going over the edge:

Delete high-cardinality columns

This is usually the quickest and easiest solution to staying below the memory threshold and it's usually the reason why the string tier is overloaded to begin with. A high-cardinality string column is a column with a lot of unique values, meaning more disk/memory is required to store the data it contains. The Resource Usage page (<cluster-url>/?resourceusage) shows a list of all the string/data columns on the environment along with the amount of disk space each column is using in kilobytes and more metadata about the column, such as which table it belongs to, disk space used as a percentage, and examples of data in the column.

Above is an example of the column table on the Resource Usage page sorted by "Size (KB)". The biggest string column on this environment, "data_baseData.properties_GcmNotificationSubscriptionId",  belongs to the "Android" table and is occupying over 69GB across the entire string tier, or nearly 14GB on every node (there are five string nodes on this cluster). You can use the "Show servers" checkbox to show disk usage per column per node. This column would be a good candidate for deletion should the environment develop import problems as a result of an overloaded string tier. Remember to check with the customer and the Customer Success person responsible for the account before deleting any columns, as well as tell the OPs person on call for that week. Here is the command to fully delete a column:

/opt/interana/backend/import_server/delete_table_data.py -t table_name -o column_name -f

The -f flag tells the script to fully delete the column's data, including the column metadata and prevent new data from coming in during Import.

Enable disk cache on the string tier

Note for Azure customers: this feature is only available on VMs with premium storage.

 

According to the Backend team this feature can increase the IOPS rate on disk by 5x-10x. Premium storage provides about 5k IOPS so setting up disk cache in read-only mode brings the rate to 25k-50k, initial tests allowed for disk usage to rise to around 300GB per node. Enabling this feature is outside the scope of Customer Support and is a Backend/OPs joint task, consult with those teams to see if this option will meet the needs of the customer. The author of this article does not know if this feature has been enabled on any AWS or on-prem environment, or if it is even possible to do so on said environments.

Increasing the amount of memory across the string tier

Another OPs-only task. This may incur additional costs to the customer as a hardware upgrade usually requires an upgrade for the entire node. Work with Customer Support and OPs team if you choose to go this route. 

Delete the entire table

This is normally not an option but it may be the case that there are unused event tables on the environment with "dead" import pipelines that are still consuming a lot of resources. As of 2.20 there is no way to remove string data for a table besides deleting columns or the table itself. In 2.20 a feature was introduced to timestamp the string columns in memory which will allow for string data roll of in a future version. Here is the command to delete a table:

/opt/interana/backend/import_server/delete_table_data.py -t table_name -i

The -i flag tells the script to delete the import records for the table so new data can be imported. Pass the -f option if you want to completely delete the table and make it inaccessible for any and all use. 

  • Was this article helpful?