Skip to main content
Interania

Admin Guide: Resize a Cluster

1votes
51updates
171views

About resizing and rebalancing

Interana stores ingest data on data nodes and string nodes. Collectively, we refer to all of the data nodes within an Interana cluster as the data tier, and all the string nodes as the string tier. Typically, both of these node types use premium storage for fast access. However, data nodes support a tiered storage option that allows you to keep newer data on premium (faster) storage and older data on less expensive (slower) devices. 

You can resize a cluster by increasing or reducing the number of nodes. Rebalancing then redistributes shards evenly across all eligible (non-excluded) nodes. In most cases, you would want to expand a cluster, adding nodes so the cluster capacity matches the growth of your data. However, you can also choose to reduce the number of nodes in a cluster.

There are two other Interana tier types:

  • Interana import nodes are responsible for routing data from a data source, transforming it, and loading it into the data and string nodes.
  • Interana api nodes are responsible for issuing queries to the data and string nodes and returning query responses to the requestor.

The import and API node types can be resized without rebalancing. You just have to restart the tier after resizing, so all nodes in the tier are used correctly. 

This document covers the following topics:

Resizing guidelines

Take snapshots of all nodes prior to resizing to ensure your data can be restored in case an error occurs. 

We recommend that you follow these guidelines when resizing a cluster:

  • All nodes need access to the MySQL server, which requires that the nodes be on the same subnet. This is typically configured during an Interana installation (.deb installer). For more information, see your MySQL administrator. 
  • Plan the shards on cluster nodes in anticipation of future (data tier) growth. For example, if you start with a single node cluster and plan to later expand to a 10 node cluster in the future, you should provision the first node with 10x the number of shards you currently need to accommodate for future growth. 
  • You define the shards for a table when the table is created (ia table create --shards-per-node). The default is 24 shards per node, unless you specify otherwise. If future data tier growth is not anticipated when the tables were created, resizing can decrease the shard effectiveness.
  • Sampled queries look at one shard per data node, so the number of shards affects the sampling rate. Sampled queries may return slightly different results before and after cluster resize (which will still be a statistically accurate sample) for the following reasons: 

Sampling based on a different number of shards —  If you have 10 nodes and sample one shard per node, then downsize the cluster to 5 nodes where one shard per node is sampled, 5 shards (instead of 10) are your baseline. If the data is perfectly balanced across all shards (meaning all shards are equivalent for sampling purposes), then sampling results won't be affected. However, if the data is unbalanced across the shards, you will see slightly different results.

Sampling based on a different set of shards — If you have 10 nodes and sample one shard per node, we'll call "A" sampled shards. Then you downside to 5 nodes and sample one shard per node, we'll call "B" sampled shards. The new sampling schema (B) is not guaranteed to contain, or be a part of the original set (A), due to the shards moving after rebalancing. If the data is unbalanced, this results in a slight difference in results.

  • It is important to have the proper number of import nodes in your cluster. Each import node can process between 25K and 100K events per second, depending on the size of each event (both raw size and number of columns) and the amount of custom data transformation in the data import pipeline (decompression, custom processing). For example, if you need a throughput of 200K events per second to keep up with real time data, then you'll want between 4-8 import nodes in the cluster.
  • If sharding degrades after rebalancing, delete the table and recreate it with optimum sharding, then re-ingest the data.
  • The resize process will stop if there is not enough room for the data on the remaining nodes. We recommend that you allow for 5% additional overhead for a data tier rebalance and an additional 20% overhead for string tier rebalance. The additional space is needed to temporarily host the same data in two places while it is in transit. Data nodes require less overhead, because the increments of data being moved at one time is smaller than that for string nodes.  
  • We strongly recommend that if you are using tiered storage, set up tiered storage on the target node and start tiering data from the primary to the secondary tier as the resize progresses. The resize always moves the data into the target node's primary storage, even if the data is coming from the secondary storage of the source node. If you don't structure tiering in this manner, the target node's primary storage may fill up, causing the resize to stop unnecessarily from lack of space. 
  • You cannot pause or stop a data tier resize once it's begun. However, if you change your mind about a data tier resize while it is in progress, you can run another rebalance command which will override the prior rebalance instruction and start routing shards to new destinations. With this approach you could potentially send shards back to their original locations without waiting for the original rebalance command to fully complete. 
  • You can pause a string tier resize once it's begun, using the ia tier stop-rebalance command. Since this will leave the string shards only partially transferred to the new layout, you will likely want to follow this up with another ia tier rebalance command to either continue the original resize, reset back to the original configuration, or choose a new resize configuration.

What happens during a rebalance 

Understanding what happens during a rebalance can help you effectively plan for a resize. What happens when rebalancing a data tier is quite different from that of a string tier. 

Data tier rebalance

A data tier rebalance overwrites the existing shard layout with the new one, breaking queries, and gradually moving the shards to the new locations. There are three basic phases to a data tier rebalance:

  1. Write the new shard layout to the config database (DB). When you issue the ia tier rebalance data command, the shard layout in the config DB is overwritten with the new shard layout. Any queries you run (until the rebalance is complete) will only be able to access a subset of shards — whichever ones don't need to change between the old and new layout. While the rebalance is in progress, queries will return incorrect results (including "no matched results") or fail outright if insufficient shards are available.
  2. Issue the rebalance instructions. Instructions are issued to the individual data nodes to move the shards. 
  3. Move the shards to their new locations. There is a temporary overhead of up to 5% of disk space on the system while copying is underway. Once the shards fully reach their target host, the old copy on the original host is removed. As the shards fully arrive at their new host, they once again become available for queries. 

String tier rebalance

A string tier rebalance utilizes two shard layouts during the process — the original layout and the new shard layout. There are two basic phases to a string tier rebalance:

  1. The new shard layout is created. When you issue the ia tier rebalance string command, the new shard layout is created and the original shard layout remains intact. This allows queries to continue to run against the original shard layout, ensuring that queries remain valid during the rebalance.
  2. Shards are moved individually to their new locations. As each shard is moved, the original location is updated for that shard, which allows for incremental resizing without breaking queries. To accommodate the two active shard layouts during the rebalance, an additional 20% overhead is required. This provides for temporarily hosting the same data in two places while it is in transit. The shards are removed from the original location once the data has reached its new location.

Resizing a cluster

To increase the size of a cluster, perform tasks 1-3, as necessary. To reduce the size of a cluster, rebalance the node (step 2 or 3) and then perform task 4.

  1. Provision a new node.
  2. Rebalance a data node.
  3. Rebalance a string node.
  4. Optional: Remove a node.

1. Provision a new node

To increase the size of a cluster, you must first provision new nodes. 

To provision a new node on a cluster:

  1. Provision a new node, as described in the Admin Guide. The new node should automatically be added to cluster node map and should have access to the MySQL server. 
  2. Log in to the cluster and verify that the node has been added with the following command.
ia node list

In the following example, the new data node 10.1.61.224 was installed on a stacked single node cluster. You can see that data001 is new and has the IP address of the new cluster.

ia node list
Nodes (uid,public_ip,private_ip): [('api000','10.1.63.7','10.1.63.7'),('config000','10.1.63.7'),

('data000', '10.1.63.7', '10.1.63.7'), ('data001', '10.1.61.224', '10.1.61.224'), ('import000', '10.1.63.7', '10.1.63.7'), ('string000', '10.1.63.7', '10.1.63.7'), ('string001', '10.1.58.129', '10.1.58.129')]
  1. If the cluster does not appear on the node list, add it to the cluster with the following command, specifying the tier and the public IP and private IP addresses of the push node (security access node for the cluster).
ia node add <tier:public_ip:private_ip>

In the following example, the data tier public IP and private IP addresses are the same. A prompt is shown telling you the node was added successfully.

ia node add data:10.0.0.9:10.0.0.9

Successfully added node(s) to the cluster.
  1. Do one of the following:
  • If the new node is an api node or import node, restart the tier and the procedure is complete.
  • If the new node is a data node, continue with rebalance a data node.
  • If the new node is a string node, continue with rebalance a string node.

2. Rebalance a data node

There are two basic rebalance scenarios:

  • Expanding a single-node cluster to a multi-node cluster.
  • Adding additional nodes to a multi-node cluster to expand an existing data tier.

In the following example, we expand a single-node cluster and rebalance the data to a dedicated node. Rebalance automatically calculates destination hosts that have not been excluded with the --exclude option.

The ia tier rebalance command defaults to dry-run mode. This means, that if you do not use of the  -r or --run option, the results show what would happen when you actually execute the command. 

The sizes of the data on the disk before shards are moved across devices will not be the same after the shards are moved. This is expected, as file sizes change when the data server performs its sort and split behavior.

To rebalance a data node:

  1. Before initiating a rebalance, check to make sure that a rebalance is not already in progress by entering the following command.
ia tier rebalance data --status

In the following example, ok in the results means that a rebalance is not in progress. The Events column shows the events that still need to be migrated. Since we haven't started a migration, no events (0) need to be migrated.

If you do not see an ok Status, you can check data-server logs for problems, such as not enough space in the destination. Then contact Interana Support immediately.

 

ia tier rebalance data --status
Host Status Events Shard Path Destination Host
10.1.61.224 ok 0    
10.1.63.7 ok 0    

 

  1. To rebalance data to a dedicated data node, use the following command to exclude any other nodes. This ensures that data will only be transferred to the desired node or nodes. For multi-node clusters, you can specify multiple nodes to be excluded using a comma-separated list with no spaces in between.
ia tier rebalance --exclude <node_ip1,node_ip2,node_ip3,...>

In the following example, we exclude the original config node (data000) of the single-node cluster so the data goes to a dedicated node. This allows us to view a list of available nodes, and verify that the new data node appears on the list. We omitted the --run option, to use dry-run mode and view the potential results.

ia tier rebalance data --exclude 10.1.63.7
Shard Layout ID Shard ID Old Host New Host
1 0 10.1.63.7 10.1.61.224
1 1 10.1.63.7 10.1.61.224
1 2 10.1.63.7 10.1.61.224
  1. Enter the same command (as in step 2), this time with the --run option to execute the rebalance, as shown in the following example.
ia tier rebalance data --exclude 10.1.63.7 --run
Shard Layout ID Shard ID Old Host New Host
1 0 10.1.63.7 10.1.61.224
1 1 10.1.63.7 10.1.61.224
1 2 10.1.63.7 10.1.61.224
The changes have been committed. Run with --status option to monitor the progress.
  1. Verify the status of the rebalance by entering the following command. 
ia tier rebalanace data --status

In the following example, the new node 10.1.61.224 (data001) has 0 events left to transfer, which is correct since it had none to begin with. The original node, 10.1.63.7 (data000) is moving 4 shards to data001. The Events column indicates the number of events that are left to move, not what was on data000 to begin with. 

ia tier rebalance data --status
Host Status Events Shard Path Destination Host
10.1.61.224 ok 0    
10.1.63.7 ok 30210    
    7323 /var/lib/data/datasets/f/1/1/1/music-userID-data-8 10.1.61.224
    7690 /var/lib/data/datasets/f/1/3/1/music2-userID-data-7 10.1.61.224
    6663 /var/lib/data/datasets/f/1/2/1/music-artist-data-19 10.1.61.224
    8534 /var/lib/data/datasets/f/1/3/1/music1-userID-data-15 10.1.61.224

When the rebalance is complete, the status should look something like the following example. Your new-data node is now ready for use.

ia tier rebalance data --status
Host Status Events Shard Path Destination Host
10.1.61.224 ok 0    
10.1.63.7 ok 0    

3. Rebalance a string node

The steps for rebalancing a string node are similar to those used to rebalance a data node

To rebalance a string node:

  1. Before initiating a rebalance, first check to make sure that a rebalance is not already in progress by entering the following command.
ia tier rebalance string --status

The ok in the results means there is no rebalance is in progress. The Events column shows the events that still need to be migrated. Since we haven't started a migration, there should be no events (0) that need to be migrated.

If you do not see an ok Status, you can check data-server logs for problems, such as not enough space in the destination. Then contact Interana Support immediately.

 

  1. To rebalance strings to a dedicated string node, use the following command. For multi-node clusters, you can specify multiple nodes to be excluded using a comma-separated list, with no spaces in between.
ia tier rebalance string --exclude <ip_add1,ip_add2,ip_add3,...>

In the following example, we exclude the original config node and omit the --run option to specify dry-run mode.

ia tier rebalance string --exclude 10.1.63.7
Nunber of Target Hosts Rounds of Transfer Needed Status
4 4 Not running yet.
  1. Enter the same command as in step 2, this time using the --run option to execute the rebalance, as shown in the following example.
ia tier rebalance string --exclude 10.1.63.7 --run
Number of Target Hosts Rounds of Transfer Needed Rounds Left Number of Shards to Move Status
4 4 4 10 Started
The changes have been committed. Run with --status option fo monitor the progress.
  1. To monitor the progress of the rebalance, enter the following command.
ia tier rebalance string --status

The following example shows the following results.

Total Number of Shards to Move Rounds Left Number of Shards Moving Status
10 out of 12 on this cluster. 3 1 out of 5 for this cluster. Rebalance is in progress.

At this point there are a total of 10 shards left to move. Out of the first round of 5 shards, there is still 1 shard left to move. Once this shard has finished moving, the Total Number of Shards to Move will be 5 out of 12 on this cluster and the Number of Shards Moving will be 5 out of 5 for this cluster. This is because we have moved 5 out of 10 shards and we have 5 out 5 shards left to move.

When the rebalance completes successfully, you should see the following message.

Status
No rebalance is in progress

4. Optional: Remove a node

The commands to remove a node default to dry-run mode. To execute the command, use the -r or --run option.

You must remove nodes in the order in which they were added to the cluster.

The procedure you use for removing a node depends on the node type:

  • api and import nodes — only requires the ia node remove <node_uid> command to remove the node from the cluster. You must use the uid or name of the node (such as data000). Note that when adding or removing import nodes, you must restart the import-pipeline service on ALL import nodes so that import takes the change into account when deciding which import node will handle which data. To do this, ssh into each of the import nodes and enter this command: sudo service interana import-pipeline restart
  • data and string nodes —  first exclude the node from the tier, then use the ia node remove <node_uid> command to remove the node from the cluster.

Prerequisite

Before removing a node from an Interana cluster, rebalance the node to clear any active data as described in Rebalance a data node and Rebalance a string node.

To remove an api or import node:
  • Log in to the cluster and enter the following command.
ia node remove <node_uid>

The following example shows the prompt that displays as a result.

ia node remove api000 -r

Successfully removed node(sO from the cluster.

To remove a data or string node:
  1. Log in to the config node of the cluster, and enter the following command to --exclude the node from the tier.
ia tier rebalance <node> --exclude '<node_ip>' -r

The following example excludes a data node from the tier, in preparation for its removal from the cluster.

ia tier rebalance data --exclude '10.0.0.2' -r

#wait for it to finish... check using --status
  1. Remove the node from the cluster with the following command.
ia node remove <node_uid> -r

The following example removes the node from the cluster.

ia node remove data002 -r

Pro tips for monitoring a rebalance

Monitoring a rebalance with the CLI is covered in Rebalance a data node and Rebalance a string node. This section explains how to monitor a rebalance by reviewing a query-server.log.

To monitor a rebalance by reviewing a log file:

  1. On the config node, tail the query-server.log
  2. Next, grep for replicate messages. If you see replicate messages, the rebalance is still in progress.
  3. If you do not see any more replicate messages, it means one of the following:
  • The rebalance completed. 
  • The rebalance stalled with a problem and you should contact help@interana.com.
  • Was this article helpful?