Skip to main content
Interania

Planning your Interana deployment

This document offers guidelines for planning an Interana deployment that meets your company's current data analytics needs, as well as scaled growth for the future. 

Interana—behavioral analytics for your whole team

Interana is full stack behavioral analytics software with a web-based visual interface and scalable distributed back-end datastore to process queries on event data.

Interana's intuitive graphical user interface (GUI) provides interactive data exploration for a wide range of users across the spectrum of digital businesses. The Visual Explorer encourages rapid iteration with point-and-click query building and interactive visualizations. All without having to deal with complicated query syntax. You can go from any Living Dashboard to explore the underlying data, change parameters, and drill down to the granular details of the summary.

To plan an Interana deployment that is optimized for your company, review the following topics:

High-level overview—how Interana works

Interana is full stack behavioral analytics software that allows users to explore the activity of digital services. Interana includes both its own web-based visual interface and a highly scalable distributed back end database to store the data and process queries. Interana supports Ubuntu 14.04.x in cloud environments, as well as virtual machines or bare-metal systems. Interana enables you to ingest data in a variety of ways, including live data streams.

The following image shows the flow of data into the Interna cluster, from imported files (single and batch) to live data streams from HTTP and Kafka sources. Ingested data is transformed and stored in the appropriate node, data or string, then processed in queries, with the results delivered to the requesting user.

The basics—system requirements

It is important to understand the hardware, software, and networking infrastructure requirements to properly plan your Interana deployment. Before making a resource investment, read through this entire document. The following sections explain sizing guidelines, data formats, and platform specific information needed to accurately estimate an Interana production system for your company.

An Interana cluster consists of the following nodes that can be installed on a single server (single-node cluster), or across multiple servers (multi-node cluster).

  • config node — Node from which you administer the cluster. MySQL database (DB) is only installed on this node for storage of Interana metadata. Configure this node first.
  • api node — Serves the Interana application, merges query results from data and string nodes, and then presents those results to the user. Nginx is only installed on the api node. 
  • ingest node — Connects to data repositories (cloud, live streaming, remote or local file system), streams live data, downloads new files, processes the data and then sends to data and string tiers, as appropriate.
  • data node — Data storage, must have enough space to accommodate all events and stream simultaneous query results.
  • string node — String storage for the active strings in the dataset, stored in compressed format. Requires sufficient memory to hold the working set of strings accessed during queries.
  • listener node — Streams live data from the web or cloud, also known as streaming ingest. Optional during installation.

The following table outlines basic software, hardware, and network requirements for an Interana deployment.

Software Hardware Browser Networking

Operating system (OS):

Ubuntu 14.04

Linux desktop and server systems in a private cloud, or locally at your company site.

Minimum requirements for each node, unless otherwise noted:

  • 4 CPU cores
  • 8 GB RAM
  • 100 GB storage

Microsoft Edge—Latest version

Chrome—Latest version

  • Gigabit network connection
  • All nodes must be on the same subnet.
  • Each node must have a routable IP address, that has not been remapped (via NAT).
  • There should be no restriction on local port connections.
  • Required open ports across the cluster are specified in Node configurations.

Guidelines for calculating the number of node type for your deployment are covered in the following sections.

Guidelines for calculating the number of each type of node for your deployment are covered in the following sections.

Cluster configurations—the big picture

We recommend that you initially deploy a single-node sandbox cluster on which to test sample data. The sandbox test cluster will help you to accurately estimate the necessary size and capacity of a production cluster. This section provides an overview of a several

typical cluster configurations:

Stacked vs. single nodes

The configuration of your cluster depends on the amount of data you have, its source, and projected growth. To validate the quality of your data and the analytics you wish to perform, we recommend you install a test sandbox single-node cluster prior to going into production. 

Stacked nodes are not recommended for large workloads, or if you are planning to scale your cluster at a later date.  

For smaller deployments, it is possible to stack nodes—install multiple services on one system—in the following ways:

  • API, config, ingest—admin 
  • string, data—worker

Review the following configuration examples and sizing guidelines, followed by data format and platform-specific issues.

Single-node cluster

The stacked single-node cluster is recommended for sandbox test environments that are used to validate data and assess the requirements for a production cluster.

Two node cluster

The stacked two-node cluster configuration is for deployments with low data volumes. In this configuration, the API and config nodes are stacked on the same node. A Listener node for streaming ingest from a live data source is not included.

Five node cluster

The following is an example of a basic five node cluster configuration, in which each functional component is installed on a separate system. A Listener node for streaming ingest from a live data source is not included, as this tier is optional.

Ten node cluster

The ten node cluster configuration illustrates how to scale a cluster for increased data capacity with added data and string nodes. To increase disk capacity, it is recommended that you scale the cluster linearly to maintain your original performance.

Sizing guidelines—for today and tomorrow

Determining the appropriate Interana cluser size for your company is largely based on the quantity of daily events that are generated. First, determine the current number of generated events, then estimate your company's projected data growth over a year, then multiply by the number of years in your expected growth cycle.

Consider the following when planning the size and capacity of your production Interana cluster:

Node configurations

The following table provides guidelines for Interana cluster node configurations.

Component Recommended configuration Open ports across the cluster
Data nodes (#) 
  • 8 CPU Cores
    (1 CPU Core scans 1 billion rows of data)
  • 64 GB (8 GB per core on the node)
  • 1:1 ratio of disk space to total data size for initial test cluster. For example, 1 GB of data requires 1 GB disk space. Adjust the data:disk-space ratio after you know the compression rate of your data.

8500, 8050, 7070
Ingest nodes (#)

The ingest node can be deployed on its own for intensive ingest requirements, or co-located (stacked) with other nodes (such as a data node) for optimal performance.

  • 8 CPU Cores
  • 64 GB (8 GB per core on the node)
  • 80 GB SSDs
8500, 8600, 8000
String nodes
(3 or more)
  • 8 CPU Cores
    (1 CPU Core scans 1 billion rows of data)
  • 64 GB (8 GB per core on the node)
  • 1:1 ratio of disk space to total data size

    Example: 1 GB of data requires 1 GB disk space
8050, 2000-4000
API node

60 GB disk space (per CPU Core)

The API node includes the following:

  • Analytics server (processes incoming query requests and returns the query results)
  • Web service
  • Front-end GUI

The API node can be deployed on its own, or stacked with the config node.

80, 443, 8600, 8200, 8050, 8500, 8400
Config node

60 GB disk space (per CPU Core)

The config node can be deployed on its own node or stacked with the API node.

3306, 8050, 8700
Listener node Manages streaming ingest, live data (data-in-motion) from a web or cloud source. The three streaming ingest utilities (Listener, Kafka, and Zookeeper) can be deployed separately or on a single node. For more information, see Install multi-node Interana. 2181, 2888, 3888, 9092 

Capacity planning

Use the following guidelines to estimate the capacity for an Interana production cluster.

Resource How to calculate capacity
CPU Billions of events / number of cores per data node
Storage Bytes per event x number of events / available disk per data node
Memory  Memory = 0.35* disk requirements (for half the events in memory)
Data nodes (#)  Max (disk, CPU, memory)
Ingest nodes (#)

Number of data nodes / 4 during bulk import, then 1 after.

String nodes

Number of data nodes / 4, rounded to the closest odd number. Use 3 to 5 string nodes. Do not use less than 3. 

It is recommended that you use SSD storage for the string nodes. Bytes per event can be computed after importing a sample of data. 40 is a conservative estimate.

Planning the data tier

The main constraint for a data tier is disk space. However, beware of under-provisioning CPU and memory resources as it can result in reduced performance. Consider the following guidelines when planning a production data tier:

CPU Memory and Disk Space

The number of rows and columns in a table directly affects performance. All columns matter, even if they are not utilized. Prior to importing data, omit columns that are not used.

If you have excessively large tables, consider increasing the CPUs of the data nodes. 

Read performance scales linearly in proportion with the number of CPU cores.

A good estimate is that a data set with 1B events fits in 25 GB memory. Memory increases and scales linearly in proportion with the number of events and nodes.

Interana compression rate is approximately 15-20x (raw data bytes to bytes stored on data servers).

DO NOT store data on the root partition.

AWS—Use either an ephemeral drive (SSD) or separate EBS volume to store data. A minimum GP2 is recommended.

Azure—Use premium local redundant storage (LRS)

For older generation CPUs, 8 MB of L3 cache is the required minimum. Pre-2010 CPUs are not supported.

Planning the ingest tier

Consider the following guidelines when planning a production ingest tier:

Capacity Sizing

50 GB of raw data / hour / core is a conservative estimate. A single 8 core ingest node is sufficient in most instances. 

SSDs are recommended.

1 ingest node / 4 data nodes is recommended, in general. You can add ingest nodes to accommodate bulk imports, then remove the extra nodes after the bulk import is done. Performance scales linearly with the number of nodes.

Beware of over-provisioning. If the data or string tiers are over capacity, the ingest nodes can be affected causing a drop in performance. 

Planning the string tier

Consider the following guidelines when planning a production string tier:

Capacity Sizing

String columns typically have a lot of duplicates, but are stored only once, verbatim.

High-cardinality string columns impede performance, so it's fine to omit duplicate columns.

The main constraint is the disk space. SSDs are strongly recommended.

An odd number of string nodes is required. Use 3 or 5 such instances and keep an eye on the disk usage.

Cluster configuration guidelines

It is recommended that you follow the production workflow, and first deploy a sandbox test cluster with sample data to properly assess your data and accurately estimate a production cluster. 

Use the following guidelines in planning your cluster:

  • For a sandbox test cluster, start with one main dataset on the cluster, with 1 shard key. A separate table copy is required for each shard key. This means that if you want N shard keys for your dataset, you need N times the storage on your data tier. Defining the appropriate number and type of shard keys may be an iterative process for your production cluster. It is recommended that you start small and add additional shard keys as necessary, since each shard key requires a full copy of the dataset.
  • Log data usually ranges from 50 - 150 bytes / event (storage required for 1 compressed event on the data tier). 80 bytes / event is a reasonable density estimate. High cardinality integer data, such as timestamps and unique identifiers and set columns, take up the most space in a dataset. If your data contains a lot of this type of data, estimate a few more data nodes for your production deployment.
  • String data should be stored efficiently. The following estimates account for data with several high cardinality string columns: 1 or 2 parsed URL columns, 1 user agent column, and 1 IP address column. Hex identifier columns should be stored as integers, through the use of a hex transform (applied by default in Interana). Other "string" identifiers (like base64 encoded session ids) are hashed into integers and stored on the data tier.
  • Consider your expected data retention and adjust the configuration accordingly. You will need more data nodes for a 90-day retention than for 30 days.  
  • Data input volume is another factor to consider when planning a cluster. For instance, a single-node cluster which has a total capacity of 24B events is adequate for 267M (24B / 90D) events per day. To import a higher volume of events per day, larger import and data tiers are required to keep up with the ingest rate and service queries in a timely manner.

Given these guidelines, the following table lists some cluster sizing examples, based on the number of events stored in the cluster.  If you want to store more than 80 Billion events, it is recommended that you build one of the following clusters and import data first, to be able to more accurately size your larger cluster.

The AWS estimated costs in the following table are for backing up data and string nodes, and do not take into account organizational discounts.  I also assume the use of 1-year partial upfront reserved instances to reduce costs.

 
Example Cluster Events Nodes AWS estimate
Single-node cluster 4B

1 stacked node: API, ingest, config, data, string

https://calculator.s3.amazonaws.com/...6-6DA91054DDC1

$4,713.76 / year (partial upfront RIs)

Two node cluster 8B
  • 1 stacked: API, ingest, config
  • 1 stacked: data, string

https://calculator.s3.amazonaws.com/...B-4A5B3B99E467

$8,388.40 / year (partial upfront RIs)

Individual node cluster 24B
  • 1 API
  • 1 ingest
  • 1 config
  • 3 data
  • 1 string

https://calculator.s3.amazonaws.com/...7-B517FD5B0856

$18,274.96 / year (partial upfront RIs)

Mid-size cluster 80B
  • 1 API
  • 1 config
  • 10 data
  • 3 string

https://calculator.s3.amazonaws.com/...0-48230B20BD62

$52,372.77 / year

Data types and formats—consider your source

It's important to consider the source of your data, as well the data type and how it's structured. Streaming ingest requires a special cluster configuration and installation (see Install multi-node Interana). Be aware that some data types may require transformation for optimum analytics. 

Data types

Interana accepts the following data types:

  • JSON—Interana's preferred data format. The JSON format is a flat set of name-value pairs (without any nesting), which is easy for Interana to parse and interpret. 
  • Apache log format—Log files generated using mod_log_config. For details, see http://httpd.apache.org/docs/current/mod/mod_log_config.html. It's helpful to provide the mod_log_config format string used to generate the logs, as Interana can use that same format string to ingest the logs.
  • CSV—Interana accepts CSV format with some exceptions. First, ensure sure you have a complete header row. Next, ensure you are using a supported separator character (tab, comma, semicolon, or ascii character 001). Finally, ensure your separator is clean and well-escaped—for example, if you use comma as your separator, make sure to quote or escape any commas within your actual data set.

For more information, see the Data types reference. For best practices in logging your data, see Log your data for Interana.

Data sources

Interana can ingest event data a variety of ways:

Logging and adding data

How your data is structured is important for optimum analytics results and performance. Review the following topics:

Platform-specific planning—how much is enough?

The following table compares system recommendations for cluster configurations on the following platforms:

On-premise installations should only use local SSDs, no Storage Area Networks (SAN).

The Azure recommendations in the following table allow you to add premium storage.

 
 
Configuration System recommendation

Single-node:

1 node, 4 CPU, 8 GB memory

AWS—t2.xlarge or m3.xlarge,

Azure—DS3_V2

Stacked nodes:

2 nodes, 4 CPU, 16 GB memory

AWS:

  • API/config/ingest—C3.2xlarge
  • data/string—i3.xlarge

Azure: DS3_v2 

Blob Storage is different between the 2 nodes:  

  • Stacked API/config/ingest can use a P1 LRS storage. 
  • Stacked data/string can use P2 LRS storage.

Small multi-Node:

4 nodes, 4 CPU, 64 GB memory

Depends on the stacked configuration, but in general the following configurations are recommended. 

AWS:

  • API/config—C3.2xl
  • ingest—R3.xlarge 
  • data/string—i3.xlarge  

Azure:

  • API/config—DS3_v2
  • ingest—DS4_v2 
  • data/string—DS12_v2

Large multi-node:

10 nodes, 4 CPU, 800 GB memory

AWS:

  • API—c3.2xlarge 
  • config - m3.large 
  • ingest—r3.xlarge 
  • string—3.xlarge 
  • data—i3.xlarge  

Azure: 

  • API - ds13_v2 
  • config, ds12_v2 
  • import - ds12_v2 
  • data - DS13_v2 
  • string - DS13_v2

 

Production workflow—test, review, revise, and go

It is strongly recommended that you first set up a sandbox test cluster with a sampling of your data, so you can make any necessary adjustments before deploying a production environment. You will be able to assess the quality of your data and the better determine the appropriate size for a production cluster. 

 Follow these steps:
  1. Install a single node cluster in a sandbox test environment, as described in Install single-node Interana.
  2. Load a week's worth of data.
  3. Modify your data formats until you get the desired results.
  4. Go to the Resource Usage page: https://<cluster_ip_address>/?resourceusage
  5. Review the usage for the string and data nodes. For more information, see Track your data usage.
  6. From the usage for one week's worth of data, you can estimate usage for one month, and then one year.
  7. Factor in the estimated growth percentage for your company to create a multi-year usage projection.
  8. Install your Interana production environment and add your data.

 

  • Was this article helpful?