You are viewing the documentation for Interana version 2. For documentation on the most recent version of Interana, go to docs.interana.com.

Understanding string data

325views
Alex Muller

Interana supports storing and analyzing string data (strings are sequences of characters such as words). However, there are limits to Interana's analytic capabilities for strings. The goal of this article is to explain different categories of string data, and the implications on the performance and storage of your system. Having a better understanding of your string data will help you make better decisions about how to use Interana most efficiently for analyzing strings.

Cardinality and Why It Matters

Cardinality refers to the number of unique elements in a set. Cardinality is an important concept for strings in Interana, because Interana stores one copy of every unique string ingested into the system*, discarding duplicates. Therefore the amount of storage used by the string tier is determined by the cardinality of the string set ingested, and the length of the strings themselves.

Another useful concept is cardinality percentage, which refers to the ratio of cardinality (unique strings) to total strings. When this percentage is low, the string tier will use relatively little space compared to the volume of the data set. As this value increases, the string tier will use more space relative to the data set.

* Sets of unique strings are stored per table, so the same string appearing in two tables will be stored twice in the string tier. However, only one copy of each string is stored per table copy.

Categories of String Data

We classify string data into four categories, where each category increases in string volume:

Category 1: Low cardinality enums

Finite sets of strings, for example the countries of the world. They are small, enumerable sets (cardinality < 10,000) whose values change infrequently in the data.

Category 2: Medium cardinality sets that evolve slowly

Larger string sets that change gradually over time (cardinality < 1,000,000). Datestamps and Twitter hash tags would fall into this category. The string values in this category tend to have large amounts of repetition, and values that change slowly over time.

Category 3: High cardinality sets with a large amount of variation

Huge, varied string sets (cardinality >= 1,000,000). This category is characterized by a large volume of infrequently appearing strings, perhaps in addition to a set of commonly repeated strings. Url's, IP addresses, and query search strings are examples of category 3 data sets.

Category 4: Extreme cardinality identifiers

Extremely high cardinality sets with a high cardinality percentage. Most commonly this category includes unique identifiers (e.g., transaction id's and session id's). Specific values appear infrequently in the data, but new values appear constantly.

Interana provides several query functionalities for string values, such as string comparison and regex matching. The interface can also provide typeahead for string values. These functionalities become unavailable when the size of a string set exceeds certain thresholds (these thresholds depend on the configuration and resources of your cluster).

The categories of string data are useful as guidelines for typical system functionality:

Starts with/
Ends with/
Arbitrary
Regex
Category 1
Category 2
Category 3
Category 4

This table provides guidelines for typical performance; the actual performance of your cluster will vary depending on the specifics of the hardware and the data set you use.

• Green check marks indicate the functionality typically performs well with this category of string data
• Yellow caution signs indicate performance will vary
• Red x marks indicate the functionality will typically perform poorly or not at all with that category of string data.

Notice that Category 4 has all red x marks. We do not recommend ingesting Category 4 type data as strings in Interana. Instead, this type of data can often be ingested as an identifier type through a process called hexing. This process can be applied to columns during the setup of your ingest pipeline. Columns that are hexed are not available for typeahead, string matching, or regex. However, these columns can still be used to used in group by, filtered to specific values, and joined with lookup tables. See Apply a Hex Transform to a Shard Key Column for more information.

Aside from query and UX performance, the string data you put into your system also impacts resources consumed on the string tier. Intuitively, the higher the category of string data ingested, the more storage tends to be used on the cluster. The hexing process described above results in the data being stored on the data tier, so string tier resources will not be used if you choose to use that process. The total storage used on the string tier can have implications on system performance as it relates to other available resources such as memory and I/O. These resources are shared for all string columns and tables, so it is sometimes possible to trade off the number of strings stored for system performance.

What You Can Do

Having an understanding of your string data allows you to optimize the configuration of your cluster to meet your needs. It is always recommended to hex Category 4 columns, rather than ingesting them as strings. When faced with fixed system resources, consider which large Category 3 columns are essential to your workflow. You can make tradeoffs on the number of large string columns stored or the retention window of your data. There are also strategies for reducing the cardinality of your string data, such as applying a transformer during ingest to split a string column into multiple columns. One final tip:

There is a tool called the cardinality monitor (`cardinality_monitor.py`) that returns statistics for your ingested string data. It outputs string cardinality, memory, and storage consumption for each column, as well as roll ups for each table and the overall system. Using this tool, you can easily determine which are your high cardinality string columns that use the most system resources.

I hope this guide has helped you better understand how strings are handled in Interana. Have fun!