Skip to main content

What to think about before you import data

Interana’s most powerful query patterns analyze the behavior of many actors in parallel, connecting multiple events to characterize the actor’s behavior over time. Unlike a traditional data warehouse system, Interana computes these per-actor metrics on the fly during the query, allowing an extraordinary degree of interactivity when defining and refining them. This is especially important when the log data is “raw” and likely to be contaminated in ways that are not fully understood up front. To get the most benefit out of this dynamic filter/metric architecture, we recommend that you follow a few principles in your logging.

Events can be related only if they’re in the same table

Every event will be placed in one or more tables in your Interana instance. Each property of the event will go in the table column corresponding to the name of the property. Typically, an event has several columns that describe the event itself, and several that allow it to be placed in the context of other events. For example, an event representing a click on a web page might have properties describing the click target and the state of the user’s interaction with the page, plus properties that identify the user account and server-side session that went into creating the web page.

A table may contain events of very different kinds: client and server, interactive and batch processes, logged “live” and delayed/backdated. What matters is not the uniformity of the columns on the events, but combining properties of multiple events at query time to understand the behavior of the actors they have in common. Focus on the layers of your system from which events can be logged with enough context to relate them to one another.

Every event must have a timestamp

That timestamp doesn’t always correspond to the exact time that the event is logged; it might be backdated to reflect an estimate of when the event “really happened,” or filled in based on data extracted from another source.

When thinking about what timestamp to put on an event, it’s a good idea to put it in sequence with related events, and to ensure that this sequence is consistent for each actor, even if the related events are logged by different systems. Events can only be related to one another efficiently if they are nearby in time, and some Interana features are much easier and more efficient to use if the events they relate are reliably in a specific order.

Favor a consistent ordering of timestamps on related events over absolute accuracy, even to the extent of time-shifting some of them and correcting for this later in the analysis.

Associate events with actors

Many queries require that an event be associated with an actor. Depending on the nature of your data and the complexity of your analysis needs, you may have one or more classes of actors whose behavior you are interested in analyzing. You may have vendors and purchasers, or users and advertisers, or locations and personnel.

Each class of actor corresponds to a shard column, designated when the event table is created. If a table has several shard columns, there will be several copies of its event data in your Interana instance, each organized by one shard column; each step of computation within a query will be routed to the appropriate copy of the event data. All copies of an event table share other resources, such as column definitions and descriptions, the list of strings found in each non-numeric column, and user access controls.

Not every event in a table has to have a value for every shard column. For example, a table may contain some events associated solely with vendor behavior, and some events associated solely with purchaser behavior, as well as events representing interactions between vendors and purchasers. There are even valid uses for events with no associated actor; for instance, some of a web site’s traffic may arrive through an anonymized channel, and it may be appropriate to include this traffic when computing overall statistics about site performance. However, any query that analyzes the behavior over time of a class of actors will automatically exclude events that are not associated with an actor in that class. 

Organize your logging around the classes of actors that participate in the most important events, and label each event consistently with the unique IDs of the actors involved.

Plan your shard keys

An actor has a permanent identifier, called a shard key, which is a specific value of a specific shard column. If you want to analyze any relationship between multiple events, those events must have a shard key in common. A table can have multiple shard columns - at some cost in system resources - but these columns are part of the definition of the table, and changing them is a major administrative operation. Where possible, use multiple shard columns in the same table to associate events with completely different classes of actors, not with different kinds of identifiers for the same actor.

Think carefully about the potential shard keys that occur in your data, especially if an actor may have different identifiers in different layers of your logging architecture (for example, browser session ID vs. user account ID vs. billing account ID). A good rule of thumb is to identify actors by a property that is constant for the duration of one sequence of connected events. For example, if you run an e-commerce site, and you want to analyze a customer acquisition process that starts before the customer has logged into (or even created) a user account, you will need a shard column containing a unique identifier assigned at the beginning of the process. You may also want to analyze sequences of purchases by the same user account, but this analysis will rely on a different shard column, perhaps in a completely different table (which does not need to contain any of the events preceding the login step of each visit to the site).

Remember that the same real-world entity may be known by different IDs in different parts of your system, which are different “actors” for analysis purposes; plan your analysis accordingly.

Queries look at events within a time window

Every query has a time window, and only looks at the event data for events within (or, in some cases, close to) the time window. An actor may have other permanent properties, but for query purposes, most interesting properties of an actor can be extracted from events in or near the query’s time window. Event streams that involve long-lived user sessions, or other kinds of event context that isn’t always in the immediate time neighborhood, may need to be supplemented with periodic “keep-alive” logging. This can be added to the main log stream or supplied as a separate source of events to be folded into the same table.

Interana also supports “lookup tables” for permanent properties of an actor, and for mapping values of any column (for example, ZIP code) to properties not logged directly on events (for example, country, state, median income). This is analogous to a SQL left outer join. Lookup tables can be useful for anonymizing values of an event column or filling in properties that are not normally accessible from the main logging system; but almost any mutable property of an actor can be more efficiently and accurately captured by applying Interana’s “cohort” and “per-actor metric” capabilities to the event table itself. Give preference to categorizing actors by their recent behavior - and logging extra events as necessary to provide this context - over static auxiliary lookup tables.

Identify clusters of related events in as many ways as you can

You may also have smaller clusters of events for which you want to measure a cluster-level property. For example, the delay between delivering a message to a user’s inbox and the first time the user opened it for reading, or the number of times that the message was forwarded to other users. This kind of analysis works well when each event cluster is confined to a single shard key and is clearly separated from other clusters with the same shard key, in one of four ways:

  • A gap in time between adjacent clusters of events: this is one classic definition of a “session” in relatively unstructured log streams, and corresponds to an Interana “session,” with new sessions started based on an “idle time” parameter.
  • A specific event that reliably occurs between adjacent clusters of events: for example, a mobile app might record an event each time the app is started, which implies that the user switched away from the app long enough for it to be stopped by the mobile OS. This is another way of defining a “session,” and corresponds to “separator events” in Interana sessions.
  • One or more events within the cluster that occur in a known order, beginning with a specific “start event” with known properties (for example, an “event_name” column with the value “session_start”). Each occurrence of this event marks the start of a new cluster for the associated shard key. This corresponds to an Interana “funnel,” which is one of the more powerful mechanisms in Interana for analyzing relatively structured log streams.
  • A unique identifier for the cluster of events, in a different column from the shard key column: each value found in this colocated column should be associated with a single value of the shard key. For example, a “logged-in browser session ID” column could be colocated with a “user account ID” shard key, or an “ad impression ID” column could be colocated with an “app installation ID” shard key. This is particularly useful when several clusters of events for the same shard key may overlap in time - as in the example of a user receiving, opening, and forwarding a specific message, which may be interleaved with activities associated with other messages. Unlike adding a shard column, adding a colocated column is a trivial operation, as long as it’s present in your data. Colocated columns can be used together with both sessions and funnels, and a query may compute properties for each value of a colocated column and then summarize them as properties of the actor (shard key) to which they correspond.

Analysis of clustered events can provide great insights into users’ and customers’ behavior; but it can be a challenge to make this analysis robust against data loss and contamination due to software bugs, data handling errors, and unexpected or even malicious user behavior. Structuring your log data to make effective use of colocated columns may take some effort, but pays off in query speed and confidence in the results.

Identify clusters of related events in as many ways as possible - grouping in time, separator events, sequences with a known order, and per-cluster identifiers in colocated columns. Then back up your preferred analysis with cross-checks based on alternate clustering criteria.

Use transformations and derived columns, when necessary

Any column that is defined in an Interana table, but not present on a specific event within the table, is considered to have a null value for that event in that column. Nulls are represented more compactly in Interana than any other value. If there is a column in your data that usually has a particular value (either fixed, or predictable from other properties of the same event), and only occasionally has some other value - for instance, an “affected account ID” that is almost always the same as the event’s main account ID, but is occasionally the account of another family member - it may be worthwhile to represent the predictable value with a null, and only record values that differ from the predictable value.

This transformation can be done as data is imported into Interana, and undone at query time using a derived column. Derived columns are an Interana feature that lets you compute additional event properties at query time, usually by combining several properties of the raw event. Other common use cases for derived columns include arithmetic (for example, divide the total price of a shopping cart’s contents by the number of items) and complex filter expressions (for example, exclude events that appear to be fraudulent by at least two out of eight criteria).

Log events as “raw” as possible, but consider using import-time transformations and derived columns to eliminate redundancy, store events more compactly, and postpone complex decisions to query time.

Build helper columns

There are situations where it may be appropriate to represent one action in the system doing the logging by multiple events in Interana. For instance, if an action is associated with two or more different actors in the same class - perhaps the sender and receiver of a message, both of which are user accounts - and you want to put it in sequence with each, you may need to log several copies of it, each with a different actor ID in the shard column. The copies may also differ in other columns, such as the role of the actor ID in the shard column (for example, “sender” vs. “receiver”).

If you do this, you should have:

  • a column containing an “action ID” that is the same for all copies (but different from the IDs of other multiple-event actions), omitted from non-multiple events
  • a column containing the “event multiplicity”, the number of events that were logged for this same action, also omitted from non-multiple events
  • a column containing a “duplicate index”, which should be null for exactly one event associated with each action (and non-null, possibly an integer, for the others)

These three columns, with the help of derived columns (to synthesize appropriate values for non-multiple events), can be used to build efficient queries that don’t over-count actions represented by multiple events.

Depending on the goal of the query, you might count unique action IDs, divide the weight of each event by its multiplicity, or filter out events whose duplicate index is non-null.

This illustrates a broader principle, worth keeping in mind as you refine your logging strategy: It’s better to have multiple helper columns, each mostly nulls, than to have one helper column that tries to account for all of the use cases. Columns are cheap as long as they’re mostly null, and practically free if they’re not used in queries. Think about how each event should contribute to summary statistics, and annotate it in all the ways that might help weight its contribution accurately.

Helper column example

It’s better to have multiple helper columns, each mostly nulls, than to have a single helper column that tries to cover every use case. When thinking about how to analyze your data, start from the query and work backwards.

For example, my company wants to track actions between customers (let’s call these actions “asks”). I want to count the number of events that happened yesterday where at least one of the accounts is a known bot, and I don’t want to double-count cases where both accounts are bots.

I have a table with shard key “either_user_id,” and customer events are imported twice: once with the “asking” customer in the “either_user_id” column and once with the “asked” customer in the “either_user_id” column.

The same table also has “bot classification” events, with the user ID in the “either_user_id” column.

I can create a "user_is_knownbot" cohort on either_user_id, consisting of IDs for which there has been a “bot classification” event with account_type = ‘knownbot’ within the last 48 hours. Then I can create a query that says “count unique either_user_id, filter: either_user_id in cohort ‘user_is_knownbot’, event_type = ‘ask’”. This gives me the number of knownbot _users_ that participated in “ask” events, but it doesn't give me the number of asks.

To get the number of asks, I can create the query “count unique action_id, filter: either_user_id in cohort ‘user_is_knownbot’, event_type = ‘ask’”.

I don’t want to put a non-null action_id on every event because it would be very expensive (in terms of storage and performance requirements). Instead, I can put a non-null raw_action_id only on asks, and make action_id a derived column.

long action_id () {
 if (raw_action_id == -1) return hash(either_user_id, timestamp);
 return raw_action_id;

(Where hash is a function that’s just strong enough to make collisions rare.)

This is much less expensive than having a non-null action_id on rows that don’t need it.

Strategic use of nulls and derived columns can help you avoid having massively duplicated, high-cardinality columns in your data.

  • Was this article helpful?