Skip to main content
Interania

Use the Transformer Library to transform data before ingest

2votes
24updates
736views

This topic refers to several transformer "steps." For a full list of steps, see Transformer Library Steps and Examples.

Interana users often want to transform their data before ingesting it into Interana. There are many reasons for applying a data transformation: you may want to add a category to all events coming from a certain data source, parse out a high-cardinality string to improve usability and string tier performance, or add a file name to assist in troubleshooting logging errors.

You can perform data transformations in more than one place. You may want to transform your data before it ever hits the Interana import server. If you have a sophisticated data pipeline in place at your organization that allows you to transform data, by all means go ahead and use it, as this approach will allow complete control over your transformations as well as a faster import to Interana. You can also perform complex transformations, such as looking up values in a relational data store and supplementing your event data with high-value information not accessible from the Interana import server.

But sometimes this is not possible. Perhaps you don't have the infrastructure to perform complex transformations on your side, or want to apply Interana-specific transformations to data that is used by a variety of analytics tools. In this case, we've got you covered with the Interana Transformer Library. The Interana Transformer Library allows you to transform your data as a part of the data ingest process, in a fully-supported, blazing fast, and best-practice framework.

This howto provides everything you need to know to get started transforming your data with the Interana Transformer Library.

Transformer Library example

For the purposes of this howto, I've constructed a simple example that we can use to understand the Interana Transformer Library application process.

I'll be assuming the role of a data analyst tasked with importing a data set into Interana for my team to use. For the past two months, we've been working with Interana to understand how our users interact with our desktop application. Our engineering team has put a lot of work into a new mobile application, which we're rolling out to a select group of customers next week, and I need to integrate the events from the mobile application into our main events table so that we can understand our mobile usage alongside our desktop usage.

Our desktop events look like this:

{
"username": "jdrahos@cool-tech.com",
"timestamp": "2016-09-17 19:34:12.254",
"event_name": "send_message",
"source": "desktop",
"version": "3.4.2",
"message_id": 38452212,
"is_private": true
}

And the same event, from our mobile application, would look like this:

{
"username": "jdrahos@cool-tech.com",
"client_timestamp": "2016-09-17T19:34:12.254Z",
"event_name": "send_message",
"mobileversion": "0.8.1",
"message_info": "38452212/true"
}

Since we'll be sending our mobile events directly into our S3 bucket, I've decided to use the Interana Transformer Library to make sure that our mobile events integrate seamlessly with our desktop events.

Step 1: Identify the transformations you want to apply

It's important to have a complete understanding of the problems I want to solve with my transformations. Since I want the mobile data to integrate seamlessly with my desktop data, I'll want to make the mobile events as similar as possible to the desktop events, while retaining the ability to separate mobile and desktop usage when I need to.

Given these requirements, here are the transformations I'd like to apply:

  1. My mobile timestamp field has a different name and than the desktop timestamp field. Since the time key on our events table is the timestamp column, I'll need to resolve these differences.
  2. We don't pass a source field in our mobile events. I know that we'll want to use the source field to be able to differentiate between application sources, so I'd like to add that field to our mobile events.
  3. The desktop event attributes message_id and is_private are contained in one mobile event attribute: message_info. I'll want to separate these two pieces of information into separate columns, just as in our desktop events.
  4. Since these mobile events are being generated from a new code path, I'll want to be able to easily debug errors in these events. Since we're using a client timestamp for our mobile events, there's no guarantee that we'll be able to track down the file an erroneous event came from in S3. I'll make sure to add the file name that contained the event so that we can diagnose issues faster.
  5. Our mobile events are gzipped to save storage costs, so I'll need to decompress these files before import.

Now that I understand the transformers required, I can start developing my transformer configuration.

Step 2: Write the transformations

There are two places I can work to implement my transformers: the Interana Import Wizard (currently being tested internally), or manually on the import server. Since the Interana Import Wizard is currently in development, I'll be writing my transformations from scratch in this example. Don't worry, it's not too tough!

An Interana Transformer Library configuration file is composed of a list of transformer "steps." Every file (and event within that file) will pass through these steps on their way to an output file.

I start by constructing a very basic configuration that will decompress, decode, and load the JSON event into a python dictionary (for the actual transformations), and dump the event back into JSON. We do this in a list of four steps, which are lists themselves:

[
["gunzip"],
["decode"],
["json_load"],
["json_dump"],
]

Now that I have that done, it's time to start implementing the actual data tranformations. I'll start with the first transformation, the timestamp.

Reformat and rename the timestamp

We'll be using the time_convert step to transform the mobile timestamp format into the desktop timestamp format. We'll be using three time_convert arguments for this transformation:

  • read_formats to specify the time formats we want to convert
  • write_format to specify the format we want to convert to
  • column to specify the Python dictionary key that contains the value we want to convert (remember, the json_load transformer converts the JSON into a Python dictionary for us to work with)

Putting it all together, our transformer step looks like this:

["time_convert", {"read_formats": ["%Y-%m-%dT%H:%M:%S.%fZ"], "write_format": "%Y-%m-%d %H:%M:%S.%f", "column": "client_timestamp"}],

Now we see why our transformer configuration file is a list of lists: the more complex transformer steps contain at least a string indicating the transformer name and a JSON object that contains the transformer parameters.

Next, we'll need to rename client_timestamp. We'll use the rename step for this:

["rename", {"column": "client_timestamp", "new_name": "timestamp"}],

A quick note on rename: if you rename a column to an already existing column, you will replace the value of that column with the value of the renamed column.

Add the source field to mobile events

We'll use the add_label step to add the source key (with the value mobile) to our transformer configuration:

["add_label", {"column": "source", "label": "mobile"}],

Just like rename, if you add_label to an existing column, you'll overwrite the value of the existing column.

Parse out message_info

We'll be using my favorite transformer step, regex_extract, to parse out message_id and is_private from mobile event field message_info. We'll use the following arguments of regex_extract:

  • column to specify what column we want to parse (message_info in our case)
  • output_columns to specify the columns that we want to place our regex capturing groups in (message_id and is_private in our case)
  • regex to specify the regular expression we wish to use to parse our column ((\d*)/([a-z]*) in our case)

Putting it all together, our transformer step looks like this:

["regex_extract", {"column": "message_info", "output_columns": ["message_id", "is_private"], "regex": "(?:(\d*)/([a-z]*)|.*)"}],

I've surrounded the regex above in a non-capturing group with a general second alternative because I don't want to get warnings in the logs if the regex does not match (if you want to receive warnings, just use the raw regular expression).

Next, I'll go ahead and omit the message_info column, as it's no longer needed:

["omit", {"columns": ["message_info"]}],

You may notice that the value for the columns key is a list – if we want to, we can specify a list of columns to omit.

Add the filename

Finally, I'll add the source filename to each event. This is done using add_filename:

["add_filename", {"column": "filename"}],

Whether your file is on the local file system or in your favorite cloud storage, like S3 or an Azure blob, the filename will now be added to each event to assist in troubleshooting.

Step 3: Assemble and test the transformations

Now that we've written our transformers, it's time to test them! First, assemble your different transformers between your json_load and json_dump steps:

[
["gunzip"],
["decode"],
["json_load"],
["time_convert", {"read_formats": ["%Y-%m-%dT%H:%M:%S.%fZ"], "write_format": "%Y-%m-%d %H:%M:%S.%f", "column": "client_timestamp"}],
["rename", {"column": "client_timestamp", "new_name": "timestamp"}],
["add_label", {"column": "source", "label": "mobile"}],
["regex_extract", {"column": "message_info", "output_columns": ["message_id", "is_private"], "regex": "(?:(\d*)/([a-z]*)|.*)"}],
["omit", {"columns": ["message_info"]}],
["add_filename", {"column": "filename"}],
["json_dump"],
]

Remember that events pass through the transformer configuration sequentially, so the order is important! For instance, you would not want to omit the field message_info before you extract its contents.

Now it's time to test our your configuration. Go ahead and place some test events in a file on your import node, along with your configuration file. I'll put my test event into file test.gz and my configuration in config.

Next, run generators.py to test your transformers. Use this command to test config on test.gz and place output in out:

/opt/interana/backend/import_server/generators.py -i test.gz -o out -c config

You'll see a message reporting the success or failure of your transformer, with specific error messages if something went wrong. For instance, if we misspell the column parameter in the rename step:

["rename", {"colum": "client_timestamp", "new_name": "timestamp"}],

We'll get the following error message:

Once you see that your transformation has completed successfully, inspect the output and ensure that everything turned out all right:

Transformer notes

As you can see, our transformation was successful! A couple of notes:

  •  __ia__unique__token__ is a unique identifier that we use to perform de-duplication. This field will be generated any time you use the json_load step, and by default is calculated using the entire event contents. If you want to indicate specific columns to calculate this de-duplication token from, you can do so using the optional argument unique_columns.
  • You may have noticed that is_private and message_id are strings, not booleans and ints, respectively. This is fine; because of how the import purifier works, booleans are converted to strings upon import, and since message_id is currently being stored as an int in the dataset, our quoted int will be converted into an integer upon import.

Step 4: Apply the transformer configuration to the pipeline

Now that we're happy with the results of our transformer configuration, it's time to apply it to our import pipeline so that every file we import is transformed. If creating a new pipeline, specify that you want to apply the transformer configuration as follows:

/opt/interana/backend/import_server/setup_pipeline.py -t cooltech_events -p mobile_events --transformers ~/config

You can also apply the config configuration to an existing pipeline mobile_events for table cooltech_events using the same command above.

The transformer configuration will be stored in the Interana configuration database in the data_sources.transformer table, if you ever need to retrieve it.

At this point, we're done! I can now create an import job with the updated pipeline and have both mobile and desktop events in the same table, integrated despite different formats for seamless querying.

What's next

Now it's time to analyze that freshly transformed data!

  • Was this article helpful?