Skip to main content
Interania

Use Transformer Library to Transform Data Before Ingest

2votes
13updates
558views

note_icon.png I will be referring to several transformer "steps" in this guide.  For a full list of steps with examples included, see Transformer Library Steps and Examples.


Goal

It's not uncommon that Interana users want to transform their data before ingesting it into Interana.  There are many reasons for applying a data transformation- you may want to add a category to all events coming from a certain data source, parse out a high-cardinality string to improve usability and string tier performance, add a file name to assist in troubleshooting logging errors- the list goes on and on.

Data transformations can happen in more than one place. You may want to transform your data on your side, before it ever hits the Interana import server.  If you have a sophisticated data pipeline in place at your organization that allows you to transform data, by all means go ahead and use it, as this approach will allow complete control over your transformations as well as a faster Interana import.  You can also perform complex transformations like look up values in a relational data store and supplement your event data with high value information not accessible from the Interana import server.

Sometimes this is not possible, however.  Perhaps you don't have the infrastructure to perform complex transformations on your side, or want to apply Interana-specific transformations to data that is used by a variety of analytics tools.  In this case, we've got you covered with the Interana Transformer Library.  The Interana Transformer Library allows you to transform your data as a part of the data ingest process, in a fully supported, blazing fast, and best-practice framework.  The purpose of this how-to is to provide everything you need to know to get started transforming your data with the Interana Transformer Library.

Exposition

For the purposes of this how-to, I've constructed a simple example that we can use to understand the Interana Transformer Library application process.  

I'll be assuming the role of a data analyst tasked with importing a data set into Interana for my team to work with.  For the past two months, we've been working with Interana to understand how our users interact with our desktop application.  Our engineering team has put a lot of work into a new mobile application, which we're rolling out to a select group of customers next week, and I need to integrate the events from the mobile application into our main events table so that we can understand our mobile usage alongside our desktop usage.

Our desktop events look like this:

{
    "username": "jdrahos@cool-tech.com",
    "timestamp": "2016-09-17 19:34:12.254",
    "event_name": "send_message",
    "source": "desktop",
    "version": "3.4.2",
    "message_id": 38452212,
    "is_private": true
}

And the same event, fired from our mobile application, would look like this:

{
    "username": "jdrahos@cool-tech.com",
    "client_timestamp": "2016-09-17T19:34:12.254Z",
    "event_name": "send_message",
    "mobileversion": "0.8.1",
    "message_info": "38452212/true"
}

Since we'll be sending our mobile events directly into our s3 bucket, I've decided to use the Interana Transformer Library to make sure that our mobile events integrate seamlessly with our desktop events.

First Step- Identify the Transformations You Wish To Apply

Before I start, it is important to have a complete understanding of the problems I want to solve with my transformations.  Since I want the mobile data to integrate seamlessly with my desktop data, I'll want to make the mobile events as similar as possible to the desktop events, while retaining the ability to separate mobile and desktop usage when I need to.  Given these requirements, I come up with a list of the following transformations that I'd like to apply:

  1. My mobile timestamp field is different in name and format from the desktop timestamp field.  Since the time key on our events table is the "timestamp" column, I'll need to make sure that I resolve these differences.
  2. We don't pass a "source" field in our mobile events.  I know that we'll want to be able to use the "source" field to be able to differentiate between application sources, so I'd like to add that field to our mobile events.
  3. The desktop event attributes "message_id" and "is_private" are contained in one mobile event attribute - "message_info".  I'll want to separate these two pieces of information into separate columns, just as in our desktop events.
  4. Since these mobile events are being generated from a new code path, I'll want to be able to easily debug errors in these events.  Since we're using a client timestamp for our mobile events, there's no guarantee that we'll be able to track down the file an erroneous event came from in s3.  I'll make sure to add the file name that contained the event so that we can diagnose issues faster.
  5. Our mobile events are gzipped to save storage costs- I'll need to decompress these files before import.

Now that I have a firm understanding of the transformers required, I can get started developing my transformer configuration.

Second Step - Write Transformations

There are 2 places I can work to implement my transformers- the Interana Import Wizard (currently being tested internally), or manually on the import server.  Since the Interana Import Wizard is currently in development, I'll be writing my transformations from scratch in this example.  Don't worry- it's not too tough!

An Interana Transformer Library configuration file is composed of a list of transformer "steps".  Every file (and event within that file) will pass through these steps on their way to an output file.

I start the construction of my configuration file by constructing a very basic configuration that will just decompress, decode, load the json event into a python dictionary (for the actual transformations), and dump the event back into json.  We do this in a list 4 simple steps, which are lists themselves:

[
["gunzip"],
["decode"],
["json_load"],
["json_dump"],
]

Now that I have that out of the way, it's time to start implementing the actual data tranformations.  I'll start with transformation (1), the timestamp.

Reformat and Rename Timestamp

We'll be using the time_convert step to transform the mobile timestamp format into the desktop timestamp format.  We'll be using 3 time_convert arguments for this transformation:

  1. "read_formats" to specify the time formats we want to convert,
  2. "write_format" to specify the format we want to convert to,
  3. and "column" to specify the python dictionary key that contains the value we want to convert (remember, the json_load transformer converts the json into a python dictionary for us to work with).

Putting it all together, our transformer step looks like this:

["time_convert", {"read_formats": ["%Y-%m-%dT%H:%M:%S.%fZ"], "write_format": "%Y-%m-%d %H:%M:%S.%f", "column": "client_timestamp"}],

Now we see why our transformer configuration file is a list of lists- more complex transformer steps contain at least a string indicating the transformer name and a json object that contains the transformer parameters.

Next, we'll need to rename "client_timestamp".  This is simple, using the rename step:

["rename", {"column": "client_timestamp", "new_name": "timestamp"}],

A quick note on rename- if you rename a column to an already existing column, you will replace the value of that column with the value of the renamed column.

Add "source" Field to Mobile Events

We'll use the add_label step to add the "source" key (with value="mobile") to our transformer configuration:

["add_label", {"column": "source", "label": "mobile"}],

Just like rename, if you add_label to an existing column, you will overwrite the value of the existing column.

Parse Out "message_info"

We'll be using my favorite transformer step regex_extract to parse out "message_id" and "is_private" from mobile event field "message_info"  To do so, we'll use the following arguments of regex_extract:

  1. "column" to specify what column we want to parse ("message_info" in our case)
  2. "output_columns" to specify the columns that we want to place our regex capturing groups in ("message_id" and "is_private" in our case)
  3. "regex" to specify the regular expression we wish to use to parse our "column" ("(\d*)/([a-z]*)" in our case)

Putting it all together, our transformer step looks like this:

["regex_extract", {"column": "message_info", "output_columns": ["message_id", "is_private"], "regex": "(?:(\d*)/([a-z]*)|.*)"}],

I've surrounded the regex above in a non-capturing group with a general second alternative because I don't want to get warnings in the logs if the regex does not match- if you do wish to receive warnings, just use the raw regular expression.

Next, I'll go ahead and omit the "message_info" column, as it is no longer useful to me:

["omit", {"columns": ["message_info"]}],

You may notice that the value for the "columns" key is a list - if we want to, we can specify a list of columns to omit.

Add Filename

Finally, I'll add the source filename to each event.  This is easily done using add_filename:

["add_filename", {"column": "filename"}],

Whether your file is on the local file system or in your favorite cloud storage like s3 or an azure blob, the filename will now be added to each event to assist in troubleshooting.

Third Step - Assemble and Test Transformations

Now that we've written our transformers, it's time to test them!  First, assemble your different transformers between your json_load and json_dump steps:

[
["gunzip"],
["decode"],
["json_load"],
["time_convert", {"read_formats": ["%Y-%m-%dT%H:%M:%S.%fZ"], "write_format": "%Y-%m-%d %H:%M:%S.%f", "column": "client_timestamp"}],
["rename", {"column": "client_timestamp", "new_name": "timestamp"}],
["add_label", {"column": "source", "label": "mobile"}],
["regex_extract", {"column": "message_info", "output_columns": ["message_id", "is_private"], "regex": "(?:(\d*)/([a-z]*)|.*)"}],
["omit", {"columns": ["message_info"]}],
["add_filename", {"column": "filename"}],
["json_dump"],
]

Remember that events pass through the transformer configuration sequentially- so order is important!  For instance, you would not want to omit the field "message_info" before you extract its contents.

Now it's time to test our your configuration.  Go ahead and place some test events in a file on your import node, along with your configuration file. I'll put my test event into file "test.gz" and my configuration in "config".

Next, run generators.py to test your transformers.  To test "config" on "test.gz" and place output in "out", we issue command:

/opt/interana/backend/import_server/generators.py -i test.gz -o out -c config

You'll see a message that communicates the success or failure of your transformer, with specific error messages if something went wrong.  For instance, if we misspell the "column" parameter in the rename step:

["rename", {"colum": "client_timestamp", "new_name": "timestamp"}],

We'll get the following error message:

Once you see that your transformation has completed successfully, inspect the output and ensure that everything turned out all right:

As you can see, our transformation has been successful!  A couple of notes:

  1.  "__ia__unique__token__" is a unique identifier that we use to perform de-duplication.  This field will be generated any time you use the json_load step, and by default is calculated using the entire event contents.  If you wish to indicate specific columns to calculate this de-duplication token from, you can do so using the optional argument unique_columns.
  2. You may have noticed that is_private and message_id are strings, not booleans and ints, respectively.  This is fine- because of how the import purifier works, booleans are converted to strings anyway upon import, and we know that since message_id is currently being stored as an int, in the dataset, our quoted int will be converted into an integer upon import.

Final Step - Apply Transformer Configuration to Pipeline

Now that we're happy with the results of our transformer configuration, it's time to apply it to our import pipeline so that every file we import is transformed.  If creating a new pipeline, specify that you want to apply the transformer configuration as follows:

/opt/interana/backend/import_server/setup_pipeline.py -t cooltech_events -p mobile_events --transformers ~/config

Note that you can also apply the "config" configuration to an existing pipeline "mobile_events" for table "cooltech_events" using the same command above.

The transformer configuration will be stored in the i|a configuration database in the data_sources.transformer table, if you ever need to retrieve it.

At this point, we're done!  I can now create an import job with the updated pipeline and enjoy both mobile and desktop events in the same table, integrated despite different formats for seamless querying.

What's Next

Get to analyzing that freshly transformed data!

  • Was this article helpful?