Use the Transformer Library to transform data before ingest
This topic refers to several transformer "steps." For a full list of steps, see Transformer library reference.
Interana users often want to transform their data before ingesting it into Interana. There are many reasons for applying a data transformation: you may want to add a category to all events coming from a certain data source, parse out a high-cardinality string to improve usability and string tier performance, or add a file name to assist in troubleshooting logging errors.
You can perform data transformations in more than one place. You may want to transform your data before it ever hits the Interana import server. If you have a sophisticated data pipeline in place at your organization that allows you to transform data, by all means go ahead and use it, as this approach will allow complete control over your transformations as well as a faster import to Interana. You can also perform complex transformations, such as looking up values in a relational data store and supplementing your event data with high-value information not accessible from the Interana import server.
But sometimes this is not possible. Perhaps you don't have the infrastructure to perform complex transformations on your side, or want to apply Interana-specific transformations to data that is used by a variety of analytics tools. In this case, we've got you covered with the Interana Transformer Library. The Interana Transformer Library allows you to transform your data as a part of the data ingest process, in a fully-supported, blazing fast, and best-practice framework.
This howto provides everything you need to know to get started transforming your data with the Interana Transformer Library.
Transformer Library example
For the purposes of this howto, I've constructed a simple example that we can use to understand the Interana Transformer Library application process.
I'll be assuming the role of a data analyst tasked with importing a data set into Interana for my team to use. For the past two months, we've been working with Interana to understand how our users interact with our desktop application. Our engineering team has put a lot of work into a new mobile application, which we're rolling out to a select group of customers next week, and I need to integrate the events from the mobile application into our main events table so that we can understand our mobile usage alongside our desktop usage.
Our desktop events look like this:
{
"username": "jdrahos@cool-tech.com",
"timestamp": "2016-09-17 19:34:12.254",
"event_name": "send_message",
"source": "desktop",
"version": "3.4.2",
"message_id": 38452212,
"is_private": true
}
And the same event, from our mobile application, would look like this:
{
"username": "jdrahos@cool-tech.com",
"client_timestamp": "2016-09-17T19:34:12.254Z",
"event_name": "send_message",
"mobileversion": "0.8.1",
"message_info": "38452212/true"
}
Since we'll be sending our mobile events directly into our S3 bucket, I've decided to use the Interana Transformer Library to make sure that our mobile events integrate seamlessly with our desktop events.
Step 1: Identify the transformations you want to apply
It's important to have a complete understanding of the problems I want to solve with my transformations. Since I want the mobile data to integrate seamlessly with my desktop data, I'll want to make the mobile events as similar as possible to the desktop events, while retaining the ability to separate mobile and desktop usage when I need to.
Given these requirements, here are the transformations I'd like to apply:
- My mobile
timestamp
field has a different name and than the desktoptimestamp
field. Since the time key on our events table is thetimestamp
column, I'll need to resolve these differences. - We don't pass a
source
field in our mobile events. I know that we'll want to use thesource
field to be able to differentiate between application sources, so I'd like to add that field to our mobile events. - The desktop event attributes
message_id
andis_private
are contained in one mobile event attribute:message_info
. I'll want to separate these two pieces of information into separate columns, just as in our desktop events. - Since these mobile events are being generated from a new code path, I'll want to be able to easily debug errors in these events. Since we're using a client timestamp for our mobile events, there's no guarantee that we'll be able to track down the file an erroneous event came from in S3. I'll make sure to add the file name that contained the event so that we can diagnose issues faster.
- Our mobile events are gzipped to save storage costs, so I'll need to decompress these files before import.
Now that I understand the transformers required, I can start developing my transformer configuration.
Step 2: Write the transformations
To implement transformers, we use the import server. This example writes transformations from scratch. Don't worry, it's not too tough!
An Interana Transformer Library configuration file is composed of a list of transformer "steps." Every file (and event within that file) will pass through these steps on their way to an output file.
I start by constructing a very basic configuration that will decompress, decode, and load the JSON event into a python dictionary (for the actual transformations), and dump the event back into JSON. We do this in a list of four steps, which are lists themselves:
[
["gunzip"],
["decode"],
["json_load"],
["json_dump"],
]
Now that I have that done, it's time to start implementing the actual data transformations. I'll start with the first transformation, the timestamp.
Reformat and rename the timestamp
We'll be using the time_convert
step to transform the mobile timestamp format into the desktop timestamp format. We'll be using three time_convert
arguments for this transformation:
read_formats
to specify the time formats we want to convertwrite_format
to specify the format we want to convert tocolumn
to specify the Python dictionary key that contains the value we want to convert (remember, thejson_load
transformer converts the JSON into a Python dictionary for us to work with)
Putting it all together, our transformer step looks like this:
["time_convert", {"read_formats": ["%Y-%m-%dT%H:%M:%S.%fZ"], "write_format": "%Y-%m-%d %H:%M:%S.%f", "column": "client_timestamp"}],
Now we see why our transformer configuration file is a list of lists: the more complex transformer steps contain at least a string indicating the transformer name and a JSON object that contains the transformer parameters.
Next, we'll need to rename client_timestamp
. We'll use the rename
step for this:
["rename", {"column": "client_timestamp", "new_name": "timestamp"}],
A quick note on rename
: if you rename a column to an already existing column, you will replace the value of that column with the value of the renamed column.
Add the source field to mobile events
We'll use the add_label step to add the source
key (with the value mobile
) to our transformer configuration:
["add_label", {"column": "source", "label": "mobile"}],
Just like rename
, if you add_label
to an existing column, you'll overwrite the value of the existing column.
Parse out message_info
We'll be using my favorite transformer step, regex_extract
, to parse out message_id
and is_private
from mobile event field message_info
. We'll use the following arguments of regex_extract
:
column
to specify what column we want to parse (message_info
in our case)output_columns
to specify the columns that we want to place our regex capturing groups in (message_id
andis_private
in our case)regex
to specify the regular expression we wish to use to parse ourcolumn
((\d*)/([a-z]*)
in our case)
Putting it all together, our transformer step looks like this:
["regex_extract", {"column": "message_info", "output_columns": ["message_id", "is_private"], "regex": "(?:(\d*)/([a-z]*)|.*)"}],
I've surrounded the regex above in a non-capturing group with a general second alternative because I don't want to get warnings in the logs if the regex does not match (if you want to receive warnings, just use the raw regular expression).
Next, I'll go ahead and omit the message_info
column, as it's no longer needed:
["omit", {"columns": ["message_info"]}],
You may notice that the value for the columns
key is a list – if we want to, we can specify a list of columns to omit.
Add the filename
Finally, I'll add the source filename to each event. This is done using add_filename
:
["add_filename", {"column": "filename"}],
Whether your file is on the local file system or in your favorite cloud storage, like S3 or an Azure blob, the filename will now be added to each event to assist in troubleshooting.
Step 3: Assemble and test the transformations
Now that we've written our transformers, it's time to test them! First, assemble your different transformers between your json_load
and json_dump
steps:
[
["gunzip"],
["decode"],
["json_load"],
["time_convert", {"read_formats": ["%Y-%m-%dT%H:%M:%S.%fZ"], "write_format": "%Y-%m-%d %H:%M:%S.%f", "column": "client_timestamp"}],
["rename", {"column": "client_timestamp", "new_name": "timestamp"}],
["add_label", {"column": "source", "label": "mobile"}],
["regex_extract", {"column": "message_info", "output_columns": ["message_id", "is_private"], "regex": "(?:(\d*)/([a-z]*)|.*)"}],
["omit", {"columns": ["message_info"]}],
["add_filename", {"column": "filename"}],
["json_dump"],
]
Remember that events pass through the transformer configuration sequentially, so the order is important! For instance, you would not want to omit the field message_info
before you extract its contents.
Now it's time to test our your configuration. Go ahead and place some test events in a file on your import node, along with your configuration file. I'll put my test event into file test.gz
and my configuration in config
.
Next, run generators.py
to test your transformers. Use this command to test config
on test.gz
and place output in out
:
/opt/interana/backend/import_server/generators.py -i test.gz -o out -c config
You'll see a message reporting the success or failure of your transformer, with specific error messages if something went wrong. For instance, if we misspell the column
parameter in the rename step:
["rename", {"colum": "client_timestamp", "new_name": "timestamp"}],
We'll get the following error message:
Once you see that your transformation has completed successfully, inspect the output and ensure that everything turned out all right:
Transformer notes
As you can see, our transformation was successful! A couple of notes:
-
__ia__unique__token__
is a unique identifier that we use to perform de-duplication. This field will be generated any time you use thejson_load
step, and by default is calculated using the entire event contents. If you want to indicate specific columns to calculate this de-duplication token from, you can do so using the optional argumentunique_columns
. - You may have noticed that
is_private
andmessage_id
are strings, not booleans and ints, respectively. This is fine; because of how the import purifier works, booleans are converted to strings upon import, and sincemessage_id
is currently being stored as an int in the dataset, our quoted int will be converted into an integer upon import.
Step 4: Apply the transformer configuration to the pipeline
Now that we're happy with the results of our transformer configuration, it's time to apply it to our import pipeline so that every file we import is transformed. If creating a new pipeline, specify that you want to apply the transformer configuration as follows:
/opt/interana/backend/import_server/setup_pipeline.py -t cooltech_events -p mobile_events --transformers ~/config
You can also apply the config
configuration to an existing pipeline mobile_events
for table cooltech_events
using the same command above.
The transformer configuration will be stored in the Interana configuration database in the data_sources.transformer
table, if you ever need to retrieve it.
At this point, we're done! I can now create an import job with the updated pipeline and have both mobile and desktop events in the same table, integrated despite different formats for seamless querying.
What's next
Now it's time to analyze that freshly transformed data!