Skip to main content
Interania

Transformer Library Steps and Examples

0votes
2updates
72views
John Drahos

Introduction

This page details the different transformations available in the Interana Transformer Library.  For information on how to put these to use, please read the HowTo article here: https://docs.interania.com/HowTos/UseTransformerLibrary

If you're looking for something in particular, you may want to try the table of contents to the right.

Steps: Before Data Load

gunzip

Decompress gzip files.

Example: Decompress test.json.gz

Config:
[
["gunzip"],
["decode"]
]

In:
test.json.gz

Out:
{"a": 1, "b": 2}
{"c": 3, "d": 4}
{"e": 1, "f": 2}

bunzip2

Decompress bz2 files.

Example: Decompress test.json.bz2

Config:
[
["bunzip2"],
["decode"]
]

In:
test.json.bz2

Out:
{"a": 1, "b": 2}
{"c": 3, "d": 4}
{"e": 1, "f": 2}

unpack_json_array

Split array of objects into individual objects.
This is useful when event logs have multiple events in a single json array.

Example: Unpack 3 json objects within single json array into line separated objects

Config:
[
["unpack_json_array"],
["decode"]
]

In:
[{"a": 1, "b": 2}, {"c": 3, "d": 4}, {"e": 1, "f": 2}]

Out:
{"a": 1, "b": 2}
{"c": 3, "d": 4}
{"e": 1, "f": 2}

sample

Take a random sample of each input file.

Args:
sample_size (int): Number of lines to sample.
random_seed (int): Seed for random number generator. Used to sample same random events across multiple runs.
        
Example: Sample 2 lines from 3 total

Config:
[
["sample", {"sample_size": 2}],
["decode"]
]

In:
{"a": 1, "b": 1}
{"a": 2, "b": 2}
{"a": 3, "b": 3}

Out:
{"a": 3, "b": 3}
{"a": 2, "b": 2}

head

Read beginning of each input file.

Args:
sample_size (int): Number of lines to read from beginning of each file.

Example: Take first 2 lines of input files

Config:
[
["head", {"sample_size": 2}],
["decode"]
]

In:
{"a": 1, "b": 1}
{"a": 2, "b": 2}
{"a": 3, "b": 3}

Out:
{"a": 1, "b": 1}
{"a": 2, "b": 2}

tail

Read end of each input file.

Args:
sample_size (int): Number of lines to read from the end of each file.

Example: Take last 2 lines of input files

Config:
[
["tail", {"sample_size": 2}],
["decode"]
]

In:
{"a": 1, "b": 1}
{"a": 2, "b": 2}
{"a": 3, "b": 3}

Out:
{"a": 2, "b": 2}
{"a": 3, "b": 3}

regex_replace_line

Perform regular expression replacement on the line.
Replacement string uses Python-style backreferences.

Args:
regex (str): Regular expression to match.
repl_string (str): Replacement string to substitute.

Example: Fix unquoted json key (user) before json load step

Config:
[
["decode"],
["regex_replace_line", {"regex": "user:", "repl_string": "\"user\":"}]
]

In:
{user: "u1", "trans": "x1", "val": 12}
{user: "u1", "trans": "x2", "val": 9}
{user: "u2", "trans": "x1", "val": 21}

Out:
{"user": "u1", "trans": "x1", "val": 12}
{"user": "u1", "trans": "x2", "val": 9}
{"user": "u2", "trans": "x1", "val": 21}

decode

Decode file using specified character encoding.
`encoding` defaults to "utf-8"

Args:
encoding (Optional[str]): Character encoding to use.

Example: Decode file

Config:
[
["decode"]
]

Load Data into Python Dictionary

json_load

Deserialize JSON and calculate a unique token that Interana uses to deduplicate lines.

Args:
unique_columns (Optional[list]): List of columns to use to generate unique token. If not specified, original line will be used.

Example: Specify columns user and trans to generate unique token

Config:
[
["decode"],
["json_load", {"unique_columns": ["user", "trans"]}],
["json_dump"]
]

In:
{"user": "u1", "trans": "x1", "val": 12}
{"user": "u1", "trans": "x2", "val": 9}
{"user": "u2", "trans": "x1", "val": 21}

Out:
{"user":"u1","__ia__unique__token__":6073597307750309666,"trans":"x1","val":12}
{"user":"u1","__ia__unique__token__":8385417071786987290,"trans":"x2","val":9}
{"user":"u2","__ia__unique__token__":2558069477487871788,"trans":"x1","val":21}

csv_load

Load CSV data into dictionaries.

By default, csv_load assumes the first line of the file is a primary header row specifying field names. If the CSV file does not contain a header row, or you wish to override the existing header, you can configure a custom "primary header" row.  

In the case where one or more CSV lines do not match the number of fields in the primary header, csv_load will use a best-effort algorithm as follows.  If a row of the CSV file has fewer fields than the header, csv_load will fill as many fields as possible in order. If a row of the CSV file has more fields than the header, csv_load will generate extra header names like "field_N", "field_N+1" starting with the number of fields already present in the header.

If you desire even more control over handling non-standard CSV lines, it is possible to specify additional header rows (each with a distinct number of fields), in which case csv_load will attempt to find a header with exactly the same number of fields as the current line and use that one. If neither the primary header nor any alternate header matches the number of fields in the current line, csv_load will use the "primary header" and fall back to its best-effort algorithm.

Args:
unique_columns (Optional[list]): List of columns to use to generate unique token. If not specified, original line will be used.
delimiter (Optional[str]): Character to use as delimiter, e.g. a comma or a tab.
headers (Optional[list]): List of list of headers. Headers are chosen by number of elements in each CSV line.
header_row (Optional[str]): One of 'use', 'skip', or by default 'no_header'. If 'use', the header in the file is used if no `headers` match. If no `primary_header` is set, the file header is used as the primary.
primary_header (Optional[list]): Headers to use if no other headers match.
doublequote (Optional[bool]): Controls how instances of quotechar appearing inside a field should themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True.
escapechar (Optional[str]): Removes any special meaning from the following character. It defaults to None, which disables escaping.
quotechar (Optional[str]): A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters. It defaults to '"'.
skipinitialspace (Optional[bool]): When True, whitespace immediately following the delimiter is ignored. The default is False.

Example: Load CSV data into dictionary, dump into json

Config:
[
["decode"],
["csv_load"],
["json_dump"]
]

In:
User,Event,Value
John,Login,0
Todd,Logout,1

Out:
{"Event":"Login","Value":"0","User":"John","__ia__unique__token__":6739147714641650777}
{"Event":"Logout","Value":"1","User":"Todd","__ia__unique__token__":311502209663632137}

After Data Load (Dictionary-to-Dictionary Transformations)

hash

Calculate non-cryptographic murmurhash3.

Args:
column (str): Column to hash.

Example: Hash username

Config:
[
["decode"],
["json_load"],
["hash", {"column": "username"}],
["json_dump"]
]

In:
{"username": "BenHogan", "event": "putt", "dist": 3}
{"username": "ArnoldPalmer", "event": "chip", "dist": 15}
{"username": "JackNicklaus", "event": "drive", "dist": 267}

Out:
{"username":8545539730387850342,"dist":3,"__ia__unique__token__":4345827873645831930,"event":"putt"}
{"username":4800646705760021970,"dist":15,"__ia__unique__token__":7955354759641172706,"event":"chip"}
{"username":1304984085549885728,"dist":267,"__ia__unique__token__":3709088310454520588,"event":"drive"}

anonymize

Calculate 52 bit sha1 hash.

Args:
column (str): Column to anonymize.

Example:
Anonymize username

Config:
[
["decode"],
["json_load"],
["anonymize", {"column": "username"}],
["json_dump"]
]

In:
{"username": "BenHogan", "event": "putt", "dist": 3}
{"username": "ArnoldPalmer", "event": "chip", "dist": 15}
{"username": "JackNicklaus", "event": "drive", "dist": 267}

Out:
{"event":"putt","username":2528885808796705,"dist":3,"__ia__unique__token__":4345827873645831930}
{"event":"chip","username":1922980560516075,"dist":15,"__ia__unique__token__":7955354759641172706}
{"event":"drive","username":1542261379125033,"dist":267,"__ia__unique__token__":3709088310454520588}

lowercase

Convert all or some column names to lowercase strings.
If a name in `columns` is not present, a lowercased column name will be created and set to null.
Not to be confused with lowercase_value().

Args:
columns (Optional[list]): Columns to be renamed lowercase. Default is all columns.

Example: Change uppercase characters in "USERNAME" to lowercase

Config:
[
["decode"],
["json_load"],
["lowercase"],
["json_dump"]
]

In:
{"USERNAME": "BenHogan", "event": "putt", "dist": 3}
{"USERNAME": "ArnoldPalmer", "event": "chip", "dist": 15}
{"USERNAME": "JackNicklaus", "event": "drive", "dist": 267}

Out:
{"dist":3,"username":"BenHogan","__ia__unique__token__":8497066893068604652,"event":"putt"}
{"dist":15,"username":"ArnoldPalmer","__ia__unique__token__":587546423006921598,"event":"chip"}
{"dist":267,"username":"JackNicklaus","__ia__unique__token__":5859163929152062153,"event":"drive"}

lowercase_value

Convert `column` to a lowercase string.

Args:
column (str): Column that contains values to be lowercased.

Example: Convert values of 'text' field to lowercase

Config:
[
["decode"],
["json_load"],
["lowercase_value", {"column": "text"}],
["json_dump"]
]

In:
{"user": "Sam", "text": "Wet Paint"}
{"user": "John", "text": "STOP"}
{"user": "Elyse", "text": "Do Not Enter"}

Out:
{"__ia__unique__token__":2233568130467114876,"text":"wet paint","user":"Sam"}
{"__ia__unique__token__":4048437015924438854,"text":"stop","user":"John"}
{"__ia__unique__token__":5175482217519506709,"text":"do not enter","user":"Elyse"}

add_label

Set `column` to the value `label` for every event.

Args:
column (str): Name of column to put label in.
label (str): Value to put in column.

Example: Add "category":"electronics" to every event

Config:
[
["decode"],
["json_load"],
["add_label", {"column": "category", "label": "electronics"}],
["json_dump"]
]

In:
{"user": "Sam", "amt": 1}
{"user": "John", "amt": 2}
{"user": "Elyse", "amt": 3}

Out:
{"__ia__unique__token__":6792123659669767925,"category":"electronics","user":"Sam","amt":1}
{"__ia__unique__token__":371460198962001743,"category":"electronics","user":"John","amt":2}
{"__ia__unique__token__":2941371116376991439,"category":"electronics","user":"Elyse","amt":3}

deep_copy

Make a complete copy of a column

Args:
column (str): Column name to copy from.
new_name (str): Column name to copy to.

Example: Copy "user" values to new column "usercopy"

Config:
[
["decode"],
["json_load"],
["deep_copy", {"column": "user", "new_name": "usercopy"}],
["json_dump"]
]

In:
{"user": "Sam", "amt": 1}
{"user": "John", "amt": 2}
{"user": "Elyse", "amt": 3}

Out:
{"amt":1,"user":"Sam","__ia__unique__token__":6792123659669767925,"usercopy":"Sam"}
{"amt":2,"user":"John","__ia__unique__token__":371460198962001743,"usercopy":"John"}
{"amt":3,"user":"Elyse","__ia__unique__token__":2941371116376991439,"usercopy":"Elyse"}

rename

Rename a column.

Args:
column (str): Column to rename.
new_name (str): New name to give column.

Example: Rename column "blah" to "event_type"

Config:
[
["decode"],
["json_load"],
["rename", {"column": "blah", "new_name": "event_type"}],
["json_dump"]
]

In:
{"user": "Sam", "amt": 1, "blah": "blah"}
{"user": "John", "amt": 2, "blah": "blah2"}
{"user": "Elyse", "amt": 3, "blah": "blah3"}

Out:
{"amt":1,"event_type":"blah","user":"Sam","__ia__unique__token__":3893872012624762980}
{"amt":2,"event_type":"blah2","user":"John","__ia__unique__token__":4673211942423344710}
{"amt":3,"event_type":"blah3","user":"Elyse","__ia__unique__token__":4791908289683167877}

omit

Omit specified columns.

Args:
columns (list): List of column names to omit.

Example: Omit column "blah" from events

Config:
[
["decode"],
["json_load"],
["omit", {"columns": ["blah"]}],
["json_dump"]
]

In:
{"user": "Sam", "amt": 1, "blah": "blah"}
{"user": "John", "amt": 2, "blah": "blah2"}
{"user": "Elyse", "amt": 3, "blah": "blah3"}

Out:
{"user":"Sam","__ia__unique__token__":3893872012624762980,"amt":1}
{"user":"John","__ia__unique__token__":4673211942423344710,"amt":2}
{"user":"Elyse","__ia__unique__token__":4791908289683167877,"amt":3}

add_filename

Set `column` to the name of the file the event is from.

Args:
column (str): Column to put file name in.

Example: Add filename "in.json" to events

Config:
[
["decode"],
["json_load"],
["add_filename", {"column": "filename"}],
["json_dump"]
]

In (in.json):
{"user": "Sam", "amt": 1}
{"user": "John", "amt": 2}
{"user": "Elyse", "amt": 3}

Out:
{"__ia__unique__token__":6792123659669767925,"filename":"in.json","user":"Sam","amt":1}
{"__ia__unique__token__":371460198962001743,"filename":"in.json","user":"John","amt":2}
{"__ia__unique__token__":2941371116376991439,"filename":"in.json","user":"Elyse","amt":3}

add_file_date

Set `column` to the datestamp of the file the event is from.
If specifying input files as arguments on the command line, requires passing the --file-date flag in '%Y-%m-%dT%H:%M:%S.%fZ' format.

Args:
column (str): Column to put file date in.

Example: Add file date to events

Config:
[
["decode"],
["json_load"],
["add_file_date", {"column": "filedate"}],
["json_dump"]
]

In:
{"user": "Sam", "amt": 1}
{"user": "John", "amt": 2}
{"user": "Elyse", "amt": 3}

Out:
{"amt":1,"filedate":"2016-05-05T12:12:12.123Z","user":"Sam","__ia__unique__token__":6792123659669767925}
{"amt":2,"filedate":"2016-05-05T12:12:12.123Z","user":"John","__ia__unique__token__":371460198962001743}
{"amt":3,"filedate":"2016-05-05T12:12:12.123Z","user":"Elyse","__ia__unique__token__":2941371116376991439}

flatten_dict

Flatten dictionary by concatenating successive keys with a '.'

Args:
columns (Optional[list]): List of column names to flatten. Defaults to all columns.
depth_limit (Optional[int]): Recursion limit. Defaults to no limit.

Example: Flatten "user" column 1 level

Config:
[
["decode"],
["json_load"],
["flatten_dict", {"columns": ["user"], "depth_limit": 1}],
["json_dump"]
]

In:
{"user": {"name": "John", "info": {"phone": 5551234, "city": "Redwood City"}}, "event": "login"}

Out:
{"user.name":"John","event":"login","__ia__unique__token__":4975147593049633177,"user.info":{"phone":5551234,"city":"Redwood City"}}

shred_array

Change array of objects into an object of arrays.

Example: Shred array "exp"

Config:
[
["decode"],
["json_load"],
["shred_array", {"column": "exp"}],
["json_dump"]
]

In:
{"user": "John", "exp": [{"var": "a", "val": "1"}, {"var": "b", "val": "4"}]}
{"user": "Fred", "exp": [{"var": "b", "val": "5"}, {"var": "c", "val": "4"}]}

Out:
{"__ia__unique__token__":7370096177795160827,"user":"John","exp":{"val":["1","4"],"var":["a","b"]}}
{"__ia__unique__token__":2876036297403557858,"user":"Fred","exp":{"val":["5","4"],"var":["b","c"]}}

split_array

Split a column containing an array into multiple columns.

Args:
column (str): Split this column into new columns if it is an array.
output_columns (list): Column names to use. Defaults to `column` suffixed with '.N'.

Example: Split values in array "exp" into 3 new columns

Config:
[
["decode"],
["json_load"],
["split_array", {"column": "exp", "output_columns": ["animal1", "animal2", "animal3"]}],
["json_dump"]
]

In:
{"user": "John", "exp": ["cat","dog","hamster"]}
{"user": "Fred", "exp": ["lion","tiger","bear"]}

Out:
{"exp":["cat","dog","hamster"],"animal1":"cat","animal3":"hamster","user":"John","__ia__unique__token__":2032956903660321772,"animal2":"dog"}
{"exp":["lion","tiger","bear"],"animal1":"lion","animal3":"bear","user":"Fred","__ia__unique__token__":4045256004748629181,"animal2":"tiger"}

convert_to_array

Split `column` on `separator` into a list of values in `output_column`.

Args:
separator (str): String to split on, e.g. ', '.

Example:
Split "blurb" into an array containing individual words in "blurb"

Config:
[
["decode"],
["json_load"],
["convert_to_array", {"column": "blurb", "separator": " "}],
["json_dump"]
]

In:
{"user": "John", "blurb": "I like trains"}
{"user": "Sam", "blurb": "I hate pizza"}
{"user": "Dave", "blurb": "I like pizza"}

Out:
{"__ia__unique__token__":8715638442104830642,"user":"John","blurb":["I","like","trains"]}
{"__ia__unique__token__":3445418821193718572,"user":"Sam","blurb":["I","hate","pizza"]}
{"__ia__unique__token__":7497843545051763820,"user":"Dave","blurb":["I","like","pizza"]}

keep

Keep only the specified columns, dropping all others.
Does not drop the interana deduplication token.

Args:
columns (list): List of column names to keep.

Example: Keep only "user" and "event" columns

Config:
[
["decode"],
["json_load"],
["keep", {"columns": ["user", "event"]}],
["json_dump"]
]

In:
{"user": "Sam", "event": "login", "blah": "blah"}
{"user": "John", "event": "expand", "blah": "blah2", "junk": 1}
{"user": "Elyse", "event": "sendmsg", "blah": "blah3"}

Out:
{"event":"login","user":"Sam","__ia__unique__token__":8468468302936289505}
{"event":"expand","user":"John","__ia__unique__token__":2346607208574296553}
{"event":"sendmsg","user":"Elyse","__ia__unique__token__":7644083907636303077}

whitelist_string_values

Remove values from a string column not in the given list.

Args:
column (str): Column to operate on.
whitelist (list): List of allowed string values.

Example: Allow only accepted values in 'source' column

Config:
[
["decode"],
["json_load"],
["whitelist_string_values", {"whitelist": ["frontend", "backend"], "column": "source"}],
["json_dump"]
]

In:
{"user": "Sam", "event": "click", "source": "frontend"}
{"user": "John", "event": "purchase", "source": "fronted"}
{"user": "John", "event": "load", "source": "backend"}
{"user": "Elyse", "event": "flush", "source": "backeend"}

Out:
{"event":"click","user":"Sam","__ia__unique__token__":6897873284472877507,"source":"frontend"}
{"event":"purchase","user":"John","__ia__unique__token__":5406739099689257143,"source":null}
{"event":"load","user":"John","__ia__unique__token__":7635827136580490088,"source":"backend"}
{"event":"flush","user":"Elyse","__ia__unique__token__":2880053427446368098,"source":null}

merge_keys

Merge `column_1` and `column_2` into `output_column` using sha1.
Combine two mutually exclusive shard key columns into one, allowing them to share a table copy.
If column_1 is None and column_2 is not None, output_column will be set to digest of column_2 with LSB set to 0.
If column_2 is None and column_1 is not None, output_column will be set to digest of column_1 with LSB set to 1.
If both are set, or both are None, output_column will be set to None.

Args:
column_1 (str): Column to merge and set LSB to 0.
column_2 (str): Column to merge and set LSB to 1.
output_column (str): Column to put result in.

Example: Merge columns "web1" and "web2" into column "merge"
            
Config:
[
["decode", {"encoding": "utf8"}],
["json_load"],
["merge_keys", {"column_1": "web1", "column_2": "web2", "output_column": "merge"}],
["json_dump"]
]

In:
{"web1": null, "web2": "w1", "val": 12}
{"web1": "u1", "web2": null, "val": 9}
{"web1": "u2", "web2": "w1", "val": 10}

Out:
{"val":12,"__ia__unique__token__":7297373526901799213,"merge":724008111296322,"web2":"w1","web1":null}
{"val":9,"__ia__unique__token__":7187483173479303532,"merge":3481771430954869,"web2":null,"web1":"u1"}
{"val":10,"__ia__unique__token__":2818846474846197375,"merge":null,"web2":"w1","web1":"u2"}

join

Join `columns` together with `separator` between each.
Null columns are skipped.

Args:
columns (list): List of columns names to join together.
output_column (str): Column to put resulting string.
separator (Optional[str]): String to put between each each column. Defaults to the empty string.

Example:
Join first and last name into fullname column

Config:
[
["decode"],
["json_load"],
["join", {"columns": ["first", "last"], "output_column": "fullname", "separator": " "}],
["json_dump"]
]

In:
{"first": "Sam", "last": "Ryan"}
{"first": "John", "last": "Johnson"}
{"first": "Elyse", "last": "Jackson"}

Out:
{"__ia__unique__token__":6631810923594382537,"fullname":"Sam Ryan","first":"Sam","last":"Ryan"}
{"__ia__unique__token__":520329767215318629,"fullname":"John Johnson","first":"John","last":"Johnson"}
{"__ia__unique__token__":3271708847906864678,"fullname":"Elyse Jackson","first":"Elyse","last":"Jackson"}

multiply

Perform floating point multiplication.
If `multiplier` is a string, treat it as a column name.
If `multiplier` column is null, result will be null.

Args:
column (str): Column to use as multiplicand.
multiplier (str): Value or column to use as multiplier
output_column (Optional[str]): Column name to put output in, defaults to `column`.

Example: Multiply 'price' column by 'quantity' column into 'total'

Config:
[
["decode"],
["json_load"],
["multiply", {"column": "price", "multiplier": "quantity", "output_column": "total"}],
["json_dump"]
]

In:
{"first": "Sam", "event": "purchase", "price": 2.13, "quantity": 4}
{"first": "John", "event": "purchase", "price": 5, "quantity": 2}
{"first": "Elyse", "event": "purchase", "price": 8.56, "quantity": 12}

Out:
{"event":"purchase","quantity":4,"price":2.13,"first":"Sam","total":8.52,"__ia__unique__token__":2308855450660799997}
{"event":"purchase","quantity":2,"price":5,"first":"John","total":10.0,"__ia__unique__token__":2479674756660365726}
{"event":"purchase","quantity":12,"price":8.56,"first":"Elyse","total":102.72,"__ia__unique__token__":5658422012269400335}

divide

Perform floating point division.
If `divisor` is a string, treat it as a column name.
If `divisor` column is null, result will be null.

Args:
column (str): Column to use as multiplicand.
divisor (str): Value or column to use as divisor
output_column (Optional[str]): Column name to put output in, defaults to `column`.

Example: Divide 'CPM' column by one million

Config:
[
["decode"],
["json_load"],
["divide", {"column": "CPM", "divisor": 1000000, "output_column": "actual_cost"}],
["json_dump"]
]

In:
{"first": "Sam", "event": "viewad", "CPM": 2385}
{"first": "John", "event": "viewad", "CPM": 4729}
{"first": "Elyse", "event": "viewad", "CPM": 1122}

Out:
{"event":"viewad","actual_cost":0.002385,"first":"Sam","CPM":2385,"__ia__unique__token__":2322298538924664253}
{"event":"viewad","actual_cost":0.004729,"first":"John","CPM":4729,"__ia__unique__token__":3128606018585001222}
{"event":"viewad","actual_cost":0.001122,"first":"Elyse","CPM":1122,"__ia__unique__token__":8493916172843408209}

remove_punctuation

Remove all punctuation from a column.
Assumes unicode characters.

Args:
column (str): The column to remove punctuation from.
output_column (Optional[str]): Column to put result in. Defaults to `column`.

Example: Strip punctuation from "message"

Config:
[
["decode"],
["json_load"],
["remove_punctuation", {"column": "message"}],
["json_dump"]
]

In:
{"user": "Punctuation", "message": "!@#%&*()_-{}[]:;,.?"}
{"user": "NotPunctuation", "message": "$^+=<>"}
{"user": "Combined", "message": "!@#%&*()_-{}[]:;,.?$^+=<>"}

Out:
{"message":"","user":"Punctuation","__ia__unique__token__":1267513633946236610}
{"message":"$^+=<>","user":"NotPunctuation","__ia__unique__token__":1684850124204466571}
{"message":"$^+=<>","user":"Combined","__ia__unique__token__":4132463954692381155}

time_convert

Reformat time column, or do nothing if none of the read formats match.
Uses Python-style time format strings (https://docs.python.org/3.4/library/datetime.html), or one of "epoch_seconds", "epoch_millis", "epoch_micros".
If testing on the command line, the max_time_ago and max_time_hence args require passing the --file-date flag.

Args:
column (str): The column to perform time conversion on.
output_column (Optional[str]): The column to put result in.
read_formats (list): List of datetime format strings to try in order.
write_format (Optional[str]): Datetime format string for output. Defaults to "%Y-%m-%dT%H:%M:%S.%fZ"
max_time_ago (Optional[str]): Events older than the file date minus `max_time_ago` hours will be silently dropped.
max_time_hence (Optional[str]): Events more recent than the file date plus `max_time_hence` hours will be silently dropped.
max_abs_time (Optional[str]): Events stamped after this ISO-8601 timestamp will be silently dropped.
min_abs_time (Optional[str]): Events stamped before this ISO-8601 timestamp will be silently dropped.
    
Example: Convert a variety of time formats into %Y-%m-%d %H:%M:%S

Config:
[
["decode"],
["json_load"],
["time_convert", {"read_formats": ["%A, %B %d %Y", "%m/%d/%Y", "%b %d, %Y", "%Y-%m-%d %H:%M:%S"], "write_format": "%Y-%m-%d %H:%M:%S", "column": "ts"}],
["json_dump"]
]

In:
{"ts": "Tuesday, March 22 2016"}
{"ts": "3/22/2016"}
{"ts": "Mar 22, 2016"}
{"ts": "2016-03-22 12:43:30"}

Out:
{"__ia__unique__token__":1033586384520328482,"ts":"2016-03-22 00:00:00"}
{"__ia__unique__token__":6737880881032506082,"ts":"2016-03-22 00:00:00"}
{"__ia__unique__token__":803264224078215379,"ts":"2016-03-22 00:00:00"}
{"__ia__unique__token__":2708765703165006780,"ts":"2016-03-22 12:43:30"}

url_parse

Parse url into dictionary.

Args:
column (str): Column containing url to be parsed
default_scheme (Optional[str]): scheme to append if no scheme in url (default 'http')

Example: Parse 'referrer' url into dictionary

Config:
[
["decode"],
["json_load"],
["url_parse", {"column": "referrer", "default_scheme": "https"}],
["json_dump"]
]

In:
{"user": "John", "referrer": "john.interana.co.uk:80/videoplay?docid=-7246927612831078230&hl=en&view=dark&loggedin=yes#00h02m30s"}

Out:
{"__ia__unique__token__":7912114277290934867,"referrer":{"fragment":"00h02m30s","query":"docid=-7246927612831078230&hl=en&view=dark&loggedin=yes","scheme":"https","path":"/videoplay","params":"","port":80,"hostname":"john.interana.co.uk"},"user":"John"}

url_query_string_parse

Parse a url query string of the form 'key1=value1&key2=value2'.

Args:
column (str): The column to remove punctuation from.
output_column (Optional[str]): Column to put result in. Defaults to `column`.
keep_parameters (list): Only keep parameters in the query string from this list.

Example: Further parse 'referrer.query' into selected parameters

Config:
[
["decode"],
["json_load"],
["url_parse", {"column": "referrer", "default_scheme": "https"}],
["flatten_dict", {"columns": ["referrer"]}],
["url_query_string_parse", {"column": "referrer.query", "keep_parameters": ["hl", "view", "loggedin"]}],
["json_dump"]
]

In:
{"user": "John", "referrer": "john.interana.co.uk:80/videoplay?docid=-7246927612831078230&hl=en&view=dark&loggedin=yes#00h02m30s"}

Out:{"user":"John","__ia__unique__token__":7912114277290934867,"referrer.port":80,"referrer.params":"","referrer.path":"/videoplay","referrer.fragment":"00h02m30s","referrer.scheme":"https","referrer.hostname":"john.interana.co.uk","referrer.query":{"hl":"en","view":"dark","loggedin":"yes"}}

name_value_extract

Extract list of generic properties into one object.

Args:
column (str): The list column to extract values from.
name_field (str): Name of field to treat as generic property name.
value_field (str): Name of field to treat as generic property value.

Example: Extract price and category info from item field

Config:
[
["decode"],
["json_load"],
["name_value_extract", {"column": "items", "name_field": "category", "value_field": "price"}],
["json_dump"]
]

In:
{"user": "John", "items": [{"label":"tab.abc","category":"abc","price":17.80},{"label":"tab.f","category":"food","price":41.00}]}

Out:
{"items":{"food":41.0,"abc":17.8},"user":"John","__ia__unique__token__":1918698498990518500}

regex_extract

Extract matching regular expression subgroups into new columns.

Args:
column (str): Name of column to extract from.
regex (str): Regular expression to match with.  Regex must match entire string.
output_columns (list): List of column names to put regex subgroups into in order.

Example: Extract "plan" query parameter from non-url field 'uri'

Config:
[
["decode"],
["json_load"],
["regex_extract", {"column": "uri", "output_columns": ["plan"], "regex": "(?:(?:.*)plan=([^&]*)?(?:.*)|.*)"}],
["json_dump"]
]

In:
{"user": "John", "uri": "/channel/videoplay/?vidid=7246927612831078230&plan=pro&view=dark&loggedin=yes"}

Out:
{"user":"John","uri":"/channel/videoplay/?vidid=7246927612831078230&plan=pro&view=dark&loggedin=yes","__ia__unique__token__":6723934658596588009,"plan":"pro"}

regex_replace

Perform regular expression replacement on `column`.
Replacement string uses Python-style backreferences.

Args:
column (str): Column to perform replacement with.
regex (str): Regular expression to match.
repl_string (str): Replacement string to substitute.
output_column (Optional[str]): Column name to put output in, defaults to `column`.

Example: Use regex_replace to prefix 'uri' field with scheme and host

Config:
[
["decode"],
["json_load"],
["regex_replace", {"column": "uri", "regex": "^(\/)", "repl_string": "https://john.interana.com/"}],
["json_dump"]
]

In:
{"user": "John", "uri": "/channel/videoplay/?vidid=7246927612831078230&plan=pro&view=dark&loggedin=yes"}

Out:
{"user":"John","__ia__unique__token__":6723934658596588009,"uri":"https://john.interana.com/channel/vi...k&loggedin=yes"}

regex_array_extract

For each element in array, extract matching regular expression subgroups into new columns.

Args:
column (str): Name of column to extract from.
regex (str): Regular expression to match with.
output_columns (list): List of column names to put regex subgroups into in order.

Example: Extract categories from list of products.

Config:
[
["decode"],
["json_load"],
["regex_array_extract", {"column": "products", "regex": "^([a-z]*)-.*", "output_columns": ["categories"]}],
["json_dump"]
]

In:
{"user": "John", "products": ["produce-hdy438ed", "cleaning-or944dus", "seafood-pd33sk2q"]}

Out:
{"user":"John","__ia__unique__token__":3040505961674288811,"categories":["produce","cleaning","seafood"],"products":["produce-hdy438ed","cleaning-or944dus","seafood-pd33sk2q"]}

regex_array_replace

Perform regular expression replacement on each element in an array.
Replacement string uses Python-style backreferences.
Do nothing if column is not an array.

Args:
column (str): Column to perform replacement with.
regex (str): Regular expression to match.
repl_string (str): Replacement string to substitute.
output_column (Optional[str]): Column name to put output in, defaults to `column`.

Example: Use regex_array_replace to remove trailing identifiers from items in 'products' list

Config:
[
["decode"],
["json_load"],
["regex_array_replace", {"column": "products", "regex": "-[\S]*", "repl_string": ""}],
["json_dump"]
]

In:
{"user": "John", "products": ["produce-hdy438ed", "cleaning-or944dus", "seafood-pd33sk2q"]}

Out:
{"user":"John","products":["produce","cleaning","seafood"],"__ia__unique__token__":3040505961674288811}

dictionary_replace

Apply dictionary replacement to a column.

Args:
mapping (dict): Dictionary mapping strings to replacements.

Example: Use dictionary_replace to change product codes into product names

Config:
[
["decode"],
["json_load"],
["dictionary_replace", {"mapping": {"xkdruw67": "mp3player", "skr885jc": "dvdcleaner", "kk284jds": "cartridge"},"column": "product"}],
["json_dump"]
]

In:
{"first": "Sam", "product": "xkdruw67", "event": "purchase"}
{"first": "John", "product": "skr885jc", "event": "purchase"}
{"first": "Elyse", "product": "kk284jds", "event": "purchase"}

Out:
{"__ia__unique__token__":6913206265407409391,"event":"purchase","product":"mp3player","first":"Sam"}
{"__ia__unique__token__":496410646764207962,"event":"purchase","product":"dvdcleaner","first":"John"}
{"__ia__unique__token__":9151368650616356276,"event":"purchase","product":"cartridge","first":"Elyse"}

json_load_column

Deserialize a column containing a JSON encoded string.
This is useful for CSV files containing JSON encoded strings.

Example: Load value of 'embed' as json

Config:
[
["decode"],
["json_load"],
["json_load_column", {"column": "embed"}],
["json_dump"]
]

In:
{"user": John, "embed": "{\"action\": \"click\"}"}

Out:
{"__ia__unique__token__":673996973727761320,"embed":{"action":"click"},"user":"John"}

if_then_otherwise

Run steps conditionally.
If `condition` evaluates to True for an individual event, the `then` list of steps will be run, or else
the `otherwise` list of steps will be run if provided.

Args:
condition (step): Condition to determine which steps to run.
then (list): Steps to use if `condition` is True.
otherwise (Optional[list]): Steps to use if `condition` is False.

Conditions:
regex_match: column (str) pattern (str)
regex_search: column (str) pattern (str)
contains: column (str)
starts: column (str) prefix (str)
ends: column (str) suffix (str)
in_list: column (str) value (str)
not_in_list: column (str) value (str)
equals: column (str) value (str)
not_equals: column (str) value (str)
lt (less than): column (str) value (str)
gt (greater than): column (str) value (str)
lte (less than or equal): column (str) value (str)
gte (greater than or equal): column (str) value (str)
bool_eval: code (str)
and_: conditions (list)
or_: conditions (list)

Example: If the url path indicates a payment, add the 'page' = 'pay' label.

Config:
[
["decode"],
["json_load"],
["if_then_otherwise", {"condition": ["regex_match", {"column": "url.path", "pattern": "^\/pay\/.*"}], "then": [["add_label", {"column": "page", "label": "pay"}]]}],
["json_dump"]
]

In:
{"user": "John", "url.path": "/pay/addcard/"}
{"user": "John", "url.path": "/dashboard/configure/"}

Out:
{"user":"John","__ia__unique__token__":6826495190851080943,"page":"pay","url.path":"/pay/addcard/"}
{"user":"John","__ia__unique__token__":4857886812416484204,"url.path":"/dashboard/configure/"}

code_snippet

Write custom Python code to transform each event.

Args:
code (str): Python statements operating on the variable "line".
import_list (Optional[list]): Module names to import for use in this code snippet.

Example: Use code_snippet to copy 'event' column to 'action' column

Config:
[
["decode"],
["json_load"],
["code_snippet", {"code": '''
line['action'] = line['event']
'''}],
["json_dump"]
]

In:
{"user": "John", "event": "click"}
{"user": "John", "event": "purchase"}

Out:
{"user":"John","event":"click","__ia__unique__token__":8820840440868449619,"action":"click"}
{"user":"John","event":"purchase","__ia__unique__token__":1537361993526075803,"action":"purchase"}

After Transformations

json_dump

Returns a dictionary as a JSON-encoded object.
This is the only dictionary-to-line step.
This should generally be the last step.

Example: Transform Python dictionary to JSON for ingest.

Config:
["json_dump"]

 

 

  • Was this article helpful?