Skip to main content
Interania

Derived column examples

1votes
24updates
289views
This applies tov2.23

 

Column calculations

One of the most common ways to use derived columns is to return a calculation that uses one or more column values from your dataset. 

For example, let's say your events describe video plays. You have columns called start_time and stop_time that represent when the user started and stopped playing the video, respectively. You can calculate the elapsed time with this column:

# function return_elapsed_time()
# input column(s): long start_time, long stop_time
# output: long
    
long return_elapsed_time(){
    if (stop_time == NULL || start_time == NULL){
        return -1;
    }
    return (stop_time - start_time);
}

Note that this column checks for null values explicitly; it's always a good idea to do this in order to avoid unexpected results.

You can also handle null values safely by using the get() and getd() functions, which accept a default value to return if the column is null. You should use get(string field_name, long on_null) to access integer columns and getd(string field_name, double on_null) to access decimal columns. There's more info on these functions on the Derived Column Syntax Reference

For example, say you have decimal columns screen_width and screen_length, and you want to return the area of the screen. You can use this column:

# function return_screen_area()
# input column(s): double screen_width, double screen_height
# output: double screen_area

double return_screen_area(){
    double screen_area =  getd("screen_width",0.0)*getd("screen_height",0.0); # note the use of getd()
    return screen_area;
} 

In this case, if screen_width is NULL for an event,getd("screen_width",0.0) returns 0.0, so the column returns 0.0.

Derived columns also let you do modular arithmetic on column values. For example, if you want to bucket user_ids into five groups based on their value mod 5, you could use this derived column:

# function user_id_mod_5()
# input column(s): long user_id
# output: long

long user_id_mod_5(){
    return (user_id%5);
}

Check column values

Maybe you want to check whether an event meets a set of criteria. You can use a derived column to check the value of one or more columns and then return a boolean value. 

For example, let's say you want to know which events come from iPhone users in Japan. You could use this column:

# function is_japan_iphone_event()
# input column(s): string country, string device
# output: long is_japan_iphone

long is_japan_iphone_event(){
    long is_japan_iphone = 0;
    if (match_string(country, "Japan") && match_string(device, "iPhone")){
        is_japan_iphone = 1;
    }
    return is_japan_iphone;
}

You can also use regular expression to check column values. For example, this derived column checks for the en-us locale in a URL:

# function is_en_us()
# input column(s): string url
# output: long is_en_us
long is_en_us(){
    long is_en_us = 0;
    if (match_string(url,".*\/en-us\/.*")){
        is_en_us = 1;
    }
    return is_en_us;
}

Compare columns

You can use derived columns to compare column values to each other. 

For example, say I have integer columns num_likes and num_dislikes. I want to know how they compare to each other. I can check that with this derived column:

# function compare_likes()
# input column(s): long num_likes, long num_dislikes
# output: static_string

static_string compare_likes(){
    # if either column is NULL, use 0 in its place
    long likes_or_zero = get("num_likes", 0);
    long dislikes_or_zero = get("num_dislikes", 0);
    
    if (likes_or_zero > dislikes_or_zero){
        return string_id("more likes");
    }
    else if (dislikes_or_zero > likes_or_zero){
        return string_id("more dislikes");
    }
    else{
        return string_id("equal likes and dislikes");
    }
}

Categorize columns

You can also use a derived column to check the value of one or more columns in your dataset and return some categorical value.

Let's say you have a column called device that can take these values: Android, iPhone, Pixel, iPad, MS Surface. 

For each event, you want to know what type of device it came from. You could use this column:

# function return_device_type()
# input column(s): string device
# output: long device_type

long return_device_type(){
   long device_type = -1;
    if (match_string(device, "iPad") || match_string(device, "MS Surface")){
        device_type = 1;
    }
    else if (match_string(device, "Android") || match_string(device, "iPhone") || match_string(device, "Pixel")){
        device_type = 2;
    }
    return device_type;
} 

Even better: in version 2.22 and later, derived columns can return static strings! This means instead of returning an integer that represents a category, you can return a friendlier string:

# function return_device_type()
# input column(s): string device
# output: static_string device_type

static_string return_device_type(){
    if (match_string(device, "iPad") || match_string(device, "MS Surface")){
        return string_id("Tablet");
    }
    else if (match_string(device, "Android") || match_string(device, "iPhone") || match_string(device, "Pixel")){
        return string_id("Smartphone");
    }
    return string_id("None");
}

Check for null values

You can use derived columns to check whether one or more columns are null for each event (and then do something else with those results).

The way that you check for nulls depends on whether the column is a numeric (i.e. integer or decimal) type, a string type, or a set type.

Checking for nulls in numeric columns

Let's say you have an integer column called rating which represents a rating on a scale of 1-10. You want to run a query that sums all of these ratings. However, if a particular event doesn't have a rating, you want to add 5 to the sum instead of 0 for that event. You can define a column that returns the rating if it exists or 5 if it doesn't:

# function return_rating_or_5()
# input column(s): long rating
# output: long
long return_rating_or_5(){
    if (rating == NULL){
        return 5;
    }
    else{
        return rating;
    }
}

You can also use the get(column_name, default) method instead of using an if statement for null checking:

# function return_rating_or_5()
# input column(s): long rating
# output: long
long return_rating_or_5(){
    return get("rating",5);
}

Checking for nulls in string columns

Let's say you have a string column called username and you wan to find the number of events where username exists minus the number of events where username is null. You can sum this derived column over your dataset:

# function check_if_username_exists()
# input column(s): string username
# output: long
long check_if_username_exists(){
    if (username < 0){
        # username is null
        return -1;
    }
    else{
        return 1;
    }
}

Note the syntax for checking if a string column is null: you should reference the string column by its name, without wrapping it in match_string().

Checking for nulls in set columns

Set columns can't be referenced directly in derived columns, so the null-checking methods for regular numeric and string columns won't work on sets. Instead, you can use the set_size(set_column) method, since checking for a null value in a set column is the same as checking whether the set has a length of 0. You can do that like this:

# function check_if_set_exists()
# input column(s): string_set email_addresses
# output: long
long check_if_set_exists(){
    long size = set_size(email_addresses);
    if (size == 0){
        return -1;
    }
    else{
        return 1;
    }
}

Merge columns

Sometimes it makes sense to merge the values of two separate columns into a single derived column for aggregation purposes.

For example, say I have a column called version, but at a typo somewhere in the pipeline cause several events to be sent with a column called versio instead. I can merge version and versio into a single derived column like this:

# function merge_version()
# input column(s): long version, long versio
# output: long merged_version

long merge_version(){
    long merged_version = -1;
    if (version != NULL){
        merged_version = version;
    }
    else if (versio != NULL){
        merged_version = versio;
    }
    return merged_version;
}

Date/time calculations

You can use derived columns to mine information from timestamps. For example, you can write a derived column that takes in an epoch timestamp and returns the month, day of week, or fiscal year.

Here's a derived column that takes an epoch timestamp in 2017 and returns its month as an integer between 1 and 12:

# function returns_month_2017()
# input column(s): millitime timestamp
# output: long

long returns_month_2017(){
    long newyears = 1483228800000;
    long ms_in_day = 1000*60*60*24;
    long seconds_since_newyears = timestamp - newyears;
    long day = seconds_since_newyears / ms_in_day;
    if (timestamp < newyears) { return 0; }
    if (day < 31){return 1;}
    else if (day < 59){ return 2; }
    else if (day < 90){ return 3; }
    else if (day < 120){ return 4; }
    else if (day < 151){ return 5; }
    else if (day < 181){ return 6; }
    else if (day < 212){ return 7; }
    else if (day < 243){ return 8; }
    else if (day < 273){ return 9; }
    else if (day < 304){ return 10; }
    else if (day < 334){ return 11; }
    else if (day < 365){ return 12; }
    else{ return 0; }
}

This one takes an epoch timestamp and returns its day-of-week in the PST timezone, represented as an integer between 0 and 6:

# function return_day_of_week()
# input column(s): millitime timestamp
# output: long day_of_week
    
long return_day_of_week(){    
    long tz_offset = -8;
    long weekday_offset = 4; # this offset (relative to the epoch start day) sets the week's "day 0"
    long hours =(timestamp/3600000)+tz_offset;
    long days = hours/24;
    long day_of_week = (days+weekday_offset)%7;
    return day_of_week;
} 

And this one takes an epoch timestamp and returns its hour-of-day (represented as an integer between 0 and 23) in the PST timezone:

# function hour_of_day_pst()
# input column(s): millitime timestamp
# output: long

long hour_of_day_pst(){    
    long tz_offset = -8; # PST is offset -8 from UTC
    long hours =(timestamp/3600000)+tz_offset;
    return hours%24;
}

Alternatively, you can use strings to return the hour of the day: 

long return_hour_of_day(){    
    long tz_offset = -8;
    long hours =(__time__/3600000)+tz_offset;
    return hours%24;
}

You can also return the day of the week as a string: 

static_string return_day_of_week(){    
    long tz_offset = -8;
    long weekday_offset = 4; 
    long hours =(__time__/3600000)+tz_offset;
    long days = hours/24;
    long day_of_week = (days+weekday_offset)%7;
    switch(day_of_week){
        default:
            return string_id("NULL");
        case 0:
            return string_id("Sunday");
        case 1:
            return string_id("Monday");
        case 2:
            return string_id("Tuesday");
        case 3:
            return string_id("Wednesday");
        case 4:
            return string_id("Thursday");
        case 5:
            return string_id("Friday");
        case 6:
            return string_id("Saturday");
    }
}

Return constants

There are multiple reasons why you might want to use a derived column to return a constant. 

Dividing metrics by constants

Let's say you have a column called ride_duration that indicates the duration of a ride in milliseconds. You want to find the average ride duration across all rides, but you want to display the average in minutes, not milliseconds (so you need to divide the average by 60000). 

The first step is to create a derived column that returns 60000.

# function return_60000()
# input column(s): N/A
# output: long

long return_60000(){
    return 60000;
}

Next, create a custom metric where the numerator calculates the average ride_duration. In order to divide by 60000, you can find the average, minimum, or maximum (it doesn't matter which!) of your derived column in the denominator of your custom metric. 

Complex Counts

Derived columns provide a lot of flexibility to check multiple columns and write complex logic. This makes them ideal for calculating counts of events that meet complex criteria.

A common use case is to write a derived column that returns 1 (or another constant) if an event meets a complex filter condition and returns 0 otherwise. You can then use this derived column like a saved filter; for example, you could filter to events where the derived column = 1 instead of writing out a long filter set. You could also sum the derived column across all events instead of counting events that meet a complex filter set.

For example, let's say each event has two columns called cat_exists and dog_exists. You want to find the total happiness across all events, where each event has a happiness score defined by:

  • if cat_exists = 0 and dog_exists = 0, happiness = 1
  • if cat_exists = 1 and dog_exists = 1, happiness = 0
  • if cat_exists = 1 and dog_exists = 0, happiness = 2
  • if cat_exists = 0 and dog_exists = 1, happiness = 2

You can do this by summing this derived column across all events:

# function return_happiness()
# input column(s): long cat_exists, long dog_exists
# output: long

long return_happiness(){
    if ((cat_exists == 1) && (dog_exists == 1)){
        return 0;
    }
    if ((cat_exists == 1) ^ (dog_exists == 1)){ # note the use of ^ (not available in advanced filters)
        return 2;
    }
    else{
        return 1;
    }
}    

Finally, one of the coolest use cases is calculating DAU/MAU.

Working with Set Columns

Set columns are handled differently from non-set columns in Interana. For example, if you want to filter on a string column in Explorer, you can choose from many filter operations, including "is one of," "text contains," and "starts with."

string filter options

However, if you want to filter on a set column, you only have two choices of operation: "set contains" and "set excludes."

set column operators

Don't worry, though — you can use derived columns to do complex operations on set columns, using three functions:

  • set_size(set_column)
  • get_item(set_column, index)
  • set_match_string(set_column, pattern)

Check out the derived column reference for more info on these function. 

Checking set size

Say you have a set column that contains any error codes associated with a given event. You want to query the number of errors per request. You can do this by writing a derived column that returns the size of the set column:

# function get_error_count()
# input column(s): string_set error_codes
# output: long

long get_error_count(){
    return set_size(error_codes);
    # note: if error_codes is null, set_size will return 0
}

Accessing particular set indices

Advanced filters allow you to check whether or not a set contains a particular value, but you should use a derived column if you want to check the element at a particular position within a set.

Once again, you have a string_set column called error_codes which lists the error codes associated with an event. You want to retrieve the first error code for each event and check whether it is a server error. You can do this by writing a derived column that uses the get_item() function:

# function first_error_is_server()
# input column(s): string_set error_codes
# output: long

long first_error_is_server(){
    if (match_string(get_item(error_codes,0),r"^5\d{2}$")){
        return 1;
    }
    else{
        return 0;
    }
}

Checking set elements against regular expressions

You can use advanced filters to check whether a set contains an exact string, but derived columns also let you compare a regular expression against all elements of a set. 

Say you are doing some A/B testing. You have multiple experiments running at once, so one event can be considered in multiple experiments. To keep track of events' experiment membership, you use a string set column called "experiments" whose elements indicate both the experiment ID and the experimental group. For example, an event that is in the control group of experiment 1 and the variation 3 group of experiment 5 would have
"experiments": ["experiment1control","experiment1variation3"]

You want to filter to all the events that are in experiment 1, regardless of which experimental group they are in. You could use an advanced filter with the contains operator, but you would have to explicitly specify every experiment string that contains experiment1. This would look something like:

(`experiments` contains "experiment1control") or (`experiments` contains "experiment1variation1") or (`experiments` contains "experiment1variation2") or ...

Instead, you can use a derived column that tests a regular expression against all elements of the experiments set and returns 1 if there is a match. 

# function is_in_experiment1()
# input column(s): string_set experiments
# output: long

long is_in_experiment1(){
    return set_match_string(experiments,"experiment1.*");
}

Looping over the elements of a set column

This function creates a derived column that returns the sum for all the elements in an integer set column.

double total_cpu_sec() {
  double total = 0;
  if (set_size(leaf_cpu_ms) == 0) return 0;

  for (int i=0; i < set_size(leaf_cpu_ms); i++)
    total += get_item(leaf_cpu_ms, i);

  return (total/1000.0);
}

This function creates a derived column that returns the maximum value found within an integer set column.

double max_wall_sec() {
  double max = 0;
  if (set_size(leaf_wall_ms) == 0) return 0;

  for (int i=0; i < set_size(leaf_wall_ms); i++)
    if (get_item(leaf_cpu_ms,i) > max)
      max = get_item(leaf_cpu_ms,i);

  return (max/1000.0);
}

Create a derived column from a column syntax

Derive a column from one that contains a "." character (dot)

long item_is_free()
{
  // The value for this column will be 1 if the item is free, and 0 otherwise
  return c("item.cost") == 0;
}

Here's an example of what this would look like in the UI. Notice that the company names are stored in the data server as integers (for example, Apptimize is stored as 2707 and Vamo is  stored as 3059).

Work with negative numbers

Although you can't use negative values with our filters, you can work around this by creating a derived column that converts those values. In this example, the function converts latitude values, changing any negative latitudes to non-negative numbers.

long make_pos()
{
  return lat + 180000000;
}

Interana multiples all stored numeric values by 1,000,000; this function accounts for that by adding 180 million to the value, instead of 180.

View specific events

The example matches on an event type to create a column with all of the error events. 

long is_error_event() 
{
  return match_string(event, ".*error.*") || match_string(event, ".*Error.*");
}

Similarly, this example matches on the event type "signup" to create a column containing all signup events.

long is_signup_event() 
{
  return match_string(event, ".*signup.*") || match_string(event, ".*Signup.*");
}

Looping over the elements of a set column

This function creates a derived column that returns the sum for all the elements in an integer set column.

double total_cpu_sec() {
  double total = 0;
  if (set_size(leaf_cpu_ms) == 0) return 0;

  for (int i=0; i < set_size(leaf_cpu_ms); i++)
    total += get_item(leaf_cpu_ms, i);

  return (total/1000.0);
}

This function creates a derived column that returns the maximum value found within an integer set column.

double max_wall_sec() {
  double max = 0;
  if (set_size(leaf_wall_ms) == 0) return 0;

  for (int i=0; i < set_size(leaf_wall_ms); i++)
    if (get_item(leaf_cpu_ms,i) > max)
      max = get_item(leaf_cpu_ms,i);

  return (max/1000.0);
}
  • Was this article helpful?