In the last section, we learned about Activity metrics: showing how often actors are active during some time period. Here, we'll take a look at Per-Actor Metrics in general and how they can be used to compute metrics and ratios across all the actors in the data. Interana makes it simple to create custom metrics to understand per-actor behavior. These metrics are quickly calculated for all actors in the dataset, at the time of the query. There's no need to pre-aggregate, build summaries, indices, or cubes — which makes it painless to define and refine metric definitions until you get it just right. Define the metric, use it in your queries, and it's quickly evaluated across the selected actors in your data.
Understanding how much data a user changes in Wikipedia
We have some great documentation on per-actor metrics in the Explorer Guide, so check it out. Here, we're going to focus on hands-on doing. The per-actor metrics we'll build relate to the size of Wikipedia articles, and how many bytes each user changes. Our dataset has two columns that represent the size of the article before the change (named
length.old) and after the change (named
length.new). We've also created a derived column — more on that later — that computes the absolute change in size named
Head over to the Explorer and let's take a quick look at what it looks like to plot the Average
length.abs.delta from 7 days ago to now. Use the Query Builder to specify a query that looks like this:
Based on the chart above, it looks like on average the edits are in the range of several hundred bytes. Since there are no filters, this is computed across all Wikis, users, spaces, etc. We'll explore this further and see how those sizes differ across those different factors.
What makes it special: Derived columns are a powerful advanced feature of Interana. They help you work with available data, reducing the need to modify the data collected or perform additional transformations before the data is imported (ingested). They're built to be a natural extension to our schemaless, columnar backend — allowing custom computation that a user can create and adjust after data has already been loaded. Derived columns are typically created and published by your Data Science team or crafted with a little help from other Interana users on our discussion forums. There's a little bit of coding involved, but it's streamlined and simplified. Even without a data team, a user with little coding experience can often find something that comes close to what they need and make it work for them with small tweaks to the code.
Creating a per-actor metric
We can use the
length.abs.delta column to create a per-actor measure for users that aggregates the values in some way. For example, we can compute the maximum to see the largest change submitted by the user. We can also add all the values together to see the total amount of changes a user submitted. Let's use the latter method to better understand how users create and edit Wikipedia articles.
Head over to the Metrics manager by clicking the calculator icon on the navigation panel:
When there, click the big friendly blue NEW METRIC button on the top right of the Per-Actor Metric area:
When the new metric box pops up, give it a name like User_TotalBytesChanged and make sure that you're defining it for
user in the For Each field. Then click the field next to Measure and select Sum. Select the
length.abs.delta column as the values to sum together. Lastly, enter a meaningful description like "Sum of bytes changed for each user" and click the blue Save button. The dialog should look something like this:
Once you've saved the new metric, it'll be listed along with all the other published metrics. Head over to the My area to see just those metrics that you've created. Let's take a look at the metric. One click on the compass icon will bring it into the Explorer using an appropriate View:
The resulting chart should look something like this:
Notice that Interana selected the Distribution view by default to visualize per-actor metrics. This makes sense since there are many actors, and one way to understand a metric for the population as a whole is using a distribution. The distribution is interesting in that most of the mass is on the left with small sizes. It looks like the vast majority of users make a very small amount of changes in Wikipedia. But there's still a significant bar all the way on the right.
The chart only plots 2 weeks, so perhaps the distribution will change if we look at a longer period of time. Let's rerun the query across all available data by dragging the left side of the time scrubber until it encompasses the whole blue area.
So the distribution shifted a little, but still looks the same. It's even clear that most users change far fewer than 2000 bytes of content, and almost all users change less than 16KB. But there's still a significant number all the way on the right of the distribution. Hovering over that bar shows it's less than 0.6% of all users...but that's still over 38000 individual users.
What makes it special: Being able to define custom per-actor metrics offers you powerful insights into the population of actors and their behavior. It lets you follow your intuition and curiosity, defining metrics you can visualize around most aspects of who actors are and what they do. Like everything else, Interana dynamically calculates the metrics on the latest data when the query is run. It always helps to log data with intention, but with Interana you don't have to know exactly what you're looking for when collecting the data or instrumenting the application.
Using per-actor metrics in other named expressions
Who are the users on the right side of the chart and what are they doing? They certainly seem out of the ordinary. Let's take a deeper look. Click on that right-most bar and Interana zooms in, filtering to just those users. Notice that the left-most boundary of the bar is now part of the filter condition:
Interestingly, the chart still looks similar. Let's zoom in on that last bar again. Wow, it's still got a long tail:
More interesting still, even though the left bar represents only around 200 users, the events in this graph account for over 45% of the data in the dataset. Those few users are not only prolific, they are also very busy constantly generating lots of changes.
Let's change the View to Table, set the measure to Count Events, and group by user:
Looks like many of those user names are associated with bots. That would make sense, since bots work tirelessly doing whatever it is they do. You may recall that we have a column in the dataset named
bot that represents a flag that's supposed to be set to true when a user is a bot. Let's filter for
bot is not one of true.
What happens? There are still tons of users left that haven't set the bot flag but submit more changes than seems possible for a human. Hmmm...suspicious. Some of those user names indicate that the user is likely a bot, even if the flag isn't set. Maybe you recall seeing a published cohort named "Undeclared Bots". Let's filter for events from users not in that cohort and run the query again. The filter stack should look like this:
The table is shrinking (as is the event count in the donut), but something still seems suspicious. It's highly unlikely that an unassisted human would be submitting over 6 million changes to Wikipedia:
That Undeclared Bots cohort doesn't seem to be very good at catching bots. Let's take a look at how it's defined. Click the tool tip next to the cohort name, and then click on the Edit link. The cohort is defined as follows:
So it's checking for the
bot flag, but also looking for "bot" — with various capitalizations — in the name or the user.
Not particularly sophisticated. I think we can do better. Click that blue Copy button, and we'll try to improve how that cohort is defined. Name it something like Improved Undeclared Bots. When it shows up under My cohorts, click the pencil icon to edit the definition:
Once there, let's take a more careful look. The cohort is defined across the last 56 days, but what if the bot flag had been set earlier and then reset? Let's redefine the cohort to go all the way back in time by changing the Start date to be "5 years ago". Then add another filter and let's use our per-actor metric to find users who've contributed over 1000KB of changes to Wikipedia. Since the vast majority of users only submit under 16KB of changes, this would seem a strong indication of bot behavior. Finally, change the description to be something meaningful. The cohort definition would look something like this:
...But wait! Something is a bit strange. We don't want to include only users with bot in their name AND when they change lots of stuff. We'd like to include users when they exhibit either of those behaviors without setting the bot flag. We can do that by switching over to the Advanced filters by clicking the little circle next to Advanced:
You can now see that the filter is defined using three "and" operations. We'd like to change one of those to an "or" operation, and group the clauses correctly. We can do that by using parentheses and editing the text field. Edit the definition so that it contains filter text something like that below:
`bot` not in ("true") and ((`user` matches "[bB][oO][tT]") or (`user.User_TotalBytesChanged___miros...firstname.lastname@example.org` >= 1000000))
Be sure to put parenthesis around the last two filter conditions, and change the "and" to an "or". The user name in the expression should be yours, and the metric whatever name you selected. Finally, update the description to be something like "Users where their name indicates a bot or they've issued over 1MB in changes, but they haven't declared being a bot via the bot flag." When all done, the definition should look something like this:
Save the new cohort definition, and let's try it out. Click the compass under Actions for the new cohort to bring it into the Explorer. According to the table, there are over 8000 undeclared bots lurking around:
That's much higher than the value reported on the landing dashboard, even if the dashboard only covered 28 days:
So it looks like our improved bot detection cohort is working better than the original!
Let's take a minute to review what we've done:
- Defined a per-actor metric that aggregates the number of bytes a user has changed in Wikipedia.
- Used the per-actor metric inside of a cohort to identify users who are suspected bots based on their name and behavior.
Both these named expressions are definitions that get evaluated when the query is run, so no time is wasted waiting for pre-aggregation or indexing. Pretty powerful stuff! And you've created very sophisticated queries without having to write any code.
Getting even fancier
Ok, now we know that users changing Wikipedia seem to behave differently depending on whether they are interactive humans or bots. Can we use that insight to create a more accurate per-actor metric that's just looking at the behavior of humans editing articles on English Wikipedia? Let's give that a shot. Head back to your collection of Metrics and make a copy of the User_TotalBytesChanged metric. Name it something like Human_EnglishWikiBytesChanged. Next, add some filters to include just those events we care about. Only include the following:
- Events that happen on English Wikipedia (
wikiis one of enwiki).
- Events that are in the Main name space (
spaceis one of Main).
- Events where the bot flag isn't set (
botis not one of true).
- Events where the user isn't in the improved bot detection cohort (
useris not in cohort Improved Undeclared Bots).
The more refined per-actor metric should look something like this:
Save it. Now let's try exploring it to see how this distribution looks. Pretty similar, although the buckets are generally smaller. How about if we look at the data a bit differently? Switch the chart to Time View. Click the Add Measure button twice, and set the measures to be the Average, Median, and 95th Percentile of the metric. The resulting query and chart should look something like this:
Interesting! It looks like the Median is all the way on the bottom, which implies the vast majority of changes are tiny. But the Average is somewhere between 250 and 500 bytes in a 4 hour window. That's much more reasonable for human editors. The 95th percentile is significantly higher, but still within what feels like human limits. Looks like this new per-actor metric is doing what we want. Success!
How can I use it? The technique we learned is widely applicable. Just define what metrics you care to study and the subset of actors you'd like to better understand better. The combination of flexibly defined metrics and behavioral cohorts makes for a powerful analytical tool. For example:
- ECommerce companies could study quarterly spending for customers referred by a specific advertising campaign.
- A two-sided marketplace like a ride-sharing service could study the number of hours driving outside of 9am-5pm for in a particular geography, and how that maps to the drivers' home zip codes.
- Device companies can use this technique to look at the number or diagnostic alerts per day for failed devices in the month prior to failure.
Custom metrics and ratios
Another type of metric that Interana supports is a custom metric defined over the entire dataset rather than on a per-actor basis. These metrics can also be defined as ratios, which are often more meaningful for behavioral analytics. Let's riff off the chart above and our per-actor metric to define a custom ratio metric. Head back to your Metrics page, and create a new Custom Metric:
One metric that's often used to monitor predictable behavior or catch anomalies is a Peak-to-Mean ratio. We can do something similar with our per-user metric to find situations where the ratio changes significantly. That might mean more undeclared bots made it through or something else strange is happening. Define the metric as:
You'll have to click the "Add Custom Denominator" button to make the denominator area appear. Save the new ratio metric, and then Explore it by clicking on the compass in the metric actions. Notice that Interana opens the metric in Time View, with the selected Measure being the metric you explored:
This is different from when we explored the per-actor metric, since that made more sense as a distribution across all the actors in the dataset. Looking at the resulting chart, it looks like there's plenty of variability with the 4 hour window, but the ratio values are generally between 2 and 4. Let's widen the time range to the last 28 days and use the chart controls to set the Resolution to 1 Day (the time window will automatically adjust):
The resulting chart is much smoother:
Let's widen that out to 26 weeks and see if there are any longer-term trends:
Maybe? Perhaps there's a small quarterly rise and fall, but it generally looks stable around 3.0 within a +/- 0.5 range.
What makes it special: Interana is for digital economy workers of all kinds, not just coders and data scientists; it's especially designed with product managers, UX designers, marketers and growth teams in mind. Being fast and flexible, Interana allows you to rapidly iterate on your questions and ideas. Being able to define and customize metrics per-actor and for the whole population makes Interana adaptable for a range of industries and applications. Interana is accessible to people who are analytical but not technical, since you don't have to write code (or even SQL).
Let's consider what we just did. Our last metric was composed of the ratio of P95 and mean for a per-actor metric. So for every user and every day in the last 26 weeks we've computed the sum of their changes to English Wikipedia. But not just that, we've also verified that the user was human and not included in a cohort that was based on another per-actor metric. We calculated the P95 and mean for the metric across that user population for every day in the last 26 weeks. And then graphed the information. All by using a nested combination of named expressions and not writing any code. If we check the stats:
We see that we ran a bunch of complex behavioral calculations across 460 Million events and came back with a pretty chart in a few seconds. Not bad! This dataset is still tiny compared to those some Interana customers use in production, but it should give you a sense of how quick queries combined with rapidly iterating to refine questions empowers everybody to explore the data.
That's probably more than enough of Metrics for now. Another important behavioral analytics technique is to automatically group actor events into sessions and see what happens in and across sessions. That's our next stop!
Keep in mind that you're using a shared demo system meant for learning by everybody. The dashboards and objects you create will stick around for a while, but we will periodically clean up the system and remove stale accounts.