How cohorts are defined
We used a cohort in the previous section. Now let's see how it's defined. Click the Cohorts icon on the navigation panel to head over to the Cohorts manager:
Once there, locate the Users Registered Last 28 Days cohort that we used before and click on the + sign icon to open a copy of the cohort definition. This is what you'd do if you wanted to customize the cohort for your needs.
The box that pops up should look something like this:
Take a minute to examine the definition. If we put it into words, you can phrase it roughly as:
- From the Wikipedia dataset, examine the events for each user.
- Include the user in this cohort if the user has at least one event within the last 28 days where the
log_typeis newusers and
log_actionis either create or create2.
We're filtering for events in the Wikipedia dataset that indicate a new user being created. When we see one of the events for a user within the selected time range, we consider that user to be part of the cohort.
Now, click the Cancel button to close the box. We'll create a similar cohort from scratch in the next step.
How can I use it?
Cohorts are a used everywhere:
• eCommerce companies use cohorts to understand retention, cashflow, and lifetime value.
• B2B SaaS companies use cohorts to compare ROI for various customer acquisition channels.
• IoT and manufacturing companies use cohorts to map device batches to errors and defects.
You might compare how actors in different cohorts behave in the same situation. Or how the same actors behave in different situations. Wherever you have a group of actors that are similar to each other, you can benefit from cohorts.
Create a new cohort for recently created articles
Ok, we've seen how the cohort for newly registered users looks. Let's create one that defines a cohort of new articles as those that have been created in the last 28 days. Click the big blue NEW COHORT button at the top right side of the page. A new cohort definition box appears that looks like the one we just looked at.
Follow these steps to create the cohort:
- Fill in a name for your cohort in the top empty box. Since this will be a cohort for articles created in the last 28 days, call it something like "Articles Created Last 28 Days".
- Since we're looking at articles, click into the For Each field and select
articleas the key to examine for this cohort.
- You can keep the Measure as the default (Count Events with At Least 1).
- Change the Between date range to start at "last 28 days" and end "today". This includes events that start at midnight 28 days ago and go until midnight today.
- Click the + next to "Add Filter" to open a new filter field. Click into the left field and enter type to choose the
typecolumn. Leave the comparison as is one of. Click the right box and enter new. This will filter for events that indicate something new is being created.
- Click the + next to "Add Filter" again. This time choose the
spacecolumn, and filter for is one of Main. Wikipedia has multiple name spaces, as sometimes indicated in the page title with a colon separating the space from the article name. We're just looking for articles in main name space, which we can find by looking for the value Main in the space column. (An alternative would be to look for a value of 0 in the namespace column, but that's a little harder to read.)
- Click in the Description field and write a short explanation of the cohort. What you enter here can be seen in the interface by clicking the small information circle to get the tooltip description. It's a best practice to document the data and objects by entering descriptions. In the long run, it'll help you and your colleagues better remember and understand what's available.
When you're done, the box should look something like this:
Click the big friendly blue Save button. Congratulations, you've just defined a new cohort! It'll now show up in the list of All cohorts, and if you click over to the "My" cohorts area it should be the only one listed there:
What makes it special: Interana cohorts are a type of named expression that is evaluated (computed) when the query is run. It requires no special indexing or pre-aggregation to work quickly and efficiently, so it's easy to define and refine without having to wait after each tiny edit. Cohorts serve as shorthand for a group of actors that you're interested in studying. Instead of constantly having to describe the actors to include in a query, they're described once in the cohort definition. The cohort is then used in the query. If the definition is complex, more experienced team members can craft the definition and publish it for less experienced team members to use in their queries.
Using the New Cohort
Now that we have a shiny new cohort, it'd be awesome to use it. Let's explore using the new cohort. Just like in dashboard charts, there's a compass icon under the list of cohort actions. This lets us explore from a cohort with a single click:
Click the compass, and Interana brings the cohort into the Explorer. We create a new query that compares a count of unique key values (the column used in the For Each definition) in the cohort versus all observed unique key values during the time period defined in the cohort:
That's a great start. Let's use some of what we learned in the previous step to go from this to a pie chart comparing new articles with older articles, broken down by the wiki where the article exists. To accomplish that, do the following:
- Click the View field and select Pie.
- Set the Start time as "last 28 days" and the End time as "today".
- Select wiki as the column by which we'll Group.
- Click next to the blue "A" in the list of filters and name the first comparison clause "New Articles".
- Click next to the blue "B" in and name the second comparison clause "Older Articles".
- Click in the box under the "B" and select article as the filter column, is NOT in cohort as the comparison. Select your new cohort as the cohort to check.
- When everything is ready, click the big green GO button at the top.
If everything was correct, you should see something like this:
Checking out the chart, it looks like there are lots of new articles created in the Wikidata project (wikidatawiki), a collaboratively edited knowledgebase. English Wikipedia (enwiki) comes in second. Older articles that we observed during the same time period also include many articles from the Wiki Commons (commonswiki), a repository of media files freely available for reuse.
Once you're happy with your query, click the pin icon to save it to your dashboard. You can then come back to it whenever you want.
What makes it special: Interana cohorts are very flexibly defined and focused on behavior. Actors are included in cohorts based on their actions, not just their demographics. As we saw during the initial tour, cohorts can be defined in terms of other complex objects, and in turn used to define other objects.
A deeper dive into the interface (optional)
The rest of this section is optional. However, it is a recommended way to learn more of the interface. If you're eager to get on with learning about the analytical features, you can go straight to learning about measuring activity.
Still here? Great! We're going to check out some other useful ways to interact with the Explorer.
We mentioned tool tips when you were defining the new cohort. In case you haven't tried it yet, click the little circled "i" next to the name of your cohort:
You should see a tool tip come up with a definition of the cohort, including the description you entered at the top. You even click the "Edit" link in the tool tip to go straight to editing or copying the expression (or just looking at the definition). Just click anywhere in the grey area outside the tool tip to close it. Tool tips are a great way to remember what each object does and how to use it.
Similarly, columns in the dataset also have tool tips. These bring up a description of the column (and more!) to help you better understand the data in the dataset.
Column descriptions are set by users with an administrator role via the dataset settings. They would typically come from data dictionaries or be written by the Data Science team at a company. Including descriptions for the most important columns in a dataset is a great way to help new users learn and explore the data.
We've selected the time range for our queries using specific start and end times. That's great when you're looking for precise time ranges, or need to specify a relative time (e.g., 4 weeks ago to now). But sometimes we just want to get an estimate across a larger or smaller time range. We can do that with the time scrubber. Perhaps you've noticed it at the bottom of the charts:
The scrubber shows up as a light blue area chart, with the height proportional to the number of events observed during that period of time. The right-hand side of the chart represents the current time, with older events showing up toward the left. The time range selected for the current query is highlighted with a slightly darker overlay and thin lines that define the start and end times:
Those lines are handles that can be clicked and dragged to quickly select approximate time ranges. Let's try to see what our query looks like if we examine the full history instead of just the last 28 days. Grab the left handle by clicking and holding it, then pull it left until you've selected all the events (the whole area in blue):
When you complete the click, the query will refresh using the larger time range. Your query should look something like this:
Notice that the time range is all the way back toward the beginning of 2016.
Comparing charts in the Explorer
Can you notice what's similar and what's different? The New Articles are still the same, since those are defined by the cohort and limited to articles created in the last 28 days. But the Older Articles now include over a year of observations and have shifted. Wikidata makes up less of the pie, and Wiki Commons has a larger share.
We can see this more clearly by expanding some of the earlier results in the query history (breadcrumbs) below the main chart. The immediately prior chart should be our query from earlier with the 28 day time range. Click the diagonal arrows on the collapsed chart to expand it:
This will open the chart within the history and make it easier to compare differences between the current and earlier query results:
Now it's more obvious how unique article counts compare between the last 28 days and the whole history.
Stats for performance nerds
Lastly, there are some interesting stats available for each query. Have you ever wondered about the circle at the top left of the chart? The one showing a percentage of the events that the query matched? It's got a super power: if you hover over it, you get precise event counts and some additional statistics about how long the query took and how that work was distributed among the nodes in the cluster. Put your cursor over the circle and take a look:
This shows that the query scanned about 815 million events, taking 1.45 seconds to return the results. The work was distributed across 5 nodes and used about 18.2 CPU seconds from the underlying cluster. Kind of nerdy, but also occasionally useful. We sometimes refer to that circle as the "donut."
We've now learned the mechanics of creating and editing cohort definitions. Next up: understanding how to measure activity in the data.
Please keep in mind that you're using a shared demo system meant for learning by everybody. The dashboards and objects you create will stick around for a while, but we will periodically clean up the system and remove stale accounts.