Skip to main content
Interania

How to perform a privacy purge

0votes
84updates
285views

The European Union (EU) General Data Protection Regulation (GDPR) was designed to protect EU citizens data privacy, and reshape the way organizations approach data privacy.

Interana Privacy Purge enables you to comply with GDPR and other privacy regulations, as well as any voluntary privacy policies your company may adhere to. For more information, see How to comply with GDPR.

This document covers the following topics:

What happens in a privacy purge

Interana Privacy Purge enables you to protect the privacy of Interana users and users of services whose data resides in Interana. The following diagrams illustrate the types of information Interana Privacy Purge encompasses.

Behavioral information about your users—Interana as the repository of privacy data

GDPR_scenario1.png

 

Information about Interana users—Interana as the producer of privacy data

GDPR_scenario2.png

A privacy purge is a three part process:

PrivacyPurge_overview.png

As you might guess, a lot goes on behind the scenes to complete a privacy purge. The following table lists the types of data affected during a purge, with an explanation of what happens to each data type in the process.

Type of purge data  Meaning
Source event data files

You are responsible for purging your source event data files of an individual's behavioral data.

If you do not sanitize your original material, we recommend that you maintain a cumulative set of purged IDs. Then if you have to reingest, you can rerun the purge.

Event data When data stored in Interana contains references to purged user actions, the entire event record is purged when it matches the purge user in any of the purge identifier columns.
String data

Strings that exactly match a purge identifier are deleted from the string server. Other strings that are associated with deleted events are de-linked and not purged. 

IMPORTANT: The Interana admin must be careful to only pass purge identifier values that are Personally Identifiable Information (PII), such as email addresses and GUIDs. The purge utility removes whatever the admin requests, even a value such as "blue."

Query result history Dashboard caches are refreshed or aged out within 30 days.
Named expressions, global filters, dashboards, and derived columns

Named expressions, global filters, dashboards, and derived columns created by a purged user are considered the intellectual property of the company for which the purge user worked, and are not removed. 

For named expressions, global filters, dashboards, and derived columns that reference the purged user in their filters or other query parameters, the purge command deletes them entirely. An Interana admin has the option to run the purge in preview mode (without the --run flag) which will print a list of references without deleting them.

Interana user account

Named expressions, global filters, dashboards, and derived columns created by the user, and any audit history of changes to these objects edited by the purged user, are considered the intellectual property of the company for which the purge user worked and are left intact. Interana admins can remove them manually, as necessary. 

System backups

For system backups, you can set a policy of not retaining backups for longer than your governing policies allow.

NOTE: If you are subject to the EU's GDPR, a 30 day retention policy ensures compliance.

Human readable and structured system logs

We recommend that you rotate the logs on the Interana cluster within seven days.

If these files are downloaded off the cluster for longer storage, keep a cumulative list of purge user identifiers, so you can rerun privacy requests should the logs be needed for analysis at a later time.

What you should know before scheduling a privacy purge

Before you schedule a privacy purge, it's important that you are aware of the downstream affects:

  • Deleted references to a purged user can cause query failures and dashboard charts to disappear.
  • A purge scans and deletes specified rows at a speed of 10GB/hour per data node. Performance may be impacted by importing a large volume of data while the purge is running. Such as, the equivalent of one import node (with 4 CPUs) importing as much as it can into one data node (also 4 CPUs), or about 250 million events/day per data node CPU.
  • A privacy purge is a cluster-wide operation with resource-intensive processes. Although it doesn't affect the performance of most queries, longer running queries can take up to 30% more time to complete while a purge is in progress. For this reason, we recommend that you schedule privacy purges at non-peak hours.
  • You can use query structured logging to determine which objects were deleted in a privacy purge.
  • We recommend that you maintain a cumulative set of purged IDs, if you do not to purge your raw logs. That way if the original source files are re-ingested, you can rerun the necessary privacy purges.

The dashboard cache is not included in a privacy purge. However, the cache refreshes every week, clearing out old data. For this reason, there may be a short time when dashboards still display privacy information that has been purged. 

Interana Privacy Purge removes the following PII data:
  • Event data
  • String data
  • Query result history
  • Query definition history
  • References in derived columns
  • References in funnels
You are responsible for removing the following PII data:
  • PII data in original source logs
  • PII data in lookup files

Requirements for a privacy purge

This section covers the procedural and data structure requirements you must adhere to for a successful privacy purge, then outlines the information you should have on hand before you begin.

A privacy purge runs across all data available on the cluster at the time of the purge, including data that is in the process of being imported. Data that is imported after the purge pass completes, is not scanned unless a new purge is run.

Procedural requirements and limitations

  • DO NOT launch a privacy purge while a cluster rebalance is in progress. Wait until the cluster rebalance is complete before starting a privacy purge.
  • DO NOT run a privacy purge on a lookup table. Privacy purge does not currently support lookup tables.
  • DO NOT attempt to use a file larger than 16 MB (16000 K bytes) in a purge, or the job will hang.

Data structure requirements

You must follow the rules in this section for a successful privacy purge.

  • If a column name exists in multiple tables, the columns must be of the same type. 
  • If there are columns with the same name but are of different types, change the column names in the Interana UI so they are unique.
  • Hexadecimal/Identifier columns must be in the format of the original ingested (raw source) data, such as the original value of a GUID: 
    "30dd879c-ee2f-11db-8314-0800200c9a66". A privacy purge requires the original ingested (raw source) value.
  • String and integer sets are not deleted in a privacy purge.
  • Each string must be individually specified for a purge. For example, a "userID" and the "userID@mailaddress" must be individually specified to be deleted.
  • A userID that is in a column description (in the Interana UI) will not be deleted in a privacy purge. You must manually remove any privacy information that appears in column descriptions.
  • If a userID appears in a dashboard title, that dashboard will not be deleted in a privacy purge. However, the dashboard title will appear in the metadata delete preview, flagging it for manual deletion.
  • Advanced filters, titles, descriptions, and derived columns that contain decimal values or a plain text string that contains a space, are not deleted in a privacy purge.
  • Only exact instances of a string are deleted. If the string appears with a letter or number adjacent, it is considered a different string (because it's not an exact match) and is not deleted.
  • Privacy purge does not currently support deleting UTF-8 characters, such as kanji and emoji.

Information you'll need 

Before you create a privacy purge job, have the following on hand:

  • Names of the columns that contain data to be purged
  • User identifiers for the PII data to be purged
  • JSON conf file listing the column names and respective identifier values

Config file requirements

The config file should be written in JSON, where each line is a condition, and must conform to the following requirements:

  • Must have Interana user interface (UI) column names and a comma-separated list of the respective filter values.
  • ALL variables must be enclosed in double quotes (" ") whether string or integer, and comma-separated lists must be in brackets [ ].
  • Each file must be a JSON object, where each key is a column name and each value is an array of strings.

Values with spaces in the name are not supported by Interana privacy purge.

Config file example

{
    "UI_ColumnName1":["value1", "value2", "value3"],
    "UI_ColumnName2":["1234", "286", "523"]
}

How to perform a privacy purge

Perform a privacy purge with the Interana CLI ia purge  command, which automatically performs many of the same Selective data deletion operations. By default ia purge is run in preview (dry-run) mode. Use the --run option to execute the command, as shown in the following example.

The dry-run mode currently only previews metadata, such as named expressions and queries pinned to a dashboard. For a comprehensive list of files to be purged, use the Selective data deletion preview command. However, be aware that the config file format for ia data jobs is different from that used for ia purge run. See the Config file for selective data deletion example.

The following table lists the ia purge arguments, followed by an example.

ia purge  
Positional arguments  
run

Deletes strings, events, and metadata with values specified by the config file.

Defaults to dry-run mode.

interrupt

Halts an active purge.

The interrupt option is new in Release 2.25.1.

Optional arguments  
--example-config Displays a sample privacy purge config file.
--instance-name <cluster-name>

If you have more than one cluster, you can specify on which cluster to run the purge.

NOTE: Use the Interana cluster name, a specific stored credential.

--output {json,text,table} Sets the output format. The default is table.
--run Executes the command. The default is dry-run mode.
--verbose, -v Sets verbose mode, which shows the crash stack trace if an error occurs.
--version Displays the version of Interana and Interana CLI currently installed.
--unsafe Use when there is not a valid certificate. To acquire a valid certificate, see How to replace a self-signed certificate.
--help, -h Prints help for this command and then exits.
Example

To preview a list of files that will be purged, use the selective data deletion ia data preview-delete-job command. This command returns a list of events that match the filters in the specified config file.

In the following example, ia purgerun is used with the --run argument to execute the command. A config file is used that contains column names with their respective comma-separated list of user identifier values for the PII data to be purged. For an example of a config file, see the ia purge section of the Interana CLI reference.

ia purge run gdpr011518.config --run