Skip to main content
Interania

Selective data deletion

0votes
64updates
192views

This feature is new in Release 2.25.

This article explains how to selectively delete data based on specified filter criteria, such as time range, actor, or event type. Click a link to jump to the topic:

Advantages of selectively deleting data

There are a number of reasons why you might want to selectively delete events:

  • When there are garbage records that can be identified with the use of a boolean expression.
  • When you need to delete records for a particular set of actors. To comply with legal requirements, you may need to delete all logs of activity for users who request a privacy purge. There is a grace period between the time a purge request is placed and when the data must purged (see the EU GDPR web site for time limits for compliance). 
  • When there is a long retention period for specific high-value events, and a short retention period for all other events. You can use selective delete to periodically delete events of a particular type that are older than a specified date.
  • When you have ingested data from multiple data sources into a single table, and wish to delete and re-ingest the events from just one of those data sources. 

Overview of selective data deletion

Admins can selectively delete events from Interana that match a specific filter criteria, such as the following:

  • A specific time range (x < time < y)
  • A collection of event types (event =x,y,z)
  • events for a particular user (userid=123)

Selectively deleting data follows this process:

  1. Events to be deleted are specified by a config file.
  2. Each config file maps to a data deletion job.
  3. Each data deletion job is assigned a unique job ID.
  4. Possible delete job statuses are: INACTIVE (0), ACTIVE (1), or DONE (2)

Selective data deletion commands

The Interana CLI provides commands that allow for fine-grained control for selectively deleting data. You use these commands in conjunction with a job that utilizes a config file. The following table lists the ia data command arguments. Click a command argument to jump to that section. For more information, see the Interana CLI Reference.

ia data [--version] [-h] {list-delete-jobs, create-delete-job, preview-delete-job, run-delete-jobs, remove-delete-job, help}
  list-delete-jobs Shows a list of selective data deletion jobs in the database.
  create-delete-job Creates a selective data deletion job based on a specified configuration file.
  preview-delete-job Returns a sampled query of events matching the specified configuration file.
  run-delete-jobs Mark all inactive jobs as active and ready for deletion. The default is dry-run mode.
  remove-delete-job Given an ID, remove a data delete job from the database. Can specify multiple jobs. Can only remove inactive/done jobs.
  --version Shows the Interana version number, then exits.
  -h, --help Shows the help for this command, then exits.

List all selective data deletion jobs

You can display a list of all selective data deletion jobs with the ia data list-delete-jobs command, along with information on each job.

ia data list-delete-jobs    
  --output {json,text,table} Sets the output format to json, text, or table. The default is table.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.

 

Output time values are shown in human-readable format, as shown in the following example. In a config file, time values must be in Unix epoch time (milliseconds), as shown in Config file for selective data deletion.

To list selective data deletion jobs, enter the following command.
ia data list-delete-jobs

  Job ID  Start time    End time        Filters                                                                            Status    Create Time              Update Time
--------  -----------   ------------  -------------------------------------------------------------------------------     ---------  -----------------------  ----------------------      
       1    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:08:01 UTC  2017/12/14 10:33:20 UTC
       2    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:12:18 UTC  2017/12/14 10:33:20 UTC
       3    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:14:38 UTC  2017/12/14 10:33:20 UTC
       4    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:16:10 UTC  2017/12/14 10:33:20 UTC
       5    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:17:33 UTC  2017/12/14 10:33:20 UTC     

Note that when deleting string values, the "values" listed in the output of ia data list-delete-jobs will be integers that are the internal storage IDs of those strings on the data tier. 

Create a selective data deletion job

You create a selective data deletion job with the ia data create-delete-job command that references a config file. The following table lists the command arguments.

ia data create-delete-job    
  config_file A JSON file that specifies the selective data delete configuration.
  --output {json,text,table} Sets the output format to JSON, text, or table. The default is table.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  --example-config Displays an example config file, for reference.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.

A selective data deletion job uses the details of a specified config file

A selective delete job is created with an INACTIVE status, and is not scheduled until executed with the
ia data run-delete-jobs command with the --run option.

If you are unsure about how to structure a config file for a selective delete job, you can view an example config file with the following command.

ia data create-delete-job --example-config

-------- Sample Data Delete Create Config --------

{
    "table_name": "music",
    "start_time": 0,
    "end_time": 1510016107744,
    "filters": {
        "user_id": ["eccbc87e-cfcd2084-45c48cce-45c48cce", "66e7dff9-28308fd9-66e7dff9-ea1afc51"],
        "anonymous_id": ["6505913639713474836", "8143414483406512381"]
    }
}

NOTE: Filters are AND'd together, meaning an event will only be deleted if user_id is one of
["eccbc87e-cfcd2084-45c48cce-45c48cce", "66e7dff9-28308fd9-66e7dff9-ea1afc51"] AND anonymous_id is
one of ["6505913639713474836", "8143414483406512381"].
To create a selective delete job, use the following command:
ia data create-delete-job [path/to/config_file]

Preview a selective data deletion job

You can use the ia data preview-delete-job command to view how many events match the filters specified by a selective data deletion config file. The query is sampled by default, and returns a close approximation of the event count. There is a limit of 100 filters for ia data preview-delete-job.

For an unsampled query that returns an exact event count, use the --exact option with the ia data preview-delete-job command.

ia data preview-delete-job    
  config_file A JSON file that specifies the selective data delete configuration.
  --exact Runs an unsampled query to get the exact event count. The default sampled query returns a close approximation.
  --output {json,text,table} Sets the output format to JSON, text, or table. The default is table.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.

You can use this command to see how many events would be deleted using the config file. Results with no matching events is a confirmation of the successful completion of the job associated with the config file.

To preview selective delete jobs for a config file, use the following command:
ia data preview-delete-job [path/to/config_file]

Execute selective data deletion jobs

You can use the ia data run-delete-job command to do the following:

  • Show the INACTIVE jobs waiting to be scheduled for deletion.
  • Use the --run option to mark all INACTIVE jobs as ACTIVE, thereby scheduling the jobs for deletion.

Selective data deletion jobs are INACTIVE by default. You must use the --run option to set the jobs to ACTIVE.

ia data run-delete-jobs    
  -r, --run Executes the command, marking all selective delete jobs as ACTIVE, effectively scheduling the data deletions. The default is dry-run mode.
  --output {json,text,table} Sets the output format to JSON, text, or table. The default is table.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! For developers ONLY! 
  -h, --help Shows the help for this command and then exits.
To show a list of INACTIVE selective delete jobs, use the following command:
ia data run-delete-jobs

Currently inactive job IDs: 3. Use -r/--run to activate them.
To mark all INACTIVE selective delete jobs as ACTIVE, use the following command:
ia data run-delete-jobs --run

Remove a selective data deletion job

Use the ia data remove-delete-job command to remove a data deletion job associated with the specified job ID.

ia data run-delete-jobs    
  job_id Specify the ID of job to be removed.
  --output {json,text,table} Sets the output format to json, text, or table. The default is table.
  -f, --force Remove jobs even if they are active.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.
To remove a selective delete job, use the following command:
ia data remove-delete-job <job_ID>

Config file for selective data deletion

The config file specifies the conditions for the selective data deletion. The config file should written in JSON, where each line is a condition, and conform to the following requirements. 

Requirements
  • The table_name is required, and at least one filter must be specified.
  • The start_time and end_time are optional, and will default to the data's start and end date if not specified. 
  • Time values must be in Unix epoch time (milliseconds), as shown in the following example. Output time values display in human-readable format, as shown in List all selective data deletion jobs.

Remember that epoch time correlates to UTC, so if you want to delete something from one time to another in a different time zone, make sure the time zone offset is factored in.

  • ALL variables must be enclosed in double quotes (" ") whether string or integer, and comma-separated lists must be in brackets [ ].

Config file example

The following JSON example filters for the following:

  • The table_name is: NewUsers
  • The timestamp (epoch, milliseconds) is in the range: [0, 1510016107744] 
  • UI_ColumnName1 has values: ["1",  "2", "3"]
  • UI_ColumnName2 has values: ["John", "Jacob", "Jessica"] 

Only events that match these parameters will be deleted. Additional column filters can be added as needed.

{
    "table_name": "NewUsers",
    "start_time": 0,
    "end_time": 1510016107744,
    "filters": { 
        "UI_ColumnName1": ["1", "2", "3"],
        "UI_ColumnName2": ["John", "Jacob", "Jessica"]
    }
}

You can specify additional config files so the event can match any one of the conditions to be purged. For example, the following config file would specify condition1 OR condition2 OR ... conditionx.

{“table_name”: "OldUsers", “start_time”: 0, …} // file1/condition1
{“end_time”: 1510016107744, “filters”: [...], …} // file2/condition2
...
{“table_name”: AllUsers, “start_time”: 0, …}  // filex/conditionx
  • Was this article helpful?