Skip to main content
Interania

Selective data deletion

0votes
78updates
351views

This feature is new in Release 2.25.

This article explains how to selectively delete data based on specified filter criteria, such as time range, shard key, or event type. Click a link to jump to the topic:

Advantages of selectively deleting data

There are a number of reasons why you might want to selectively delete events:

  • When there are garbage records that can be identified with the use of a boolean expression.
  • To comply with legal requirements, you may need to delete all logs of activity for users who request a privacy purge. There is a grace period between the time a purge request is placed and when the data must purged (see the EU GDPR web site for time limits for compliance). 
  • When there is a long retention period for specific high-value events, and a short retention period for all other events. You can use selective delete to periodically delete events of a particular type that are older than a specified date.
  • When you have ingested data from multiple data sources into a single table with some way to identify the data source (i.e. a column with data source information, specific column(s) only populated by certain data sources, etc), and wish to delete the events from just one of those data sources. 

Overview of selective data deletion

Interana admins can selectively delete events that match filter criteria, such as the following:

  • A specific time range (x < time < y)
  • A collection of event types (event =x,y,z)
  • Events for a particular user (userid=123)

Selective data deletion works as follows:

  1. Events to be deleted are specified by a config file.
  2. Each config file maps to a data deletion job.
  3. Each data deletion job is assigned a unique job ID.
  4. Possible delete job statuses are: INACTIVE (0), ACTIVE (1), DONE (2), or INTERRUPTED (3).

Selective data deletion commands

The Interana CLI commands allow for fine-grained control of selectively deleting data. You use these commands in conjunction with a job that utilizes a config file.

Use the following recommended process to perform a selective delete:

  1. List all selective delete jobs.
  2. Preview selective data deletion.
  3. Create a selective data deletion job.
  4. Execute a selective data deletion jobs.

ia data commands

The following table lists the ia data command arguments. Click a command argument to jump to that section. For more information, see the Interana CLI Reference.

ia data [--version] [-h] {list-delete-jobs, create-delete-job, preview-delete-job, run-delete-jobs, remove-delete-job, help}
  list-delete-jobs Shows a list of selective data deletion jobs in the database.
  preview-delete-job

Returns a sampled query of events matching the specified configuration file.

NOTE: There is a limit of 100 filters for ia data preview-delete-job.

  create-delete-job Creates a selective data deletion job based on a specified configuration file.
  run-delete-jobs Mark all inactive jobs as active and ready for deletion. The default is dry-run mode.
  interrupt-delete-jobs Stops all active data delete jobs, and marks them as INTERRUPTED.
  reactivate-interrupted-jobs Marks interrupted jobs as active. Particular jobs can be specified. If no specific jobs are specified, then all interrupted jobs will be reactivated. Jobs must have status, INTERRUPTED, to be reactivated.
  remove-delete-job Given an ID, remove a data delete job from the database. You can specify multiple jobs. Only inactive/done jobs are removed.
  --version Shows the Interana version number, then exits.
  -h, --help Shows the help for this command, then exits.

List all selective data deletion jobs

You can display a list of all selective data deletion jobs with the ia data list-delete-jobs command, along with information on each job.

ia data list-delete-jobs    
  --output {json,text,table} Sets the output format to JSON, text, or table. The default is table.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.

 

Output time values are shown in human-readable format, as shown in the following example. In a config file, time values must be in Unix epoch time (milliseconds), as shown in Config file for selective data deletion.

To list selective data deletion jobs, enter the following command.
ia data list-delete-jobs

  Job ID  Start time    End time        Filters                                                                            Status    Create Time              Update Time
--------  -----------   ------------  -------------------------------------------------------------------------------     ---------  -----------------------  ----------------------      
       1    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:08:01 UTC  2017/12/14 10:33:20 UTC
       2    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:12:18 UTC  2017/12/14 10:33:20 UTC
       3    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:14:38 UTC  2017/12/14 10:33:20 UTC
       4    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:16:10 UTC  2017/12/14 10:33:20 UTC
       5    ---           ---          [{"column": "model", "table": "query_usage", "values": [9], "column_id": 2}]         Done     2017/12/14 05:17:33 UTC  2017/12/14 10:33:20 UTC     

When deleting string values, the "values" listed in the output of ia data list-delete-jobs are the internal storage ID integers of the strings on the data tier

Preview selective data deletion

 

You can use the ia data preview-delete-job command to view how many events match the filters specified by a selective data deletion config file. By default, the query is sampled and returns a close approximation of the event count.

Best Practice: Run a sample query with the exact same criteria in Explorer (UI) to make sure the results match. 

For an unsampled query that returns an exact event count, use the --exact option with ia data preview-delete-job. There is a limit of 100 filters for ia data preview-delete-job.

 

ia data preview-delete-job    
  config_file A JSON file that specifies the table name, start and end times, and filters for the selective data delete job.
  --exact

Runs an unsampled query to get the exact event count.

NOTE: The default sampled query returns a close approximation.

  --output {json,text,table} Sets the output format to JSON, text, or table. The default is table.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.

You can use the ia data preview-delete-job command to view a list of events will be deleted. Results with no matching events is a confirmation of the successful completion of the job associated with the config file.

To preview selective delete jobs for a config file, use the following command:
ia data preview-delete-job [path/to/config_file]

Create a selective data deletion job

You create a selective data deletion job with the ia data create-delete-job command that references a config file. The following table lists the command arguments.

ia data create-delete-job    
  config_file A JSON file that specifies the selective data delete configuration.
  --output {json,text,table} Sets the output format to JSON, text, or table. The default is table.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  --example-config Displays an example config file, for reference.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.

A selective delete job is created with an INACTIVE status, and is not scheduled until executed with the
ia data run-delete-jobs command with the --run option.

A selective data deletion job uses the details of a specified config file. If you are unsure about how to structure a config file for a selective delete job, you can view an example config file with the following command.

ia data create-delete-job --example-config

-------- Sample Data Delete Create Config --------

{
    "table_name": "music",
    "start_time": 0,
    "end_time": 1510016107744,
    "filters": {
        "user_id": ["eccbc87e-cfcd2084-45c48cce-45c48cce", "66e7dff9-28308fd9-66e7dff9-ea1afc51"],
        "anonymous_id": ["6505913639713474836", "8143414483406512381"]
    }
}

NOTE: Filters are AND'd together, meaning an event will only be deleted if user_id is one of
["eccbc87e-cfcd2084-45c48cce-45c48cce", "66e7dff9-28308fd9-66e7dff9-ea1afc51"] AND anonymous_id is
one of ["6505913639713474836", "8143414483406512381"].

The ia data create-delete-job command actually creates the purge job. If there is a mistake and you create another job to correct it, be sure to remove the bad job. Otherwise, the bad job automatically becomes ACTIVE when you run the good job.

To create a selective delete job, use the following command:
ia data create-delete-job [path/to/config_file]

Execute selective data deletion jobs

You can use the ia data run-delete-job command to do the following:

  • Show the INACTIVE jobs waiting to be scheduled for deletion.
  • Use the --run option to mark all INACTIVE jobs as ACTIVE, thereby scheduling the jobs for deletion.

Selective data deletion jobs are INACTIVE by default. Use the --run option to set ALL jobs to ACTIVE.

ia data run-delete-jobs    
  -r, --run Executes the command, marking all selective delete jobs as ACTIVE, effectively scheduling the data deletions. The default is dry-run mode.
  --output {json,text,table} Sets the output format to JSON, text, or table. The default is table.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! For developers ONLY! 
  -h, --help Shows the help for this command and then exits.
To show a list of INACTIVE selective delete jobs, use the following command:
ia data run-delete-jobs

Currently inactive job IDs: 3. Use -r/--run to activate them.
To mark all INACTIVE selective delete jobs as ACTIVE, use the following command:
ia data run-delete-jobs --run

Interrupt and reactivate an interrupted data deletion job

Use the ia data interrupt-delete-jobs stops all active delete jobs.

ia data interrupt-delete-jobs    
  --output {json,text,table} Sets the output format to json, text, or table. The default is table.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.

Use the ia data reactivate-interrupted-jobs command to mark interrupted jobs as active. You can specify particular jobs to be reactivated. If no specific jobs are specified, then all interrupted jobs will be reactivated. Jobs must have an INTERRUPTED status to be reactivated..

ia data reactivate-interrupted-jobs    
  --job_id <job-id> <job-id> ... Specify the ID of the jobs to be reactivated. If no specific job is specified, all interrupted jobs are reactivated.
  --output {json,text,table} Sets the output format to json, text, or table. The default is table.
  -f, --force Remove jobs even if they are active.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.

Remove a selective data deletion job

Use the ia data remove-delete-job command to remove a data deletion job associated with the specified job ID.

ia data run-delete-jobs    
  job_id Specify the ID of job to be removed.
  --output {json,text,table} Sets the output format to JSON, text, or table. The default is table.
  -f, --force Remove jobs even if they are active.
  --instance-name <cluster_name> Specify the cluster name, if you are using multiple clusters.
  -v, --verbose Displays more information, such as stack traces on errors.
  --version Shows the Interana version number.
  --unsafe Does not verify SSL certificates. DANGER! DEV ONLY!
  -h, --help Shows the help for this command and then exits.
To remove a selective delete job, use the following command:
ia data remove-delete-job <job_ID>

Config file for selective data deletion

The config file specifies the conditions for selective data deletion. The config file should written in JSON, where each line is a condition, and conform to the following requirements. 

Requirements
  • The table_name is required, and at least one filter must be specified.
  • The start_time and end_time are optional, and will default to the data's start and end date if not specified. 
  • Time values must be in Unix epoch time (milliseconds), as shown in the following example. Output time values display in human-readable format, as shown in List all selective data deletion jobs.

Remember that epoch time correlates to UTC, so if you want to delete something from one time to another in a different time zone, make sure the time zone offset is factored in.

  • ALL variables must be enclosed in double quotes (" ") whether string or integer, and comma-separated lists must be in brackets [ ].

Config file example

The following JSON example specifies the following:

  • The table_name is: NewUsers
  • The timestamp (epoch, milliseconds) is in the range: [0, 1510016107744] 
  • UI_ColumnName1 has values: ["1",  "2", "3"]
  • UI_ColumnName2 has values: ["John", "Jacob", "Jessica"] 

Only events that match these parameters will be deleted. Additional column filters can be added as needed.

Filters are AND'd together. In the following example, an event will only be deleted if UI_ColumnName1 is one of ["1", "2", "3"] AND UI_ColumnName2 is one of ["John", "Jacob", "Jessica"].

{
    "table_name": "NewUsers",
    "start_time": 0,
    "end_time": 1510016107744,
    "filters": { 
        "UI_ColumnName1": ["1", "2", "3"],
        "UI_ColumnName2": ["John", "Jacob", "Jessica"]
    }
}

You can specify multiple config files so the event can match any one of the conditions to be purged.

Config files are OR'd together. In the following example, an event will be deleted if its conditions match the config file condition1 OR config file condition2 OR ... config file conditionx.

{“table_name”: "OldUsers", “start_time”: 0, …} // file1/condition1
{“end_time”: 1510016107744, “filters”: [...], …} // file2/condition2
...
{“table_name”: AllUsers, “start_time”: 0, …}  // filex/conditionx