Skip to main content

 

Interana Docs

Perform an automated privacy purge

To purge data from Interana, for example to comply with GDPR, you have two choices:

  • You can run an ad hoc privacy purge, or you can run an automated purge pipeline, or both. The ad hoc privacy purge is designed to be run infrequently, perhaps once per month. See Perform an ad hoc privacy purge for details.
  • In contrast, the purge pipeline runs continuously, in a similar way to the input pipeline. It continuously looks for requests and picks them up when they appear. This article describes how to use an automated purge pipeline.

For an overview of privacy purges in Interana, see About Interana privacy purges.

Create an automated privacy purge

Creating an automated privacy purge is analogous to ingesting data (see Get data in using the CLI). The steps are as follows:

  1. Create a purge pipeline that identifies the data to delete.
  2. Create a purge job that runs the pipeline.

Create a purge pipeline

Only someone with the ia_admin permission bundle can create a purge pipeline.

The purge request includes a command_id, and optionally a path for the receipt. For example:

ia pipeline create IA_PURGE_PIPELINE_1 IA_EVENTS file_system -p file_pattern 
/home/ubuntu/kramamurthy/requests/{year:04d}-{month:02d}-{day:02d}/* -t 
/home/ubuntu/import/transformation.jsonl --pipeline-type purge

In this example, IA_PURGE_PIPELINE_1 is the pipeline name, and IA_EVENTS is a table name. The --pipeline-type needs to be purge.

A table name is mandatory with the pipeline command, but the table name does not relate to the purge. The purge works across all tables, regardless of the table that is named.

Create a one-time (ad hoc) purge job

For example, if pipeline_id 1 is a purge pipeline, then

ia job 1 onetime 2 today

creates an import job for that purge pipeline that will do a onetime search for files from the folders in the file pattern representing data from 2 days ago to folders in the file pattern representing today's data. Once the search completes and all files found are downloaded, transformed, and added to the request queue, the job will be marked as "Done". 

Create a continuous purge job

To run the pipeline, start a continuous purge pipeline import job to search for requests over a certain number of iteration_dates in the same manner as a regular import pipeline.

For example, if pipeline_id 1 is a purge pipeline, then

ia job 1 continuous 2 today

creates an import job for that purge pipeline that searches for files from the folders in the file pattern representing data from 2 days ago to folders in the file pattern representing today's data.

Interana searches the pipeline file path for requests, downloads them, transforms them if necessary, and sends them to the configDB to be stored in a queue of request jobs.

If a purge job is already running, the new job is set to inactive status. A new wave kicks off with the schedule-server service on the config node running a metadata delete, the data-manager service on the config node running the string tier delete, and the data-manager service running the data delete. Once a wave completes, those jobs are marked as completed, and any new inactive jobs become part of the next purge wave, which then kicks off.

See the cluster lexicon entry for a summary of the nodes in an Interana deployment.

Receipt generation

The purge pipeline also generates receipts once per day. By default the receipts appear in the object store, in a separate folder.

Every day two receipt files are uploaded, the receipt and the summary receipt.

The format for the receipt filename is as follows:

receipt_{cluster_name}_{today date}.jsonl

The cluster name comes from the config db. The default cluster name is 'interana'. An example receipt for Mar 03 2020 is as follows:

receipt_testcluster_2020-03-03.jsonl
{
'Timestamp': '2020-03-03', 
'CommandId': ''00079317-3657-41bc-b68a-e89b107c3182', 
'Cluster': 'testcluster'
}

An example of the receipt filename and contents for Mar 04 2020 is as follows:

receipt_testcluster_2020-03-04.jsonl
{
'Timestamp': '2020-02-04, 
'CommandId': '00079317-3657-41bc-b68a-e89b107c3183', 
'Cluster': 'testcluster'
}

An example of a summary receipt for Mar 03 2020 is as follows:

summary_testcluster_2020-03-03.jsonl
{
'cluster_index': 0,
'command_ids_processed': 1,
'folders_processed': [
{
'folder_date': 2020-02-29 # Iteration date, running purge for data Feb 29 2020
'command_ids_processed': 1
}
]
}

Troubleshoot an automated privacy purge

To find the status of your automated privacy purge pipeline, use the CLI command 

ia purge status

The status of the receipts generated, done, active, and interrupted jobs is output with the job start and end dates. For example:

Status                Number of jobs  Start date    End date
------------------  ----------------  ------------  ----------
receipts_generated               499  2020-03-10    2020-03-10
active                          1000  2020-03-09    2020-03-10
interrupted                      250  2020-03-09    2020-03-09
inactive                         251  2020-03-09    2020-03-09

In addition, use the following commands to check pipeline and job status:

ia pipeline list
ia job list
  • Was this article helpful?