Skip to main content
Interania

How to analyze Interana logs with Datadog

0votes
15updates
63views

You can monitor Interana with your monitoring system, using a short python script to parse events from the Interana syslog (/var/log/syslog).

This document provides examples for monitoring Interana syslog with Datadog, and lists other logs that you may want to monitor. Click a link to jump directly to the topic:

Datadog check file for Interana syslog

The following example script uses Datadog to monitor Interana by checking the Interana syslog for errors.

import sys
import os
import json
import subprocess
sys.path.append(os.path.dirname(__file__))
if __name__ == "__main__":
   from ddcommon import AgentCheck
else:
   from checks import AgentCheck
class SyslogCheck(AgentCheck):
   def check(self, instance):
       '''
       Very basic check showing how we could use the new syslog for datadog monitoring.
       The permissions on interana_syslog at the moment require us to use sudo.
       We could also open the file and parse that rather than using grep to get what we want out of the syslog, this requires changing permissions on the file.
       '''
       try:
           error_count = 0
           filepath = instance['log_path']
           tags = ['syslog_test', instance['cluster_name']]
           output = subprocess.check_output("sudo grep -oP '\K{.*}' /var/log/interana_syslog", shell=True)
           for l in output.split('\n'):
               try:
                   line = json.loads(l)
               except Exception as e:
                   # Not valid json
                   continue
               print line
               if line['severity'] == 'ERROR':
                   error_count += 1
           self.gauge('{}.{}'.format('interana.error-check',
                                     'Number of errors detected'), error_count, tags=tags)
           
           if error_count > 10:
               self.service_check(
                   check_name=instance['name'],
                   status=AgentCheck.WARNING,
                   message='High number of errors in /var/log/interana_syslog. Error Count: {}'.format(error_count))
       except Exception, e:
           print e
if __name__ == "__main__":
   i = SyslogCheck(None, None, None)
   i.check(None)

Parsing Interana syslog for purification events

The following example continuously parses each line of the interana_syslog file for purification events to catch errors in the system.

import logging
logging.basicConfig(level=logging.DEBUG)
import json
import re

def lines_parsed(logger, line):
   if 'git_describe' in line:
       try:
           log = json.loads(line)
       except ValueError:
           return
       ia_version = log['git_describe']
       if '2.xx' in ia_version:
           result = event_count(logger, log) if 'event_name' in log else None
       else:
           result = lines_read(logger, log) if 'lines_read' in log else None
   else:
       try:
           log = json.loads(line)
       except ValueError:
           return
       result = event_count(logger, log) if 'event_name' in log else None
   return result

def event_count(logger, log):
   try:
       sev = log['severity']
   except KeyError:
       sev = 'INFO'
   event_name = log['event_name']
   if sev == 'INFO' and event_name == 'purifier_parse_finish':
       date = int(log['__time__']) / 1000.0
       dot = '.'
       process = log['process']
       event_class = log['event_class']
       event_count = int(log['event_count'])
       metric_event_count = ['interana', event_class, process, 'event_count']
       metric_event_count = dot.join(metric_event_count)
       attr_dict = {'metric_type': 'counter', 'unit': 'events'}
       return (metric_event_count, date, event_count, attr_dict)
   else:
       return None

def lines_read(logger, log):
   try:
       sev = log['severity']
   except KeyError:
       sev = 'INFO'
   activity_name = log['activity_name']
   event_name = log['event_name']
   if sev == 'INFO' and activity_name == 'threaded_tasks' and event_name == 'purifier_worker_progress':
       date = int(log['__time__']) / 1000.0
       dot = '.'
       process = log['process']
       event_class = log['event_class']
       event_count = int(log['event_count'])
       lines_read = int(log['lines_read'])
       lines_skipped = lines_read - event_count
       metric_event_count = ['interana', event_class, process, 'event_count']
       metric_event_count = dot.join(metric_event_count)
       metric_lines_read = ['interana', event_class, process, 'lines_read']
       metric_lines_read = dot.join(metric_lines_read)
       metric_lines_skipped = ['interana', event_class, process, 'lines_skipped']
       metric_lines_skipped = dot.join(metric_lines_skipped)
       attr_dict = {'metric_type': 'counter', 'unit': 'events'}
       return [(metric_event_count, date, event_count, attr_dict),
               (metric_lines_read, date, lines_read, attr_dict),
               (metric_lines_skipped, date, lines_skipped, attr_dict)]
   else:
       return None

def test():
   # Set up the known log entries for getting the error
   test_file = open('/tmp/purifier.log', 'r')
   for line in test_file:
       line = line.strip()
       # log rotation will add brackets
       # which messes with the json parser
       # so need to make sure json is pure
       b = '{}'
       if b in line:
           line = re.sub(b, '', line)
       parsed = lines_parsed(logging, line)
       if parsed:
           logging.debug("LINES: {}".format(parsed, ))

if __name__ == '__main__':
   # For local testing on the command line
   test()

Other Interana logs to monitor

This section lists other Interana logs that you may want to monitor:

backup.log

Monitors Interana backups. The following events are useful:

  • event_type — backup_succeeded
  • __time__ (seconds, epoch timestamp)
  • mountpoint

cardinality_monitor.log

Monitors column size and whether or not the number of strings is excessive. The following events are useful:

  • __time__
  • column_id
  • num_strings
  • column_size
  • column_name
  • event_name

import-pipeline.log

Monitors ingest on a pipeline by pipeline basis. The following events are useful:

  • pipeline_id
  • event_name
  • __time__
  • table_id
  • job_id

merge-server.log

Monitors activity of query responses. Should requests fail, you can determine the node or nodes that are affected. The following events are useful:

  • __time__
  • activity_name
  • event_name

precacher.log 

Decides what charts to keep in the cache to load quickly. The following events are useful:

  • chart
  • __time__
  • event_class
  • cache_key
  • dashboard_id

purifier.log

Monitors data transformation processes. The following events are useful:

  • event_name
  • job_id
  • pipeline_id
  • __time__
  • table_id

query_api_server.log

Monitors individual queries and what their responses. The results will be specific to your queries.

query-server.log

Monitors the query server. The following events are useful:

  • __time__
  • process — read-server
  • event_class
  • activity_name
  • user_id
  • event_name — activity_begin, activity_end
  • query_api_id

string-server.log

Monitors the string server. The following events are useful:

  • __time__
  • activity_name
  • process_id
  • event_name — activity_begin, activity_end
  • purifier_filename
  • table_id
  • string_leaf_strings
  • string_leaf_appends
  • pipeline_id
  • job_id

For more information

See the following topics:

  • Was this article helpful?