Skip to main content
Interania

Set up an S3 bucket

0votes
10updates
208views

The purpose of this how-to is to provide practices on S3 bucket layout.  If you use the following guidelines, you can be confident that your Interana ingest is as efficient as possible with respects to your S3 bucket. Click a link to jump to a topic.

 

How Interana looks for new files to import in the S3 bucket

Understanding how Interana looks for files to import is key to understanding why we suggest the S3 bucket layout described in the following sections.

Interana import from S3 data sources has two types of jobs:

  • One-time backfill jobs—imports all files available between two dates / times 
  • Continuous import jobs—continuously scans your bucket for new files to import

Both of job types require that we have a way to identify the dates of the events contained within the file.  And since we identify what files we need to import by listing the S3 bucket, we must set up the bucket so that we can identify the files that we need to import with the least amount of list requests possible.

You can only list an S3 bucket by file name (not update time, file size, etc.), and wildcard characters are not supported.

The following points further support our recommended S3 bucket structure:

  • All data is brought into S3 via "import pipelines." Each pipeline has the following characteristics:
    • The table the data will be imported into
    • The pattern of the file names in S3 that should be imported (like bucketname/tableid/year/month/day/hour/file.gz)
    • A transformer configuration that specifies any transformations to be made before ingest
  • When we run a one-time backfill or a continuous job, we associate a pipeline with that job so that we know what files to import and where to import them.

Step 1: Organize by dataset

Now we're ready to work on our structure!  Let's say that we have a bucket named interana-s3-logs ready to go.  For organizational purposes, it's a good idea to separate your bucket into the datasets that you will be importing into:

s3://interana-s3-logs/clientlogs/

and

s3://interana-s3-logs/serverlogs/

Step 2: Organize by time of events in files

Immediately after some preliminary organization, you'll want to identify the year / month / date / (optional) hour that each file corresponds to (in UTC).  So, if a file contains timestamps from 3:00 am UTC January 15, 2017, you would want to put that file in:

s3://interana-s3-logs/clientlogs/2016/01/15/

or 

s3://interana-s3-logs/clientlogs/2016/01/15/03/

Including the hour will allow us to import files from a specific hour in the day, which is occasionally useful, but not necessary. Running an hourly pipeline in a continuous job requires us to use 24x the list requests than a daily pipeline, because we issue a list request for each hour.  List requests aren't all that expensive so it's not a big issue, but if trying to minimize costs I recommend the daily structure.  It's fairly rare that we require an hourly pipeline.

Step 3: Organize by file source

Some datasets are made up of events from multiple data sources. Now's the time to put a folder for each type of file we are importing into the dataset. Since you never know if you'll want to import different types of events into the dataset, you may want to make a folder specifying the source even if you have just one source.

s3://interana-s3-logs/eventlogs/2016...source/file.gz

s3://interana-s3-logs/eventlogs/2016...source/file.gz

This will allow us to set up separate pipelines if the different file types require separate transformer configurations - or if they do not, we can set up one pipeline for everything after the day folder.

Step 4 - Other considerations

Here's a few more things to keep in mind when putting files into S3:

  1. For best import speed and flexibility, aim for file sizes of 100-200 MB uncompressed.  This allows i|a to maximize import parallelism!
  2. While S3 list does not support wildcarding, we can sort of support wildcarding by issuing a lot of list requests. If we needed to import s3://interana-s3-logs/eventlogs/*/20...source/file.gz, we could by issuing a list request for every single variation of *.  This is inefficient and we don't recommend it!

 

  • Was this article helpful?