The purpose of this how-to is to provide practices on s3 bucket layout. If you follow these guidelines, you can live your life fully confident that your Interana ingest is as efficient as possible, at least with respects to your s3 bucket!
Background: understand how Interana looks for new files to import in the s3 bucket
Understanding how i|a looks for files to import is key to understanding why we suggest the following bucket layout.
Interana import from s3 data sources has 2 types of jobs: one-time backfill jobs, and continuous import jobs. A one-time job imports all files available between two dates / times, and a continuous job continuously scans your bucket for new files to import.
Both of these job types require that we have a way to identify the dates of the events contained within the file! And since we identify what files we need to import by listing the s3 bucket, we must set up the bucket so that we can identify the files that we need to import with the least amount of list requests possible.
A quick note: you can only list a s3 bucket by file name (not things like update time and file size), and that listing does not support wildcarding. These points further inform our recommended s3 bucket structure.
The other important thing to understand is that all data is brought into s3 via "import pipelines". Each pipeline has the following important characteristics:
- The table the data will be imported into
- The pattern of the file names in s3 that should be imported (like bucketname/tableid/year/month/day/hour/file.gz)
- A transformer configuration that specifies any transformations to be made before ingest
When we run a one-time backfill or a continuous job, we associate a pipeline with that job so that we know what files to import and where to import them.
Step 1 - Organize by dataset
Now we're ready to work on our structure! Let's say that we have a bucket named interana-s3-logs ready to go. For organizational purposes, it's a good idea to separate your bucket into the datasets that you will be importing into:
Step 2 - Organize by time of events in files
Immediately after some preliminary organization, you'll want to identify the year / month / date / (optional) hour that each file corresponds to (in UTC). So, if a file contains timestamps from 3:00am UTC January 15, 2017, you would want to put that file in:
Including the hour will allow us to import files from a specific hour in the day, which is occasionally useful, but not necessary. Running an hourly pipeline in a continuous job requires us to use 24x the list requests than a daily pipeline, because we issue a list request for each hour. List requests aren't all that expensive so it's not a big issue, but if trying to minimize costs I recommend the daily structure. It's fairly rare that we require an hourly pipeline.
Step 3 - Organize by file source
Some datasets are made up of events from multiple data sources! Now's the time to put a folder for each type of file we are importing into the dataset. Since you never know if you'll want to import different types of events into the dataset, you may want to make a folder specifying the source even if you have just one source.
This will allow us to set up separate pipelines if the different file types require separate transformer configurations - or if they do not, we can set up one pipeline for everything after the day folder.
Step 4 - Miscellaneous Considerations
Here's a few more things to keep in mind when putting files into s3:
- For best import speed and flexibility, aim for file sizes of 100-200 MB uncompressed. This will allow i|a to maximize import parallelism!
- While s3 list does not support wildcarding, we can sort of support wildcarding by issuing a lot of list requests. If we needed to import s3://interana-s3-logs/eventlogs/*/20...source/file.gz, we could by issuing a list request for every single variation of *. This is inefficient and I do not recommend!