Building a serverless data warehouse on AWS with Firehose, Glue and Athena
I work at a company with fewer than 10 people on the engineering team. Everyone wears multiple hats and is responsible for their piece of the product end to end. There are no BAs, QAs, DBAs or DevOps engineers. There is no fence to throw anything over.
This presented a challenge when we determined the aggregate data we were getting from Google Analytics/Firebase was not sufficient and that we needed a new system to track user engagement. It needed to be able to collect large volumes of data from all of our products and expose this for both ad-hoc querying and BI reporting – a data warehouse.
Since we use AWS, the obvious approach would have been to use Redshift. However it’s expensive and realistically the data we needed to store was going to be in the order of gigabytes per year. Certainly not the terabytes or petabytes Redshift boasts of being able to handle. This was late 2018 and so Athena and Glue had been released, but not Lake Formation. I was vaguely aware that these services existed, but hadn’t really worked out what they might be useful for. To be fair, the AWS FAQs don’t exactly clarify things:
Q: What can I do with Amazon Athena?
Amazon Athena helps you analyze data stored in Amazon S3. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Amazon Athena can process unstructured, semi-structured, and structured data sets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC…https://aws.amazon.com/athena/faqs/?nc=sn&loc=6
Not the most motivating bit of sales copy.
Two things helped things fall into place for me:
- Reading the book Designing Data-Intensive Applications. This gives a contemporary overview of the database ecosystem and underlying technologies. It helped me understand the Parquet file format (a compressed columnar alternative to CSVs/JSON) and that Athena is actually just AWS hosting the open source Presto query engine for you. Additionally, AWS Glue is just a way of defining Apache Hive tables.
- Reading the engineering blog posts of a few big tech companies (Netflix, AirBnB, Uber). It turns out even these big players are not using commercial propriety data warehouse systems but instead composing their own, with similar sets of open source technologies for streaming, storage and querying.
It turns out the answers were all there in the AWS docs if you dig hard enough, but it took a while to work out how to combine the various services. We got there eventually and now have a cheap, low maintenance and simple ‘data warehouse for the little guy’. The moving parts of our solution are:
- A simple custom API that accepts requests from our web applications and mobile apps, adds some additional data (timestamp, unique uuid) and sends this JSON onto AWS Firehose as a base64 encoded string.
- An AWS Kinesis Firehose delivery stream which is configured to:
- Match the JSON with a specific glue schema
- Convert the record to Apache Parquet format
- Buffer 15 mins worth of events and then write them all to a specific S3 bucket in a year/month/day/hour folder structure.
- AWS Glue configured with a database, table and columns to match the format of events being sent. Additionally, an Glue crawler is configured to run several times a day to discover the new folders being added to the S3 bucket and to update the schema with this partition metadata.
- AWS Athena to query the database as defined in Glue.
- External SQL query tools and Microsoft Power BI also query Athena using the ODBC driver.
At the time of writing, we have 6Gb of Parquet files stored in S3 using this system. That’s ~130 million events. The monthly costs to run this are wonderfully low:
- S3 storage – $0.80
- Athena – $0.30 (varies based on your query volume)
- Glue – $0.50
- Kinesis Firehose – $0.50
- Modifying the schema in Glue (eg. adding a new column) does not automatically reflect in Firehose. You need to modify the stream config to use the new schema version.
- The Glue crawlers have a number of configuration options. It turns out the settings you want if you wish to be in full control of the schema and not have deleted columns re-appear are:
- Schema updates in the data store: Ignore the change and don’t update the table in the data catalog.
- Object deletion in the data store: Ignore the change and don’t update the table in the data catalog.
- I read that Parquet files perform best when they are ~1Gb in size. Ours are way smaller than that, but to try and get as close as possible we have the buffer settings in our Firehose stream at the maximum setting. One consequence of this is that when testing new events you have to remember to wait for the buffer to flush out and to run the Glue crawler before trying to query the data.
- Finally, and this is a biggie – Athena is a read only service. If you want to modify or delete data, you need to work with the underlying files in S3. Our data is append only so this hasn’t been much of an issue, but definitely something to consider.
- There is a one-one relationship between a Firehose stream and a AWS Glue table.
- At some point you or an upstream events producer will send through bad data.
- Firehose emits a FailedConversion.Records CloudWatch metric. It’s useful to put some monitoring around this so you can react quickly.
- If an event does not match the configured schema, Firehose will put it in a S3 folder named ‘format-conversion-failed’. The raw data will be there in base64 encoded form.
Overall, this system is working really well for us. A few things we’re considering for the future:
- Investigating the new options in Lake Formation and Glue for ingesting changes directly from RDS databases. This would open up options for transitioning to more of a ‘data lake’ with data coming from multiple places.
- Experimenting with reading and analyzing data with Databricks/Spark as an alternative to Athena – might be good for more involved queries.
- Looking at using more complicated data structures such as struct and array columns.
- Adding a downstream process to query the system and aggregate certain events into a format better suited for quick reporting.