Exactly, and same as my comment below (parquet+iceberg+s3).
And yes Athena is a part of that. And we also use dbt but mostly for a place to commit and push queries. And I agree with the other question about glue, it is the ugliest part.
I guess a +1 is not per hacker news standard, but i still want to give it some strength, given that we came up with the same solution independently.
- an established convention for project organization
- a tool to run lots of SQL queries at scale
- a tool to create and update views in the correct graph order to avoid dependency issues (e.g. removing a column from child view that parent still depends on).
- SQL codegen / templating using Jinja
- an ecosystem of packages that provide useful utility macros. E.g. every project eventually needs a calendar. Just look at that SQL statement to generate one. It’s gnarly.
- a test runner on data to ensure quality and contract adherence to avoid breakage upstream.
- S3 for storage
- Glue catalog to describe / define source data shapes
- Athena to query the above
- dbt for business data modelling (has Athena and glue adapter)
The only difficult part I always struggle with is getting partitioning right.