Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An easy to maintain stack from my experience that almost anyone can do:

- S3 for storage

- Glue catalog to describe / define source data shapes

- Athena to query the above

- dbt for business data modelling (has Athena and glue adapter)

The only difficult part I always struggle with is getting partitioning right.



Everytime I look at S3/Glue/Athena I can’t help but feeling like the Glue layer shouldn’t be necessary and it’s instead just part of athena’s ddl


Athena is query engine and can use multiple catalogs. It forwards DDL queries to the catalog. Glue is the default catalog.


Glue catalogs can be used by other query engines as well. Separating schema from compute is the foundational concept behind a data lake.


Exactly, and same as my comment below (parquet+iceberg+s3).

And yes Athena is a part of that. And we also use dbt but mostly for a place to commit and push queries. And I agree with the other question about glue, it is the ugliest part.

I guess a +1 is not per hacker news standard, but i still want to give it some strength, given that we came up with the same solution independently.


Is DBT really necessary? (serious question) If so, why? What would go wrong by skipping it?


No, not necessary at all. You can write queries and CTEs and create views in Athena/Glue by hand, if that’s what you prefer.


I mean what does DBT offer me here that makes it worthwhile?


- an established convention for project organization

- a tool to run lots of SQL queries at scale

- a tool to create and update views in the correct graph order to avoid dependency issues (e.g. removing a column from child view that parent still depends on).

- SQL codegen / templating using Jinja

- an ecosystem of packages that provide useful utility macros. E.g. every project eventually needs a calendar. Just look at that SQL statement to generate one. It’s gnarly.

- a test runner on data to ensure quality and contract adherence to avoid breakage upstream.


It offers a well organized SQL project. You have to store the scripts somewhere.

You may not need it, I find it really useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: