More

abrazensunset · on Oct 21, 2023

This has loose overlap with:

- Materialize - Flink SQL - Arroyo - Readyset - RisingWave - Timeplus - Pathway - Dozer - ReadySet - Snowflake dynamic tables - Native materialized views in OLTP databases - Just having a stack of views in your db - Poor man's MVs with triggers

All subtly different on every spectrum from consistency, UDF support, operator support, latency, scaling/state limits, source/sink integrations, and compatibility with existing protocols.

What seems unique is the focus on "writebacks to the source without Kafka/Connect in between", instead of having either a built-in cache, serving as a stream processor, or both. It looks like the built-in cache is still available through the FDW deployment pattern.

They note that relative to the source tables they are eventually consistent (of course, unless you want to delay transaction writes) but it's not clear what other consistency aspects they respect (such as preserving transactions end to end).

Overall this looks like it's designed to overcome materialized view limitations (which in popular OLTP dbs are pretty severe w.r.t. either what operations are supported, latency, or both) compared to other solutions that basically move the action downstream...curious if it will see much use, or if they'll inevitably introduce sinks and direct access to see if they can compete in the "live ODS" segment with Materialize and RisingWave.

edit: to make my comment more clear: this is a new entrant in a crowded space with several sophisticated, established players and the main differentiation is the deployment pattern. I'd be curious to know if anything else sets them apart

necubi · on Oct 21, 2023

I'm with Arroyo [0] — thanks for the mention! I'd be interested to see someone from Epsio chime in with where exactly they're positioning, but you're right that this is (recently) a very crowded space.

I think you can somewhat arbitrarily draw a line between systems like Materialize/RisingWave that are focused on materialized view maintenance (often reading change feeds from your OLTP database) and stream processors like Flink/Arroyo that are focused on supporting more operational use cases and work with a lager variety of sources and sinks.

Epsio seems like it's working primarily in the former mold, with fast incremental computation of materialized views. Unlike Materialize/RisingWave it seems to be designed to run in front of your database, with all queries going through it.

ReadySet is a really cool project in the query caching/materialized view space that I think doesn't get enough attention. Rather than making you define your materialized views ahead of time, it acts as an automated caching layer on top of postgres/mysql that performs incremental computation for components of query graphs.

As someone who's been in the streaming space for years, it's really exciting to see so much energy in the space in the past couple of years after a long period of stagnation, with everyone trying to figure out the programming and deployment models that make the most sense.

Most folks right now are gravitating towards materialized views as the model, largely because it's easy and familar for users. But ultimately I think this approach will end up too limited for most use cases and will remain valuable but somewhat niche.

[0] https://github.com/ArroyoSystems/arroyo

abrazensunset · on Aug 2, 2023

Agree with this. Snowflake has best-in-class dev experience and performance for Spark-like workloads (so ETL or unconstrained analytics queries).

It has close to worst-in-class performance as a serving layer.

If you're creating an environment to serve analysts and cached BI tools, you'll have a great time.

If you're trying to drive anything from Snowflake where you care about operations measured in ms or single digit seconds, you'll have a bad time and probably set a lot of money on fire in the process.

abrazensunset · on Aug 2, 2023

Please don't do this.

If you want an Airflow-ish approach without punishing your future self, pick Prefect. Otherwise go with Temporal. Above all do not adopt Airflow for the use cases you describe in 2023

tetha · on Aug 2, 2023

Prefect seems like a really good suggestion, thank you.

We're pretty much committing to python as our language of choice in the infra-layer. Most of the team is sent onto courses over the next month, too. So I have a whole lot of python scripts popping up over the infrastructure.

And this approach of slapping some @task and some @flow onto scripts or helper functions seems to work really well with what the team is doing. It took me like 30 - 40 minutes to convert one of those scripts into what seems a decently fine workflow. very intrigued.

abrazensunset · on July 21, 2023

If you use Julia, Makie crushes this use case and comes with great Python interop.

https://github.com/holoviz/datashader is a good one in the Python ecosystem.

abrazensunset · on Oct 27, 2022

What distinguishes this from the more well-known OpenMetadata project?

abrazensunset · on Oct 24, 2022

"Lakehouse" usually means a data lake (bunch of files in object storage with some arbitrary structure) that has an open source "table format" making it act like a database. E.g. using Iceberg or Delta Lake to handle deletes, transactions, concurrency control on top of parquet (the "file format").

The advantage is that various query engines will make it quack like a database, but you have a completely open interop layer that will let any combination of query engines (or just SDKs that implement the table format, or whatever) coexist. And in addition, you can feel good about "owning" your data and not being overtly locked in to Snowflake or Databricks.

abrazensunset · on Oct 24, 2022

There is a huge round of "data observability" startups that address exactly this. As a category it was overfunded prior to the VC squeeze. Some of them are actually good.

They all have various strengths and weaknesses with respect to anomaly detection, schema change alerts, rules-based approaches, sampled diffs on PRs, incident management, tracking lineage for impact analysis, and providing usage/performance monitoring.

Datafold, Metaplane, Validio, Monte Carlo, Bigeye

Great Expectations has always been an open source standby as well and is being turned into a product.

mywittyname · on Oct 25, 2022

Thanks for the recommendations, I'm going to check some of them out.

abrazensunset · on July 14, 2022

I think it's more a matter of comparing minivans (cloud "DWH" engines) to sports cars (Clickhouse et al) here.

Snowflake's performance characteristics & ops paradigm have always been more consistent with managed Spark than anything else. Thus the competition with Databricks. They have only recently started pretending to be anything than a low-maintenance batch processor with a nice managed storage abstraction, and their pricing model reinforces this.

That being said, for now it's pretty hard currently to find something that gives you: - Bottomless storage - Always "OK" performance - Complete consistency without surprises (synchronous updates, cross table transactions, snapshot isolation) - The ability to happily chew through any size join and always return results - Complete workload isolation

...all in one place, so people will probably be buying Snowflake credits for a few years yet.

I'm excited about the coming generation--c.f. StarRocks and the Clickhouse roadmap--but the workloads and query patterns for OLAP and DWH only overlap due to marketing and the "I have a hammer" effect.

I don't think the slight misuse of either type of engine is bad at small-to-medium scale, either. It's healthy to make "get it done" stacks with fewer query engines, fewer integration points, and already-known system limitations.

abrazensunset · on June 9, 2022

"I want to write my orchestration in Python and I'm comfortable hosting my own compute" -> Prefect (lightweight) or Dagster (heavier but featureful)

"My team already knows Airflow and/or I want to pay Astronomer a lot of money" -> Airflow

"I love YAML and everything is on k8s anyway" -> Argo

"I just want something that works out of the box and don't want to host my own compute" -> Shipyard, maybe Orchest

"I want a more flexible, generic workflow engine and don't care about writing orchestration in Python" -> Temporal/Cadence

"I am very nostalgic" -> Azkaban, Oozie, Luigi

"I love clunky Java solutions to data problems" -> Nifi et al

"I like to pay for half-managed solutions and late upgrades to a first-generation technology" -> AWS/GCP hosted Airflow options

"I am on AWS and it doesn't need to be complicated" -> AWS Step Functions

herodoturtle · on June 9, 2022

This was a really useful comment. Thank you.

I know there's a degree of oversimplification going on here, but there's something to be said for having a simple bullet-list breakdown of all the use-cases - alongside the best tool for each use-case.

It is servers as a practical starting point in terms of narrowing down the list of tools (of which there are so many), before one proceeds with a deeper dive into the best fitting tool.

Would be great if there were a site that did this sort of thing for all the common architectural needs.

corrius · on June 9, 2022

+1 to AWS Step Functions, in my last three companies I have built fairly complicated workflows with them and once you get used to them they are very powerful, reliable and cheap. I just wish a little bit more monitoring on top of them but nothing you can not build by yourself.

otabdeveloper4 · on June 9, 2022

I ended up choosing NodeRed at current job. What does that make me? :)

abrazensunset · on May 24, 2022

I'm a heavy Prefect user and was also very confused about the initial rewrite, even after reading several summaries. My best advice is to just try using 2.0 (Orion). Here's how I'd summarize the difference:

Prefect 1.0 feels like second-gen Airflow--less boilerplate, easy dynamic DAGs, better execution defaults, great local dev, etc etc. It's more sane but you still feel the impedance mismatch from working with an orchestrator.

Prefect 2.0 is a first-principles rewrite that removes most of the friction from interacting with an orchestrator in the first place. Finally, your code can breathe.