The great thing about HN is that it consistently shoves into my face how many, seemingly common, dev tools or frameworks etc... that I've never heard of.
Event sourcing isn't anything I've ever heard of, let alone something for which broad marketing promises need debunking.
How common is this framework/archecture/product? Google isn't helping me determine how widely used it is.
It's become a fairly widely known concept in data engineering circles, expounded upon in Martin Kleppman's Designing Data Intensive Applications book. (buy this book if you want to get up to speed on modern ideas around distributed systems and data architecture)
This became popular as people were trying to figure how to use Kafka as a persisted log store that could be "replayed" into various other databases. This meant that you could potentially stream all the deltas (well, more accurately the operations to create the delta, e.g insert, update, delete) in your data -- through a mechanism called Change-Data-Capture (CDC) [1] -- into a single platform (Kafka) and consistently replicate that data into SQL databases, NoSQL databases, object stores, etc. Because these are deltas, this lets you reconstruct your data at any point in history on any kind of back end database or storage (it’s database agnostic).
Event sourcing to my understanding is a term used among DDD practitioners and Martin Fowler disciples but with a different nuance. This article explains what it is:
[1] Debezium is an open-source CDC tool for common open-source databases. Side note: A valid (but potentially expensive) way of implementing CDC is by defining database triggers in your SQL database.
In fact, git stores a full snapshot of your entire repo with every commit. It does not store diffs from the previous commit. When you do "git show <COMMIT_SHA>" it generates a diff from the parent commit on the fly.
There's a huge optimization though: it uses a content-addressed blob store, where everything is referenced by the sha1 of its contents. So if a file's contents is exactly the same between two commits, it ends up using the same blob. They don't have to be sequential commits, or it could even be two different file paths in the same commit. Git doesn't care - it's a "dumb content tracker". If one character of a file is different, git stores a whole separate copy of the file for it. But every once in a while it packs all blobs into a single file, and compresses the whole thing at once, and the compression can take advantage of blobs which are very similar.
Gits bi-modal nature is a wonderful representation of a sanely architected Event Sourced system. When needed it can create a delta-by-delta view of the world for processing, but for most things most of the time it shows a file-centric view of the world.
IMO a well-factored event sourced system isn't going to feel 'event sourced' for most integrated components, APIs, and services because it's working predominately with snapshots and materialized views. For compex process logic, or post-facto analysis, the event core keeps all the changes available for processing.
Done right it should feel like a massive win/win. Done wrong and it's going to feel hard in all the wrong places :)
Also directory structures are content-addressed, so if a commit changes nothing in a given directory, the commit data won't duplicate the listing of that directory.
Then I've just found out I've worked on a 28 years old event source'd code base. In Clipper. An old loan management software, which had to control how installments changed along their lives.
Basic event sourcing is quite simple to implement. All the bells and whistles people sell alongside event sourcing are hard - whether you do event sourcing or not.
Your typical application presents a user interface based on data in a set of database tables (or equivalent), the user takes some action, the database tables get updated.
The equivalent event-sourced application presents a user interface based on data in a set of database tables (or equivalent), the user takes some action, the outcome of that action is written to one table, the other database tables get updated.
For git, where "or equivalent" is the working copy. You could easily imagine a source code management system similar to git, but without storing history - every commit and pull is a merge resulting in only the working copy, every push replaces the remote working copy with your working copy.
But man, wouldn't it suck to be limited to only understanding the most recent state of your source code…
You need to know beforehand how exactly you’re going to pull out the data, how it’s going to be indexed, updated, etc. You can get nice benefits out of it especially if you’re doing transaction states, but your query times are going to suffer unless you’re caching the end-state. This can make “simple” things take a lot of effort. It’s a really significant departure in terms of effort to deal with your data.
Event-based systems that run on message queues are the backbone of money. These often co-exist with batch systems that process huge files (which also can be thought of as a journal of events to be processed in a batch).
Things like Redux or Bitcoin are more or less event sourced systems. It's basically the idea of deriving your application state from a series of business events and storing those as a single source of truth. It's very appealing in theory but as the article explains, it's a bit more complicated in practice (e.g. dealing with consistency).
Eventsourcing is really a lose set of concepts, and each application of it will look very different. That's also why there are almost no useful frameworks here.
There are some good talks on Youtube about it.
The concept is used all over though: Redux (js) is basically a lightweight form of eventsourcing ( with redux-saga being the "process managers" mentioned in the post).
Events are the cornerstone of product analytics. If you want to understand what your users are doing on your platform, and to look for opportunities to improve the user experience, events are a big part of that.
I've been working with user event funnels for years, with tools like mixpanel and others but it seems like this is a build state tool for development workflow.
The basic idea of event sourcing is to store the actions rather than the end state of the action. This allows actions to interleave and systems calculate the endstate from all of the actions.
Think of ATMs. They don't update the balance of your bank account directly. They just record a debit against it and then the sum of all of your credits and debits is your balance. This avoids it having to have some kind of lock on your account during the transition and even allows significant delays from various transaction sources.
In my view most of the problems with event sourcing that people run into boils down to overdoing it and modelling large parts of their systems as sets of events instead of maintaining an event log just for those things that falls out naturally as discrete events from the rest of your design.
Having certain critical events, and especially complex ones, logged as events that can be replayed and reconstructed and model transformations as operations over the state of those logs can be invaluable. Building the system around modelling all changes as events is amazingly painful.
The first thing ATMs (at least over here) do is check if you have enough balance for your transaction. If you just stored events, you'd need to collapse all the previous delta operations at that point, which would be arbitrarily slow if you never persisted data.
Storing event records vs materializing changes is sort of an old hat trick in databases. People did that in the early 90s for better TPC-C scores. It has its uses, and in some contexts storing deltas can have huge advantages (e.g. bigtable with a log structured file system), but it's no silver bullet.
Databases that try to only reflect the entire truth as-is are like everyone working from a shared whiteboard.
Databases setup to event source, ie reflect the truth as it happened at each step, are like everyone working from a shared spreadsheet with change tracking.
It's an entire extra dimension that makes reconciliating problems of conflicting actions in disparate systems possible.
It seems to be one of those things that you probably don't need. And when you do need it, it becomes obvious that you need it. Specifically, in my research, you really shouldn't use it until it becomes painfully necessary to horizontally scale writes.
Parts of it can be useful, but you don't need to split out an event bus to get auditing for example. As you say, you can avoid that until/unless you need to scale writes.
In the meantime, you can look for inbound data that naturally correspond to immutable events and apply some of the ideas to that. E.g. that form a user submits? It's reasonably an immutable event. Many of them won't matter to you, because you'll never care to audit it. But some might.
E.g. we have projections of financials being submitted by third parties. Being able to go back and audit how original form submissions relate to changes in other system state is useful, or just being able to re-run old reports after fixing bugs and confirming that the reports show what they should before/after certain events. So instead of just storing the end state, we're increasingly looking to store the original external signals that triggered those changes, and build transformations as views over that event log, and then where we need it only drive transformations to tables we don't event the same way, often with a suitable reference to the source event(s).
It avoids the problems in the article for the most part (some, such as changes in the structure of the events will always be an issue), but gets enough of the benefits to be worth it, because it's only applied to data we have that it genuinely fits (where we have clear, natural event sources, often but not always external submissions of data) where we have a need (whether for complexity reasons or because of external auditing requirements) to be able to get past views of data.
I think you also need it when scaling reads becomes painful.
Reads that have different patterns, specifically, the kinds of patterns that can't be indexed easily because they need denormalization to generate all the indexed expressions. Or you need to read a time series, a snapshot at a point in time, or the latest version of the data, all from different places under different loads - analytic, machine learning, transactional.
One user needs to read across all the data over all time; another user wants super-fast scrollable access to user-customized sorts of a subset of the latest data. The user-configurability of the sort is what defeats the kinds of indexing you get in a traditional RDBMS. The obvious way to get this is a lambda architecture: have an immutable append-only system of record which contains all the data, and build the other views out of it. It's a small step from there to event sourcing.
Event sourcing isn't anything I've ever heard of, let alone something for which broad marketing promises need debunking.
How common is this framework/archecture/product? Google isn't helping me determine how widely used it is.