I've worked with several event sourcing systems and was even seduced into implementing one out of sheer hubris once. These problems are ever present in every ES project I've had the misfortune of coming into contact with. It doesn't even mention the worst part that comes afterwards, when you realize after all of that pain that it is only used by a single person in the company to generate a noncritical report comparing a meaningless KPI that could have been manually done in four hours by a different intern every quarter. By the time the tooling is up and running enough to make a stable system, no one will trust it enough to use ES's landmark features except for a few developers still coming off the koolaid.
99% of the time when ES sounds like a good idea, the answer is to just use Postgres. Use wal2json and subscribe to the WAL stream - if you really need time travel or audit logs or whatever, they'll be much cheaper to implement using WAL. If you need something more enterprisey to sell to the VP-suite, use Debezium.
Event sourcing sounds so awesome in theory and it is used to great effect in many demanding applications (like Postgres! WAL = event sourcing with fewer steps) but it's just too complex for non-infrastructure software without resources measured in the 10s or 100s of man years.
This is a common problem I see across many things.
We had a guy who spent two weeks writing a report script that collated data uploaded into S3 and wrote out another data lump into S3 after some post processing then sent this data lump to a guy via email. This entire thing had a bunch of lambda layers to pull in the pipeline, a build pipeline in jenkins, terraform to deploy it. The python script itself was about 200 lines of the equivalent of spreadsheet cell manipulation basically.
Only after it was implemented we found out it was only run every 3 months and took the original dude 5 mins to paste it into his existing excel workbook and the data popped out in another sheet instantly. This was of course a “massive business risk” and justified a project to replace it. What it should have been was a one page thing in confluence with the business process in it and the sheet attached. But no it now needs a GitHub repo full of shite and 2 separate people to understand every facet of it who require 3x the salary of the original dude each. And every 3 months, something is broken or it pukes an error which requires someone 3 hours of debugging and reverse engineering the stack to work out why.
Fundamentally, event sourcing, as the above, is the outcome of seeing something shiny and covered in marketing and wanting to play with it on any available business case rather than doing a rational evaluation of suitability. There are use cases for this stuff, but sometimes people don’t know when to apply the knowledge they have.
As for ES itself, I have resisted this one big time, settling on CQRS class functionality which is quite frankly no difference to query/mutate database APIs using stored procedures in PL/sql I dealt with in the dark ages. People also don’t have the architectural skills or conceptual ability generally to rationalise the change on data consistency on top of that or the recovery concerns. I would rather find another way to skin the scalability cat than use that architecture.
You don't give people enough credit. I know damn well that the "complicated" solution is at best a monumental waste of money but doing the same trivial CRUD shit for years on end is neither intellectually stimulating nor good for my career. Until "the business" finds a way to change that I will use every available opportunity for "resume-driven development". Really, I don't give a crap if the shareholders make money or not. I only care about my own paycheck.
I wouldn't use the overengineering of a single script influence your view of the technology you named (S3 data laking, lambda pipelines, build in jenkins, terraform to deploy). Those are well understood concepts and technologies to do reporting. The problem here is that these tools were not in use at the company. Once you have 50 scripts run by different people on different schedules in different excel version, having everything go through S3 (immutable, cheap), ingested by lambdas (scalable, well versioned, monitored), built and deployed in jenkins using terraform (vs say, aws command line scripts) significantly reduces the complexity of the problem.
Event sourcing can be a very powerful pattern if used correctly. You don't need to combine ES with Eventual Consistency. ES can be implemented in a totally synchronous manner and it works very elegantly capturing nicely all the busines events you need. You get a very detailed audit for free and you don't loose importan business data. You can travel back in time, construct different views of data (projections) etc. Most of the complications arise when you bring in Eventual Consistency, but you absolutely don't have to.
I agree. Audit and history functionality have been the motivating features for me in building systems that are "ES-lite," where the event stream is not a general-purpose API and is only consumed by code from the same project.
Sometimes the ability for an engineer to fetch the history of a business object out of a datastore and explain what it means checks an "audit trail" requirements box. Sometimes showing the history of a business process in a UI, with back-in-time functionality, is a game changing feature. If so, using event sourcing internally to the service is a great way to ensure that these features use the same source of truth as other functionality.
Where you get into trouble is when you realize that event sourcing will let you distribute the business logic related to a single domain object to a bunch of different codebases. A devil on your shoulder will use prima facie sound engineering logic tell you it is the right thing to do. The functionality relates to different features, different contexts, different operational domains, so keeping it in the same service starts to feel a bit monolithic.
But in practice you probably don't have enough people and enough organizational complexity to justify separating it. What happens is, a product manager will design an enhancement to one feature, they'll work out the changes needed to keep the product experience consistent, and the work will get assigned to one engineer because the changes are reasonable to get done in one or two sprints. Then the engineer finds out they have to make changes in four different systems and test six other systems for forwards compatibility. Oops. Now you have nanoservices-level problems, but only microservices-level capabilities.
If you aren't tempted down that path, you'll be fine.
For an interactive application it's an absolute nightmare when you don't have read-your-own-writes support (including synchronization for "indexes"). Async background updates aren't always appropriate or easy to do.
few questions:
- what was your biggest ES system that you worked on?
- how many people worked on it?
- how did the ES system particulary solve your problem?
- what was the tech stack?
No, I'm not a consultant. Maybe lead developer is the most accurate title for me:) By what criteria the biggest system do you mean? They certainly weren't toy projects, these are real systems in active use solving real world problems for thousands of users. The ES part of these systems is mainly implemented in .NET using excellent Marten DB as ES store on top of PostgreSQL. I would say that ES changes drastically how you model things and see business. It forces you to identify and define meaningful events. These are often actually something that also non programmers understand so it also improves creatly communication with your clients(or in DDD terms domain experts:)) as a by-product. Scaling also has not been a real issue, these systems can handle thousands of operations per second without exotic hardware and setups, all this mostly synchronously.
And I must add, use the right tool for the job, there are many cases where ES is not a good fit. Also, if you choose to use ES in your project, you don't have to use ES for everything in that project. Same thing actually applies to asynchronous processing. If something in the system doesn't scale synchronously, that doesn't mean you now should do everything asynchronously.
Not the OP, not a consultant, the biggest ES system I work on is an order management system in one of the biggest investment bank in APAC, which processes orders from clients to 13 stock exchanges.
40 people approx work on it in Asia, maybe around 100 globally at the raw dev level.
I feel it's a sort of false good idea for our particular problems. Clients trade ether at the 10ms latency for high value orders or sub-ms for latency-sensitive ones. They conceptualise what they want to buy and a bunch of changes they want to apply on that bulk: for instance, change in quantity, price limits, time expiry and try to make money by matching a target price either by very closely following a prediction curve or spending as little time on doing so (giving us only a constraint and asking us to fit it by doing our own curve fitting algos).
The tech stack is pure java with a kernel of C++ for networking, with as little external dependency as possible. And no GC beyond the first few minutes after startup (to prealloc all the caches). Everything is self built.
How does it solve the problem: it simply is the only way we can think of to over optimize processing. If you want to fit a millisecond or sub millisecond target (we use fpga for that), you must cut the fat as much as possible. Our events are 1kb max, sent in raw tcp, the queue is the network switch send queue, the ordering has to he centrally managed per exchange since we have only one final output stream but we can sort of scale out the intermediary processing (basically for one input event how to slice in multiple output events in the right order).
I'd say it doesnt work: we run into terrifying issues the author of the main link pointed out so well (the UI, God, it's hard, the impossible replays nobody can do, the useless noise, solving event stream problems more than business problems etc), but I d also say I cant imagine any heavier system fitting better the constraint. We need to count the cycles of processing - we cant have a vendor library we cant just change arbitrarily. We cant use a database, we cant use heavier than tcp or multicast. I'll def try another bank one day to see how others do because I m so curious.
Are you really using event sourcing, or just message queues?
I worked for a large US bank, and had close contact with those who worked on the asset settlement system. I don't have as deep insight as you do, obviously. But the general architecture you describe sounds very similar. Except they clearly used message queues and treated them as message queues, and not event streams / sourcing / whatever. They used a specific commercial message broker that specializes in low latency.
My experience also. I've worked with 3 different ones and I think the domain fit was good for only one of them. In that one domain model, the fit was spot on because that domain was just a bunch of events and every question you asked of the domain model was really saying "create me an object with the events between time X and Y". That was it tho, all other domains I've worked on would not suit ES - even though they would suit CQRS without the ES piece.
Thanks a lot for mentioning Debezium. Agreed that a CDC-based approach will be simpler to implement and reason about than ES in many cases. Semantics are different a bit of course (e.g. ES events capturing "intend"); we had a post discussing the two things a while ago on the blog [1]. Speaking of capturing intend, we just added support for Postgres's pg_logical_emit_message() function in today's release of Debezium 1.8.0.Beta1 [2].
Am I crazy for wanting a standardized WAL format that can be treated as an event stream for anything from replicas to search services to OLAP? Why can't we drink straight from the spigot instead of adding abstractions on abstractions?
Its honestly not that hard to build a WAL-style solution exactly the way you want it for your own application. You just have to get away from the "you shouldn't write your own xyz" boogeyman experience long enough to figure it out.
Do you know how to model your events using a type system?
Can you be bothered to implement a Serialize and Deserialize method for each event type, or simply use a JSON serializer?
Do you know how to make your favorite language compress and uncompress things to disk on a streaming basis?
Do you know how to seek file streams and/or manage a collection of them all at once?
Can you manage small caches of things in memory using Dictionary<TK,TV> and friends?
Are you interested in the exotic performance possibilities that open up to those who leverage the ring buffer and serialized busy wait consumer?
If you responded yes to most or all of the above, you are now officially granted permission to implement your own WAL/event source/streaming magic unicorn solutions.
Seriously, this stuff is really easy to play with. If you follow the rules, you would actually have a really hard time fucking it up. Even if you do, no one gets hurt. The hard part happens after you get the events to disk. Recovery, snapshots, garbage collection - that's where the pain kicks in. But, none of these areas is impossible. Recovery/Snapshots can again be handily defeated by the mighty JSON serializer if one is lazy enough to succumb to its power. Garbage collection can be a game of segmenting log files and slowly rewriting old events to the front of the log on a background thread. The nuance is in tuning all of these things in a way that makes the business + computer happy at the same time.
Plan to do it wrong like 15 times. Don't invest a whole lot into each attempt and you can pick this up really fast. Try to just write it all yourself. Aside from the base language libraries (file IO, threading, et. al), JSON serializer and GZIP, you really should just do it by hand because anything more complex is almost certainly wrong.
100% agree. I only used wal2json because my risk tolerance for the project was between "can't be bothered to pay the onboarding/maintenance cost of Kafka for Debezium" and "can't be bothered to implement a reader for the (well documented IIRC) stable binary WAL format" which is a weird spot to be in. This was a PoC written meant to demonstrate how we could implement the features we needed from ES using no more than standard Postgres tooling and a tiny service that was good enough to roll right into production with minor changes. It took under a week, though ideally I would have take the time to decoded the binary WAL format directly after saving the raw stream for safety's sake. Rust wasn't even an option at the time, nowadays it'd be mostly a bunch of macros and annotations with a sprinkingly of hand written FromBytes implementations and tiny bit of IO+serde glue code.
IIRC it took another data scientist and engineer under a month to turn the raw WAL logs into an audit log interface with pretty SSO avatars and weekly reporting. Someone from the devops team with DBA experience implemented time traveling staging DBs with continuous archiving and point in time recovery from production in the same time. Someone else later improved it so PITR used full backups created from a filtered WAL log so devs could select which parts of the production DB they copied over instead of each babying their own staging cluster that took days to rebuild. The whole project ended up giving us all of the benefits of event sourcing using standard, well tested tooling for a fraction of the cost.
We've thrown around the phrase "not invented here syndrome" so much that we've over corrected - as humans are wont to do - and now the younger generation thinks architectures like event sourcing or infrastructure like Kafka are a better solution than to just consume replication logs over a TCP connection to one of the most popular open source databases on the planet (not directed at the GP but my former coworkers :)). I'm starting to wonder if I've reached the age where I sound like the adults in the Peanuts cartoons, except the sound vaguely resembles "Get your resume driven development off my lawn!"
> "can't be bothered to pay the onboarding/maintenance cost of Kafka for Debezium"
Debezium can also be used without Kafka; either via Debezium Engine [1], where you embed it as a library into your JVM-based application and it will invoke a callback method you registered for every change event it receives. That way, you can react to change events in any way you want within your application itself, no messaging infrastructure required. The other option is using Debezium Server [2], which takes the embedded engine to connect Debezium to all sorts of messaging/streaming systems, such as Apache Pulsar, Google Cloud Pub/Sub, Amazon Kinesis, Redis Streams, etc.
> 99% of the time when ES sounds like a good idea, the answer is to just use Postgres.
I like some of the things in Event Sourcing very much.
However, whenever I ask "Why not just use PostgreSQL as the event store until I'm 5 orders of magnitude bigger?" I never seem to get a really good answer.
from your experience there is no framework or set of libraries which would bring the advantages at lesser cost other than wal2json? Is still build from scratch the way to go and the shortcut to use common and more generic tools?
99% of the time when ES sounds like a good idea, the answer is to just use Postgres. Use wal2json and subscribe to the WAL stream - if you really need time travel or audit logs or whatever, they'll be much cheaper to implement using WAL. If you need something more enterprisey to sell to the VP-suite, use Debezium.
Event sourcing sounds so awesome in theory and it is used to great effect in many demanding applications (like Postgres! WAL = event sourcing with fewer steps) but it's just too complex for non-infrastructure software without resources measured in the 10s or 100s of man years.