The naming strikes me as quite unfortunate. Kdb+ is a (time-series) database that supports the languages K and SQL. Naming a product in a similar space KSQL was already a questionable idea. Jumping through the hoops of renaming only to choose ksqlDB (which, arguably, is even more confusing) is, well, unfortunate.
It’s a long list of caveats about how your queries will silently fail if you don’t guarantee all sorts of properties of the data.
Anyone considering making Kafka the center of your data infrastructure should also consider using a conventional column-store database instead. The only advantage of Kafka in this comparison is better latency (seconds vs minutes). If you can live with a minute or two of latency, the advantages of a real database are MANY.
There is major difference between Kafka and database(s): Kafka is used for passing data through and databases are for storing the data. Also, KSQL/kdbSQL are not intended to be general tool for transforming all data, but to assist you without writing custom services (unless I'm getting things wrong).
Now, should Kafka be central place for routing data? I think that is the main reason why it exists. Should it be used for permanently storing data like conventional database? Absolutely not.
By design, it will temporarily store it in case consumer goes offline, but you can also use it to implement backpressure/queue for slow receivers that are not capable to catch up with fast data ingestion (e.g. syslog -> kafka -> logstash).
I really think most people want a traditional SQL database that has a stream query that gives real time updates. I honestly don't know enough about databases to understand why that isn't a simple enough feature to implement.
I don't think that KDB does, either?
MY understanding of this in KDB, and I could be wrong (I'm not an expert) is that in KDB you can write to the tickerplant database. Those clients subscribing to the realtime will see this in realtime, and then eventually it will get asynchronously written to another DB which allows querying.
I guess you'd think of it as eventual consistency.
(Please correct me if I'm incorrect/misunderstood the above)
You’re right. You don’t get read-your-own-writes guarantees in a typical Kdb tickerplant architecture, as the event streams are propagated to the real-time nodes asynchronously.
AIUI you can only write to the database indirectly through Kafka messages which the KSQL then periodically consumes and materializes into tables with syntax almost like SQL.
So if you emit an event (like "New User Created") and have a ksql table that summarizes the user count, then Kafka having accepted the event does not mean the ksql tables have also been updated.
Contrasted with a traditional database where a COMMIT returning one one session means the second can immediately read it.
That's standard behaviour in every analytics database that talks to kafka that I can think of. Kafka doesn't really encourage ACK/NACK type processing patterns. Accepting an event usually means the consumer has successfully read it and staged it for whatever is meant to happen next, not that the operation is completed.
Now if it were possible for it to accept an event from Kafka but not guarantee that the event will eventually make it into the materialized view but may be lost, that'd be a problem.
> Now if it were possible for it to accept an event from Kafka but not guarantee that the event will eventually make it into the materialized view but may be lost, that'd be a problem.
This is usually not a problem these days as it’s possible to guarantee exactly once ingestion using Kafka offsets
IIRC clickhouse still doesn't have this guarantee. It guarantees exactly one ingestion but not that the ingested event gets processed all the way through to view. And if the processing fails, that event won't be retried and is now gone.