We would wait for Bss data and journal DirectIO and the acking (sending response back to api_server) in the callback function. What you are implying is what s3 actually doing and you can get see from their paper[1] and we are stronger than that.
Well said and there are some bitter lessons in the storage industry.
In my last company we need to disable the disk write cache during each reboot, and we also heard a lot industry stories related to underneath firmware implementation from oxide computer podcasts [1]. Yes, to provide truly reliable service, we need to evaluate underneath hardware settings case-by-case.
That's a lot of valuable information and thanks for the input. Yes the original blog post is mainly focusing on reducing the metadata overhead due to fsync(), and I got a lot of good feedback from here and a lot of discussion is beyond our original scenario settings. We would like to incorporate all these enhancement suggestions without re-introducing fsync(), and make it work for more general environments.
Thanks for the encouragement! Another author here. Yes, if you are interested you can check our another blog [1] for the internal storage engine. Yes, we are limiting the delimeter to "/", to better support posix FS semantics. I have just finished the fs feature branch which has passed all posix fstests [2].
Yes, that's right. We could go even further, to use the raw devices without relying on any filesystem. We then need to allocate/format raw disk spaces and we can not just open files as simple as right now. It would take some extra effort, but we would like to explore that in the future.
It will also make the system initialization faster, since right now we need to write all zeros to make ext4/xfs to actually initialize extents as "allocated".
Yes, that has also been pointed out in other threads. Yes this could be very important settings, and even some of common Linux file systems actually don't do that every time and we need to disable the disk writecache during boot up to make sure the data truly persistent (as in my previous storage company).
So instead of saying "We removed fsync" you should say: "We redesigned the database write path to avoid paying the full fsync durability cost on every write"
Yes, as we mentioned in the post, it is targeted for the virtualized NVME disk and we don't have control for actually issing FUA command. We are also changing to open data files with O_DATA_SYNC to make them work with normal on-prem deployment environments.
Even then, I also share the confusion of the poster you're replying to.
I don't see how a virtualised NVMe disk is different from a physical one.
Especially if you don't have control over the underlying hardware (so you don't know if it has power-loss-protection PLP SSDs), you should send the FUA.
> O_DATA_SYNC
You mean `O_DSYNC`?
Why would you need `O_DSYNC` on-premise, but not on cloud VMs? (Or are you saying you'd include it everywhere?) Similar to my above point, surely it is the task of the VM to pass through any FUA commands the VM guest issues to the actual storage?
Further: Is `O_DSYNC` actually substantially different from writing and then `fdatasync()`ing yourself?
My understand is that no, it's the same. In particular, the same amount of data gets written. So if you believe that to avoid the "can trigger an order of magnitude more I/O" by avoiding `fdatasync()`, you would re-introduce it with `O_DSYNC`.
However, I suspect that that whole consideration is pointless:
The only thing that makes your O_DIRECT+preallocated-only-overwrites writes safe are enterprise SSDs with Power Loss Protection (PLP), usually capacitors.
On those SSDs, NVMe Flush/FUA are no-ops [1]. So you might as well `fdatasync()`/`O_DSYNC`, always. This is simpler, and also better because you do not need to assume/hope that your underlying SSDs have PLP: Doing the safe thing is fast on PLP [2], and safe on non-PLP.
So the only remaining benefit of `O_DSYNC` over `fdatasync()` is that you save a syscall. That's an OK optimisation given they are equivalent, but it would surprise me if it had any noticeable impact at the latencies you are reporting ("413 us"), because [2] reports the difference beting 6 us.
Let me know if I got anything wrong.
The only remaining question is: Why do you then see any difference in your benchmark?
Thanks for the feedback, since I have relied in other thread related to O_DSYNC which a lot of folks have already suggested, and I will not repeat it here.
For the benchmark results, and they were mainly due to metadata management. We have implemented our own KV store, see internal here [1], which is more efficient than ext4 namespace management, even after doing very aggressive fs tuning for that [2] (plus 65536 sharding for each leveled dir).
Fsync on PLP drives isn't strictly a NOP - you still take a latency hit from the round trip of the command to the NVMe device, where it is implemented as a NOP.
Thanks for pointing it out the mistakes. We should make it clearer, when fsync an opened file descriptor, it would only sync its own metadata. To make it truly persistent, we need to issue another fsync for the directory fd, which would make it more expensive.
Congrats to the TigerBeetle team's new feature! and it looks like TB has already moved from shared nothing to partially shared disk (object storage) architecture. We are always a big fan of the tigerbeetle engineering and is actually using TB's excellent io_uring runtime [1] to build a new object storage, and this connection feels amazing to me.
[1]https://www.amazon.science/publications/using-lightweight-fo....
reply