More

thomas_fa · 2026-05-10T03:52:09 1778385129

We would wait for Bss data and journal DirectIO and the acking (sending response back to api_server) in the callback function. What you are implying is what s3 actually doing and you can get see from their paper[1] and we are stronger than that.

[1]https://www.amazon.science/publications/using-lightweight-fo....

thomas_fa · 2026-05-09T18:28:18 1778351298

Well said and there are some bitter lessons in the storage industry.

In my last company we need to disable the disk write cache during each reboot, and we also heard a lot industry stories related to underneath firmware implementation from oxide computer podcasts [1]. Yes, to provide truly reliable service, we need to evaluate underneath hardware settings case-by-case.

[1] https://onthemetal.transistor.fm/

thomas_fa · 2026-05-09T18:06:52 1778350012

That's a lot of valuable information and thanks for the input. Yes the original blog post is mainly focusing on reducing the metadata overhead due to fsync(), and I got a lot of good feedback from here and a lot of discussion is beyond our original scenario settings. We would like to incorporate all these enhancement suggestions without re-introducing fsync(), and make it work for more general environments.

gmokki · 2026-05-13T11:13:24 1778670804

Have you tried terminating the EC2 instances abruptly to test the resilience of your solution?

aws ec2 stop-instances \ --instance-ids i-12345678 \ --skip-os-shutdown

thomas_fa · 2026-05-09T16:49:51 1778345391

Thanks for the kind words! You check more of our work in https://github.com/fractalbits-labs/fractalbits.

thomas_fa · 2026-05-09T16:48:22 1778345302

Thanks for the encouragement! Another author here. Yes, if you are interested you can check our another blog [1] for the internal storage engine. Yes, we are limiting the delimeter to "/", to better support posix FS semantics. I have just finished the fs feature branch which has passed all posix fstests [2].

[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...

[2] https://github.com/pjd/pjdfstest

thomas_fa · 2026-05-09T16:40:46 1778344846

Yes, that's right. We could go even further, to use the raw devices without relying on any filesystem. We then need to allocate/format raw disk spaces and we can not just open files as simple as right now. It would take some extra effort, but we would like to explore that in the future.

It will also make the system initialization faster, since right now we need to write all zeros to make ext4/xfs to actually initialize extents as "allocated".

thomas_fa · 2026-05-09T16:35:02 1778344502

Yes, that has also been pointed out in other threads. Yes this could be very important settings, and even some of common Linux file systems actually don't do that every time and we need to disable the disk writecache during boot up to make sure the data truly persistent (as in my previous storage company).

seebeen · 2026-05-09T17:47:19 1778348839

So instead of saying "We removed fsync" you should say: "We redesigned the database write path to avoid paying the full fsync durability cost on every write"

thomas_fa · 2026-05-09T16:26:49 1778344009

Yes, as we mentioned in the post, it is targeted for the virtualized NVME disk and we don't have control for actually issing FUA command. We are also changing to open data files with O_DATA_SYNC to make them work with normal on-prem deployment environments.

nh2 · 2026-05-09T18:00:45 1778349645

Even then, I also share the confusion of the poster you're replying to.

I don't see how a virtualised NVMe disk is different from a physical one.

Especially if you don't have control over the underlying hardware (so you don't know if it has power-loss-protection PLP SSDs), you should send the FUA.

> O_DATA_SYNC

You mean `O_DSYNC`?

Why would you need `O_DSYNC` on-premise, but not on cloud VMs? (Or are you saying you'd include it everywhere?) Similar to my above point, surely it is the task of the VM to pass through any FUA commands the VM guest issues to the actual storage?

Further: Is `O_DSYNC` actually substantially different from writing and then `fdatasync()`ing yourself?

My understand is that no, it's the same. In particular, the same amount of data gets written. So if you believe that to avoid the "can trigger an order of magnitude more I/O" by avoiding `fdatasync()`, you would re-introduce it with `O_DSYNC`.

However, I suspect that that whole consideration is pointless:

The only thing that makes your O_DIRECT+preallocated-only-overwrites writes safe are enterprise SSDs with Power Loss Protection (PLP), usually capacitors.

On those SSDs, NVMe Flush/FUA are no-ops [1]. So you might as well `fdatasync()`/`O_DSYNC`, always. This is simpler, and also better because you do not need to assume/hope that your underlying SSDs have PLP: Doing the safe thing is fast on PLP [2], and safe on non-PLP.

    [1] https://news.ycombinator.com/item?id=46532675
    [2] https://tanelpoder.com/posts/using-pg-test-fsync-for-testing-low-latency-writes/

So the only remaining benefit of `O_DSYNC` over `fdatasync()` is that you save a syscall. That's an OK optimisation given they are equivalent, but it would surprise me if it had any noticeable impact at the latencies you are reporting ("413 us"), because [2] reports the difference beting 6 us.

Let me know if I got anything wrong.

The only remaining question is: Why do you then see any difference in your benchmark?

    Configuration            Throughput (obj/s)
    -------------------------------------------
    ext4 + O_DIRECT + fsync             116,041
    Our engine                          190,985

That is what I'd find very valuable to investigate.

The first suspicion I have is: Shouldn't you be measuring `+ fdatasync` instead?

So I'd be interested in:

    ext4 + O_DIRECT + fdatasync
    ext4 + O_DIRECT + O_DSYNC
    Our engine + O_DSYNC (which you're suggesting above)

Also I don't fully understand what the remaining diference between "ext4 + O_DIRECT + O_DSYNC" and "Our engine + O_DSYNC" would be.

thomas_fa · 2026-05-09T18:17:57 1778350677

Thanks for the feedback, since I have relied in other thread related to O_DSYNC which a lot of folks have already suggested, and I will not repeat it here.

For the benchmark results, and they were mainly due to metadata management. We have implemented our own KV store, see internal here [1], which is more efficient than ext4 namespace management, even after doing very aggressive fs tuning for that [2] (plus 65536 sharding for each leveled dir).

[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...

[2] https://github.com/fractalbits-labs/fractalbits/commit/12109...

jmalicki · 2026-05-10T11:05:52 1778411152

Fsync on PLP drives isn't strictly a NOP - you still take a latency hit from the round trip of the command to the NVMe device, where it is implemented as a NOP.

thomas_fa · 2026-05-09T16:19:41 1778343581

Thanks for pointing it out the mistakes. We should make it clearer, when fsync an opened file descriptor, it would only sync its own metadata. To make it truly persistent, we need to issue another fsync for the directory fd, which would make it more expensive.

dezgeg · 2026-05-09T17:02:04 1778346124

You don't need to do that for every write though. Only when the database file is created.

thomas_fa · 2026-05-09T17:10:43 1778346643

Yes, especially for our object storage each putObject would need to create new entry for in the (data)name space which would need fsync for dir fd.

thomas_fa · 2026-04-16T21:57:55 1776376675

Congrats to the TigerBeetle team's new feature! and it looks like TB has already moved from shared nothing to partially shared disk (object storage) architecture. We are always a big fan of the tigerbeetle engineering and is actually using TB's excellent io_uring runtime [1] to build a new object storage, and this connection feels amazing to me.

[1] https://codeberg.org/thomas-fractalbits/iofthetiger [2] https://fractalbits.com/blog/why-we-built-another-object-sto...