fsync waits for the drive to report back the success write. When you do a ton of...

zozbot234 · 2025-07-20T13:15:12 1753017312

> It has literally the same durability as a fsync write

I don't think we can ensure this without knowing what fsync() maps to in the NVMe standard, and somehow replicating that. Just reading back is not enough, e.g. the hardware might be reading from a volatile cache that will be lost in a crash.

benjiro · 2025-07-20T13:25:38 1753017938

Unless your running cheap consumer NVME drives, that is not a issue on Enterprise SSD/NVMEs as they have their own capacitors to ensure data is always written.

On cheaper NVME drives, your point is valid. But we also need to add, how much at risk are you. What is the chance of a system doing funky issues, that you just happened to send X amount of confirm requests to clients, with data that never got written.

For specific companies, they will not cheap out and spend tons of enterprise level of hardware. But for the rest of us? I mean, have you seen the German Hetzner, where 97% of their hardware is mostly consumer level hardware. Yes, there is a risk, but nobody complains about that risk.

And frankly, everything can be a risk if you think about it. I have had EXT3 partition's corrupt on a production DB server. That is why you have replication and backups ;)

TiDB, or was it another distributed DB is also not consistency guaranteed, if i remember correctly. They give for performance eventual consistency.

gpderetta · 2025-07-20T18:38:25 1753036705

Forget about consumer FD, unless you are explicitly doing O_DIRECT, why would you expect that a notification that your IO has completed would mean that it has reached the disk at all? The data might still be just in the kernel page buffer and not gotten close to the disk at all.

You mention you need to wait for the compilation record to be written. But how do you do that without fsync or O_DIRECT? A notification that the write is completed is not that.

Edit: maybe you are using RWF_SYNC in your write call. That could work.

codys · 2025-07-20T15:21:26 1753024886

> Yes, you do need to check if both records are written and then report it back to the client. But that is a non-fsync request and does not tax your system the same as fsync writes.

What mechanism can be used to check that the writes are complete if not fsync (or adjacent fdatasync)? What specific io_uring operation or system call?