Hi,
When using multiple Fio threads without specifying the offset for each of them, Fio will read the same data for every thread. Once one of the threads reads the data, it is cached in ARC, so every next thread that ask for this specific data is getting it directly from RAM. Fio registers it as normal IO operation though, which greatly increases the reported IO
Something seems wrong here. No doubt this configuration rocks, but it should be physically impossible to get better bandwidth than 8x 3.5GB/s = ~28GB/s that the raw drives offer, which is also well matched to the PCIe x4 slots bandwidth.
Maybe fio is producing trivially compressible data? Dedup? Directio not working?
I am not sure, I think some of it is cached in the RAM. I did a longer sustained test and I am getting about 70 GB/s with a 200GB file - this is definitely larger than 64GB RAM. Is it to do with lz4 compression? Fetching smaller blocks from disk and then decompressing it, thus inflating the bandwidth?
The initial file was created with fio rw=randwrite, so the dummy data is random.
I think its even more impressive that ZFS can do this. I verified with ext4 fs on a single NVMe drive and no matter what knobs I turn, I am getting around around 3.5 GB/s.
Edit: See note 2 at the end of the article, I posted the raw results of another 100GB file test.
Edit 2: See note 3. I rebooted the VM with 2GB ram (instead of 64GB) and it matches almost exactly as per your prediction ~ 20GB/s. But, it still doesn't explain the absurd 80GB/s sustained reads of a 100GB file with only 64GB ram.
I do think it sounds like compression might be part of it.
I'm thinking that a) fio might create test data that is "trivially" compressible and b) this might occur before directio says "ignore the buffer cache and write directly to the drive", but then what gets written is 10x or 100x smaller.
Another possibility is that somehow directio doesn't work when zfs+compression is turned on?
I'd do a test where everything else is the same, but turn lz4 compression off. In theory, with a real world dataset, compression will cause CPU load to go up, but I would not expect to see the datasize at the disk go down by more than 2x to 3x, and therefore the bandwidth should not go up my anymore than that, and only if the CPU isn't the limiting factor. In most cases, compressed (or encrypted) filesystems should be slower, not faster because of this.
So a benchmark without compression would be the first step. Then maybe as a second step, a "real world" benchmark - copy 10-100GB of representative data or something like that - with and without compression to see if bandwidth goes up or down and how bad CPU utilization goes up.
Another estimate is to look at the stats that tell you what the compression ratio is. If it's really, really high, that just means fio created all-zeros, really fast.
Interesting. I turned off lz4 compression and it didn't make any difference - still getting around 80 GB/s on a 30 second test. I also checked the content of the dummy file, it is random and takes about the same time as using `dd` to pipe data from `/dev/random`. It's a mystery and it's related to RAM (see note 3). I think this would be an interesting question to ask on SO!
Well that's mysterious. I have a similar but smaller system - two EVO 970s on a 4x4x4x4 PCI card just like you have (I planned on adding more if it went well). I went through similar issues benchmarking it about a year ago.
At the end of the day, I found that with ZFS it was just very difficult to bypass, avoid, turn-off, etc. all the caching. The machine I was using has a lot of RAM. I had to reduce the RAM to like 64GB with a boot option to even see what was happening. I was still never thrilled with the numbers I got.
Then, finally it dawned on me: with enough RAM, and an efficient filesystem like ZFS, the underlying storage is only a small part of the equation. The important thing is that your benchmark is accurate to the real world workload. If the real world can use a lot of buffer cache and RAM, and it makes it go fast - then good, use it like that. Which I did, and was happy, and forgot all about it.
What we want to avoid is fio saying you have 100GB/s and then in the real world you don't see nearly that and are disappointed. To me benchmarks serve two purposes: comparison and prediction of performance with a reproducible test case, for example compare the effect of ARC cache size setting. Second, diagnosing whether the hardware is working correctly and as close to ideal.
So I did see some stuff vaguely like you are seeing. I remember being impressed at ZFS and how hard it is to benchmark because it tries so hard to cache efficiently (or whatever). Unfortunately it was like a year ago so I don't remember more details.
OK, I found part of the problem. I resurrected my old setup and switched to using fio (was using iozone and some others). When you specify --filename= in fio but have multiple jobs, they are all sharing the same file (https://fio.readthedocs.io/en/latest/fio_doc.html). So instead of 28 jobs reading 28 distinct files, you have reading the same file 28 times. So even when dropping the buffer-cache, reboot clean, etc. etc. before the benchmark, this is 27 times a test of the read caches and 1 time a test of the actual disk, which will obviously seriously inflate the results. I'm thinking --directory= would be the best option.
I also did the following to drop (most) cached data before the test:
I still have some time to test - let me try this out! One thing I already proved is with 2GB ram, the system performs as predicted ~20 GB/s. But, I resonate with your definition of benchmarking - to test utlimate performance but also real world usage.
Update:
Ok, so I used the ':' syntax in the --filename option, I got something that makes sense. FIO created invidual files for each thread. I test numjobs: 4, 8, 16 and it plateaus around 8 jobs. Adding additional files+jobs actually makes things worse.
--directory makes those files automatically for you
actually, I think those numbers are pretty good. you can zpool export to unmount and then run a non-destructive read test with --filename=/dev/nvme0n1:/dev/nvme0n2... and check that the raw devices get, say, 80%-99% of the advertised speed.
Then you can create a "raw" zvol (no filesystem) with:
create -b 128K -V 100G tank/testme
You can do a benchmark with --filename=/dev/zvol/tank/testme and it should only be testing the striping and raidz stuff. I could get like 65% of the ideal.
Running through the ZFS filesystem seemed to slow down by a good 50% of the ideal. None of this has been tuned.
You are using a PCI 4.0 SSD compared to the SSD used in this RAIDZ2 with PCIE 3.0
10 PCIe 4.0 SSDs give you maximum of around 70GB/s.
The test show sustained throughput of 80GB/s without compression. Which is still higher than the maximum allowed by 8 x 4 PCIE 3.0 slot at total of 32GB/s.
Waiting for some ZFS expert to give a plausible explanation.
Curious, did you create the zpool and dataset with default parameters. You used LZ4 which should be on by default. Any other tweaks, or do you let zfs detect the drives and choose "sane" defaults?