OK, I found part of the problem. I resurrected my old setup and switched to using fio (was using iozone and some others). When you specify --filename= in fio but have multiple jobs, they are all sharing the same file (https://fio.readthedocs.io/en/latest/fio_doc.html). So instead of 28 jobs reading 28 distinct files, you have reading the same file 28 times. So even when dropping the buffer-cache, reboot clean, etc. etc. before the benchmark, this is 27 times a test of the read caches and 1 time a test of the actual disk, which will obviously seriously inflate the results. I'm thinking --directory= would be the best option.
I also did the following to drop (most) cached data before the test:
I still have some time to test - let me try this out! One thing I already proved is with 2GB ram, the system performs as predicted ~20 GB/s. But, I resonate with your definition of benchmarking - to test utlimate performance but also real world usage.
Update:
Ok, so I used the ':' syntax in the --filename option, I got something that makes sense. FIO created invidual files for each thread. I test numjobs: 4, 8, 16 and it plateaus around 8 jobs. Adding additional files+jobs actually makes things worse.
--directory makes those files automatically for you
actually, I think those numbers are pretty good. you can zpool export to unmount and then run a non-destructive read test with --filename=/dev/nvme0n1:/dev/nvme0n2... and check that the raw devices get, say, 80%-99% of the advertised speed.
Then you can create a "raw" zvol (no filesystem) with:
create -b 128K -V 100G tank/testme
You can do a benchmark with --filename=/dev/zvol/tank/testme and it should only be testing the striping and raidz stuff. I could get like 65% of the ideal.
Running through the ZFS filesystem seemed to slow down by a good 50% of the ideal. None of this has been tuned.
I also did the following to drop (most) cached data before the test: