Block storage performance based on soft raids (mdadm, LVM and ZFS) using iSER and NVMe-oF

The first article was devoted to performance from the local system (https://habr.com/ru/articles/753322/).

The point of this article remains the same, to show maximum performance when the issue of data safety is resolved by replicas or backups.

The purpose of this article is to test the performance of three systems for combining physical devices into one logical system using iSER and NVMe-oF.

Within the framework of this article, three systems will be compared that demonstrate the highest performance according to the test results from the first part.

  1. mdadm raid0

  2. LVM stripe

  3. ZFS stripe (default lz4)

And also the most functional solution of the available free backends:

  1. ZFS stripe with compression and deduplication

When connected via iSER and NVMe-oF.

Test stand:

Virtualization server 1:

  • Motherboard: Supermicro H11SSL-i

  • CPU: EPYC 7302

  • RAM: 4x64GB Micron 2933MHz

  • Network: 40GbE ConnectX-3 Pro, 10GbE/25GbE ConnectX-4 LN EN

  • OS: ESXi 7U3 build 20036586

Virtualization server 2:

  • Motherboard: Tyan S8030 (ver 1GbE)

  • CPU: EPYC 7302

  • RAM: 4x64GB Micron 2933MHz

  • Network: 40GbE ConnectX-3 Pro x2 (One adapter inserted into the VM)

  • OS: ESXi 7U3 build 21930508

Storage VM on virtualization server #2which will provide block access for host 1 and host 2:

  • OS: Ubuntu 22.04, kernel 6.2.0-26 (mitigatios=off)

  • vCPU: 8

  • RAM: 64GB (expanded to 128GB for ZFS)

  • Drives: 6xPM9A3 1.92TB (upgraded to GDC5902Q version)

Connection diagram:

Note: There is no documentation for HCIbench on how to change the testing code to use /dev/nvme0nX instead of /dev/sdX, so tests are run using pvscsi.
According to tests, the difference between the pvscsi and nvme controller lies within the measurement error of 5% (more details in the test table) for version 7U3, since full NVMe back-to-back was implemented only in 8U2, which is no longer supported by ConnectX-3 🙁

The Prepare Virtual Disk Before Testing parameter is set to Random. Tests are carried out using – 4 VMs with two 100GB disks, 8 vCPUs, 16GB RAM.

To select tests, the reference values ​​for the parameters were taken from next source.

  • Test 1:

    VDbench – 4k – 80% rng – 50/50 r/w – 8 thread per disk

  • Test 2:

    VDbench – 8k – 80% rng – 75/25 r/w – 8 thread per disk

  • Test 3:

    VDbench – 64k – 80% seq – 75/25 r/w – 8 thread per disk

  • Test 4:

    VDbench – max_iops_read – 4k – 100% rng – 100/0 r/w – 8 thread per disk

  • Test 5:

    VDbench – max_iops_write – 4k – 100% rng – 0/100 r/w – 8 thread per disk

To test the drop in speed as a result of adding a layer in the form of a VM, tests from the first part were used, the results are below in the table and graph (error + -5%):

SCSI controller (pvscsi)

NVMe controller

%pvscsi=100%

Sequential recording 4M qd=32

4603.11 MB/s

4594.44 MB/s

99.81%

Sequential reading 4M qd=32

4618.22 MB/s

4618.33 MB/s

100.00%

Random recording 4k qd=128 jobs=16

558.56 MB/s

541.06 MB/s

96.87%

Random reading 4k qd=128 jobs=16

589.72 MB/s

581.89 MB/s

98.67%

Random recording 4k qd=1 fsync=1

54.61 MB/s

60.02 MB/s

109.90%

Random reading 4k qd=1 fsync=1

28.28 MB/s

29.76 MB/s

105.21%

The results of iperf3 tests between ConnectX-3 Pro were performed by forwarding ConnectX-3 Pro to a VM on each of the hosts, i.e. the diagram looks like this:

The result was:

Sender

34.93 Gbits/sec

Receiver

34.21 Gbits/sec

Rounding down, we get:

34 Gbits/sec (or 4.25 Gbytes/s)

Regarding OFED. Ubuntu 20.04 with kernel 5.5 and OFED installed for it was also used for tests, but Ubuntu 22.04.03 with kernel 6.2 showed higher performance and repeatability in iperf3.

iperf3

Tests were performed in accordance with these recommendations (https://fasterdata.es.net/performance-testing/network-troubleshooting-tools/iperf/multi-stream-iperf3/)

s1:  [ ID] Interval           Transfer     Bitrate         Retr
s1:  [  5]   0.00-10.00  sec  15.2 GBytes  13.0 Gbits/sec  1062    sender
s1:
s1:  iperf Done.
s2:  - - - - - - - - - - - - - - - - - - - - - - - - -
s2:  [ ID] Interval           Transfer     Bitrate         Retr
s2:  [  5]   0.00-10.00  sec  14.8 GBytes  12.7 Gbits/sec  1856    sender
s2:
s2:  iperf Done.
s3:  - - - - - - - - - - - - - - - - - - - - - - - - -
s3:  [ ID] Interval           Transfer     Bitrate         Retr
s3:  [  5]   0.00-10.00  sec  10.0 GBytes  8.59 Gbits/sec  1027    sender

Total 34.29 Gbits/sec

And with the -bidir flag

s1:  [ ID] Interval           Transfer     Bitrate         Retr
s1:  [  5]   0.00-10.00  sec  13.4 GBytes  11.5 Gbits/sec  1705    sender
s1:  [  5]   0.00-10.04  sec  13.2 GBytes  11.3 Gbits/sec          receiver
s1:
s1:  iperf Done.
s2:  - - - - - - - - - - - - - - - - - - - - - - - - -
s2:  [ ID] Interval           Transfer     Bitrate         Retr
s2:  [  5]   0.00-10.00  sec  10.4 GBytes  8.93 Gbits/sec  2171    sender
s2:  [  5]   0.00-10.05  sec  10.1 GBytes  8.61 Gbits/sec          receiver
s3:  - - - - - - - - - - - - - - - - - - - - - - - - -
s3:  [ ID] Interval           Transfer     Bitrate         Retr
s3:  [  5]   0.00-10.00  sec  16.8 GBytes  14.5 Gbits/sec  1513    sender
s3:  [  5]   0.00-10.04  sec  16.7 GBytes  14.3 Gbits/sec          receiver
s3:
s3:  iperf Done.

Comparing the performance of NVMe drives when they are connected directly to the OS or through VMware with PCIe passthrough, we can say that the performance does not change, since the results are within 2% of error.

Excel table for tests of disks connected directly to a physical host and transferred to a virtualization host.

Tests

PM1725

Physical

Via PCIe passthrough

Sequential recording 4M qd=32

1820.50 MB/s

1826.50 MB/s

Sequential reading 4M qd=32

4518.00 MB/s

4553.50 MB/s

Random recording 4k qd=128 jobs=16

1504.00 MB/s

1502.50 MB/s

Random reading 4k qd=128 jobs=16

3488.00 MB/s

3514.50 MB/s

Random recording 4k qd=1 fsync=1

148.50 MB/s

172.50 MB/s

Random reading 4k qd=1 fsync=1

42.90 MB/s

45.45 MB/s

PM9A3 (0369)

Physical

Via PCIe passthrough

Sequential recording 4M qd=32

2811.00 MB/s

2811.00 MB/s

Sequential reading 4M qd=32

5822.50 MB/s

5845.00 MB/s

Random recording 4k qd=128 jobs=16

2807.00 MB/s

2806.00 MB/s

Random reading 4k qd=128 jobs=16

4601.00 MB/s

4746.50 MB/s

Random recording 4k qd=1 fsync=1

211.50 MB/s

175.50 MB/s

Random reading 4k qd=1 fsync=1

98.25 MB/s

87.80 MB/s

PM9A3 (6310)

Physical

Via PCIe passthrough

Sequential recording 4M qd=32

2811.00 MB/s

2811.00 MB/s

Sequential reading 4M qd=32

5846.50 MB/s

5835.50 MB/s

Random recording 4k qd=128 jobs=16

2806.50 MB/s

2807.00 MB/s

Random reading 4k qd=128 jobs=16

4610.00 MB/s

4743.00 MB/s

Random recording 4k qd=1 fsync=1

211.50 MB/s

178.00 MB/s

Random reading 4k qd=1 fsync=1

98.30 MB/s

87.85 MB/s

PM9A3 (6314)

Physical

Via PCIe passthrough

Sequential recording 4M qd=32

2811.00 MB/s

2811.00 MB/s

Sequential reading 4M qd=32

5839.00 MB/s

5844.00 MB/s

Random recording 4k qd=128 jobs=16

2806.00 MB/s

2807.00 MB/s

Random reading 4k qd=128 jobs=16

4602.50 MB/s

4746.00 MB/s

Random recording 4k qd=1 fsync=1

209.00 MB/s

174.50 MB/s

Random reading 4k qd=1 fsync=1

98.65 MB/s

87.20 MB/s

PM9A3 (3349)

Physical

Via PCIe passthrough

Sequential recording 4M qd=32

2811.00 MB/s

2811.00 MB/s

Sequential reading 4M qd=32

5844.00 MB/s

5835.50 MB/s

Random recording 4k qd=128 jobs=16

2807.00 MB/s

2806.50 MB/s

Random reading 4k qd=128 jobs=16

4598.50 MB/s

2328.00 MB/s

Random recording 4k qd=1 fsync=1

208.00 MB/s

201.50 MB/s

Random reading 4k qd=1 fsync=1

98.10 MB/s

94.90 MB/s

PM9A3 (1091)

Physical

Via PCIe passthrough

Sequential recording 4M qd=32

2811.00 MB/s

2811.00 MB/s

Sequential reading 4M qd=32

5827.00 MB/s

5842.50 MB/s

Random recording 4k qd=128 jobs=16

2806.50 MB/s

2807.50 MB/s

Random reading 4k qd=128 jobs=16

4514.00 MB/s

4651.00 MB/s

Random recording 4k qd=1 fsync=1

205.50 MB/s

172.50 MB/s

Random reading 4k qd=1 fsync=1

98.00 MB/s

87.40 MB/s

PM9A3 (0990)

Physical

Via PCIe passthrough

Sequential recording 4M qd=32

2810.50 MB/s

2811.00 MB/s

Sequential reading 4M qd=32

5849.50 MB/s

5857.00 MB/s

Random recording 4k qd=128 jobs=16

2797.50 MB/s

2807.00 MB/s

Random reading 4k qd=128 jobs=16

4489.00 MB/s

4646.00 MB/s

Random recording 4k qd=1 fsync=1

206.00 MB/s

183.50 MB/s

Random reading 4k qd=1 fsync=1

97.20 MB/s

88.05 MB/s

2 scenarios will be considered:

  1. Issuing a block device via iSER using LIO (configuration is done using targetcli-fb from the master branch of the repository)

  2. Block device issuance over NVMe-oF using SPDK.

    A small digression – at first it was planned to use the native kernel implementation – nvmet (https://enterprise-support.nvidia.com/s/article/howto-configure-nvme-over-fabrics–nvme-of–target-offload), but it is not supported by VMware (https://koutoupis.com/2022/04/22/vmware-lightbits-labs-and-nvme-over-tcp/ https://communities.vmware.com/t5/ESXi-Discussions/NVMEof-Datastore-Issues/td-p/2301440), so SPDK was used (https://spdk.io/doc/nvmf.html, https://spdk.io/doc/bdev.html)

iSER

From the Linux side:

In the LIO settings enable_iser boolean=true is specified at the portal level:

set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1

Additionally, the following options are specified for the device (https://documentation.suse.com/ses/7/html/ses-all/deploy-additional.html):

set attribute emulate_3pc=1 emulate_tpu=1 emulate_caw=1 max_write_same_len=65535 emulate_tpws=1 is_nonrot=1

From the VMware side:

esxcli rdma iser add

Then the settings are similar to the settings for the iSCSI array.

The performance of mdadm with the FIO tests from the first part on 1 VM is as follows:

mdadm raid0 1VM

MB/s

IOPS

Sequential recording 4M qd=32

3632.00 MB/s

866.20

Sequential reading 4M qd=32

3796.50 MB/s

902.14

Random recording 4k qd=128 jobs=16

352.00 MB/s

85736.44

Random reading 4k qd=128 jobs=16

559.50 MB/s

136797.19

Random recording 4k qd=1 fsync=1

39.80 MB/s

9712.06

Random reading 4k qd=1 fsync=1

26.10 MB/s

6382.21

The final graph looks like this:

MDADM

mdadm --create --verbose /dev/md0 --level=0 --raid-devices=6 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1

mdadm-iSER

MB/s

IOPS

4k-50rdpct-80randompct

486.78 MB/s

124615.40

8k-75rdpct-80randompct

941.52 MB/s

120514.50

64k-75rdpct-80randompct

3445.29 MB/s

55124.70

4k-0rdpct-100randompct

495.13 MB/s

126752.90

4k-100rdpct-100randompct

483.9 MB/s

123877.10

LVM

lvcreate -i6 -I64 --type striped -l 100%VG -n nvme_stripe nvme

LVM-iSER

MB/s

IOPS

4k-50rdpct-80randompct

514.27 MB/s

131652.90

8k-75rdpct-80randompct

957.10 MB/s

122509.10

64k-75rdpct-80randompct

3710.36 MB/s

59365.80

4k-0rdpct-100randompct

535.9 MB/s

137192.80

4k-100rdpct-100randompct

505.42 MB/s

129386.40

ZFS

v1 (without dedup)

zpool create -o ashift=12 -O compression=lz4 -O atime=off -O recordsize=128k nvme /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 zfs create -s -V 10T -o volblocksize=16k -o compression=lz4 nvme/iser

ZFS – iSER

MB/s

IOPS

4k-50rdpct-80randompct

131.77 MB/s

33734.00

8k-75rdpct-80randompct

343.34 MB/s

43947.40

64k-75rdpct-80randompct

1418.62 MB/s

22698.20

4k-0rdpct-100randompct

80.81 MB/s

20690.60

4k-100rdpct-100randompct

251.38 MB/s

64354.60

v2 (with dedup)

zpool create -o ashift=12 -O compression=lz4 -O atime=off -O dedup=on -O recordsize=128k nvme /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 zfs create -s -V 10T -o volblocksize=16k -o compression=lz4 nvme/iser

ZFS – iSER

MB/s

IOPS

4k-50rdpct-80randompct

56.75 MB/s

14525.60

8k-75rdpct-80randompct

174.89 MB/s

22385.60

64k-75rdpct-80randompct

384.76 MB/s

6156.20

4k-0rdpct-100randompct

28.73 MB/s

7355.00

4k-100rdpct-100randompct

200.54 MB/s

51337.80

NVMe-oF:

From the Linux side:

In addition to the banal installation of SPDK (https://spdk.io/doc/getting_started.html), there is 1 caveat. At the time of writing, SPDK from the wizard does not work with VMware due to the implementation of a check for the responder_resources == 0 parameter. On the VMware side, this parameter is equal to 1 (https://github.com/spdk/spdk/issues/3115). Therefore it is necessary to build version 23.05.x, so the spdk installation process will not start from – fixed in c8b9bba

git clone https://github.com/spdk/spdk --recursive
modprobe nvme-rdma
modprobe rdma_ucm
modprobe rdma_cm
scripts/setup.sh
screen - либо другая сессия, либо можно написать демона, который будет запускать его сам
build/bin/nvmf_tgt
ctrl+a d - отключение от сессии screen и возврат в консоль
scripts/rpc.py nvmf_create_transport -t RDMA -u 8192 -i 131072 -c 8192
scripts/rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller
scripts/rpc.py bdev_aio_create /dev/md0 md0
scripts/rpc.py nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 md0
scripts/rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t rdma -a 10.20.0.1 -s 4420
scripts/rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t rdma -a 10.20.0.5 -s 4420

From the VMware side:

esxcli system module parameters set -m nmlx4_core -p “enable_rocev2=1” (otherwise there will be an error “Underlying device does not support requested gid/RoCE type.”)

Then the VMware NVME over RDMA Storage Adapter is created on the web, in the controller the IP address of the VM specified earlier is set in the port field – port 4420

The performance of mdadm with the FIO tests from the first part on 1 VM is as follows:

mdadm raid0 1VM

MB/s

IOPS

Sequential recording 4M qd=32

4593.00 MB/s

1095.55

Sequential reading 4M qd=32

4618.00 MB/s

1101.67

Random recording 4k qd=128 jobs=16

542.00 MB/s

132417.04

Random reading 4k qd=128 jobs=16

575.50 MB/s

140674.20

Random recording 4k qd=1 fsync=1

61.50 MB/s

15013.28

Random reading 4k qd=1 fsync=1

30.35 MB/s

7418.52

The final graph looks like this:

MDADM

mdadm --create --verbose /dev/md0 --level=0 --raid-devices=6 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1

mdadm – NVMe-oF

MB/s

IOPS

4k-50rdpct-80randompct

570.59 MB/s

146070.90

8k-75rdpct-80randompct

1113.46 MB/s

142522.10

64k-75rdpct-80randompct

5775.06 MB/s

92401.00

4k-0rdpct-100randompct

593.87 MB/s

152033.00

4k-100rdpct-100randompct

540.54 MB/s

138378.60

LVM

lvcreate -i6 -I64 --type stripe  -l 100%VG -n lvm_stripe stripe

LVM – NVMe-oF

MB/s

IOPS

4k-50rdpct-80randompct

596.39 MB/s

152675.70

8k-75rdpct-80randompct

1160.72 MB/s

148572.30

64k-75rdpct-80randompct

5826.39 MB/s

93222.20

4k-0rdpct-100randompct

618.22 MB/s

158267.10

4k-100rdpct-100randompct

559.45 MB/s

143219.80

ZFS

v1 (without dedup)

zpool create -o ashift=12 -O compression=lz4 -O atime=off -O recordsize=128k nvme /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 zfs create -s -V 10T -o volblocksize=16k -o compression=lz4 nvme/iser

NVMe-oF ZFS

MB/s

IOPS

4k-50rdpct-80randompct

131.18 MB/s

33585.30

8k-75rdpct-80randompct

344.21 MB/s

44057.10

64k-75rdpct-80randompct

1409.13 MB/s

22546.10

4k-0rdpct-100randompct

78.63 MB/s

20129.20

4k-100rdpct-100randompct

267.21 MB/s

68406.10

v2 (with dedup)

zpool create -o ashift=12 -O compression=lz4 -O atime=off -O dedup=on -O recordsize=128k nvme /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 zfs create -s -V 10T -o volblocksize=16k -o compression=lz4 nvme/iser

NVMe-oF ZFS (dedup)

MB/s

IOPS

4k-50rdpct-80randompct

56.1 MB/s

14362.50

8k-75rdpct-80randompct

183.32 MB/s

23464.70

64k-75rdpct-80randompct

355.83 MB/s

5693.00

4k-0rdpct-100randompct

28.07 MB/s

7185.10

4k-100rdpct-100randompct

211.7 MB/s

54195.40

Conclusion

LVM shows the highest results, so if we take it as 100%, the test results will look like this:

iSER

LVM

mdadm

ZFS

ZFS dedup

4k-50rdpct-80randompct

100.00%

94.65%

25.62%

11.04%

8k-75rdpct-80randompct

100.00%

98.37%

35.87%

18.27%

64k-75rdpct-80randompct

100.00%

92.86%

38.23%

10.37%

4k-0rdpct-100randompct

100.00%

92.39%

15.08%

5.36%

4k-100rdpct-100randompct

100.00%

95.74%

49.74%

39.68%

NVMe-oF

LVM

mdadm

ZFS

ZFS dedup

4k-50rdpct-80randompct

100.00%

95.67%

22.00%

9.41%

8k-75rdpct-80randompct

100.00%

95.93%

29.65%

15.79%

64k-75rdpct-80randompct

100.00%

99.12%

24.19%

6.11%

4k-0rdpct-100randompct

100.00%

96.06%

12.72%

4.54%

4k-100rdpct-100randompct

100.00%

96.62%

47.76%

37.84%

ZFS, as expected, does not show the best results, but this is also due to the CPU load, but in general this is influenced by the fact that ZFS is not suitable for fast work with NVMe.

All test results are available here And excel spreadsheet here (yes, all the tables are in the archive, since Google sheets break the formulas).

The graph of all tests grouped by the connection protocol used is as follows:

mdadm - blue, LVM - orange, ZFS - gray, ZFS-dedup - yellow

mdadm – blue, LVM – orange, ZFS – gray, ZFS-dedup – yellow

mdadm - blue, LVM - orange, ZFS - gray, ZFS-dedup - yellow

mdadm – blue, LVM – orange, ZFS – gray, ZFS-dedup – yellow

Graph of all tests grouped by software raid used
Blue - NVMe-oF, Orange - iSER

Blue – NVMe-oF, Orange – iSER

Blue - NVMe-oF, Orange - iSER

Blue – NVMe-oF, Orange – iSER

Blue - NVMe-oF, Orange - iSER

Blue – NVMe-oF, Orange – iSER

Blue - NVMe-oF, Orange - iSER

Blue – NVMe-oF, Orange – iSER

Blue - NVMe-oF, Orange - iSER

Blue – NVMe-oF, Orange – iSER

PS If you think that these tests are incomplete, something is missing in them or they are incorrect, I am open to any suggestions to supplement them.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *