Setting up iScsi on an L3 network to effectively utilize channel and storage capabilities

After testing NVME over TCP, described here https://habr.com/ru/companies/beeline_tech/articles/770174/, we decided to check how well iScsi works on an L3 network compared to a specialized FC solution.

iScsi Settings

TL/DR

- name: Setup iScsi
  ini_file:
    dest:  /etc/iscsi/iscsid.conf
    no_extra_spaces: false
    option: "{{item.name}}"
    value: "{{item.value}}"
    backup: no
  loop:
     - { name: "node.session.iscsi.InitialR2T" , value: "Yes"  }
     - { name: "node.session.cmds_max" , value: "256"  }
     - { name: "node.session.queue_depth" , value: "256"  }
     - { name: "node.session.nr_sessions" , value: "8"  }

- name: Setup  /etc/sysctl.conf
  ansible.posix.sysctl:
    name: "{{item.key}}"
    value: "{{item.val}}"
  loop:
    - { key:  net.ipv4.fib_multipath_hash_policy, val: "1 #iscsi"}
    - { key:  net.ipv4.fib_multipath_use_neigh, val: "1  #iscsi"}
    - { key:  net.ipv4.tcp_slow_start_after_idle, val: "0  #iscsi"}

The achieved speeds of 12.3GB/s on a 100GB network are considered good.

Initial test

We took a dump of a product database for a certain day, several hours of traffic on it.

On a promising platform with recommended multipathd settings from the storage vendor, we configured the connection of several dozen LUNs.
iScsid settings were used 'out of the box'

We transferred the data to the future platform.

We started playing traffic.

On the graphs we saw that network utilization reaches only 50%.

We started to figure out why this was so. And what should be done in order to occupy the network channel completely.

Device level optimization

By default, Oracle Linux enables combining multiple commands to read and write to a device into one.
This was not very suitable for some devices, so optimization had to be turned off for them.

ls -l /dev/mapper/*my_mask* | awk -F "-> ../" '{print $2}' | while read dev; do echo "$dev" ;echo 2 > /sys/devices/virtual/block/$dev/queue/nomerges; cat /sys/devices/virtual/block/$dev/queue/nomerges ;done

Initial analysis.

Using the DBMS, table validation was launched, which created a load that reached 50Gb/s and set records for electricity consumption on the server.

htop showed a high percentage of waiting in the kernel.

We decided to look at the iscsid settings.

Found in it

################################
# session and device queue depth
################################

# To control how many commands the session will queue, set
# node.session.cmds_max to an integer between 2 and 2048 that is also
# a power of 2. The default is 128.
node.session.cmds_max = 128

# To control the device's queue depth, set node.session.queue_depth
# to a value between 1 and 1024. The default is 32.
node.session.queue_depth = 32

Changed the depth of node.session.queue_depth to 256.

After which we restarted iscsid and repeated the test with table validation.

CPU (and electricity) consumption became normal, the speed increased to approximately 75Gb/s.

Further increases in node.session.queue_depth had no effect, so we decided to use synthetic tests to find out what needed to be tweaked. The ultimate goal was to completely load the network with disk operations.

Synthetic tests

It was decided to carry out synthetic tests using FIO on a machine of the same configuration in terms of model and processor. memory, network in the same switch.

For tests there were already 48 disks with storage systems, which is slightly worse than on the original stand with the base, but acceptable.

Disks received names via multipath.conf, under which they were visible in /dev/mapper

A simple shell line like this

cat 1.txt | grep " ->" | awk -F">" '{print $2}' | tr "." "_" |  awk '{print "\tmultipath {\n\t\twwid\t3"$3"\n\t\talias\t"$1"\n\t}"}'

Allows you to quickly create the necessary lines for the /etc/multipath.conf config.

'3' before '$3', in which wwid, this is protocol specific

This issue has been explained in the man page of scsi_id:

– scsi_id queries a SCSI device via the SCSI INQUIRY vital product data (VPD) page 0x80 or 0x83 and uses the resulting data to generate a value that is unique across all SCSI devices that properly support page 0x80 or page 0x83.
– If a result is generated it is sent to standard output, and the program exits with a zero value. If no identifier is output, the program exits with a non-zero value.
– scsi_id is primarily for use by other utilities such as udev that require a unique SCSI identifier.
– By default all devices are assumed blacklisted, the –whitelisted option must be specified on the command line or in the config file for any useful behavior.
– SCSI commands are sent directly to the device via the SG_IO ioctl interface.
– In order to generate unique values for either page 0x80 or page 0x83, the serial numbers or worldwide names are prefixed as follows.

Identifiers based on page 0x80 are prefixed by the character ‘S’, the SCSI vendor, the SCSI product (model) and then the the serial number returned by page 0x80. For example:

# /lib/udev/scsi_id --page=0x80 --whitelisted --device=/dev/sda
SIBM 3542 1T05078453


Identifiers based on page 0x83 are prefixed by the identifier type followed by the page 0x83 identifier. For example, a device with a NAA (Name Address Authority) type of 3 (also in this case the page 0x83 identifier starts with the NAA value of 6):

You can assemble a small volume of 48 pieces for testing like this:

ls -1 /dev/mapper/*testdata* | grep exp | xargs vgcreate vg_testdata
lvcreate -i 4 -I 64 -n testdata -l 100%FREE vg_testdata

For our tests we used disk stripes of 256k, respectively -I 256

If you don’t need to build LVM, but just need to transfer the partitions that occupy 100% of the disk under the control of the DBMS, then you can make the following udev rule for them.

ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_NAME}=="prefix*", ENV{DM_NAME}=="*p1", ENV{DM_NAME}=="*pattern*p1", OWNER="rdbmuser", GROUP="rdbmgroup", MODE="0660", SYMLINK+="prefix_disk/pattern_$env{DM_NAME}"

The tests themselves

Having assembled the stand, we began to launch FIO, getting fairly stable numbers, running in a parallel terminal sar -n DEV 3observing the distribution of traffic across adapters, the reaction to the lowering of one of 4x, 2x of 4x, and the return of adapters.

The distribution was uneven.
We have 2 dual-port cards, the second port on each was underloaded.

The idea arose to check the power consumption settings in the BIOS. No profile was selected there. We installed HPC.

The speed increased a little.

Acting in the same spirit, we decided to update the firmware of the network cards to the latest stable FW. The speed was not affected.

We decided to find out how MTU 9000 instead of 1500 affects speed.
The distribution across adapters has become more even. Just by changing the MTU we received 2% additional channel utilization. So we decided to leave it.

It was not possible to play with NUMA settings on the test server.

 pwd;cat  en*np{0..1}/device/num*
/sys/class/net
-1
-1
-1
-1

This configuration step has been omitted.

According to the manufacturer's recommendations, 4 ports were configured on the storage system. The default iscsid settings were based on one path.

We started playing with the number of parallel connections to the port.

To do this, you need to reset the existing settings to /var/lib/iscsi/nodes, restart iscsid, repeat registration via sendtargets. And log in to the storage system.

There’s no point in messing around with increasing the number of paths. Each path creates a block device. Their number is limited to 10 thousand for a reason. And from a certain point, productivity begins to decline.

We stopped at 8.

It turned out to be 4×8 paths to the device. In this case, traffic began to ideally spread evenly across 4 network cards and tend to 100Gb/s.

To further improve speed, we divided our drives into several logical groups.

  1. Disks for file systems

  2. Disks for database logs

  3. Disks for database data.

Each group of disks was supplied from its 4 ports to the storage system.

Finally we achieved almost 100% network load.

conclusions

After spending the first 2 full working weeks of May in research (that’s 5 working days, but the weeks are worse), we received settings that proved to be 5 times better than the existing infrastructure on FC.

  1. We put the IOPS array in small blocks for writing and reading (1 million writes and 1.5 million reads)

  2. We almost put the network throughput in a medium block (12.3Gb/s)

Record

# fio  --rw=write --bs=4k --numjobs=136 --iodepth=256 --size=10T --ioengine=libaio --direct=1  -group_reporting --name=jour
nal-test --filename=/dev/mapper/vg_exp_data-exp_data
journal-test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.19
Starting 136 processes
^Cbs: 136 (f=136): [W(136)][0.0%][w=4575MiB/s][w=1171k IOPS][eta 04d:00h:33m:27s]
fio: terminating on signal 2

Reading

# fio  --rw=read --bs=4k --numjobs=136 --iodepth=256 --size=10T --ioengine=libaio --direct=1  -group_reporting --name=journ
al-test --filename=/dev/mapper/vg_exp_data-exp_data
journal-test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.19
Starting 136 processes
^Cbs: 136 (f=136): [R(136)][0.0%][r=5731MiB/s][r=1467k IOPS][eta 02d:21h:33m:50s]

Bonus: Comparison with NVME over TCP

Suddenly I wanted to compare the resulting performance with NVME.

I set it up according to the instructions from the article mentioned above.
I did the tests on 1 disk of course.
The first test was limited by the speed of this single drive.

Having realized that the test was incorrect, I assembled an LVM volume of 4 NVME over TCP disks on the local machine. In theory, there are 128 connections to each.

But with the parameters for the test, which produced 12.3GB/s via iScsi, I received only 9GB/s

Conclusion

Maybe we're doing it all wrong. If this is the case, then you are kindly requested to share in the comments your recipes for effectively utilizing the bandwidth of the iScsi network with traffic.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *