ESXi by Broadcom

For the league of laziness. Some kind of gibberish about something that is not necessary, because anyway, for normal people, all applications have been in the clouds on microservices for a long time, and they work great.

Slow virtualization on x86. A little attempt to figure it out. Part 1: General overview

Part 2. What follows from this, and how the scheduler works in Broadcom ESXi. There will be nothing new here for those who opened the documentation about changing the scheduler model side-channel aware scheduler (SCA) – SCAv2by adding Performance Optimizations in VMware vSphere 7.0 U2 CPU Scheduler for AMD EPYC Processors And Optimizing Networking and Security Performance Using VMware vSphere and NVIDIA BlueField DPU with BWI

Also in 2017 a free book was published VMware vSphere 6.5 Host Resources Deep Dive, in which the topic is sufficiently covered and there is no point in repeating it. The book is not in Russian, once again: freeyou can download here.
A shorter version, literally 3 pages of basic terms and conditions, is described in the article Understanding ESXi CPU scheduling from 2019
ESXi 6.7 was released in 2018 (8169922), by the way, do not forget to put on it, and the entire line from 5.5 to 8.0.2, the latest patch, otherwise you will receive it via USBwhich is now, as a result of the existence of vulnerabilities at the level processor (Spectre, Meltdown, L1TF, MDS) I had to add as many as three scheduler options:
– default
– Side-Channel Aware Scheduler v1 (SCAv1) – safe, but “a little slow.”
In 6.7 U2 they added Side-Channel Aware Scheduler v2 (SCAv2) – less secure, almost no slowdown.
There is an article from 2019 “what to choose when” – Which vSphere CPU Scheduler to Choose and there was an online assistant – vSphere 6.7 CPU Scheduler Advisor (hid somewhere).
There is an article Performance of vSphere 6.7 Scheduling Options
Which explains the differences between schedulers:
Translation
The fixes provided by ESXi included the Side Channel Aware Scheduler (SCAv1), which reduces the number of competing context attack vectors for L1TF. If this mode is enabled, the scheduler will place processes on only one thread per core. This mode affects performance mainly in terms of capacity (maximum number of vCPUs) since the system can no longer use both HTs (threads) on the core. A server that was already fully loaded and running at maximum capacity will reduce power by approximately 30%. A server running at 75% capacity will experience a much smaller drop in performance, but CPU load will increase.

In vSphere 6.7 U2, the Side Channel Aware Scheduler (SCAv2) was extended with a new policy that allows HT to be used if both (HT) threads are executing vCPU contexts from the same VM. Additional L1TF* channels are limited so as not to reveal information between VM/VM or VM/hypervisor.
* L1TF – Resources and Response to Side Channel L1 Terminal Faultsee the graph in the paragraph For a specific subset of environments where it cannot be guaranteed that all virtualized guest operating systems are trusted.
(third paragraph not translated)

Original
The ESXi-provided patches included a Side-Channel Aware Scheduler (SCAv1) that mitigates the concurrent-context attack vector for L1TF. Once this mode is enabled, the scheduler will only schedule processes on one thread for each core. This mode impacts performance mostly from a capacity standpoint because the system is no longer able to use both hyper-threads on a core. A server that was already fully utilized and running at maximum capacity would see a decrease in capacity of up to approximately 30%. A server that was running at 75% of capacity would see a much smaller impact to performance, but CPU use would rise.

In vSphere 6.7 U2, the side-channel aware scheduler has been enhanced (SCAv2) with a new policy to allow hyper-threads to be utilized concurrently if both threads are running vCPU contexts from the same VM. In this way, L1TF side channels are constrained to not expose information across VM/VM or VM/hypervisor boundaries

To further illustrate how SCAv1 and SCAv2 work, the diagrams below show generally how different schedulers might schedule vCPUs from VMs on the same core. When there is only one VM in the picture (figure 1), the default placement, SCAv1 and SCAv2 will assign one vCPU on each core. When there are two VMs in the picture (figure 2), the default scheduler might schedule vCPUs from two different VMs on the same core, but SCAv1 and SCAv2 will ensure that only the vCPUs from the same VM will be scheduled on the same core. In the case of SCAv1, only one thread is used. In the case of SCAv2, both threads are used, but this always occurs with the vCPUs from the same VM running on a single core.

Further in the article there are graphs showing how different schedulers affect the performance of different types of VMs. Up to 30% losses (70% remains), which is very good.

What else you need to know about the scheduler in ESXi.
The scheduler operates with “worlds” – world, and there are separate “worlds” for vcpu (vmx-vcpu), vmx-svga, kvm (vmx-ms), network and processes of the system itself.
You can read about this by starting (except for the above-mentioned Deep dive) in the article vSphere 6.7 U2 & later CPU Scheduler modes: Default, SCA v1 and SCA v2
Every 2-30 milliseconds (1/1000 of a second), the scheduler checks the load on the physical cores and issues a timeslot (slice) for execution vCPU > pCPU. You can dive into NUMA, then it will become even more difficult to understand what is underneath in physics.
Execution priorities are then added to the scheduler for who to give more processor (and memory), described in section Resource Allocation Reservation.
In this case, resources are counted not by gigahertz, but by shares, depending on priority.
And there is, as stated in the first part, a separate setting – Latency Sensitivity with the following restrictions:
Original, here everything is clear and without translation:
You can turn on latency sensitivity for each VM at the VM level or at the host level.
This feature:
Gives exclusive access to physical resources to avoid resource contention due to sharing
With exclusive physical CPU (pCPU) access given, each vCPU entirely owns a specific pCPU; no other vCPUs and threads (including VMkernel I/O threads) are allowed to run on it. This achieves nearly zero ready time and no interruption from other VMkernel threads, improving response time and jitter under CPU contention.
Bypasses virtualization layers to eliminate the overhead of extra processing
Once exclusive access to pCPUs is obtained, the feature allows the vCPUs to bypass the VMkernel's CPU scheduling layer and directly halt in the virtual machine monitor (VMM), since there are no other contexts that need to be scheduled. That way, the cost of running the CPU scheduler code and the cost of switching between the VMkernel and VMM are avoidedleading to much faster vCPU halt/wake-up operations.
Tunes virtualization layers to reduce the overhead
When the VMXNET3 paravirtualized device is used for VNICs in the VM, VNIC interrupt coalescing and LRO support for the VNICs are automatically disabled to reduce response time and jitter.
Although these and other features are disabled automatically when setting latency sensitivity to high, we recommend disabling each of these features independently to avoid any cascading effects when a single parameter is altered when tuning your virtual machines.

On this (Latency Sensitivity) topic you can read the following:
Shares explanation
Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5 (very old but still useful text)
vSphere Resource Management Update 2 14 AUG 2020 VMware vSphere 6.7
vSphere Resource Management Update 3 VMware vSphere 7.0
vCenter Server and Host Management Update 3 Modified on 08 JAN 2024 VMware vSphere 7.0

It should be noted that ESXi 8.0 GA was released at the end of 2022 – it’s even official Public beta, according to New Release Model for vSphere 8as shown in the list of closed bugs for ESXi 8.0 Update 2 (This paragraph is dedicated to one of my friends. To the deceased), and the documentation is updated for 7.

Among other things, the speed of operation is affected by turning on/off CPU Hot add, described in detail in 2017 – Virtual Machine vCPU and vNUMA Rightsizing – Guidelines,
chapter Impact of CPU Hot Add on NUMA schedulingg and in the article CPU Hot Add Performance in vSphere 6.7
Translation with additions from 2019: Why is CPU Hot Add disabled by default in VMware vSphere, and how does it affect performance?
You can nail the VM directly to the kernel – Assign a Virtual Machine to a Specific Processor

What does this all mean “in general”.
The performance of virtualization in ESXi by Broadcom depends on dozens of factors, the main ones being:
Correct power saving settings in BIOS and ESXi itself – Host Power Management Policies in ESXi
Installing the latest BIOS/CPU updates and choosing the right scheduler
Installing the latest updates for ESXi by Broadcom
Turning off CPU\RAM hot add if enabled for some reason
Reducing (yes, exactly reducing) the number of vCPUs to “as much as needed”, rather than setting “as many as asked”. More in this case does NOT mean better.
Monitoring esxtop oncetwo Using esxtop to identify storage performance issues for ESX / ESXi (multiple versions) (1008205) three four five six

And look for a video (hid somewhere) from 60 Minutes of Non-Uniform Memory Architecture [HBI2278BE] by Frank Denneman from 2019

For dessert, dive into NUMA – The 2016 NUMA Deep Dive Series from the same author: Frank Denneman
Part 0: Introduction NUMA Deep Dive Series
Part 1: From UMA to NUMA
Part 2: System Architecture
Part 3: Cache Coherency
Part 4: Local Memory Optimization
Part 5: ESXi VMkernel NUMA Constructs
Part 6: NUMA Initial Placement and Load Balancing Operations (unpublished)
Part 7: From NUMA to UMA (unpublished)

Total.
Virtualization without optimization settings, out of the box by default, is certainly slow compared to pure bare metal, which runs 1 (one) program that uses 100% CPU and 100% RAM, and does not write or read anything to the disks from them. Because if the program writes, and especially reads, then it will also be important to configure the local and remote disk subsystems, buffers and switching modes of all devices along the path, Ethernet speed – especially if a DCB is screwed in along the way, and TSO and so on are screwed locally .
However, using a modern server, even the simplest ARM with 32 cores, for 1-2 tasks, which also, due to the foundation of crutches (1c), slow down on any device and are not able to use all 32 cores, is somewhat economically inexpedient. Although possible, it is the business's choice what to buy and how to apply it. Some of my colleagues used servers to optimize air flow in the rack – because plugs for the rack must be purchased through a competition, there are already servers in the rack, and removing a server from the data center is a half-day event, because this requires the presence of a financially responsible person. Therefore, the servers worked as a stub. By the way, it helped a lot against the suction of hot air from the rear of the rack to the front.

In order to move from the internal feeling of “braking” to some numbers, there is esxtop, there is cpu ready and a dozen other counters. There is an article Determining if multiple virtual CPUs are causing performance issues (1005362), along with any co stop (Maximizing VMware Performance and CPU Utilization) and iowait (Using esxtop to identify storage performance issues for ESX / ESXi (multiple versions) (1008205). Maybe it really is “slowing down”. But, most likely, you need to go inside the VM and look at the application running inside – what it, the bastard, actually needs.

THERE IS NO UNIQUE ANSWER AS TO HOW EXACTLY YOUR STRAKING IS AND WHY IT IS STRAKING.
More precisely, there is an answer, but you won’t like it.

Bonus.
Performance Best Practices for VMware vSphere 8.0 Update 2
Performance Best Practices for VMware vSphere 8.0 Update 1
Wherein Best Practices for vSAN Networking exists for now only for 7.
How ESXi NUMA Scheduling Works
VMware NUMA Optimization Algorithms and Settings
Intermittent CPU ready time due to NUMA action affinity on VMware ESXi (starting with the part about LocalityWeightActionAffinity)
NUMA nodes are heavily load imbalanced causing high contention for some virtual machines (2097369)

And yet, if you suddenly decide that you understand everything, then you can dive into redditor in the article Setting the number of cores per CPU in a virtual machine (1010184) and in the previously mentioned article Virtual Machine vCPU and vNUMA Rightsizing – Guidelines – in which not only about NUMA, but also begin to go deeper, into the virtual machine itself and how the program inside the VM works with memory – Soft-NUMA (SQL Server).