Introduction to SSD. Part 5. Controller

In the previous parts of the series, we talked about the history of drives, about the interfaces and form factors used, as well as about the organization at the physical level. The fifth part is devoted to the “brain” of a modern solid state drive.

A modern storage controller is a small computer that accepts standardized commands and performs appropriate actions on the storage it controls. In this case, the internal device of the controller can be anything.

Intel has a 6.4 TB P4618 drive that appears to the system as two 3.2 TB drives. The same is true for hard drives. Seagate drives with MACH.2 technology are two drives enclosed in a single enclosure and linked together by a single controller.

The controller is a rather complex device that, depending on the purpose of the disk, performs various data management tasks. For example, databases often require the drive to write directly to non-volatile memory, bypassing the cache, in which case the server SATA SSD will be faster than the custom NVMe. Due to the large variability of controllers, we will not go into the details of specific devices, but talk about the general principles of operation of a modern solid-state drive.

Recording Features

Blocks and pages in NAND memory. A source

An SSD storage is made up of many field-effect transistors connected to each other. With this approach, reads and writes are performed in pages of data, which are typically 4KiB in size. Thus, changing one bit on disk results in the need to rewrite the entire data page. This problem is called write amplification

Also, SSDs cannot refresh data in the page. Updating the page is done in four steps:

  1. Reading data from a page into a buffer.
  2. Changing data in the page.
  3. Clearing the data page.
  4. Writing updated data from the buffer.

The controller can write data pages, but erase only blocks – a sequence of pages. Typically a block consists of 64 data pages. Under the stated conditions and with the finite resource of rewriting the drive cells, the controller needs to carry out the write and erase operations carefully.

Wear resistance

Wear leveling. A source

Modern drives are built on the basis of TLC cells, the resource of which is several times less than that of drives with SLC and MLC cells. If some program in the OS will constantly overwrite a small file, and the controller will “naively” update one data page, then the block with this page will soon run out of resources. The exhaustion of the resource will be displayed in the drive indicators, which will inevitably lead to anxiety for the system administrator.

To avoid severe wear of single drive units, technologies are used wear leveling… In this case, the data is updated without clearing the drive page and looks like this:

  1. Reading data from a page into a buffer.
  2. Changing data in the page.
  3. Writing updated data from the buffer to a “blank” page.
  4. The old page is marked dirty.

The block is cleared after all pages are marked “dirty”. A natural question arises: how will the drive behave when there are no “clean” pages left and there are no blocks ready to be cleaned? The answer is simple: over provisioning and garbage collection

Wear balancing

In a way, the drive manufacturer is cheating us twice. First time uses decimal prefixes instead of binary ones: 480 GB is 447 GiB. And the second time, when the actual volume of the drive is more than available to the user. A part of the volume is reserved by the manufacturer for the internal needs of the controller. Such a reserve is called spare area (spare)

This way, the controller always has some free space that can be used for internal processes. Although there is no exact data, various sources claim that 7 to 28% of the storage capacity is reserved for the controller.

Increasing the size of the reserved area decreases the available space, but most often improves disk performance. To increase the spare area, it is enough to leave part of the drive unallocated. However, if you want to be extreme and do everything “according to the mind”, then you can reduce the amount of space visible to the user using the -N switch of the hdparm utility.

Be that as it may, it will not work to return the area reserved by the manufacturer to your own use.

Garbage collection process. A source

In addition to balancing wear, controllers often run a process in the background. Garbage collection… During it, actual pages are collected from several blocks and placed in one block. Then the original blocks are cleared, since there are no data pages left in them.

It is important to note that the garbage collector is concerned with shifting data into storage so that there are as many clean blocks as possible. At the same time, he cannot understand that on the file system some file is marked as deleted, since the drive controller does not know how to work in terms of file systems.

To solve this problem, each of the protocols has a command that allows you to notify the controller about the deletion of the file. For NVMe, this is deallocate, for SATA – TRIM, and for SCSI – unmap… The essence of each of these commands is the same: mark the pages with the deleted file as “dirty”.

The controller has to constantly worry about the state of the store pages. This suggests an obvious optimization: if the operating system tries to read data from pages with no data, then instead of a read operation, you can simply generate the required number of zeros.

This is easily confirmed by experiment. We carry out Secure erase for the drive and run tests for random reads with a queue depth of 64. Then we “clog” the drive using sequential write, preferably twice. And we repeat the tests.

Block size Clean Hammered
4M 3400 MiB / s 3376 MiB / s
8M 3399 MiB / s 3336 MiB / s

In our tests, we used a Micron 7300 1.92 TB SSD connected via PCIe 3.0 x4. The third version of PCI Express on four lanes is capable of passing 3940 MB / s or 3757 MiB / s. We, of course, have not reached the limit, but I suppose this is due to the overhead of the NVMe protocol. Nevertheless, you can see that reading from a disk without data “rests” on the 3400 MiB / s limit. After filling the disk by 15%, the test results became worse.

Despite the fact that the drive controller always tries to do its best, sometimes the system administrator should take a look at the drive’s metrics with his own eyes.

Indicators

Regardless of the drive interface, SSDs have a set of health indicators that can be read by the system administrator. For SATA drives, SMART metrics are used and are not standardized. The lack of a standard leads to the emergence of different interpretations of one indicator.

Let’s look at the output of the smartctl utility using Intel S4510 as an example.

# smartctl -iA /dev/sda
smartctl 7.3 (build date Jan  1 2021) [x86_64-linux-4.15.0-51-generic] (local build)
Copyright (C) 2002-21, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel S4510/S4610/S4500/S4600 Series SSDs
Device Model:     INTEL SSDSC2KB480G8
Serial Number:    ThereIsNoSerialHere
LU WWN Device Id: 5 5cd2e4 14fd823b7
Firmware Version: XCV10100
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jul  5 14:30:43 2021 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       21345
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       76
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       48
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2625 (76 65535)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error_Count  0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Drive_Temperature       0x0022   080   080   000    Old_age   Always       -       20 (Min/Max 11/21)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       48
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       20
197 Pending_Sector_Count    0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       67625
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       276
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       84
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       1280413
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
234 Thermal_Throttle_Status 0x0032   100   100   000    Old_age   Always       -       0/0
235 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       2625 (76 65535)
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       67625
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       359222
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       477803

The following parameters are interesting for our disk:

  • 5 Reallocated_Sector_Ct – the number of sectors with zero resource.
  • 187 Uncorrectable_Error_Cnt – the number of errors that could not be corrected using correction codes.
  • 233 Media_Wearout_Indicator – percentage of disc wear.
  • 243 NAND_Writes_32MiB – the amount of data that has been written to the disc at all times.

Already in this list, the problems of lack of standards are visible. So, the Host_Writes_32MiB parameter occurs twice: under the number 225 and under the number 241.

From the point of view of the wear of a serviceable disk, the indicator is interesting 233 Media_Wearout_Indicator, since when the number 1023 is reached, the drive will be programmatically locked and will be available in read-only mode.

SMART metrics are a feature of the SATA protocol. For NVMe drives, there is NVMe log, which is also read by the smartctl program. Similar output can be obtained using the nvme smart-log command.

# smartctl -iA /dev/nvme0n1
smartctl 7.3 (build date Jan  1 2021) [x86_64-linux-4.15.0-51-generic] (local build)
Copyright (C) 2002-21, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPF2NV153TZ
Serial Number:                      ThereIsNoSerialToo
Firmware Version:                   ACV10024
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Total NVM Capacity:                 15,362,991,415,296 [15.3 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.3
Number of Namespaces:               128
Namespace 1 Size/Capacity:          15,362,991,415,296 [15.3 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            5cd2e4 71434b0400
Local Time is:                      Mon Jul  5 14:31:34 2021 MSK

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        28 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    77,871,643 [39.8 TB]
Data Units Written:                 148,394,810 [75.9 TB]
Host Read Commands:                 5,737,206,704
Host Write Commands:                1,802,643,030
Controller Busy Time:               81
Power Cycles:                       5
Power On Hours:                     146
Unsafe Shutdowns:                   1
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
# nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 29 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 77,871,643
data_units_written                  : 148,394,810
host_read_commands                  : 5,737,206,704
host_write_commands                 : 1,802,643,030
controller_busy_time                : 81
power_cycles                        : 5
power_on_hours                      : 148
unsafe_shutdowns                    : 1
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

There are fewer obscure indicators in the NVMe output of the drive, but there is still room for discrepancies. It is easy to assume that the parameter percentage_used is responsible for the amount of disk space occupied by the user, but it is not. This parameter is equivalent to Media_Wearout_Indicator and indicates wear on the drive.

Do not forget that the provided indicators and controller behavior are implemented by firmware, which can be updated.

Flashing

You don’t often think about firmware for solid state drives. In the best case, after the purchase, a fresh version is rolled over and forgotten until the end of the drive’s life.

Be that as it may, firmware updates rarely bring any significant and noticeable innovations to the user. The firmware, like any other software, may contain errors, including critical ones. Fortunately, this rarely happens, and therefore there is no need to constantly keep the firmware up-to-date on all drives used.

Although NVMe can be flashed through the fw-download and fw-commit commands, most often firmware upgrades are performed through utilities provided by the drive manufacturer. In order to avoid potentially destructive actions, we will not publish the exact commands, but we recommend that you refer to the official instructions from the manufacturer.

Conclusion

Drive controllers are complex devices that control the equally complex processes that take place inside solid state drives. We have considered only the most interesting processes in general terms.

If you want to dive deeper into the specifics of working with NVMe, we recommend the article on NVMe namespaces.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *