Linux Kernel 5.0 – we write Simple Block Device under blk-mq

9 min


Good News, Everyone!
Linux kernel 5.0 is already here and appears in experimental distributions, such as Arch, openSUSE Tumbleweed, Fedora.

And if you look at the RC distributions of Ubuntu Disko Dingo and Red Hat 8, then it will become clear: soon kernel 5.0 will pump it from fans' desktops to serious servers.
Someone will say – so what. The next release, nothing special. Linus Torvalds himself said:

I’m not getting any more than that 4.x numbers started and toes.

(I repeat once again – our releases are not tied to any specific features, so the number of the new version 5.0 means only that for numbering versions 4.x I don’t have enough fingers and toes)

However, the module for floppy disks (who does not know – these are the disks of the size of c a breast pocket of a shirt, with a capacity of 1.44 MB) – corrected …
And that's why:

It's all about the multi-queue block layer (blk-mq). There are plenty of introductory articles about him on the Internet, so let's get straight to the point. The transition to blk-mq was started a long time ago and slowly advanced. Multi-queue scsi appeared (kernel parameter scsi_mod.use_blk_mq), new mq-deadline schedulers, bfq, etc. appeared …

[root@fedora-29 sblkdev]# cat / sys / block / sda / queue / scheduler
[mq-deadline] none

By the way, what's yours?

The number of block device drivers, which work in the old manner, has been reduced. And in 5.0, the function blk_init_queue () was removed as unnecessary. And now the old glorious code lwn.net/Articles/58720 from 2003 is not only not going, but has lost its relevance. Moreover, new distributions that are being prepared for release this year, in the default configuration, use multi-queue block layer. For example, on the 18th Manjaro, the kernel, though version 4.19, but blk-mq by default.

Therefore, we can assume that in 5.0 the transition to blk-mq is completed. And for me this is an important event that will require rewriting the code and additional testing. That in itself promises the appearance of bugs large and small, as well as several fallen servers (It is necessary, Fedya, it is necessary! (C)).

By the way, if someone thinks that this turning point did not come for rhel8, since the core was “frozen” with version 4.18, then you are mistaken. In the fresh RC at rhel8, the novelties from 5.0 have already migrated, and the function blk_init_queue () was also cut out (probably, while dragging the next chekin from github.com/torvalds/linux to their sources).
In general, the “freeze” version of the kernel for Linux distributors, such as SUSE and Red Hat, has long been a marketing concept. The system reports that the version is, for example, 4.4, and in fact the functionality of fresh 4.8 vanilla. At the same time on the official website there is an inscription like: “In the new distribution we saved for you a stable 4.4 core.”
But we digress …

So here. We need a new simple block device driver to make it clearer how it works.
So, the source on github.com/CodeImp/sblkdev. I suggest discussing, doing pull requests, starting an issue – I will fix it. QA has not verified yet.

Further in the article I will try to describe what for. Therefore, much more code.
Immediately I apologize that the Linux kernel coding style is not fully respected, and yes – I do not like goto.

So let's start with entry points.

static int __init sblkdev_init (void)
{
    int ret = SUCCESS;

    _sblkdev_major = register_blkdev (_sblkdev_major, _sblkdev_name);
    if (_sblkdev_major <= 0){
        printk(KERN_WARNING "sblkdev: unable to get major numbern");
        return -EBUSY;
    }

    ret = sblkdev_add_device();
    if (ret)
        unregister_blkdev(_sblkdev_major, _sblkdev_name);
        
    return ret;
}

static void __exit sblkdev_exit(void)
{
    sblkdev_remove_device();

    if (_sblkdev_major > 0)
        unregister_blkdev (_sblkdev_major, _sblkdev_name);
}

module_init (sblkdev_init);
module_exit (sblkdev_exit);

Obviously, when the module is loaded, the sblkdev_init () function is launched, and when the sblkdev_exit () is unloaded.
The register_blkdev () function registers a block device. He is allocated a major number. unregister_blkdev () – frees this number.

The key structure of our module is sblkdev_device_t.

// The internal representation of our device
typedef struct sblkdev_device_s
{
    sector_t capacity; // Device size in bytes
    u8 * data; // The data aray. u8 - 8 bytes
    atomic_t open_counter; // How many openers

    struct blk_mq_tag_set tag_set;
    struct request_queue * queue; // For mutual exclusion

    struct gendisk * disk; // The gendisk structure
} sblkdev_device_t;

It contains all the necessary information about the device to the kernel module, in particular: the capacity of the block device, the data itself (this is simple), pointers to the disk and the queue.

All initialization of the block device is done in the sblkdev_add_device () function.

static int sblkdev_add_device (void)
{
    int ret = SUCCESS;

    sblkdev_device_t * dev = kzalloc (sizeof (sblkdev_device_t), GFP_KERNEL);
    if (dev == NULL) {
        printk (KERN_WARNING "sblkdev: unable to allocate% ld bytes  n", sizeof (sblkdev_device_t));
        return -ENOMEM;
    }
    _sblkdev_device = dev;

    do {
        ret = sblkdev_allocate_buffer (dev);
        if (ret)
            break;

#if 0 // simply variant with helper function blk_mq_init_sq_queue. It`s available from kernel 4.20 (vanilla).
        {// configure tag_set
            struct request_queue * queue;

            dev-> tag_set.cmd_size = sizeof (sblkdev_cmd_t);
            dev-> tag_set.driver_data = dev;

            queue = blk_mq_init_sq_queue (& dev-> tag_set, & _mq_ops, 128, BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE);
            if (IS_ERR (queue)) {
                ret = PTR_ERR (queue);
                printk (KERN_WARNING "sblkdev: unable to allocate and initialize tag set  n");
                break;
            }
            dev-> queue = queue;
        }
#else // more flexible variant
        {// configure tag_set
            dev-> tag_set.ops = & _mq_ops;
            dev-> tag_set.nr_hw_queues = 1;
            dev-> tag_set.queue_depth = 128;
            dev-> tag_set.numa_node = NUMA_NO_NODE;
            dev-> tag_set.cmd_size = sizeof (sblkdev_cmd_t);
            dev-> tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
            dev-> tag_set.driver_data = dev;

            ret = blk_mq_alloc_tag_set (& dev-> tag_set);
            if (ret) {
                printk (KERN_WARNING "sblkdev: unable to allocate tag set  n");
                break;
            }
        }

        {// configure queue
            struct request_queue * queue = blk_mq_init_queue (& dev-> tag_set);
            if (IS_ERR (queue)) {
                ret = PTR_ERR (queue);
                printk (KERN_WARNING "sblkdev: Failed to allocate queue  n");
                break;
            }
            dev-> queue = queue;
        }
#endif
        dev-> queue-> queuedata = dev;

        {// configure disk
            struct gendisk * disk = alloc_disk (1); // only one partition
            if (disk == NULL) {
                printk (KERN_WARNING "sblkdev: Failed to allocate disk  n");
                ret = -ENOMEM;
                break;
            }

            disk-> flags | = GENHD_FL_NO_PART_SCAN; // only one partition
            // disk-> flags | = GENHD_FL_EXT_DEVT;
            disk-> flags | = GENHD_FL_REMOVABLE;

            disk-> major = _sblkdev_major;
            disk-> first_minor = 0;
            disk-> fops = & _fops;
            disk-> private_data = dev;
            disk-> queue = dev-> queue;
            sprintf (disk-> disk_name, "sblkdev% d", 0);
            set_capacity (disk, dev-> capacity);

            dev-> disk = disk;
            add_disk (disk);
        }

        printk (KERN_WARNING "sblkdev: simple block device was created  n");
    } while (false);

    if (ret) {
        sblkdev_remove_device ();
        printk (KERN_WARNING "sblkdev: Failed add block device  n");
    }

    return ret;
}

We allocate memory for the structure, allocating a buffer for data storage. There is nothing special.
Next, we initialize the request queue either with one function blk_mq_init_sq_queue (), or with two at once: blk_mq_alloc_tag_set () + blk_mq_init_queue ().
By the way, if you look at the sources of the blk_mq_init_sq_queue () function, you will see that this is just a wrapper over the functions blk_mq_alloc_tag_set () and blk_mq_init_queue (), which appeared in the kernel 4.20. In addition, it conceals many of the parameters of the queue, but it looks much simpler. You choose which option is better, but I prefer the more explicit.

The key in this code is the global variable _mq_ops.

static struct blk_mq_ops _mq_ops = {
    .queue_rq = queue_rq,
};

This is where the function is located that provides the processing of requests, but more about it later. The main thing is that we designated the entry point to the request handler.

Now that we have created a queue, we can create an instance of the disk.
Here, without much change. The disk is allocated, parameters are set, and the disk is added to the system. I want to clarify about the parameter disk-> flags. It allows the system to indicate that the disk is removable, or, for example, that it does not contain partitions and does not need to look for them there.

For disk management there is a _fops structure.

static const struct block_device_operations _fops = {
    .owner = THIS_MODULE,
    .open = _open,
    .release = _release,
    .ioctl = _ioctl,
#ifdef CONFIG_COMPAT
    .compat_ioctl = _compat_ioctl,
#endif
};

The entry points _open and _release are not very interesting for us for a simple block device module. In addition to the atomic increment and decrement of the counter, there is nothing there. I also left compat_ioctl without implementation, since the version of systems with a 64-bit kernel and a 32-bit user-space environment does not seem promising to me.

But _ioctl allows you to process system requests for this disk. When a disk appears, the system tries to learn more about it. By your own understanding, you can answer some requests (for example, to pretend to be a new CD), but the general rule is this: if you do not want to respond to requests that are of no interest to you, simply return the error code -ENOTTY. By the way, if necessary, here you can add your own request handlers for this particular disk.

So, the device we added – you need to take care of the release of resources. Here you don't hereRust

static void sblkdev_remove_device (void)
{
    sblkdev_device_t * dev = _sblkdev_device;
    if (dev) {
        if (dev-> disk)
            del_gendisk (dev-> disk);

        if (dev-> queue) {
            blk_cleanup_queue (dev-> queue);
            dev-> queue = NULL;
        }

        if (dev-> tag_set.tags)
            blk_mq_free_tag_set (& dev-> tag_set);

        if (dev-> disk) {
            put_disk (dev-> disk);
            dev-> disk = NULL;
        }

        sblkdev_free_buffer (dev);

        kfree (dev);
        _sblkdev_device = NULL;

        printk (KERN_WARNING "sblkdev: simple block device was removed  n");
    }
}

In principle, everything is obvious: we delete the disk object from the system and release the queue, after which we release our buffers (data areas).

And now the most important thing is the processing of requests in the function queue_rq ().

static blk_status_t queue_rq (struct blk_mq_hw_ctx * hctx, const struct blk_mq_queue_data * bd)
{
    blk_status_t status = BLK_STS_OK;
    struct request * rq = bd-> rq;

    blk_mq_start_request (rq);

    // we can't use that thread
    {
        unsigned int nr_bytes = 0;

        if (do_simple_request (rq, & nr_bytes)! = SUCCESS)
            status = BLK_STS_IOERR;

        printk (KERN_WARNING "sblkdev: request process% d bytes  n", nr_bytes);

#if 0 // proprietary module
        blk_mq_end_request (rq, status);
#else // can set real processed bytes count
        if (blk_update_request (rq, status, nr_bytes)) // GPL-only symbol
            BUG ();
        __blk_mq_end_request (rq, status);
#endif
    }

    return BLK_STS_OK; // always return ok
}

To begin, consider the parameters. The first is struct blk_mq_hw_ctx * hctx – the state of the hardware queue. In our case, we do without the hardware queue, so unused.
The second parameter, const struct blk_mq_queue_data * bd, is a parameter with a very laconic structure, which I’m not afraid to bring to your attention in its entirety:

struct blk_mq_queue_data {
struct request * rq;
bool last;
};

It turns out that in fact it is all the same request, which came to us from the time of which the chronicler elixir.bootlin.com does not remember. So we take the request and start processing it, which we notify the kernel by calling blk_mq_start_request (). Upon completion of the request processing, we will notify the kernel by calling the function blk_mq_end_request ().

Here is a small note: the function blk_mq_end_request () is, in essence, a wrapper over calls to blk_update_request () + __blk_mq_end_request (). When using the blk_mq_end_request () function, it is not possible to specify how many bytes were actually processed. He believes that everything is processed.

The alternative has another feature: the blk_update_request function is exported only for GPL-only modules. That is, if you want to create a proprietary kernel module (let PM save you from this thorny path), you cannot use blk_update_request (). So here the choice is yours.

Immediately, the transfer of the bytes from the request to the buffer and back, I brought to the do_simple_request () function.

static int do_simple_request (struct request * rq, unsigned int * nr_bytes)
{
    int ret = SUCCESS;
    struct bio_vec bvec;
    struct req_iterator iter;
    sblkdev_device_t * dev = rq-> q-> queuedata;
    loff_t pos = blk_rq_pos (rq) << SECTOR_SHIFT;
    loff_t dev_size = (loff_t)(dev->capacity << SECTOR_SHIFT);

    printk(KERN_WARNING "sblkdev: request start from sector %ld n", blk_rq_pos(rq));
    
    rq_for_each_segment(bvec, rq, iter)
    {
        unsigned long b_len = bvec.bv_len;

        void* b_buf = page_address(bvec.bv_page) + bvec.bv_offset;

        if ((pos + b_len) > dev_size)
            b_len = (unsigned long) (dev_size - pos);

        if (rq_data_dir (rq)) // WRITE
            memcpy (dev-> data + pos, b_buf, b_len);
        else // READ
            memcpy (b_buf, dev-> data + pos, b_len);

        pos + = b_len;
        * nr_bytes + = b_len;
    }

    return ret;
}

There is nothing new here: rq_for_each_segment goes through all the bio, and in them all the bio_vec structures, allowing us to get to the pages with the request data.

What are your impressions? It seems, everything is simple? Processing the request is generally just copying the data between the request pages and the internal buffer. Well worthy of a simple block device driver, right?
But there is a problem: This is not for real use!

The essence of the problem is that the request processing function queue_rq () is called in a loop that processes requests from the list. I don’t know what kind of blocking is used for this list, Spin or RCU (I don’t want to lie – who knows, correct me), but when I try to use, for example, mutex in the request processing function, the debug kernel swears and warns: doze here it is impossible. That is, it is impossible to use conventional synchronization tools or virtual memory (virtually contiguous memory) – the one that is allocated using vmalloc and can fall into the swap with everything that follows – because the process cannot go into the standby state.

Therefore, either only Spin or RCU locks and a buffer as an array of pages, or a list, or a tree, as implemented in .. linux drivers block brd.c, or pending processing in another thread, as implemented in .. linux drivers block loop.c.

I think it is not necessary to describe how to assemble the module, how to load it into the system and how to unload it. No innovations on this front, and thanks for that 🙂 So if someone wants to try it out, he will surely figure it out. Just do not do it right away on your favorite laptop! Raise the virtual or at least make a backup on the ball.
By the way, Veeam Backup for Linux 3.0.1.1046 is already available. Just do not try to run VAL 3.0.1.1046 on a kernel of 5.0 or higher. veeamsnap will not gather. And some multi-queue innovations are still at the testing stage.


One Comment

Leave a Reply

  1. That you for the write-up. A couple of notes.

    The vmalloc function allocates memory that is virtually contiguous, but it does not swap. So you can use vmalloc memory inside of the queue call, but you must have allocated it earlier. Only user space allocated memory can be used for swap.