How to write your own filesystem in Rust

The initial data and results of the programs must be stored somewhere for further use. Their storage needs to be organized so that we can quickly get the information we need. The File System (FS) is responsible for this task: it provides an abstraction for the devices on which data is physically stored.

In this post, we will learn more about the concepts used by filesystems and how, knowing them, you can write your filesystem in Rust.


File system and disk

When data needs to be stored on a hard disk drive (HDD) or solid state drive (SSD), it is written in small sectors (or pages in the case of an SSD). Disks do not know what a particular piece of data is. The disk only stores an array of sectors. And the file system must manage and interact with them.

The file system divides the disk into fixed-size blocks. FS uses most of these blocks to store user data, but some blocks are used to store metadata that is required for the filesystem to function.

The following figure shows an example of how the file system structures information on disk:

Next, we will figure out what types of blocks are.

Superblocks and bitmaps

The superblock stores most of the metadata about the filesystem: block size, timestamps, how many blocks and files are in use, how much free, and so on.

Bitmaps are one way to keep track of which data blocks and inodes are free. Each sector in the file system has one map bit. If the sector is busy, then the value of the corresponding bit in the bitmap is set to 1, if free, to 0.

Inode

An inode is a structure that stores metadata about a file: permissions, size, location of data blocks that make up a file, and much more.

Inodes are stored together with blocks, which together form an inode table, as shown in the following figure.

For each sector in the inode bitmap with index set to 1, the table will have a corresponding inode. The sector index is the same as the index in the table.

Data blocks

As the name suggests, data blocks are blocks that contain data that belongs to a file. These blocks are also used for other purposes, which we will discuss below.

Pointers to Data Blocks

The inode must be able to store pointers to the data blocks that make up the file. The easiest way is to store direct pointers. In this case, each block corresponds to a pointer to a particular block in the file. The problem is that with large files (the number of blocks exceeds the number of direct pointers that an inode can have) this scheme does not work. One way to solve the problem is to use indirect pointers. They do not point to data, but to other pointers to blocks containing user data. For large files, another level of indirection is added with double indirection and so on.

To get an idea of ​​the maximum file size for each level, let’s take a fixed block size of 4KiB. Most files are small, so 12 forward pointers can store files up to 48KiB. Considering that each pointer is 4 bytes, a single indirect pointer will increase the file size up to 4 MiB:

(12 + 1024) * 4 KiB

With the introduction of double indirection pointers, the file size grows to 4 GiB:

(12 + 1024 + 10242) * 4 KiB

With the addition of triple indirect pointers, we get 4 TiB:

(12 + 1024 + 10242 + 10243) * 4 KiB

Therefore, this approach may not be very efficient for handling large files.

For example, a 100 MB file would require 25600 blocks. If blocks are fragmented on disk, performance can be severely affected.

Some filesystems use extents that allow multiple logical blocks to be combined. In this case, we use one pointer and set the length of the range of the combined blocks. In our example above, we will use a single 100 MB extent to describe the file. Multiple extents can be used to work with larger files.

Catalogs

You may have noticed that directories do not have their own structure. The point is that an inode represents both files and directories. The difference lies in what is stored in the respective data blocks. A directory is essentially a list of all the files it includes. Each entry has the form (name, sequence number), so when searching for a specific file (or other directory), the system uses the name to find the corresponding inode.

Finding a file can be slow if the directory contains a large number of files. This problem can be solved by maintaining a sorted list and using binary search. Or, instead of a list, you can also use a hash table or balanced search tree.

Read and write

Reading

To read from a file, the file system needs to iterate over all inodes that will occur on the path to the desired file. Assuming the user has permission to access the file, the file system checks which blocks are associated with it and then reads the requested data from them.

Recording

To write to a file, you need to do the same search to find the corresponding inode. If a write requires a new block, the file system must allocate that block, update the appropriate information (bitmap and inode), and write to it. Thus, one write operation requires five I / O operations: one read to the bitmap, one write to mark the new block as busy, two reads and writes to the inode, and one write to the block. And when creating a new file, this number may increase, since now it is necessary to update the data of its directory and create new inodes.

My filesystem (GotenksFS)

I studied the Rust language for a while and decided to write my own filesystem in it. I relied heavily on ext4and also used FUSE, a free module for UNIX-like kernels. Instead of a disk, I used a regular file. The block size can be set to 1 KiB, 2 KiB, or 4 KiB. Files can be up to 4 GiB in size for 4KiB blocks. The file system can potentially be up to 16 TiB in size.

Getting started

The first step was to create an image with configuration values ​​for the file system. This is achieved with the mkfs command:

$ ./gotenksfs mkfs disk.img -s "10 GiB" -b 4096

After executing the command, an image is created with a total size of 10 GiB, and each block in the file system is 4 KiB in size.

At this point, configuration values ​​and other structures such as the root directory are written to the image in the first block — that is, the superblock. In addition, the bitmap is filled and data is written. These values ​​will be required for the next step – mounting the file system.

Mounting

After creating the image, we need to mount it so that we can start using it. To do this, use the mount command:

$ ./gotenksfs mount disk.img gotenks

Basic structures

The superblock is written in the first 1024 bytes and contains the configuration values ​​specified in the mkfs command.

pub struct Superblock {
    pub magic: u32,
    pub block_size: u32,
    pub created_at: u64,
    pub modified_at: Option,
    pub last_mounted_at: Option,
    pub block_count: u32,
    pub inode_count: u32,
    pub free_blocks: u32,
    pub free_inodes: u32,
    pub groups: u32,
    pub data_blocks_per_group: u32,
    pub uid: u32,
    pub gid: u32,
    pub checksum: u32,
}

The next two blocks are bitmaps for data and for inode. A set of n blocks is used for the inode table. And in the subsequent blocks user data will be written.

I formed the inode like this:

pub struct Inode {
    pub mode: libc::mode_t,
    pub hard_links: u16,
    pub user_id: libc::uid_t,
    pub group_id: libc::gid_t,
    pub block_count: u32, // should be in 512 bytes blocks
    pub size: u64,
    pub created_at: u64,
    pub accessed_at: Option,
    pub modified_at: Option,
    pub changed_at: Option,
    pub direct_blocks: [u32; DIRECT_POINTERS as usize],
    pub indirect_block: u32,
    pub double_indirect_block: u32,
    pub checksum: u32,
}

I have implemented support for double indirection – that is, for a disk with a block size of 4 KiB, the maximum file capacity is 4 GiB. I limited the number of direct pointers to twelve:

pub const DIRECT_POINTERS: u64 = 12;

When FS first starts, it creates a root directory:

pub struct Directory {
    pub entries: BTreeMap,
    checksum: u32,
}

Form groups of blocks

The bitmap size for an inode is 4KiB, which means that 32,768 inodes can be placed in each block. Let’s round this number down to 128 bytes: the corresponding inode table will require 4 MB of free space. One way to structure them looks like this: there are many blocks allocated for bitmaps, then corresponding numeric blocks for storing inodes and the remaining blocks for user data.

But instead, we can create block groups that will always have one block for the data bitmap and one for the inode bitmap. The next 1024 blocks contain the inode table, followed by 32768 blocks that are used for user data.

Implementing reading and writing

When our “disk” is created, we can do file operations. Let’s create a new directory using mkdir:

The process of creating a directory looks like this: the system looks for the inode of the root directory, then looks for which data block the contents are written to, allocates a new inode and data block for the new directory, writes to the root directory, and finally writes the new directory to its data block.

A new file is created in a similar way: the system follows the specified path until the destination directory is reached and adds an entry for the new file.

The arguments to the write function are the path, the data buffer, the offset, and the structure with information about the file, which can contain the file descriptor along with other, additional information about it:

fn write(&mut self, path: &Path, buf: &[u8], offset: u64, file_info: &mut fuse_rs::fs::WriteFileInfo)

In this case, instead of traversing the path again to find the inode, the system uses the file descriptor (FS pre-initializes it when creating the file). And then FS will be able to assemble the structure by calculating the exact address of the inode:

fn inode_offsets(&self, index: u32) -> (u64, u64) {
    let inodes_per_group = self.superblock().data_blocks_per_group as u64;
    let inode_bg = (index as u64 - 1) / inodes_per_group;
    let bitmap_index = (index as u64 - 1) & (inodes_per_group - 1);
    (inode_bg, bitmap_index)
}

fn inode_seek_position(&self, index: u32) -> u64 {
    let (group_index, bitmap_index) = self.inode_offsets(index);
    let block_size = self.superblock().block_size;
    group_index * util::block_group_size(block_size)
        + 2 * block_size as u64
        + bitmap_index * INODE_SIZE
        + SUPERBLOCK_SIZE
}

Now that the FS has information about which data blocks are allocated for the inode, it can use them to find the exact address to write data there. New blocks are first added to the array of direct pointers, and if the file size exceeds (12 * BLOCK_SIZE), FS issues an indirect pointer (the indirect_block field). For very large files, the system adds a double indirection using the double_indirect_block field. Reading from a file is similar.

Below you can see how reading and writing works for me:

Output

The main idea behind implementing your own filesystem is to define your own content structure on “disk”, which will then allow the system to work with this structure when creating, writing, and reading from files and directories.

An important point is the implementation of indirect pointers for storing large files. If you are interested in learning more about filesystems, I would recommend reading about things like journaling, Copy-On-Write, Log-structured filesystems.

The code for my GotenksFS file system project is located at https://github.com/carlosgaldino/gotenksfs


Advertising

Epic servers – this is reliable servers on Linux or Windows with powerful processors of the AMD EPYC family and a very fast file system, we exclusively use NVMe disks from Intel. Try it as soon as possible!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *