We create our own analogue of sqlite from scratch. Part 1

As a web developer, I regularly use relational databases in my work, but I don't understand how they work. The following questions arise:

  • In what format will the data be saved (in memory or on disk)

  • When should they be saved to disk?

  • Why is the primary key the only one for one table?

  • How is a transaction canceled?

  • How are indices formed?

  • When and how does a full table scan occur?

  • What is a prepared statement?

In other words, how do databases work? data?

To figure this out, I'll create a database from scratch. I'm using sqlite as a sample because it's designed to be small with fewer features than mysql or postgresql, so I hope I can understand how it works. The entire database is stored in one file!

Sqlite

sqlite architecture

More detailed internal workings of sqlite can be found on theirwebsite.

sqlite architecture

The request passes through the components in order to retrieve or modify data. The frontend consists of:

The input is an sql query, the output is byte code for the sqlite virtual machine (essentially a compiled program that can work on a database).

The backend consists of:

  • Virtual machine

  • B-tree

  • Pager

  • System interface (os interface)

Virtual machine perceives the byte code generated by the frontend as instructions. It can then perform operations on one or more tables or indexes, each of which is stored in a data structure called a b-tree. Essentially, this machine is a large switch, where each case is responsible for a specific instruction.

Each Bi-tree consists of many nodes. Each node is one page long. The bi-tree can get pages from a file or store it back to a file by issuing commands to the pager.

Pager receives commands to read or write data. It is responsible for reading/writing at the corresponding offsets in the database file. It also caches recently opened pages in memory, and determines when those pages should be written back to a file.

System interface This is a layer that differs depending on which operating system sqlite was compiled for. In this tutorial, I'm not going to add support for multiple devices.

A journey of a thousand miles begins with the first stepso let's start with something less simple: REPL

Creating a simple REPL

Sqlite runs its own repl when we run it from the command line:

~ sqlite3
SQLite version 3.16.0 2016-11-04 19:09:39
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> create table users (id int, username varchar(255), email varchar(255));
sqlite> .tables
users
sqlite> .exit
~

To do the same, our main function will have an infinite that prints the command line, receives the line from the user, then processes that line:

int main(int argc, char* argv[]) {
  InputBuffer* input_buffer = new_input_buffer();
  while (true) {
    print_prompt();
    read_input(input_buffer);

    if (strcmp(input_buffer->buffer, ".exit") == 0) {
      close_input_buffer(input_buffer);
      exit(EXIT_SUCCESS);
    } else {
      printf("Unrecognized command '%s'.\n", input_buffer->buffer);
    }
  }
}

We'll define an InputBuffer as a small wrapper around the state we need to store to interact with getline(). (more on this in a minute)

typedef struct {
  char* buffer;
  size_t buffer_length;
  ssize_t input_length;
} InputBuffer;

InputBuffer* new_input_buffer() {
  InputBuffer* input_buffer = (InputBuffer*)malloc(sizeof(InputBuffer));
  input_buffer->buffer = NULL;
  input_buffer->buffer_length = 0;
  input_buffer->input_length = 0;

  return input_buffer;
}

Next, print_prompt() prints the string to the user. We do this before reading each line of input.

void print_prompt() { printf("db > "); }

To read the input line, use getline()

ssize_t getline(char **lineptr, size_t *n, FILE *stream);

lineptr: a pointer to a variable that we use as a buffer containing the read line. If it is NULL, then getline allocates memory for it and thus the memory must be freed by the user, even if the command fails.

n: Pointer to a variable where the buffer size is stored.

stream: What this function will read. We will read the standard stream.

return value: The number of bytes read, which may be less than the buffer size.

We tell getline to store the line read in input_buffer->buffer and the size of the allocated buffer in input_buffer->buffer_length. We store the return value in input_buffer->input_length

buffer starts as null, so getline allocates enough memory to hold the input string and makes buffer a pointer to it.

void read_input(InputBuffer* input_buffer) {
  ssize_t bytes_read =
      getline(&(input_buffer->buffer), &(input_buffer->buffer_length), stdin);

  if (bytes_read <= 0) {
    printf("Error reading input\n");
    exit(EXIT_FAILURE);
  }

  // Игнорировать символ новой строки
  input_buffer->input_length = bytes_read - 1;
  input_buffer->buffer[bytes_read - 1] = 0;
}

Now it’s best to create a function that frees up allocated memory for an instance of InputBuffer and a buffer element of the same structure (getline allocates memory for input_buffer->buffer in read_input).

void close_input_buffer(InputBuffer* input_buffer) {
    free(input_buffer->buffer);
    free(input_buffer);
}

At the end, we parse and execute the command. So far, only one command is recognized: .exit, which closes the program. Otherwise, we print an error message and continue the loop.

if (strcmp(input_buffer->buffer, ".exit") == 0) {
  close_input_buffer(input_buffer);
  exit(EXIT_SUCCESS);
} else {
  printf("Unrecognized command '%s'.\n", input_buffer->buffer);
}

Let's try what we wrote!

~ ./db
db > .tables
Unrecognized command '.tables'.
db > .exit
~

Okay, we have a working REPL. In the next part, we will begin developing our command language. In the meantime, here is the entire program for this part:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct {
  char* buffer;
  size_t buffer_length;
  ssize_t input_length;
} InputBuffer;

InputBuffer* new_input_buffer() {
  InputBuffer* input_buffer = malloc(sizeof(InputBuffer));
  input_buffer->buffer = NULL;
  input_buffer->buffer_length = 0;
  input_buffer->input_length = 0;

  return input_buffer;
}

void print_prompt() { printf("db > "); }

void read_input(InputBuffer* input_buffer) {
  ssize_t bytes_read =
      getline(&(input_buffer->buffer), &(input_buffer->buffer_length), stdin);

  if (bytes_read <= 0) {
    printf("Error reading input\n");
    exit(EXIT_FAILURE);
  }

  // Ignore trailing newline
  input_buffer->input_length = bytes_read - 1;
  input_buffer->buffer[bytes_read - 1] = 0;
}

void close_input_buffer(InputBuffer* input_buffer) {
    free(input_buffer->buffer);
    free(input_buffer);
}

int main(int argc, char* argv[]) {
  InputBuffer* input_buffer = new_input_buffer();
  while (true) {
    print_prompt();
    read_input(input_buffer);

    if (strcmp(input_buffer->buffer, ".exit") == 0) {
      close_input_buffer(input_buffer);
      exit(EXIT_SUCCESS);
    } else {
      printf("Unrecognized command '%s'.\n", input_buffer->buffer);
    }
  }
}

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *