Creating your own compiler

At some fairly close point in time, I became interested in the design of compilers in general, and in particular, translators from high-level languages ​​to assembler. And it turned out that there weren’t that many articles on this topic, but with examples and at least some detailed description, I actually found one, the choice of languages ​​in which was somewhat strange.

I’m not saying that this article is bad, on the contrary, but the main problem, in my opinion, was that the code is translated into the language of its own virtual machine, and not into assembly code. Therefore, I decided to try to write my own article to highlight this problem.

Selection of programming languages

I chose C++ as the language of the compiler itself, because, according to Google, it is mainly used for compilers, or pure “C”. We will translate into old asm, intended for DOS.

With the choice of assembler, it is worth making a reservation about why such an exotic solution for 2023. In fact, everything is simple: in my opinion, anyone who works with modern versions of assembler, as well as students who simply touched this language at university, will understand it.

What will and will not be discussed in this article

I’ll make a disclaimer right away: the images in this article will be cited from “Book of the Dragon-2,” but my acquaintance with it was too fleeting (get the pictures for the article and skim diagonally through the introduction).

Let’s return to the program compilation phases. To describe them, you can give this image with an example of each phase:

In the framework of this article, as in the article mentioned in the title, I will somewhat simplify the scheme, namely, the steps of semantic analysis (in the proposed model there is no point in it) of the intermediate code generator and its optimizer will be skipped. At the same time, the article will cover the issue of constructing lexical and syntactic analyzers, as well as generating assembly code.

Defining Lexical Analyzer Grammar

The lexical analyzer can be said to be one of the simplest parts of the compiler. At this stage, the program text is converted into a set of tokens, without checking for the correctness of their sequence.

First, let’s decide which operations will be available in our compiler. Within the framework of this article, we will consider a slightly complicated calculator “from the market”:

  • Basic arithmetic (+, -, *, / (integer)) is available. Arithmetic operations do not take precedence over each other, but you can specify the order of operations using parentheses.

  • For the convenience of some calculations, conditional operators and loop operators if, if/else, while have been added.

  • The results of calculations are output to the console using the print statement.

  • To compare values, a single operator < is overloaded (which returns 1 if the comparison is true, 0 otherwise).

  • To improve the user’s convenience, he is provided with 26 global variables, which are designated in lowercase Latin letters in the program text.

  • One data type is available to the user: unsigned short.

  • The user can group statements using curly braces.

  • As a result of the operation of assigning a value to a variable, the new value of the variable is returned.

  • Spaces, tabs, and newlines are not taken into account when parsing input text.

The grammar of the language is as follows (a modified version of the grammar from the article indicated at the beginning of this one):

<program> ::= <statement>
<statement> ::=  "if" <parenExpr> <statement> |
                 "if" <parenExpr> <statement> "else" <statement> |
                 "while" <parenExpr> <statement> |
                 "print" <parenExpr> |
                 "{" { <statement> } "}" |
                 <expr> ";" |
                 ";"
<parenExpr> ::= "(" <expr> ")"
<expr> ::= <test> | <id> "=" <expr>
<test> ::= <arExpr> | <arExpr> "<" <arExpr>
<arExpr>  ::= <term> | <arExpr> "+" <term> |
              <arExpr> "*" <term> | <arExpr> "-" <term> |
              <arExpr> "/" <term>
<term> ::= <id> | <int> | <parenExpr>
<id>   ::= "a" | "b" | ... | "z"
<int>  ::= <digit>, { <digit> }
<digit> ::= "0" | "1" | ... | "9" 

The entire program consists of one statement. This single statement can be either a set of the same statements enclosed in curly braces, or a conditional/cyclic statement (if, if/else, while), or a number output statement (print), or just an expression.

Conditional and cyclic operators contain a condition expression (written in parentheses. If the value of the expression is 0, then the compiler interprets it as false, otherwise as true) and a body (which is a full-fledged operator).

Regular expressions end with a semicolon.

The expressions used in the conditions and print statement are arithmetic expressions, consisting of the arithmetic assignment operators available to us (which, as stated earlier, return the new value of a variable).

An arithmetic expression consists of arithmetic operations on terms, which in turn are a variable name, an integer, or an expression in parentheses.

Having defined the grammar, we can move on to creating a lexical analyzer.

Lexical analyzer

Let us describe an enumeration that will contain all types of tokens obtained when parsing the source code, as well as dictionaries that will be used to match reserved words and special characters with tokens.

//compiler.h

enum class tokensTypes
{
    NUM, VAR, IF, ELSE, WHILE, LBRA, RBRA, LPAR, RPAR,
    ADD, SUB, MUL, DIV, LESS, ASSIG,
    SEMICOL, ENDFILE, ERROR, PRINT
};

//compiler.cpp

std::map<char, compiler::tokensTypes> compiler::specialSymbols
{
	std::pair<char, tokensTypes>{'{', tokensTypes::LBRA},
	std::pair<char, tokensTypes>{'}', tokensTypes::RBRA},
	std::pair<char, tokensTypes>{'(', tokensTypes::LPAR},
	std::pair<char, tokensTypes>{')', tokensTypes::RPAR},
	std::pair<char, tokensTypes>{'+', tokensTypes::ADD},
	std::pair<char, tokensTypes>{'-', tokensTypes::SUB},
	std::pair<char, tokensTypes>{'*', tokensTypes::MUL},
	std::pair<char, tokensTypes>{'/', tokensTypes::DIV},
	std::pair<char, tokensTypes>{'<', tokensTypes::LESS},
	std::pair<char, tokensTypes>{'=', tokensTypes::ASSIG},
	std::pair<char, tokensTypes>{';', tokensTypes::SEMICOL},
};

std::map<std::string, compiler::tokensTypes> compiler::specialWords
{
	std::pair<std::string, tokensTypes>{"if", tokensTypes::IF},
	std::pair<std::string, tokensTypes>{"else", tokensTypes::ELSE},
	std::pair<std::string, tokensTypes>{"while", tokensTypes::WHILE},
	std::pair<std::string, tokensTypes>{"print", tokensTypes::PRINT},
};

We will also define a method that, based on the line and the current cursor position in it, will determine the following token:

// compiler.h

class token
{
public:
    tokensTypes type;
    unsigned short value;
    char variable;

    token(tokensTypes type, int value = 0, char variable="a");
};

// compiler.cpp

compiler::token compiler::getNexttoken()
{
	if (nowPos == textLen) // Если файл прочитан, возвращаем конец файла
		return token(compiler::tokensTypes::ENDFILE);

	std::string str = ""; // В эту строку будет записываться текущий токен
	char now = text[nowPos++];

    // Пропускаем все незначимые символы
	while ((now == ' ' || now == '\r' || now == '\n' || now == '\t')&& nowPos < textLen)
		now = text[nowPos++];
    // И снова проверяем на конец файла
	if (nowPos == textLen)
		return token(compiler::tokensTypes::ENDFILE);

    // Пока не подойшли к символу, который однозначно говорит о конце токена, считываем
	while (!(now == ' ' || now == '\r' || now == '\n' || now == '\t'))
	{
        // Если в строке нет символов, проверяем на причастностьк специальным символам или числам
		if (str.size() == 0) 
		{
			if (specialSymbols.count(now))
				return token(specialSymbols1708872916);
			else if (isdigit(now))
			{
				do
				{
					str += now;
					now = text[nowPos++];
				} while (isdigit(now));
				nowPos--;
				break;
			}
			else
				str += now;
		}
		else
		{
			if (specialSymbols.count(now)) // Если был встречен специальный символ
			{                              // Также заканчиваем парсинг токена
				nowPos--;
				break;
			}
			str += now;
		}

		if (specialWords.count(str)) //Проверяем на нахождение в списке зарезервированных слов
			return token(specialWords[str]);

		if (nowPos == textLen)
			break;

		now = text[nowPos++];
	}

    // Если в токене единственная строчная буква, то интерпретируем как имя переменной
	if (str.size() == 1 && str[0] >= 'a' && str[0] <= 'z')
		return  token(tokensTypes::VAR, 0, str[0]);

	unsigned short value = 0;
    // Если токен - число, распознаем его 
	for (int i = 0; i < str.size(); i++)
	{
		if (isdigit(str[i]))
			value = value * 10 + (str[i] - '0');
		else
			return token(compiler::tokensTypes::ERROR);
	}

	return token(compiler::tokensTypes::NUM, value);
}

In the main compilation method, we fill the token vector:

//compiler.cpp

bool compiler::compile(std::string& result)
{
	token token = getNexttoken();

	while (token.type != tokensTypes::ENDFILE)
	{
		if (token.type == tokensTypes::ERROR) // Если токен не распознан
		{
			result = "Lexical error"; // Возвращаем лексическую ошибку
			return false;
		}
		tokens.push_back(token);
		token = getNexttoken();
	}
	tokens.push_back(token);
}

At this point, the lexical analyzer can be considered complete.

Parsing

During parsing, both the correctness of the sequence of tokens is checked and a syntax tree is built. To build a tree, you also need to determine the nodes and their types:

// compiler.h

enum class nodeTypes
{
    CONST, VAR, ADD, SUB, MUL, DIV,
    LESSTHEN, SET, IF, IFELSE, WHILE,
    EMPTY, SEQ, EXPR, PROG, PRINT
};

class node
{
public:
    nodeTypes type;
    int value;
    node* op1;
    node* op2;
    node* op3;

    node(nodeTypes type, int value = 0, node* op1 = nullptr, node* op2 = nullptr, node* op3 = nullptr);
    ~node();
};

With these inputs, we will define methods that convert a sequence of tokens into a syntax tree (there is a lot of code, so I hid it under a spoiler):

Hidden text
// compiler.cpp

bool compiler::parser::parse(std::vector<token>& tokens, compiler::node& result)
{
	int count = 0;
	compiler::node* node = new compiler::node(nodeTypes::EMPTY);
	bool res = parser::statement(tokens, count, node);

	if (count != tokens.size() - 1 || !res)
		return false;

	result = *node;
	return true;
}

bool compiler::parser::statement(std::vector<token>& tokens, int& count, compiler::node*& node)
{
	delete node;
	node = new compiler::node(nodeTypes::EMPTY);
	bool parsed = true;
	switch (tokens[count].type)
	{
	case compiler::tokensTypes::IF:
		count++;
		node->type = nodeTypes::IF;
		parsed = compiler::parser::parenExpr(tokens, count, node->op1);

		if (!parsed)
			return false;

		parsed = compiler::parser::statement(tokens, count, node->op2);

		if (!parsed)
			return false;

		if (tokens[count].type == tokensTypes::ELSE)
		{
			count++;
			node->type = nodeTypes::IFELSE;
			bool parsed = compiler::parser::statement(tokens, count, node->op3);

			if (!parsed)
				return false;
		}
		break;

	case compiler::tokensTypes::WHILE:
		count++;
		node->type = nodeTypes::WHILE;
		parsed = compiler::parser::parenExpr(tokens, count, node->op1);

		if (!parsed)
			return false;

		parsed = compiler::parser::statement(tokens, count, node->op2);

		if (!parsed)
			return false;
		break;

	case compiler::tokensTypes::SEMICOL:
		count++;
		break;

	case compiler::tokensTypes::LBRA:
		count++;
		while (tokens[count].type != tokensTypes::RBRA && count < tokens.size())
		{
			node = new compiler::node(nodeTypes::SEQ, 0, node);

			bool parsed = compiler::parser::statement(tokens, count, node->op2);

			if (!parsed)
				return false;
		}
		count++;
		break;

	case compiler::tokensTypes::PRINT:
		count++;
		node->type = nodeTypes::PRINT;
		parsed = compiler::parser::parenExpr(tokens, count, node->op1);

		if (!parsed)
			return false;
		break;

		if (!parsed)
			return false;

	default:
		node->type = nodeTypes::EXPR;
		parsed = compiler::parser::expr(tokens, count, node->op1);

		if (tokens[count++].type != tokensTypes::SEMICOL)
			return false;

		if (!parsed)
			return false;
	}

	return true;
}

bool compiler::parser::parenExpr(std::vector<token>& tokens, int& count, compiler::node*& node)
{
	delete node;
	if (tokens[count].type != tokensTypes::LPAR)
		return false;

	count++;

	bool res = expr(tokens, count, node);

	if (tokens[count].type != tokensTypes::RPAR)
		return false;

	count++;
	return true;
}

bool compiler::parser::expr(std::vector<token>& tokens, int& count, compiler::node*& node)
{
	delete node;
	if (tokens[count].type != compiler::tokensTypes::VAR)
		return test(tokens, count, node);

	bool res = test(tokens, count, node);

	if (!res)
		return false;

	if (node->type == nodeTypes::VAR && tokens[count].type == tokensTypes::ASSIG)
	{
		node = new compiler::node(nodeTypes::SET, 0, node);
		count++;
		res = expr(tokens, count, node->op2);
	}

	return res;
}

bool compiler::parser::test(std::vector<token>& tokens, int& count, compiler::node*& node)
{
	delete node;
	bool res = arExpr(tokens, count, node);

	if (!res)
		return false;

	if (tokens[count].type == tokensTypes::LESS)
	{
		count++;
		node = new compiler::node(nodeTypes::LESSTHEN, 0, node);
		res = arExpr(tokens, count, node->op2);
	}

	return res;
}

bool compiler::parser::arExpr(std::vector<token>& tokens, int& count, compiler::node*& node)
{
	delete node;
	bool res = term(tokens, count, node);

	if (!res)
		return false;

	tokensTypes now = tokens[count].type;

	while (now == tokensTypes::ADD || now == tokensTypes::SUB || now == tokensTypes::MUL || now == tokensTypes::DIV)
	{
		nodeTypes nT = nodeTypes::EMPTY;
		switch (now)
		{
		case compiler::tokensTypes::ADD:
			nT = nodeTypes::ADD;
			break;
		case compiler::tokensTypes::SUB:
			nT = nodeTypes::SUB;
			break;
		case compiler::tokensTypes::MUL:
			nT = nodeTypes::MUL;
			break;
		case compiler::tokensTypes::DIV:
			nT = nodeTypes::DIV;
			break;
		}

		node = new compiler::node(nT, 0, node);
		count++;
		bool res = term(tokens, count, node->op2);

		if (!res)
			return false;

		now = tokens[count].type;
	}

	return true;
}

bool compiler::parser::term(std::vector<token>& tokens, int& count, compiler::node*& node)
{
	delete node;
	bool parsed;
	switch(tokens[count].type)
	{
	case tokensTypes::VAR:
		node = new compiler::node(nodeTypes::VAR, tokens[count].variable - 'a');
		count++;
		break;

	case tokensTypes::NUM:
		node = new compiler::node(nodeTypes::CONST, tokens[count].value);
		count++;
		break;

	default:
		bool res = parenExpr(tokens, count, node);

		if (!res)
			return false;

		break;
	}

	return true;
}

Here, each expression is recognized according to the grammar rules defined above. It’s worth taking a closer look at how child nodes are determined. They can be divided into several groups:

  • Nodes of type CONST, VAR: do not contain child nodes, but contain a value that determines the value of a number or the position of an element in an array

  • Nodes of type ADD, SUB, MUL, DIV, LESSTHEN: each contain two child nodes, where op1 is the left operand and op2 is the right operand.

  • Nodes of type SET: contain two child nodes: op1 – a node of type VAR with a variable to which a value is assigned, op2 – a node that collects the value that needs to be assigned to the variable.

  • Nodes of type IF and WHILE: also include two child nodes: op1 is a condition whose truth is checked, op2 is an expression that is executed if true.

  • IFELSE type nodes: contain three child nodes. The first two are the same as the IF or WHILE nodes, the third are expressions that are executed if the expression is false.

  • Nodes of type SEQ: Contains two child nodes. op1 is the previous action being performed, op2 is the current action. (Perhaps it should have been done in the “correct” order: the node stores the next action, but in this form the implementation is somewhat simpler)

  • Nodes of type EXPR and PROG: contain a single child node op1 that defines an expression

  • Nodes of type PRINT: contains a single child node describing the output value

When building a tree, the current token is checked and, based on it, a decision is made about the type of the current node, and a check is made for the correctness of the syntax.

This completes the parsing: in the case of incorrectly compiled code, all that remains is to add it to the code compilation method:

// compiler.cpp

bool compiler::compile(std::string& result)
{
	token token = getNexttoken();

	while (token.type != tokensTypes::ENDFILE)
	{
		if (token.type == tokensTypes::ERROR)
		{
			result = "Lexical error";
			return false;
		}
		tokens.push_back(token);
		token = getNexttoken();
	}
	tokens.push_back(token);

	compiler::node tree = compiler::node(nodeTypes::EMPTY);
	if (!compiler::parser::parse(tokens, tree))
		return false;
}

Code generation

Based on the tree, we need to build assembly code. To begin with, let’s highlight the fragments that will be in any program: the beginning of the program, a macro for displaying numbers in decimal form, and exiting the program:

Hidden text
const std::string compiler::linker::programsStart = R"STR(ASSUME CS:CODE, DS:DATA

DATA SEGMENT
        vars dw 26 dup (0)
DATA ENDS;

CODE SEGMENT
        PRINT MACRO
        local DIVCYC, PRINTNUM;
        push ax;
        push bx;
        push cx;
        push dx;
        mov bx, 10;
        mov cx, 0;
    DIVCYC:
        mov dx, 0;
        div bx;
        push dx;
        inc cx;
        cmp ax, 0;
        jne DIVCYC;
    PRINTNUM:
        pop dx;
        add dl, '0';
        mov ah, 02h;
        int 21h;
        dec cx;
        cmp cx, 0;
        jne PRINTNUM;
		mov dl, 0dh;
        mov ah, 02h;
        int 21h;
		mov dl, 0ah;
        mov ah, 02h;
        int 21h;
        pop dx;
        pop cx;
        pop bx;
        pop ax;
        ENDM;

    begin:
        lea si, vars;
)STR";

const std::string compiler::linker::programsEnd = R"STR(        mov ah, 4Ch;
        int 21h;
CODE ENDS;
end begin)STR";

Some of the interesting things in this code: the allocation of 26 words (memory sections of two bytes each) for variables, as well as the output of a number using the PRINT macro (it outputs the contents of the AX register, saving the state of the registers on the stack before output and restoring it after, and also writes to stacks the digits of a number, and then prints them out one after another).

The next step is to determine which registers to use to perform the operations. In this code, the main register containing and processing data is the AX register (it always contains the left operand and the result of the operation), the BX register is used as the register storing the right operand.

When assembling assembly code, the programsStart line is first used, then code operations are added to it, and this ends with the programsEnd line.

Let’s consider the translation of each type of tree node into assembly code:

Hidden text
std::string compiler::linker::compileNode(compiler::node* block, int& labelCount)
{
	std::string result = "";

	int lCount1, lCount2, firstLabelNum, secondLabelNum, thirdLabelNum;

	switch (block->type)
	{
	case nodeTypes::PROG:
		result += compileNode(block->op1, labelCount);
		break;

	case nodeTypes::PRINT:
		result += compileNode(block->op1, labelCount);
		result += "        PRINT;\n";
		break;

	case nodeTypes::VAR:
		result += "        mov ax, [si + " + std::to_string(block->value * 2) + "];\n";
		break;

	case nodeTypes::CONST:
		result += "        mov ax, " + std::to_string(block->value) + ";\n";
		break;

	case nodeTypes::ADD:
		result += compileNode(block->op1, labelCount);
		result += "        push ax;\n";
		result += compileNode(block->op2, labelCount);
		result += "        mov bx, ax;\n";
		result += "        pop ax;\n";
		result += "        add ax, bx;\n";
		break;

	case nodeTypes::SUB:
		result += compileNode(block->op1, labelCount);
		result += "        push ax;\n";
		result += compileNode(block->op2, labelCount);
		result += "        mov bx, ax;\n";
		result += "        pop ax;\n";
		result += "        sub ax, bx;\n";
		break;

	case nodeTypes::MUL:
		result += compileNode(block->op1, labelCount);
		result += "        push ax;\n";
		result += compileNode(block->op2, labelCount);
		result += "        mov bx, ax;\n";
		result += "        pop ax;\n";
		result += "        mul bx;\n";
		break;

	case nodeTypes::DIV:
		result += compileNode(block->op1, labelCount);
		result += "        push ax;\n";
		result += compileNode(block->op2, labelCount);
		result += "        mov bx, ax;\n";
		result += "        pop ax;\n";
		result += "        xor dx, dx;\n";
		result += "        div bx;\n";
		break;

	case nodeTypes::SET:
		result += compileNode(block->op2, labelCount);
		result += "        mov [si + " + std::to_string(block->op1->value * 2) + "], ax;\n";
		break;

	case nodeTypes::EXPR:
		result += compileNode(block->op1, labelCount) + '\n';
		break;

	case nodeTypes::LESSTHEN:
		result += compileNode(block->op1, labelCount);
		result += "        push ax;\n";
		result += compileNode(block->op2, labelCount);
		result += "        mov bx, ax;\n";
		result += "        pop ax;\n";
		result += "        cmp ax, bx;\n";
		result += "        jae LBL" + std::to_string(labelCount++) + ";\n";
		result += "        mov ax, 1;\n";
		result += "        jmp LBL" + std::to_string(labelCount++) + ";\n";
		result += "    LBL" + std::to_string(labelCount - 2) + ":\n";
		result += "        mov ax, 0;\n";
		result += "    LBL" + std::to_string(labelCount - 1) + ": nop\n";
		break;

	case nodeTypes::IF:
		result += compileNode(block->op1, labelCount);
		result += "        cmp ax, 0;\n";
		lCount1 = labelCount;
		result += "        jne LBL" + std::to_string(labelCount++) + ";\n";
		lCount2 = labelCount;
		result += "        jmp LBL" + std::to_string(labelCount++) + ";\n";
		result += "    LBL" + std::to_string(lCount1) + ": nop\n";
		result += compileNode(block->op2, labelCount);
		result += "    LBL" + std::to_string(lCount2) + ": nop\n";
		break;

	case nodeTypes::SEQ:
		result += compileNode(block->op1, labelCount);
		result += compileNode(block->op2, labelCount);
		break;

	case nodeTypes::IFELSE:
		result += compileNode(block->op1, labelCount);
		result += "        cmp ax, 0;\n";
		firstLabelNum = labelCount;
		result += "        je LBL" + std::to_string(labelCount++) + ";\n";
		result += compileNode(block->op2, labelCount);
		secondLabelNum = labelCount;
		result += "        jmp LBL" + std::to_string(labelCount++) + ";\n";
		result += "    LBL" + std::to_string(firstLabelNum) + ": nop\n";
		result += compileNode(block->op3, labelCount);
		result += "    LBL" + std::to_string(secondLabelNum) + ": nop\n";
		break;

	case nodeTypes::WHILE:
		secondLabelNum = labelCount;
		result += "        jmp LBL" + std::to_string(labelCount++) + "; \n";
		firstLabelNum = labelCount;
		result += "    LBL" + std::to_string(labelCount++) + ": nop\n";
		result += compileNode(block->op2, labelCount);
		result += "    LBL" + std::to_string(secondLabelNum) + ": nop\n";
		result += compileNode(block->op1, labelCount);
		result += "        cmp ax, 0;\n";
		thirdLabelNum = labelCount;
		result += "        je LBL" + std::to_string(thirdLabelNum) + "\n";
		result += "        jmp LBL" + std::to_string(firstLabelNum) + "\n";
		result += "    LBL" + std::to_string(thirdLabelNum) + ": nop\n";
		break;

	default:
		break;
	}

	return result;
}

In this code you should pay attention to the following:

  1. For operations that require the participation of two operands, the left one, which is located in the AX register, is first calculated, after which it is sent to the stack, and then the second operand is formed in its place. Before the operation is performed, the right operand is moved to the BX register and the left operand is moved to AX, after which the result remains in AX.

  2. Addresses of memory cells are partially calculated at the compilation stage: multiplication by shift occurs in the compiler because this shift is known in advance.

  3. The LESSTHEN operator leaves the AX register at 1 if true and 0 otherwise.

  4. The IF, IFELSE and WHERE operators additionally compare a number with 0 (see point 3 and language requirements).

  5. Instead of conditional jumps to an unknown distance, the negation of the condition for jumping through one operator is used, and the jump to an unknown distance is carried out using the jmp operator (to avoid the “Relative jump out of range by xxxxh bytes” error).

All that remains is to add all this to the compilation method:

// compiler.css

std::string compiler::linker::compile()
{
	std::string program = "";
	program += programsStart;
	int a = 0;
	program += compileNode(tree, a);
	program += programsEnd;
	return  program;
}

bool compiler::compile(std::string& result)
{
	token token = getNexttoken();

	while (token.type != tokensTypes::ENDFILE)
	{
		if (token.type == tokensTypes::ERROR)
		{
			result = "Lexical error";
			return false;
		}
		tokens.push_back(token);
		token = getNexttoken();
	}
	tokens.push_back(token);

	compiler::node tree = compiler::node(nodeTypes::EMPTY);
	if (!compiler::parser::parse(tokens, tree))
		return false;

	compiler:linker linker = compiler::linker(&tree);

	result = linker.compile();
	return true;
}

All that remains is to check the functionality of this compiler.

Compiler functionality check

Set aside easy checks! Only hardcore. To check, we will solve the problem of finding the reciprocal number in the ring of prime numbers (hello, Euler’s theorem), as well as an algorithm for finding the gcd of two numbers.

To solve the first problem, use the following code in “our language”:

#include <iostream>
#include "compiler.h"

int main()
{
    char* data = new char[166];

    memcpy(data, "{a = 6; m = 283; s = m - 2; b = a; c = 1; while(0 < s) { if (0 < s - (s/2*2)){c = c * b; c = c - (c / m * m);} s = s / 2; b = b * b; b = b - (b / m * m);} print(c);}", 166);

    compiler comp = compiler(data, 166);

    std::string result;

    comp.compile(result);

    std::cout << result;
}

As you can see, the line with the program code is written without any formatting; this is done to check the correctness of code recognition even in conditions of weak formatting. Let’s look at the assembly code generated by our compiler:

Hidden text
ASSUME CS:CODE, DS:DATA

DATA SEGMENT
        vars dw 26 dup (0)
DATA ENDS;

CODE SEGMENT
        PRINT MACRO
        local DIVCYC, PRINTNUM;
        push ax;
        push bx;
        push cx;
        push dx;
        mov bx, 10;
        mov cx, 0;
    DIVCYC:
        mov dx, 0;
        div bx;
        push dx;
        inc cx;
        cmp ax, 0;
        jne DIVCYC;
    PRINTNUM:
        pop dx;
        add dl, '0';
        mov ah, 02h;
        int 21h;
        dec cx;
        cmp cx, 0;
        jne PRINTNUM;
                mov dl, 0dh;
        mov ah, 02h;
        int 21h;
                mov dl, 0ah;
        mov ah, 02h;
        int 21h;
        pop dx;
        pop cx;
        pop bx;
        pop ax;
        ENDM;

    begin:
        lea si, vars;
        mov ax, 6;
        mov [si + 0], ax;

        mov ax, 283;
        mov [si + 24], ax;

        mov ax, [si + 24];
        push ax;
        mov ax, 2;
        mov bx, ax;
        pop ax;
        sub ax, bx;
        mov [si + 36], ax;

        mov ax, [si + 0];
        mov [si + 2], ax;

        mov ax, 1;
        mov [si + 4], ax;

        jmp LBL0;
    LBL1: nop
        mov ax, 0;
        push ax;
        mov ax, [si + 36];
        push ax;
        mov ax, [si + 36];
        push ax;
        mov ax, 2;
        mov bx, ax;
        pop ax;
        xor dx, dx;
        div bx;
        push ax;
        mov ax, 2;
        mov bx, ax;
        pop ax;
        mul bx;
        mov bx, ax;
        pop ax;
        sub ax, bx;
        mov bx, ax;
        pop ax;
        cmp ax, bx;
        jae LBL2;
        mov ax, 1;
        jmp LBL3;
    LBL2:
        mov ax, 0;
    LBL3: nop
        cmp ax, 0;
        jne LBL4;
        jmp LBL5;
    LBL4: nop
        mov ax, [si + 4];
        push ax;
        mov ax, [si + 2];
        mov bx, ax;
        pop ax;
        mul bx;
        mov [si + 4], ax;

        mov ax, [si + 4];
        push ax;
        mov ax, [si + 4];
        push ax;
        mov ax, [si + 24];
        mov bx, ax;
        pop ax;
        xor dx, dx;
        div bx;
        push ax;
        mov ax, [si + 24];
        mov bx, ax;
        pop ax;
        mul bx;
        mov bx, ax;
        pop ax;
        sub ax, bx;
        mov [si + 4], ax;

    LBL5: nop
        mov ax, [si + 36];
        push ax;
        mov ax, 2;
        mov bx, ax;
        pop ax;
        xor dx, dx;
        div bx;
        mov [si + 36], ax;

        mov ax, [si + 2];
        push ax;
        mov ax, [si + 2];
        mov bx, ax;
        pop ax;
        mul bx;
        mov [si + 2], ax;

        mov ax, [si + 2];
        push ax;
        mov ax, [si + 2];
        push ax;
        mov ax, [si + 24];
        mov bx, ax;
        pop ax;
        xor dx, dx;
        div bx;
        push ax;
        mov ax, [si + 24];
        mov bx, ax;
        pop ax;
        mul bx;
        mov bx, ax;
        pop ax;
        sub ax, bx;
        mov [si + 2], ax;

    LBL0: nop
        mov ax, 0;
        push ax;
        mov ax, [si + 36];
        mov bx, ax;
        pop ax;
        cmp ax, bx;
        jae LBL6;
        mov ax, 1;
        jmp LBL7;
    LBL6:
        mov ax, 0;
    LBL7: nop
        cmp ax, 0;
        je LBL8
        jmp LBL1
    LBL8: nop
        mov ax, [si + 4];
        PRINT;
        mov ah, 4Ch;
        int 21h;
CODE ENDS;
end begin

It turned out to be a lot and not very beautiful (what can you expect from taking a remainder through division, multiplication and subtraction?), Let’s compare the results of the program and the result from the first site we came across:

As you can see, the results are consistent. Let’s move on to the Euclidean algorithm. The code in “our” language for it will be as follows:

#include <iostream>
#include "compiler.h"

int main()
{
    char* data = new char[85];

    memcpy(data, "{a = 11004; b = 10087; while(0 < b)  {c = a - (a / b * b); a = b; b = c;} print(a);}", 85);

    compiler comp = compiler(data, 85);

    std::string result;

    comp.compile(result);

    std::cout << result;
}

We get the following assembly code:

Hidden text
ASSUME CS:CODE, DS:DATA

DATA SEGMENT
        vars dw 26 dup (0)
DATA ENDS;

CODE SEGMENT
        PRINT MACRO
        local DIVCYC, PRINTNUM;
        push ax;
        push bx;
        push cx;
        push dx;
        mov bx, 10;
        mov cx, 0;
    DIVCYC:
        mov dx, 0;
        div bx;
        push dx;
        inc cx;
        cmp ax, 0;
        jne DIVCYC;
    PRINTNUM:
        pop dx;
        add dl, '0';
        mov ah, 02h;
        int 21h;
        dec cx;
        cmp cx, 0;
        jne PRINTNUM;
                mov dl, 0dh;
        mov ah, 02h;
        int 21h;
                mov dl, 0ah;
        mov ah, 02h;
        int 21h;
        pop dx;
        pop cx;
        pop bx;
        pop ax;
        ENDM;

    begin:
        lea si, vars;
        mov ax, 11004;
        mov [si + 0], ax;

        mov ax, 10087;
        mov [si + 2], ax;

        jmp LBL0;
    LBL1: nop
        mov ax, [si + 0];
        push ax;
        mov ax, [si + 0];
        push ax;
        mov ax, [si + 2];
        mov bx, ax;
        pop ax;
        xor dx, dx;
        div bx;
        push ax;
        mov ax, [si + 2];
        mov bx, ax;
        pop ax;
        mul bx;
        mov bx, ax;
        pop ax;
        sub ax, bx;
        mov [si + 4], ax;

        mov ax, [si + 2];
        mov [si + 0], ax;

        mov ax, [si + 4];
        mov [si + 2], ax;

    LBL0: nop
        mov ax, 0;
        push ax;
        mov ax, [si + 2];
        mov bx, ax;
        pop ax;
        cmp ax, bx;
        jae LBL2;
        mov ax, 1;
        jmp LBL3;
    LBL2:
        mov ax, 0;
    LBL3: nop
        cmp ax, 0;
        je LBL4
        jmp LBL1
    LBL4: nop
        mov ax, [si + 0];
        PRINT;
        mov ah, 4Ch;
        int 21h;
CODE ENDS;
end begin

Let’s check the result of the code:

As you can see, everything comes together.

Repository link

Full source code of the program available in the repository.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *