Implementing a simple regex engine in 70 lines of code

This short article owes one interesting test case, in which it was required to implement the basic functionality of the grep utility in PHP without using any built-in regular expression functions.

The template line had to include support for the following metacharacters:

  • ^ – the beginning of the line

  • $ – end of line

  • – any character

  • * – 0 or more times

  • ? – 0 or 1 time

  • + – 1 or more times

It was also necessary to support the escaping of metacharacters with ” so that they could be searched. As a result, the resulting utility takes about a hundred lines of code, of which 70 are regular expression functions.

For those who are interested in a simple implementation of the regular expression mechanism for purely educational purposes, here is its code and a brief analysis.

Final result:

grep.php
#!/bin/env php
<?php

const META_CHAR = 1, LITERAL = 2, META_CHARS = ["*", "?", "+", "."];

function is_metachar(string $re, int $i): bool
{
	return in_array($re[$i], META_CHARS) || ($re[$i] === "^" && $i === 0) || ($re[$i] === "$" && $i === (strlen($re) - 1));
}

function tokenize(string $re, int $i = 0): array
{
	$re_tokens = [];
	$len = strlen($re);
	while ($i < $len) {
		if ($re[$i] === '\' && $i + 1 < $len && (is_metachar($re, $i + 1) || ($i === 0 && $re[$i + 1] === "^")))
			$re_tokens[] = [LITERAL, $re[++$i]];
		else
			$re_tokens[] = [is_metachar($re, $i) ? META_CHAR : LITERAL, $re[$i]];
		$i++;
	}
	return $re_tokens;
}

function match_char(array $re_char, string $str_char): bool
{
	return ($re_char[0] === LITERAL && $re_char[1] === $str_char) || ($re_char[0] === META_CHAR && $re_char[1] === ".");
}

function match_metachar(array $tokens, int $pos, string $char): bool
{
	return isset($tokens[$pos]) && $tokens[$pos][0] === META_CHAR && $tokens[$pos][1] === $char;
}

function is_end($data, int $pos): bool
{
	return !isset($data[$pos]);
}

function re_match(array $re_tokens, string $str, int $str_pos = 0): bool
{
	if (match_metachar($re_tokens, 0, "^"))
		return match_from_pos($re_tokens, 1, $str, 0);

	$match = false;
	while (!$match && !is_end($str, $str_pos))
		$match = match_from_pos($re_tokens, 0, $str, $str_pos++);
	return $match;
}

function match_from_pos(array $re_tokens, int $re_pos, string $str, int $str_pos): bool
{
	if (is_end($re_tokens, $re_pos))
		return true;

	if (match_metachar($re_tokens, $re_pos, "$"))
		return is_end($str, $str_pos);

	if (match_metachar($re_tokens, $re_pos + 1, "*"))
		return match_from_pos($re_tokens, $re_pos + 2, $str, $str_pos) || (!is_end($str, $str_pos) &&
			match_char($re_tokens[$re_pos], $str[$str_pos]) &&
			match_from_pos($re_tokens, $re_pos, $str, $str_pos + 1));

	if (match_metachar($re_tokens, $re_pos + 1, "+"))
		return match_char($re_tokens[$re_pos], $str[$str_pos]) && (match_from_pos($re_tokens, $re_pos + 2, $str, $str_pos + 1) ||
			match_from_pos($re_tokens, $re_pos, $str, $str_pos + 1));

	if (match_metachar($re_tokens, $re_pos + 1, "?"))
		return match_from_pos($re_tokens, $re_pos + 2, $str, $str_pos) || (match_char($re_tokens[$re_pos], $str[$str_pos]) &&
			match_from_pos($re_tokens, $re_pos + 2, $str, $str_pos + 1));

	return !is_end($str, $str_pos) &&
		match_char($re_tokens[$re_pos], $str[$str_pos]) &&
		match_from_pos($re_tokens, $re_pos + 1, $str, $str_pos + 1);
}

function grep($file, array $re_tokens, string $fname = ""): void
{
	$line = 0;
	while (($str = fgets($file)) !== false && ++$line)
		if (re_match($re_tokens, $str = rtrim($str, "rn")))
			echo ($fname ? "{$fname}:" : "") . "{$line}:{$str}n";
}

if ($argc < 2)
	exit("Usage: {$argv[0]} pattern [file1 file2 ...]n");

$re_tokens = tokenize($argv[1]);

if ($argc == 2) {
	grep(STDIN, $re_tokens);
} else {
	for ($i = 2; $i < $argc; $i++) {
		if (is_dir($argv[$i])) {
			fwrite(STDERR, "{$argv[0]}: {$argv[$i]} : Is a directoryn");
			continue;
		}
		$f = @fopen($argv[$i], "r");
		if ($f === false) {
			fwrite(STDERR, "Cannot open file: {$argv[$i]}n");
			continue;
		}
		grep($f, $re_tokens, $argc > 3 ? $argv[$i] : "");
		fclose($f);
	}
}

Example of work:

./grep.php '^funct+is?.*pos.*$.*bool$' grep.php
30:function match_metachar(array $tokens, int $pos, string $char): bool
51:function match_from_pos(array $re_tokens, int $re_pos, string $str, int $str_pos): bool

First of all, we need to parse the regular expression string into metacharacters and literals. This is done by the function tokenize:

<?php

const META_CHAR = 1, LITERAL = 2, META_CHARS = ["*", "?", "+", "."];

function is_metachar(string $re, int $i): bool
{
	return in_array($re[$i], META_CHARS) || ($re[$i] === "^" && $i === 0) || ($re[$i] === "$" && $i === (strlen($re) - 1));
}

function tokenize(string $re, int $i = 0): array
{
	$re_tokens = [];
	$len = strlen($re);
	while ($i < $len) {
		if ($re[$i] === '\' && $i + 1 < $len && (is_metachar($re, $i + 1) || ($i === 0 && $re[$i + 1] === "^")))
			$re_tokens[] = [LITERAL, $re[++$i]];
		else
			$re_tokens[] = [is_metachar($re, $i) ? META_CHAR : LITERAL, $re[$i]];
		$i++;
	}
	return $re_tokens;
}

The function takes a regular expression string, iterates over each character and determines its type based on its membership in the array of metacharacters. If the metacharacter is escaped (preceded by a slash ““), then it is considered a regular literal. It is also worth noting that ^ and $ are considered metacharacters only if they appear at the beginning and end of the pattern, respectively.

<?php

function match_char(array $re_char, string $str_char): bool
{
	return ($re_char[0] === LITERAL && $re_char[1] === $str_char) || ($re_char[0] === META_CHAR && $re_char[1] === ".");
}

function match_metachar(array $tokens, int $pos, string $char): bool
{
	return isset($tokens[$pos]) && $tokens[$pos][0] === META_CHAR && $tokens[$pos][1] === $char;
}

function is_end($data, int $pos): bool
{
	return !isset($data[$pos]);
}

function re_match(array $re_tokens, string $str, int $str_pos = 0): bool
{
	if (match_metachar($re_tokens, 0, "^"))
		return match_from_pos($re_tokens, 1, $str, 0);

	$match = false;
	while (!$match && !is_end($str, $str_pos))
		$match = match_from_pos($re_tokens, 0, $str, $str_pos++);
	return $match;
}

Function re_match takes an array of tokens obtained from a function tokenize and the string in which to search, as well as the position from which to start searching in the string (default 0). If the regular expression starts with the metacharacter “^“, then the match of the remaining pattern is checked strictly from the starting position of the string. Otherwise, it searches for a match in any position (in case of failure, each time shifting the position of the string by one character until the end is reached).

<?php 

function match_from_pos(array $re_tokens, int $re_pos, string $str, int $str_pos): bool
{
	if (is_end($re_tokens, $re_pos))
		return true;

	if (match_metachar($re_tokens, $re_pos, "$"))
		return is_end($str, $str_pos);

	if (match_metachar($re_tokens, $re_pos + 1, "*"))
		return match_from_pos($re_tokens, $re_pos + 2, $str, $str_pos) || (!is_end($str, $str_pos) &&
			match_char($re_tokens[$re_pos], $str[$str_pos]) &&
			match_from_pos($re_tokens, $re_pos, $str, $str_pos + 1));

	if (match_metachar($re_tokens, $re_pos + 1, "+"))
		return match_char($re_tokens[$re_pos], $str[$str_pos]) && (match_from_pos($re_tokens, $re_pos + 2, $str, $str_pos + 1) ||
			match_from_pos($re_tokens, $re_pos, $str, $str_pos + 1));

	if (match_metachar($re_tokens, $re_pos + 1, "?"))
		return match_from_pos($re_tokens, $re_pos + 2, $str, $str_pos) || (match_char($re_tokens[$re_pos], $str[$str_pos]) &&
			match_from_pos($re_tokens, $re_pos + 2, $str, $str_pos + 1));

	return !is_end($str, $str_pos) &&
		match_char($re_tokens[$re_pos], $str[$str_pos]) &&
		match_from_pos($re_tokens, $re_pos + 1, $str, $str_pos + 1);
}

In function match_from_pos(ition) is the main regular expression engine based on recursion. The logic of work can be described as follows:

  1. If the end of the regular expression is reached, then true is returned, since any string matches an empty pattern.

  2. If the current metacharacter “$“, then it is checked whether the end of the line has been reached.

  3. If the current position is followed by the “*“(0 or more repetitions of the current character), then at the beginning an attempt is made to recursively search for the rest of the regular expression starting from the current position of the string (which corresponds to the variant with 0 repetitions). in case of a match, call the recursion from the next position in the string (matches a variant of one or more matches).

  4. If the current position is followed by the “+“(1 or more repetitions of the current character), then at the beginning we check the coincidence of characters in the current position, and then recursively compare the rest of the regular expression with the next position of the line or (if this option did not match) we check the current position of the regular expression with the next position of the line ( matches more than one match).

  5. If the current position is followed by the “?“(0 or 1 repetition of the current character), then as in the variant with”*“, at the beginning we process the case of 0 repetitions, and in case of failure, we check the characters in the current position and, if they match, we recursively go to checking the next position.

  6. The last option corresponds to a simple character comparison – we check that the end of the string has not yet been reached, we compare the characters in the current positions and recursively call the check from the next position.

The rest of the code is related to the utility itself – simply reading lines from files or from the standard input stream and outputting the matched lines.

<?php 

function grep($file, array $re_tokens, string $fname = ""): void
{
	$line = 0;
	while (($str = fgets($file)) !== false && ++$line)
		if (re_match($re_tokens, $str = rtrim($str, "rn")))
			echo ($fname ? "{$fname}:" : "") . "{$line}:{$str}n";
}

if ($argc < 2)
	exit("Usage: {$argv[0]} pattern [file1 file2 ...]n");

$re_tokens = tokenize($argv[1]);

if ($argc == 2) {
	grep(STDIN, $re_tokens);
} else {
	for ($i = 2; $i < $argc; $i++) {
		if (is_dir($argv[$i])) {
			fwrite(STDERR, "{$argv[0]}: {$argv[$i]} : Is a directoryn");
			continue;
		}
		$f = @fopen($argv[$i], "r");
		if ($f === false) {
			fwrite(STDERR, "Cannot open file: {$argv[$i]}n");
			continue;
		}
		grep($f, $re_tokens, $argc > 3 ? $argv[$i] : "");
		fclose($f);
	}
}

In general, this short article is purely educational in nature, a simple implementation of the regular expression mechanism, it may be useful to someone.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *