Recursive Regular Expressions

I decided to add regular expressions to my programming language. At first, I thought that there was absolutely no need for me to understand them, and on the Internet, for sure, there are already a lot of ready-made libraries. I began to search, found some fragments of code in C ++, which give nothing. I had to figure out what regular expressions are. here. For the sake of sport, I decided to make my library in C++.

I began to do it and thought, why don’t I add my cockroaches there. I decided to add two constructs:

{namesubexpression} – call subexpression named “namesubexpression”,
($namesubexpression:BodyExpression) – description of the subexpression named “namesubexpression”.

The subexpression description itself can occur anywhere in the regular expression structure and is ignored in the search, like the commented out ones: (#MeComment).
The problem of infinite recursion immediately arises.
Here is an example of a recursive regular expression that is not valid: ($E:{E}){E}

Of course, I did the validation stage and such search constructs are simply not allowed in the search engine. Also, the validation will not pass an expression that contains a call to an undescribed sub-expression.

Here is an example of text that can be parsed with a recursive regular expression (RRE): [[[[[A]]]]]
And here is his RR: ($RRE:\[({RRE}|A)\]){RRE}

I also decided to add three reserved constructs:
{:String} matches the expression: (("(\\.|[^"])*")|('(\\.|[^'])*'))
{:Digit} matches the expression: (-?[0-9]+.?[0-9]*[Ee]?-?[0-9]*)
{:Name} matches the expression: ([A-Za-z][A-Za-z0-9]*)
But their search engine does not use the structural elements of similar expressions, but is organized by a built-in machine search, which works much faster and returns one whole line of text that contains the entire body of the found match and not parts for each component in similar regular expressions.

Response format:

The search engine, if at least one match is found, returns the answer. Namely, a structure, which is a set of elements, each of which can be either a line of text, or the same set. The answer to a successful search is guaranteed to be many. Each element of which will be a set that is guaranteed to consist of two elements. The first place is guaranteed to be a string of text: either “Text” or “Expression”. This is a sign of a portion of the answer. In the case of “Text”, the second position is guaranteed to be a string of text. In another case, the second place is guaranteed to be the description of the found matches with the RRT elements. Any match found divides the text being searched into parts that are between …

In short, here’s an example:

Code example in Author language:

	Expression = ":";
	Data = "SomePrefixText :: SomeInteriorText : SomeSufixText.";
	Result = Expression.RegularExpression("Parse",Data);
	trace(JSON("stringify",Result));

Answer:

[
	[ "Text", "SomePrefixText " ],
	[ "Expression", [ ":" ] ],
	[ "Expression", [ ":" ] ],
	[ "Text", " SomeInteriorText " ],
	[ "Expression", [ ":" ] ],
	[ "Text", " SomeSufixText." ]
]

Here is a second example:

	Expression = "($RRE:\\(({RRE}|A+)\\)){RRE}";
	Data = "((AAAA))";
	Result = Expression.RegularExpression("Parse",Data);
	trace(JSON("stringify",Result));

Answer:

[
	[
		"Expression",
		[
			[
				"SubExpression:RRE",
				[
					"(",
					[ [ [ "SubExpression:RRE", [ "(", [ [ [ "A", "A", "A", "A" ] ] ], ")" ] ] ] ],
					")"
				]
			]
		]
	]
]

Theoretically, with one well-done PRT, one can describe the syntax of any programming language that does not have a preprocessor.

This is what PHP looks like: ((<\?(php)?.*?\?>)+|(^<\?(php)?.*))
🙂
But the disadvantage of this approach is the inability to track syntax errors.

And here’s another example.
For the sake of sporting interest, I created and tested an RRT that can describe any RRT and itself. I think you’ll be interested to see:

(?x)
($CallSE:\{({:Name}|:(String|Digit|Name))\})
($SpecialChar:\\[DSWdsw])
($SpaceChar:([ \f\n\r\t\v]|\\[fnrtv]))
($x:\\[0-9A-Fa-f]{2})
($O:\\[0-7]{3})
($onec:({x}|{O}|\\.|[^\[\]\\\^]))
($ones:({x}|{O}|\\.|[^\[\]\\\^$|?*+\(\){}]))
($Char:\[\^?({SpecialChar}|({onec}-{onec})|{onec})*?-?\])
($Qantifikator:([\?\*\+]|\{[0-9]*(,[0-9]*)?\})[\?\+]?)
($Grupa:\\[1-9])
($QE:\\Q.*?\\E)
($PosInStr:([\^\$]|\\[Bb]))
($PrefixBody:((\#|\?#)|(\${:Name}:)|(\?[ismx-]*:)|(\?=)|(\?!)|(\?<=)|(\?<!)|(\?=)|(\?>)|(\?)))
($SubE:\({BodyE}\))
($BodyE:(
	(\?[ismx-]*)|
	(
		{PrefixBody}?
		(
			({SubE}|{Char}|{CallSE}|{Grupa}|{QE}|{PosInStr}|{SpecialChar}|{SpaceChar}|(\|)|{ones}){Qantifikator}?
		)*
	)
))
{BodyE}

You can take a look at the code of this PPB library in C++ here.
Who needs it, take it and use it.

Remember: making your own programming language is the most thankless job.
But a programmer user is valued more than just a user.

I look forward to your comments, criticism and interesting suggestions.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *