What if the source codes of programs are stored in binary format?

This article is just an idea, don't judge it too harshly.

TLDR: I suggest considering storing program source codes in some binary format instead of plain text.

Compiler and IDE

How a compiler works: first, lexical analysis occurs, i.e. the source code is split into tokens. Then, syntactic analysis occurs – the resulting tokens are combined into a syntax tree. Then, semantic analysis: data type derivation, variable visibility check, etc.

And only then do the stages follow that ultimately lead to the appearance of the executable file.

How a typical IDE works: yes, exactly the same. Lexical analysis, syntactic analysis, semantic analysis, type inference, and all the rest. So basically, the guys write half a compiler so that you can get all the modern capabilities of an IDE.

That is, the program text itself is needed only by a person at the stage of entering information. Because the AST tree is not suitable for him to understand what is happening.

But what if you store your source code differently?

What if the source code is stored not as text, but immediately as a binary file containing an already parsed syntax tree with already derived types and other things. And the IDE could build a text source for a person on the fly.

What are the advantages:

  1. Compilation will be much faster.
  2. The IDE will not “index” the project for a long time, syntax highlighting and most of the information for autocomplete will be available immediately.
  3. There will be no disputes like “tabs vs. spaces”, only the tree will get into the file.
  4. If you really go crazy, you can probably even set up your IDE to write in another language. For example, the code was in Java, and you show it to yourself in Kotlin. Just mark it somewhere in the IDE to save the Java AST to a file. I'm not sure about this, just an idea of ​​what else you can do if you abandon the shackles of text.
  5. What if we could give the ability to change fonts to highlight certain important features in bold red?
  6. The IDE can store its data in these binaries to speed up some display and autocomplete processes. However, the files can grow quite large.
  7. Most likely, there will be many interesting things that don't even come to mind now. The approach is completely different.

There are downsides, of course.

All tools that edit code or show diffs should be written specifically for the language. This is an IDE, github, gitlab, etc. For simple scripting languages, this will be an additional complication. If earlier you could edit a file with any text editor, and nothing else was needed, now you need to have a special editor at hand. And you can't write this editor in 5 minutes, it will definitely be a semi-compiler. That is, if there are now monstrous IDEs, and there are editors like nano, then in the case of a binary format you need to be able to do a lot in any case. Most likely, a language server or conditional C libraries should be supplied with the language, which can be bound to the editor.

In general, to do something like this, at least as an experiment, you need to have access to both the language and a popular IDE. Probably, only Kotlin has a chance at the moment.

It won't be that easy to grep the project either; you'll need a special grep.

Where did the idea come from?

A long time ago in Karuna's smoking room I participated in a discussion of “tabs vs. spaces”, and we came to the conclusion that neither tabs nor spaces are logical. Spaces were intended to separate words, and tabs were originally a mechanism on a typewriter to make it easier to type tables.

So far it's just grid alignment:

Here I put a tab in the line between t and t, which caused the second t to align with the line z=3. The question is, why? No reason, it's just how typewriters used to do it.

In the modern world, no one programs in .txt files without highlighting, usually using powerful IDEs that will easily convert tabs to spaces and vice versa. Then why do we store this information in the source code at all? Simply because we use text, which means we need to choose something for indents. But maybe this should be reconsidered?

And some attempts in this direction, by the way, are already being made.

When I posted about this topic in tg-channelthey immediately sent me the tree-sitter project in the comments. This is a bit different, it is not related to the compiler, but the idea is roughly in the same direction: it is more convenient for code tools to work with the tree immediately and update it on the fly if necessary. Moreover, tree-sitter, if I understand correctly, provides libraries in C that can be bound to any tool. In general, if this tree is stored instead of text of the code, and also so that compilers understand it – then the problem will be solved.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *