Search for constants-“matryoshkas” to reduce the size of data in the program


We will talk about nameless constants in the program, which are often called literals. If such a literal cannot be used as an immediate operand in a machine instruction, the compiler has to allocate – and where to go! – this literal has its own memory and then operate with the address of this memory.

Of course, every self-respecting compiler checks to see if there was already exactly the same literal before, and then uses the already existing reference without re-allocating memory.

Do not be lazy, check that the compiler you are using, for example, in the type operator:

put(’Hello, world’, ’Hello, world’); 

Exactly uses the same instance of the ‘Hello, world’ constant in memory, not two different ones.

Here is my complete example:

test:proc main;
put('Hello world','Hello world');
end test;

It can be seen that in the executable program shown by the debugger, the same constant at address 40CB94 is used twice

using the same instance of a literal twice

using the same instance of a literal twice

I am convinced that almost all compilers will always do just that, since the usefulness of checking for strong constants to reduce data in a program is obvious and such a check is relatively cheap.

But the next check, perhaps, is no longer done by every compiler. I mean checking for the occurrence of one constant as a substring in another constant.

For example, by adding the symbol “!” formally we obtain two different text constants, but again only one instance in memory can be used.

put(’Hello, world!’, ’Hello, world’); 

The differences will be only in the loaded lengths of the constants 0Ch and 0Bh.

Using one literal instance as two nested with the same address

Using one literal instance as two nested with the same address

A little more interesting is the addition of the symbol “!” also the beginning of the first text constant.

put(’!Hello, world!’, ’Hello, world’);
Using one instance of a literal as two nested ones with different addresses

Using one instance of a literal as two nested ones with different addresses

Again, a single instance of the constant in memory can be used, however, since the substring occurrence does not start at the first byte, not only will the lengths differ, but the literal load addresses will also differ by the amount of “skipped” bytes when searching for the substring occurrence. In the example, these are addresses 40CB94 and 40CB95.

Does the search for nested doll constants have disadvantages? Of course it has.

The first drawback (as, by the way, in the case of checks for strict match) is manifested in the independent compilation of modules. The compiler usually does not make assumptions that the next constant can be in another module and that it has already compiled this module or will still compile it. As a result, although there are no repeated literals in all modules, they may remain in the entire compiled program.

Of course, the necessary checks could be performed already at the link editing stage, i.e. when assembling modules into a single program. But then it is necessary to save information about all literals in all modules and generally build a link editing sequence in a different way. Therefore, in the link editor I use, all this is not done for about the same reasons why the hero of the movie “Beware of the Car” Yuri Detochkin refused to use the autogen: it’s such a fuss…

The second drawback appears if the processing of literals takes place in the compiler on a single “pass”, i.e. during the same reading of the original text.

If, according to the law of dirty tricks, the compiler first comes across a shorter literal in the program text, and only then a longer one, including the short one as a substring, then the memory for the short one has already been allocated and the “matryoshka” of constants in memory does not work. But to create another special compiler pass to process literals in such unfortunate cases, as they say, “toad chokes”: the gain may be insignificant, and compilation speed decreases.

However, due to one “pass” of the compiler on literals, operators, for example, of the form:

if s=’abc’ or s=’abcd’ then …
if s=’abcd’ or s=’abc’ then …

may require a different number of bytes to store the ‘abcd’ and ‘abc’ literals.

Finally, the third drawback may be related to alignment. Often, to improve processor performance, data, including literals, are aligned in memory, for example, along eight-byte boundaries.

In the case of a substring constant, it may no longer be aligned. True, for text literals, especially long ones, alignment does not affect performance as much as, for example, for double data, which is always 8 bytes long.

Conclusion:

Searching for “matryoshka” constants in a program during compilation makes it possible to use memory more efficiently to store such constants. Lack of alignment, of course, can adversely affect the speed of memory access when working with such constants. But on the other hand, in a program with “dolls” literals, there will be more memory accesses at the same or close addresses, which will do the job cache-memory is more efficient.

Personally, I am in favor of always looking for not only strictly matching, but also nested literals. This gives a tangible effect. For example, in the program that I am maintaining (and in which there are a lot of literals), even with all the indicated restrictions, the data size has been reduced by 5600 bytes only due to the found “matryoshka” constants, while the compilation speed has practically not changed.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *