String duplicates in python sources – solution

Before offering my view on this phenomenon, I would like to describe the phenomenon itself. I don’t understand why at all, but in python sources, developers very often duplicate lines. For example, there is this piece of code:

def send_notification(respondents: list, message: dict) -> None:
    for resp in respondents:
        to_send = copy(message)
        if "subject" not in to_send:
            to_send["subject"] = "Hello, " + str(resp)
        if "from" not in to_send:
            to_send["from"] = None
        to_send['body'] = to_send['body'].replace('@respondent@', resp)
        to_send['crc'] = crc_code(to_send['body'])

I’ll make a reservation right away – I’m not the author, this code was taken from the report of Grigory Bakunov (and he is not the author), at one of the conferences dedicated to python (available on youtube). Performance issues were discussed there, but I would not like to talk about that. And about lines – lines “subject”, “from”, “body” – and similar, which are duplicated in this source as string literals. And for sure in this project in a bunch of other places they pop up again and again in the form of duplicates… I specially took someone else’s example to draw attention to the fact that this problem exists, not only in my personal experience.

From the very beginning of learning the python language, I constantly meet this in a variety of sources from different developers. And it’s a mystery to me – why no one is fighting this duplication – sort of like DRY does he have to work here? Not?

In general, what is bad about it:

1) Waste memory when creating each instance of the same string – this is actually unlikely in python, since it has a mechanism for automatic interning of strings similar to identifiers (short with certain characters). A dictionary of such strings is created and they are not duplicated in memory and can be quickly compared (a hash is created for them), and not character by character. Also, depending on the implementation of the interpreter, in general, all lines stored in the source can be interned. Therefore, about the memory rather past.

>>> "abc" is "abc"
True
>>> id("abc") == id("abc")
True

Although there is a small catch here – strings received at runtime (user input, reading from a file, expressions, etc.) are not interned automatically. There is a function for forced interning sys.intern().

2) The probability of an error by a typo increases many times over. Instead of one place in the code where a string would be assigned to a constant, we write this string every now and then and a chance to get, say, a character “c” not from the Latin alphabet, but from the Cyrillic alphabet (and which one is here now? – you can’t see it with your eyes at all …) increases many times over. Yes, modern IDEs allow you to search and replace by project, but what will you look for when you don’t know exactly which word is a typo?

3) Difficulty in finding errors – Neither the interpreter nor the linters help us find such errors. If there was a constant subject_str containing “subject”, then if we mistyped when mentioning the constant, we would get an error from the linter, and if not (there is no linter or it went wrong), then at runtime, we would still get a clear error message, because such an identifier (for example, the name of a constant with a typo “subject_strr” – the key is stuck, the finger trembled, etc. ) simply does not exist. But an invalid string literal will simply give runtime bugs, possibly without crashes at all, when compared with an invalid string, the condition may never be met, etc.

4) It’s inconvenient to maintain. If in the example above the tag “subject” For some reason, you will have to replace for example with “topic” – then it just turns into a game with the same search and replace tools in the IDE, while you need to carefully watch each inclusion, because it is not necessary that exactly all “subject” the lines in the project are exactly what will need to be replaced. If the string were declared in one place as a constant, one could simply change its value, provided of course that the constant is used according to the logic of the modules.

By the way, the SonarCube static code analysis tool also considers this a problem. https://rules.sonarsource.com/python/RSPEC-1192

A simple solution suggests itself in the form of a reference class with string constants.

class Colors:
    red   = "red"
    black = "black"
    white = "white"
C = Colors

...

print(C.red)

Of course, it’s better, but still there is a stupid duplication in the class description, which in 90% of cases will be just that and will be an eyesore, and is also a place for an error, although its probability is greatly reduced. It is necessary to use metaclasses – so I told my colleague and after 5 minutes he gave out a simple and logical solution, which I personally still use. Why not? Metaclasses create classes, participate in their creation, they can modify attributes, which means they suit us.

init_copy = None

class CStrProps_Meta(type):
    def __init__(cls, className, baseClasses, dictOfMethods):
        for k, v in dictOfMethods.items():
            if k.startswith("__"):
                continue
            if v is None:
                setattr(cls, k, k)

The operation logic is simple – we skip dunder methods, we skip those attributes that are already initialized, and we assign a value equal to the attribute name to those that are filled with None. Here init_copy is just a hint that the members of the future dictionary class will be initialized by the metaclass, and will not remain None at all, as you might think from a cursory glance at the class.

class Colors(metaclass=CStrProps_Meta):
    red   = init_copy
    black = init_copy
    white = init_copy
    msg   = "Color is "
C = Colors

print(C.msg + C.red)

You can add additional functionality – what is needed in a particular project, for example, block the ability to edit constants already created in the reference class:

class CStrProps_Meta(type):
    def __init__(cls, className, baseClasses, dictOfMethods):
        cls._items = {}
        for k, v in dictOfMethods.items():

            if k.startswith("__"):
                continue

            if v is None:
                setattr(cls, k, k)

            cls._items[k] = getattr(cls, k, k)

        cls.bInited = True
    
    def __setattr__(cls, *args):
        if hasattr(cls, "bInited"):
            raise AttributeError('Cannot reassign members.')
        else:
            super().__setattr__(*args)

Add output of all elements as a dictionary:

    def dict(cls):
        return cls._items

Add an iterator to be traversed by the for loop:

    def __iter__(cls):
        return iter(cls._items)

Of course, you should also write a simple unit test:

import unittest

class Dummy_Str_Consts(metaclass = CStrProps_Meta):
    name  = init_copy
    Error = "[Error:]"
    Error_Message = f"{Error}ERROR!!!"
DSC = Dummy_Str_Consts

class Test_Str_Consts(unittest.TestCase):
    def test_str_consts(self):
        self.assertEqual( Dummy_Str_Consts.name,  "name" )

        self.assertEqual( Dummy_Str_Consts.Error, "[Error:]" )

        self.assertEqual( Dummy_Str_Consts.Error_Message, "[Error:]ERROR!!!" )

        l = ["name", "Error", "Error_Message"]
        l1 = []
        for i in DSC:
            l1.append(i)
        self.assertEqual( l, l1 )
        self.assertEqual( l, list(DSC.dict().keys()) )

        l = ["name", "[Error:]", "[Error:]ERROR!!!"]
        self.assertEqual( l, list(DSC.dict().values()) )

        with self.assertRaises(AttributeError):
            DSC.name = "new name"


        print(C.msg + C.black)

if __name__ == "__main__":
    unittest.main()

In the test, the lines are intentionally duplicated, since this is a test of the described mechanism, and in order to correctly test the operation of the metaclass, we intentionally manually drive in the expected result.

With this approach, the probability of errors is many times lower, their search is simpler, easier and more supportive, you can use refactoring tools (if subject change to topic), or you can simply assign to a variable in only one place what we need there now.

In general, I use this option to deal with these repeated string literals, maybe there is a solution and even better, I suggest sharing it in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *