Type everything

Hello!
We already have one article about the development of typing in Ostrovok.ru. It explains why we are switching from pyContracts to typeguard, why we are switching to typeguard and what we end up with. And today I will tell you more about how this transition occurs.

A function declaration with pyContracts generally looks like this:

from contracts import new_contract
import datetime

@new_contract
def User (x):
    from models import User
    return isinstance (x, User)

@new_contract
def dt_datetime (x):
    return isinstance (x, datetime.datetime)

@contract
def func (user_list, amount, dt = None):
    "" "
    : type user_list: list (User)
    : type amount: int | float
    : type dt: dt_datetime | None
    : rtype: bool
    "" "
    ...

This is an abstract example, because I did not find in our project a definition of a function that is short and meaningful in terms of the number of cases for type checking. Typically, definitions for pyContracts are stored in files that do not contain any other logic. Note that here User is a specific user class, and it is not imported directly.

And this is the desired result with typeguard:

from typechecked import typechecked
from typing import List, Optional, Union
from models import User
import datetime

@typechecked
def func (user_list: List[User], amount: Union[int, float], dt: Optional[datetime.datetime]= None) -> bool:
    ...

In general, there are so many functions and methods with type checking in the project that if you stack them in a stack, you can reach the moon. So manually translating them from pyContracts to typeguard is simply not possible (I tried!). So I decided to write a script.

The script is divided into two parts: one caches imports of new contracts, and the second deals with code refactoring.

I want to note that neither one nor the other script claims to be universal. We did not aim to write a tool to solve all the required cases. Therefore, I often omitted the automatic processing of some special cases, if they are rarely found in the project, it’s quicker to fix it by hand. For example, the script for generating mapping contracts and imports collected 90% of the values, the remaining 10% are handmade crafting mapping.

The logic of the script for generating mapping:

Step 1. Go through all the files of the project, read them. For each file:

  • if the substring "@new_contract" is not present, skip this file
  • if there is, then split the file by the line "@new_contract". For each item:
    – parse for definition and import,
    – if successful, write to the success file,
    – if not, write to the error file.

Step 2. Manually process errors

Now that we have the names of all the types that pyContracts uses (they were defined with the new_contract decorator), and we have all the necessary imports, we can write code for refactoring. While I was translating from pyContracts to typeguard manually, I realized what I needed from the script:

  1. this is a command that accepts the name of the module as an argument (several can be), in which the syntax of function annotations must be replaced
  2. go through all the module files, read them. For each file:
    • if there is no “@contract” substring, skip this file
    • if so, turn the code into ast (abstract syntax tree)
    • find all the functions that are under the contract decorator for each:
      • get dockstring, parse, then delete
      • create a dictionary of the form {arg_name: arg_type}, use it to replace the function annotation
      • remember new imports
    • write modified tree to file via astunparse
    • add new imports to the top of the file
    • replace the lines “@contract” with “@typechecked” because it’s easier than with ast

Solve the question "is this name already imported in this file?" I did not intend to from the beginning: with this problem we will cope with an additional run of the isort library.

But after running the first version of the script, questions arose that still had to be solved. It turned out that 1) ast is not omnipotent, 2) astunparse is more omnipotent than we would like. This was manifested in the following:

  • at the moment of transition to the syntax tree, all single-line comments disappear from the code;
  • empty lines also disappear;
  • ast does not distinguish between functions and methods of the class, we had to add logic;
  • conversely, when switching from a tree to a code, multi-line comments in triple quotes are written in single-quotation comments and occupy one line, and new line breaks are replaced by n;
  • unnecessary brackets appear, for example if A and B and C or D becomes if ((A and B and C) or D).

The code passed through ast and astunparse remains working, but its readability is reduced.

The most serious drawback of the above is the disappearing single-line comments (in other cases we don’t lose anything, but only gain – brackets, for example). The horast library based on ast, astunparse, and tokenize promises to figure this out. Promises and does.

Now the empty lines. There were two possible solutions:

  1. tokenize can determine the “speech part” of a python, and horast takes advantage of it when it gets comment type tokens. But tokenize also has tokens like NewLine and NL. So, you need to see how horast restores comments, and copy, replacing the type of token.
    – suggested Anya, experience in developing 2 months
  2. Since horast can restore comments, we first replace all empty lines with a specific comment, then skip through horast and replace our comment with an empty line.
    – came up with Eugene, experience in developing 8 years

I’ll say a little lower about the triple quotation marks in comments, and it was quite easy to put up with extra brackets, especially since some of them are removed by auto-formatting.

In horast we use two functions: parse and unparse, but both are not ideal – parse contains strange internal errors, in rare cases it cannot parse the source code, and unparse cannot write something that has type type (such a type that It turns out if you do type (any_other_type)).

I decided not to deal with parse, because the logic of work is rather confusing, and exceptions are rare – the principle of non-universality works here.

But unparse works very clearly and quite elegantly. The unparse function creates an instance of the Unparser class, which in init processes the tree, and then writes it to a file. Horast.Unparser is successively inherited from many other Unparsers, where the most basic class is astunparse.Unparser. All descendant classes simply extend the functionality of the base class, but the logic of the work remains the same, so consider astunparse.Unparser. It has five important methods:

  1. write – just writes something to a file.
  2. fill – uses write based on the number of indents (the number of indents is stored as a class field).
  3. enter – increases the indent.
  4. leave – reduces the indent.
  5. dispatch – defines the type of the node of the tree (say T), calls the method corresponding to it by the name of the node type, but with underscore (i.e. _T). This is a meta method.

All other methods are methods of the form _T, for example, _Module or _Str. In each such method, it can: 1) dispatch recursively for subtree nodes, or 2) use write to write the contents of the node or add characters and keywords so that the result is a valid expression in python.

For example, we came across a node of type arg, in which ast stores the argument name and the annotation node. Then dispatch will call the _arg method, which will first write down the argument name, then write the colon and run dispatch for the annotation node, where the annotation subtree will be parsed, and dispatch and write will still be called for this subtree.

Let us return to our problem of the impossibility of processing type type. Now that you understand how unparse works, creating your own type is easy. Let's create some type:

class NewType (object):
    def __init__ (self, t):
        self.s = t.s

It stores a string in itself, and not just like that: we need to typify the arguments of functions, and we just get the types of arguments as strings from dockstring. Therefore, let's replace argument annotations not with the types that we require, but with a NewType object that stores only the name of the desired type inside.

To do this, expand horast.Unparser – write your UnparserWithType, inheriting from horast.Unparser, and add processing of our new type.

class UnparserWithType (horast.Unparser):
    def _NewType (self, t):
        self.write (t.s)

This combines with the spirit of the library. The names of the variables are made in the style of ast, and that is why they consist of one letter, and not because I do not know how to come up with names. I think that t is short for tree, and s for string. By the way, NewType is not a string. If we wanted it to be interpreted as a string type, then we would have to write quotes before and after the write call.

And now magic monkey patch: replace horast.Unparser with our UnparserWithType.

How it works now: we have a syntax tree, it has some function, the functions have arguments, the arguments have type annotations, the needle is hidden in the type annotation, and Koscheev’s death is hidden in it. Previously, there were no annotation nodes at all, we created them, and any such node is an instance of NewType. We call the unparse function for our tree, and for each node it calls dispatch, which classifies this node and calls its corresponding function. As soon as the dispatch function receives the argument node, it writes the name of the argument, then looks to see if there is an annotation (it used to be None, but we put NewType there), if it is, it writes a colon and calls dispatch for the annotation, which calls our _NewType, which just writes the string that it stores – this is the type name. As a result, we get the written argument: type.

Actually, this is not entirely legal. From the point of view of the compiler, we wrote down the annotations of the arguments with some words that are not defined anywhere, so when unparse finishes its work, we get the wrong code: we need imports. I simply form a line of the correct format and add it to the beginning of the file, and then append the result to unparse, although I could add imports as nodes to the syntax tree, since ast supports Import and ImportFrom nodes.

Solving the triple quotation mark problem is no more difficult than adding a new type. We will create the StrType class and the _StrType method. The method is no different from the _NewType method used to annotate types, but the definition of the class has changed: we will store not only the string itself, but also the tab level at which it should be written. The number of indentation is defined as follows: if this line occurs in a function, then one, if in a method, then two, and there are no cases when the function is defined in the body of another function and is decorated at the same time, in our project.

class StrType (object):
    def __init__ (self, s, indent):
        self.s = s
        self.indent = indent

    def __repr__ (self):
         return '"" " n' + self.s + ' n' + '' * 4 * self.indent + '" ""  n'

IN repr define what our line should look like. I think this is far from the only solution, but it works. One could experiment with astunparse.fill and astunparse.Unparser.indent, then it would be more universal, but this idea came to my mind already at the time of writing this article.

On this, the solved difficulties end. After running my script, the problem of cyclic imports sometimes arises, but this is a matter of architecture. I did not find a ready-made third-party solution, and to handle such cases within the framework of my script seems to be a serious complication of the task. Probably, with the help of ast it is possible to detect and resolve cyclic imports, but this idea needs to be considered separately. In general, the negligible number of such incidents in our project completely allowed me not to process them automatically.

Another difficulty that I encountered was the lack of expression processing in ast from astro B as C. The attentive reader already knows that monkey patch is the cure for all diseases. Let this be his homework for him, but I decided to do this: just add such imports to the mapping file, because usually this construction is used to bypass the name conflict, and we have few of them.

Despite the imperfections found, the script does what it was intended to do. What is the result:

  1. The time during which the project is launched has been reduced from 10 to 3 seconds;
  2. The number of files has been reduced by removing the new_contract definitions. The files themselves were reduced: I did not measure, but on average the git totaled n added lines and 2n deleted ones;
  3. Smart IDEs began to make different hints, because now they are not comments, but honest imports;
  4. Readability has improved;
  5. Somewhere brackets appeared.

Thanks!

Useful links:

  1. Ast: https://docs.python.org/3/library/ast.html
  2. Horast: https://pypi.org/project/horast/
  3. All types of ast nodes and what is stored in them: https://greentreesnakes.readthedocs.io/en/latest/nodes.html#expressions.
  4. Beautifully shows the syntax tree: https://python-ast-explorer.com/
  5. Isort: https://pypi.org/project/isort/

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *