Preparing SQLAlchemy correctly

And all this so that the application can then walk through a collection of 1000 heavy objects, look at one field in each, and change (or not) the value in another field. And then we start saving. And here is another level of hell.

There are two ways to save model objects in the database:

pass the current state of the model object back as is to a request like UPDATE or INSERT. Introduced, right? This is everything we did when extracting data from the DBMS, only in reverse order and in the opposite direction.
implement tracking of each field in each instance of the model object and flag them if their values change. This will allow, when updating, to generate a minimal and sufficient UPDATE for each instance, which will use only those fields (columns) that have actually changed. At the same time, it allows you to skip those instances in which nothing has changed (that is, those that lead to an empty UPDATE). However, this means that myobject.field = new_value – is no longer such a trivial operation. You have to pay for this during each assignment of a value to a model object (even a new one that is not yet in the database).

By the way, SQLAlchemy ORM uses the second approach. Not without flaws, of course. At least until version 2.0, my favorite “bug” was the lack of tracking of fields whose values are references. For example, dictionaries. This is because the ORM generally has no idea whether the value has changed if the reference (id(self.value)) has not changed. Compare via == doesn't always work either. In general, “add a key to a dictionary field and ensure that this field is updated in the DBMS” was a lot of fun.

How to win?

Combine. Yes, this is not another article about “only raw SQL, only hardcore.” I will not call for complete abandonment import sqlalchemy.ormhowever, by combining it and sqlalchemy.sql you can get the advantages of both and, if possible, get rid of the disadvantages of both.

However, of course, I risk awakening a little horror in the minds of “pure ORM” adherents.

PoC

First, the project structure, so that it is clear what is where

>--lib
    >--db
    |   >--connection.py
    |   >--__init__.py
    >--model
    |   >--user.py
    |   >--__init__.py
    >--query
    |   >--compiler
    |   |   >--compile_hacks.py
    |   |   >--query_compiler.py
    |   |   >--__init__.py
    |   |   >--__main__.py
    |   >--user.py
    |   >--user_compiled.py
    |   >--__init__.py
    >--repository
    |   >--user.py
    |   >--__init__.py
    >--__init__.py

Show me the code!

Let's go.

# lib.db.connection
from sqlalchemy import create_engine

engine = create_engine("postgresql+asyncpg:///", future=True, paramstyle="named")

Trivial Engine instance. At the moment we only need it for two things:

Of course, in production there will be something here that declares the URL not from the code, but from a more secure data storage about connecting to the DBMS: KeyVault, EnvVar, whatever. In this case, the engine declaration can be moved directly to the query_compiler module, and the production logic can be left here.

# lib.db.model
from sqlalchemy.orm import DeclarativeBase

class Base(DeclarativeBase):
    pass

According to the tutorial in the SQLAlchemy documentation, nothing special

# lib.db.model.user
from typing import Optional

from sqlalchemy.orm import Mapped, mapped_column

from lib.model import Base


class User(Base):
    __tablename__ = "users"

    id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True)
    name: Mapped[str] = mapped_column()
    enabled: Mapped[bool] = mapped_column()
    address: Mapped[Optional[str]] = mapped_column()

In general, nothing special either. I was a little surprised at how far zzzeek has come in making ORMs declarative. Previously, I manually declared both the types and the columns themselves via Column(). In any case, this is a PoC, a table divorced from reality (however, it does not contradict either the ORM rules or the rules of SQL and its PostgreSQL dialect). Just for reference as to what we'll be working with next.

# lib.query.compiler.__main__
from lib.query.compiler.query_compiler import query_compiler

if __name__ == '__main__':
    query_compiler()

Just a simple wrapper that will allow you to call the query compiler like python -m lib.query.compiler

# lib.query.compiler.compile_hacks
from sqlalchemy import BindParameter
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.base import _NoArg


@compiles(BindParameter)
def _render_literal_bindparam(element, compiler, **kw):
    params = dict(**kw)

    if element.value is not _NoArg.NO_ARG and element.value is not None:
        params["literal_binds"] = True

    return compiler.visit_bindparam(element, **params)

This is where things start to get interesting. The authors of SQLAlchemy did not intend this particular use of the library (precompilation of queries into text form), however, they did not explicitly prohibit it. The meaning of this “hack” is as follows.

When a user builds a query using Query Builder (sa.select, sa.delete, sa.update, sa.insert), in some cases it is necessary to pass certain constants to the query. Well, for example, as will be shown later, a constant that needs to be replaced with the value of the output column if it satisfies certain conditions. All such constants are automatically wrapped in BindParameter objects (sa.bindparam). However, as will be shown later, our queries themselves are built as parameterized ones, so we would like to render the constants as they are, and formalize the special parameters as such, so that already in runtime we can transfer them and only them.

This is a standard mechanism for declarative interception of the internal mechanisms of SQLAlchemy. We declare a function that is responsible for compiling BindParameter objects into the text of an SQL query. If the object does not have a value set in advance (the constants and literals in our query will contain values below), we leave it as it is. If the value is set, we modify the arguments of the standard visitor compiler so that the literals are rendered as literals, and not as request parameters; we then do not want to carry around constants that are already specified in the request.

There is a small scope for bugs here: firstly, the default value of BindParameter (NO_ARG) is changed to None during creation, so, of course, checking for it in this case is redundant, but I left it; secondly, it may happen that you want to pass None as a literal (however, why do this? For NULL in sa.sql there is a function null(), which generates a real SQL NULL in the query text; just as there is field methods: is_(null()), isnot(null())). In general, I warned you. I also showed where to edit this, if anything. This suggests itself with its own Enum-marker, the value of which could be used for indication.

# lib.query.compiler.query_compiler
import logging
import os
from importlib import import_module
from typing import Callable, Dict, Union

from sqlalchemy import Selectable
from sqlalchemy.sql.dml import DMLWhereBase

from lib.db.connection import engine
# noinspection PyUnresolvedReferences
import lib.query.compiler.compile_hacks

def query_compiler():
    dir_path = os.path.sep.join(
        (
            os.path.dirname(__file__),
            "..",
        )
    )
    for q_module in os.listdir(dir_path):
        compiled_queries = dict()
        full_path = os.path.sep.join((dir_path, q_module))
        if (
                not q_module.endswith(".py")
                or
                q_module.endswith("_compiled.py")
                or
                q_module.startswith("__")
                or
                not os.path.isfile(full_path)
        ):
            continue

        mod = import_module(f"..{q_module[:-3]}", __package__)
        fn: Callable[[], Dict[str, Union[Selectable, DMLWhereBase]]]
        try:

            fn = getattr(mod, "generate_queries")
            if not callable(fn):
                continue
        except AttributeError:
            continue

        try:
            for q_name, q_val in fn().items():
                if not q_name.endswith("_query"):
                    continue

                compiled = q_val.compile(engine)
                compiled_queries[
                    q_name.upper()
                ] = str(compiled)
        except Exception as e:
            logging.exception(e)
            continue

        result_module = os.path.sep.join(
            (
                dir_path,
                q_module.replace(".py", "_compiled.py")
            )
        )
        with open(result_module, "w", encoding="utf-8") as fd:
            fd.write("from sqlalchemy import text\n\n")
            fd.writelines(f"{k} = text('''\n{v}\n''')" for k, v in compiled_queries.items())

Our main charm. Meet: this is a query compiler into text form.

What it does:

runs through the directory level above, searches for all Python modules that are not directories, do not start with dunder, and do not have the _compiled prefix
each of them separately imports and looks for the generate_queries() function in the imported module
if she finds it, calls her
the result is a dictionary whose keys are strings and whose values are prepared query objects
We don’t want to limit ourselves, so the following protocol is provided: only those query objects whose dictionary keys end in _query are subject to compilation; all other “garbage” (for example, intermediate objects such as CTE and SubQuery) will be discarded if, of course, their keys in the returned dictionary will not match the pattern
here in the 49th line the compilation actually takes place
- Remember how I defined the BindParameter compiler interceptor above? Here it is in the process of calling this function and will be called for the corresponding objects
- a pre-prepared engine is supplied so that the compiler can get information about the driver and dialect from it: the dialect defines the general syntax of the future request, and constants and literals can be passed to the driver to safely escape their values, asyncpg is at least called, psycopg2 also had the corresponding interfaces
and then all compiled queries are saved into a module, which differs from the original one only by the _compiled suffix
- in this case, all text requests are wrapped in sa.text() in order to be ready for immediate transmission directly to engine.execute(); in SQLAlchemy 2.0 this interface was changed and a blank string is no longer accepted in this method
- Moreover, in the future, when working with these requests in runtime, the model will no longer participate, nor will it even be imported, which will help save a few milliseconds at the start (a small thing, but nice)

Such an expected interface also imposes certain requirements on the implementation of the actual modules that generate requests.

# lib.query.user
from sqlalchemy import select, case, null, func, bindparam, update

from lib.model.user import User


def generate_queries():
    users_query = (
        select(
            User.id.label("id"),
            User.name.label("name"),
            case(
                (User.address.isnot(null()), User.address),
                else_="N/A"
            ).label("address"),
            (
                func
                .row_number()
                .over(
                    order_by=func.lower(User.name),
                    range_=(0, 26),
                    partition_by=func.lower(func.left(User.name, 1))
                )
                .label("cohort")
            )
        )
        .where(
            ((User.id % 2) == bindparam("even_odd", type_=User.id.type))
            |
            (
                User.name.contains(bindparam("name_includes"))
                &
                User.enabled.is_(bindparam("enabled_filter"))
            ).self_group()
        )
    )

    some_intermediate_thing = select(User.id).subquery("aliased")

    return locals()

So:

Due to the fact that to generate queries it is necessary to manually call the generate_queries() function, we can afford to postpone this generation indefinitely, giving the query compiler, for example, the opportunity to substitute some additional values in the module context, and already in the generate_queries() function use these constants; the patch for the compiler will be trivial (we declare the protocol function setup_module(**kw), which enriches the global scope of the module with values, look for and call it in the compiler, then call generate_queries())
in the simplest case, it is enough to return from the locals() function, where the compiler itself will select what is needed; however, this is not necessary; you can collect objects in the local context, and then manually assemble a dictionary with the necessary keys and their query values; the main thing is to follow the protocol: what is to be compiled must comply with the requirement “ends with _query” in the key name
sa.sql. Yes, as you can see, zzzeek’s Query Builder turned out excellent. It is written in approximately the same way as a pure SQL query would be written, with an eye to Python syntax, of course. Among other things, there are standard wrappers over func, which allow you to turn an arbitrary function in the code into a windowed one (by calling .over(), the user automatically converts the function into a windowed one, which affects how it is rendered in the request; this will be useful for those who do not disdain declaring its functions in the database schema)
yes, of course I know about the “coalesce”, the case is there intentionally to demonstrate more complex constructs for comparison with how they would look in SQL, and to evaluate how similar the syntax is to pure SQL
I would like to separately note the presence of sugar in WHERE. You can, of course, write expressions in the classical style, using and_() and or_(); I demonstrated the ability to write them through binary operators | And &; Unfortunately, I can't think of any other way to explicitly enclose a group of expressions in parentheses in SQL other than calling .self_group(). Unless, crawl through the AST, which would explode the complexity of SQLAlchemy. In general, although the difference from pure SQL is not so terrible
We are in Python, so we can afford to use the available tools to make it easier to work with different data types, for example, str.contains() is much more convenient and pleasant to write than its close to SQL '%' || :param || '%' analogue Alchemy will intercept this call to mapped_column and turn the output into valid SQL
users_query will compile, but some_intermediate_thing will not (does not match the pattern in the compiler)

Result

# lib.query.user_compiled
from sqlalchemy import text

USERS_QUERY = text('''
SELECT users.id AS id, users.name AS name, CASE WHEN (users.address IS NOT NULL) THEN users.address ELSE 'N/A' END AS address, row_number() OVER (PARTITION BY lower(left(users.name, 1)) ORDER BY lower(users.name) RANGE BETWEEN CURRENT ROW AND 26 FOLLOWING) AS cohort 
FROM users 
WHERE users.id % 2 = :even_odd::INTEGER OR ((users.name LIKE '%' || :name_includes::VARCHAR || '%') AND users.enabled IS :enabled_filter::BOOLEAN)
''')

It is backlit and slightly formatted for readability:

SELECT
  users.id AS id
  , users.name AS name
  , CASE
    WHEN (users.address IS NOT NULL)
      THEN users.address
    ELSE 'N/A'
    END AS address
  , row_number()
    OVER (
      PARTITION BY lower(left(users.name, 1))
      ORDER BY lower(users.name)
      RANGE BETWEEN CURRENT ROW AND 26 FOLLOWING
    ) AS cohort 
FROM users 
WHERE
  users.id % 2 = :even_odd::INTEGER
  OR (
    (users.name LIKE '%' || :name_includes::VARCHAR || '%')
    AND
    users.enabled IS :enabled_filter::BOOLEAN
  )

firstly, this query is perfectly inserted in its original form into DataGrip and, I suspect, many other tools for working with SQL databases; when executed, it will be enough to directly substitute the parameter values into DataGrip (it will ask itself)
secondly, do not find fault with the meaning of this request that is slipping beyond the horizon, there is none, this is not an excerpt from the production implementation, but just a demonstration; I’ll just say that in production we threw very extensive queries that took JSON as output, expanded it into a table, did UPSERT on tables based on this data, marked successful and unsuccessful insertions in the table (yes, we had to repeat CHECK CONSTRAINTS in code in the query conditions ), indicating the reason for the failure, and also returned JSON, which the application had already checked and provided the frontend with line markers that should be highlighted to the user indicating what went wrong with a specific line
thirdly, the use in application code is very trivial:

from lib.db.connection import engine
from lib.query.user_compiled import USERS_QUERY


with engine.connect() as connection, connection.begin() as _:
    connection.execute(
        USERS_QUERY,
        dict(
            even_odd=0,
            name_includes="baba",
            enabled_filter=True
        )
    )

What's the gain?

First, notice that in the query constructor I accessed the model fields. This guarantees me that during refactoring, or even simply renaming a field, my query will remain compilable. And for more serious changes, for example changing the type of a field, I will go through the field in the model to the places where it is used (that is, queries in generators) and figure out how to rewrite the queries so that it continues to work in the new version of the model.

Secondly, as was already noted above, ORM has not gone away, it is simply assigned the role of a simple mapper class -> table (view). This means that in supported cases we can still count on the help of alembic to generate migrations.

Thirdly, of course, we are no longer limited in the means of expressing what we want to do using a query, as is often the case when using an ORM “head-on”. Queries of any complexity, nesting, with or without CTE, UNION, UPDATE/DELETE RETURNING, array_agg – all these delights are right under your feet, take it and use it. For PostgreSQL specifically, its dialect also has its own extended versions of Insert, which allow you to describe ON CONFLICT … DO … Of course, MySQL or Oracle also have their own extensions of standard classes.

Fourthly, since we are now free to determine for ourselves what, how and in what order to do with the data, we, of course, will stop counting row_number() on the application side, at least where the DBMS can do it for us ( she will do it faster and more efficiently). Then we will stop pulling from the database those fields that are not needed in this particular case. And then, getting even more into a rage, we will stop doing SELECT at all in order to update data, especially in bulk, in any table. After all, we know the conditions, we will simply describe them in the query, no matter how complex it may be and no matter how many other tables we would have to connect to filter the rows in the table, and we will immediately do an UPDATE or INSERT ON CONFLICT DO UPDATE.

Fifth, you can use the results of query compilation as a detector that something has gone wrong in the new version of alchemy. After all, if the compiled request in the new version differs from it in the previous version, provided that only the alchemy version has been updated, then this means that you urgently need to look at what has changed, how it affects the application, and how to fix it, if necessary. This can be done, for example, using a two-phase deployment. First of all, during deployment, queries are compiled, then the hashes of the files with the compiled queries are checked and stored in the deployment infrastructure repository (in the same state file), and then the build deployment is stopped if the hashes differ from the previous values, notifying those responsible for the release.

Yes, of course, many will have to break themselves before appreciating an approach that uses raw SQL instead of the “convenient and one-line” ORM QueryObject. But some people won’t succeed, because there is always the opportunity to find reasons not to do it.

What did I forget?

Here is my telegram: it doesn’t exist

Okay, what about YouTube (like, subscribe, bell)? There is none.

What can you then advertise in the best traditions? Well, for example, this article.

Preparing SQLAlchemy correctly

How to win?

PoC

Show me the code!

Result

What's the gain?

What did I forget?

My thoughts on Python in Excel

Quality dashboards at Ericsson, Volvo, Saab

Kubernetes and more

Elastic Locked: Enable Elasticsearch Cluster Security Options for Inside and Out Access

Fast console input in .NET

Reinforcement Learning in the Problem of Group Price Optimization

Leave a Reply Cancel reply

How to win?

PoC

Show me the code!

Result

What's the gain?

What did I forget?

Similar Posts

Leave a Reply Cancel reply