How an intern optimized queries and found a bug in Django

Hello! My name is Vanya, I’m a back-end developer-intern at KTS.

I recently found a bug in Django, filed a fix ticket and got accepted.

In the article I will tell you in more detail – what I worked on, what was the mistake and why it is difficult to meet it. And also about one more bug, which, according to the classics, turned out to be a feature 😊.

What will be in the article:

A little about myself: how I changed my occupation and got an internship at KTS.

I have a higher medical education and 4 years of work as a surgeon.

From school, I was interested in mathematics and computer technology, but for a long time this interest remained only in the form of a hobby – for example, I made graphic patches for phones. In computer science classes, only Word and PowerPoint were taught, and I didn’t have a strong mathematical base or acquaintances related to IT. So I did not perceive IT as my profession.

In my free time, I read development articles. I especially liked the topic of neural networks and the parallels with the work of the human brain. In the 4th year of the medical academy, I realized that I was ripe for learning a programming language for general development. For this purpose, I have long looked after Python, and thanks to Mark Lutz’s book “Learning Python”, my immersion into the world of programming began. No words can describe the thrill that I felt from the first working lines of code! So from year to year, slowly but surely, my knowledge and skills in Python increased. Played around with Unity a little more.

I love medicine, but all the time I have been working, the desire to do what I like even more – programming has not left me. Due to the large number of negative aspects in the work of a surgeon, I finally decided to change my occupation and engaged in self-study. I started with school programming courses, then studied algorithms and solved problems to prepare for school olympiads, read about web development.

In January 2022, by a lucky chance, I found out about the recruitment of beginner back-end developers to the free school and literally on the last day of recruitment I managed to apply. For a month of training, I got acquainted with aiohttp, discovered the power of databases, SQL and ORM. It has become much easier to write asynchronous code. It was very difficult and painful at times, but it was worth it. While writing a course project, I studied working with git.

After completing the course, I was invited for an interview, as a result of which I managed to get an internship.

Taking this opportunity, I want to thank my colleagues from KTS for help and guidance in the work, especially my mentor Sasha 👏.

Background and a little about how Django works

I started getting acquainted with Django literally from the first days of my internship. Previously had experience with aiohttp + SQLAlchemy/GINO.

In my opinion, one of the coolest features Django is Django ORM, which converts a database query from Python code to SQL. This greatly simplifies the development process, but there is one “but”. When retrieving data through relationships, these queries are not optimized and the so-called N+1 query problem occurs. The more queries in the database on the backend, the slower the application works. The solution needs optimization to get the right data in fewer queries.

The main query optimization techniques in Django are:

To create an API, there are 2 main options: REST API and GraphQL. In the working project we use GraphQL.

The REST API has a fairly strict architecture, so database queries can be manually optimized (hardcoded). GraphQL takes a more flexible approach: the client determines which fields to get, so it’s quite difficult to optimize anything manually. You can immediately do select_related and prefetch_related for all fields, but then it’s difficult to provide for all query options, there will be extra “joins” and an excess amount of data will be retrieved from the database. That is why we decided to do query optimization.

Query Optimization

The task of automatic query optimization seemed very interesting to me, so I took on its solution. The idea was to get the requested fields from info (contains all the information about the request), recursively go through the foreign keys of the models and, based on the type of relationship, apply the desired optimization method. All the necessary information about the model is in the attribute _meta. Since model information is static, it can be cached using functools.

AT info all field names are in CamelCase format, so we need to convert them to the format snake_case. It is convenient to do this with regular expressions. Eventually waitingAsyncDjango converted to waiting_async_django:

camel_case_pattern = re.compile(r"(?<!^)(?=[A-Z])")

def string_to_snake_case(value: str) -> str:
	return camel_case_pattern.sub("_", value).lower()

Recursively find out exactly which fields are requested in GraphQL. This would require a function to convert the abstract syntax tree to a dictionary ast_to_dictthe data is taken from info:

from graphql.utils.ast_to_dict import ast_to_dict

def collect_gql_fields(
	node: dict,
	fragments: dict,
	snake_case: bool = True,
) -> dict[str, dict]:
	field = {}
	if node.get("selection_set"):
		for leaf in node["selection_set"]["selections"]:
			if leaf["kind"] == "Field":
				key = leaf["name"]["value"]
				if snake_case:
					key = string_to_snake_case(key)
					field.update({key: collect_gql_fields(leaf, fragments)})
			elif leaf["kind"] == "FragmentSpread":
				field.update(
					collect_gql_fields(fragments[leaf["name"]["value"]], fragments)
				)
	return field

def get_gql_fields(
	info: ResolveInfo,
	snake_case: bool = True,
) -> dict[str, dict]:
	fragments = {}
	node = ast_to_dict(info.field_asts[0])
	for name, value in info.fragments.items():
		fragments[name] = ast_to_dict(value)
	return collect_gql_fields(node, fragments, snake_case=snake_case)

To work with only find the names of all fields in the model:

@lru_cache
def _get_model_all_field_names(model: Type[models.Model]) -> list[str]:
	return [field.name for field in model._meta.fields]

For select_related find the names of the fields with the type of relationship many-to-one and one to oneincluding the reversible types of these relations:

@lru_cache
def _get_model_fields_to_select(
	model: Type[models.Model],
) -> dict[str, Type[models.Model]]:
	fields = {
		field.name: field.related_model
		for field in model._meta.fields
		if field.many_to_one or field.one_to_one
	}
	related_fields = {
		field.name: field.related_model
		for field in model._meta.fields_map.values()
		if field.many_to_one or field.one_to_one
	}
	return fields | related_fields

For prefetch_related and Prefetch find the names of the fields with the type of relationship one-to-many and many-to-manyincluding the reversible types of these relations:

@lru_cache
def _get_model_fields_to_prefetch(
	model: Type[models.Model],
) -> dict[str, Type[models.Model]]:
	one_to_many = {
		field.name: field.related_model
		for field in model._meta.fields
		if field.one_to_many or field.many_to_many
	}
	many_to_many = {
		field.name: field.related_model 
		for field in model._meta.many_to_many
	}
	related_fields = {
		field.name: field.related_model
		for field in model._meta.fields_map.values()
		if field.one_to_many or field.many_to_many
	}
	return one_to_many | many_to_many | related_fields

For the convenience of working with data, we define dataclass:

@dataclass
class ModelFields:
	select_fields: list[str]
	prefetch_fields: list[str | Prefetch]
	only_fields: list[str]

Having learned the requested fields and relationship types, you can start recursively filling the list of fields for each optimization method:

Recursively populating a list of fields for each optimization method
def collect_model_fields(
	model: Type[models.Model],
	gql_fields: dict,
	prefix: str = "",
) -> ModelFields:
	all_field_names = _get_model_all_field_names(model)
	to_select = _get_model_fields_to_select(model)
	to_prefetch = _get_model_fields_to_prefetch(model)

	select_fields, prefetch_fields, only_fields = [], [], []

	# проходимся по всем запрашиваемым полям на текущем уровне
	for field_name in gql_fields.keys():
		# чем глубже уровень вложенности запроса, тем длиннее префикс
		location = prefix + field_name
		next_prefix = location + "__"
    
		if field_name in to_select:
			# текущее поле - внешний ключ с типом отношения many-to-one или one-to-one
			
			# рекурсивно достаем запрашиваемые поля из другой модели
			next_fields = collect_model_fields(
				to_select[field_name],
				gql_fields[field_name],
				prefix=next_prefix,
			)
      
			# и добавляем их к уже известным полям
			if next_fields.only_fields:
				# для текущей локации необходимо делать select_related
				# только в том случае, если из нее берутся поля для .only(),
				# так как если берутся поля ТОЛЬКО для prefetch,
				# будет выброшена ошибка.
				select_fields.append(location)
				only_fields.extend(next_fields.only_fields)
        
			select_fields.extend(next_fields.select_fields)
			prefetch_fields.extend(next_fields.prefetch_fields)
      
		elif field_name in to_prefetch:
			# текущее поле - внешний ключ с типом отношения one-to-many или many-to-many
			# рекурсивно достаем запрашиваемые поля из другой модели
			next_fields = collect_model_fields(
				to_prefetch[field_name],
				gql_fields[field_name],
				prefix=next_prefix,
			)
      
			# и добавляем их к уже известным полям
			prefetch_fields.extend(next_fields.prefetch_fields)
			if next_fields.select_fields:
				# если у другой модели запрашиваются поля по внешнему ключу
				# с типом отношения many-to-one или one-to-one,
				# то для бОльшей оптимизации можно использовать класс Prefetch
				related_select_fields = [
					i.removeprefix(next_prefix)
					for i in next_fields.select_fields
				]
				queryset = to_prefetch[field_name].objects.select_related(
					*related_select_fields
				)
				prefetch_fields.append(Prefetch(location, queryset=queryset))
			else:
				prefetch_fields.append(location)
      
		elif field_name in all_field_names:
			# обычное поле модели
			only_fields.append(location)
      
		else:
			# property и прочее.
			pass

	return ModelFields(
		select_fields=select_fields,
		prefetch_fields=prefetch_fields,
		only_fields=only_fields,
	)


def get_optimization_fields(
	model: Type[models.Model],
	info: ResolveInfo,
) -> ModelFields:
	# узнаем запрашиваемые поля
	gql_fields = get_gql_fields(info, snake_case=True)

	# находим поля для каждого способа оптимизации
	return collect_model_fields(model, gql_fields)

For convenience, we have defined a model manager and an abstract model class:

class OptimizedModelManager(models.Manager):
	def db_prepared(self, info: ResolveInfo) -> models.QuerySet:
		fields = get_optimization_fields(self.model, info)
		return (
			self.get_queryset()
			.select_related(*fields.select_fields)
			.prefetch_related(*fields.prefetch_fields)
			.only(*fields.only_fields)
		)


class OptimizedModel(models.Model):
	class Meta:
		abstract = True

	objects = OptimizedModelManager()

Now you can easily optimize almost any GraphQL query:

# наследуемся от абстрактного класса с реализацией оптимизации запросов
class Book(OptimizedModel):
	"""Обычная модель книги."""

class BookGQLType(DjangoObjectType):
	class Meta:
		model = Book

class Queries(ObjectType):
	books = List(NonNull(BookGQLType), required=True)

	def resolve_books(self, info: ResolveInfo) -> list[Book]:
		# практически любой запрос будет хорошо оптимизирован
		return Book.objects.db_prepared(info)

In the future, this code was overgrown with new functionality, for example, optimization of property models.

django error

Method only allows you to specify a list of fields that you want to get from the database. By getting only the fields you need, you can slightly increase the performance of your application. AT only you can specify the fields of other models that will be joined.

Important that Django automatically fetches the pk (primary key) of the models, even if it’s not specified in only.

Based on the base model, Django allows you to create any number of the same models. They will differ in behavior, but only one base model will exist in the database. This is achieved through proxy models.

And when optimizing requests to the proxy model, the following error sometimes occurred:

File "django\db\models\query.py", line 302, in len
	self._fetch_all()
File "django\db\models\query.py", line 1507, in _fetch_all
	self._result_cache = list(self._iterable_class(self))
File "django\db\models\query.py", line 71, in iter
	related_populators = get_related_populators(klass_info, select, db)
File "django\db\models\query.py", line 2268, in get_related_populators
	rel_cls = RelatedPopulator(rel_klass_info, select, db)
File "django\db\models\query.py", line 2243, in init
	self.pk_idx = self.init_list.index(self.model_cls._meta.pk.attname)
ValueError: 'id' is not in list

The study showed that the error occurred when trying to get the proxy model in select_related using only. The error only occurred with only.

Causes of the bug

Apparently, for some reason, the proxy model did not get pk, although this should happen automatically.

I forcibly added the pk (in our case it is the id) of the proxy model to onlyand the error is gone. It became obvious that the problem was somewhere in Django.

Then I started looking for where Django automatically adds pk to the list of fetched fields. The search led to django/db/models/sql/query.pyClass Queryfunction deffered_to_data.

During the debug, I found that pk is taken from the proxy model, and the requested fields are from the base one.

This code looks suspicious:

opts = cur_model._meta
if not is_reverse_o2o(source):
	must_include[old_model].add(source)
add_to_dict(must_include, cur_model, opts.pk)

The error disappeared when they began to get pk from the base model:

opts = cur_model._meta.concrete_model._meta
if not is_reverse_o2o(source):
	must_include[old_model].add(source)
add_to_dict(must_include, cur_model, opts.pk)

Ticket and Pull Request

After making sure that there were no tickets with a similar bug in the Django Issues section, I created new. The developers accepted it and offered to prepare a patch. It was necessary to create a pull request with a bug fix and a test. Fortunately, Django has great instruction.

So I made my first pull request to an open source project.

In our implementation, this bug was also fixed: we made a forced addition of pk to only when getting the fields of the proxy model.

Our optimization

Interestingly, we would hardly have found this error without custom query optimization. We were looking for ready-made implementations and libraries, and even found one – https://github.com/tfoxy/graphene-django-optimizer. But it was not flexible enough to be used in some scenarios and sometimes made even more queries to the database than our implementation.

We did not find anything else similar, but if you know, share in the comments 😉

A bug that turned out to be a feature

While working, I found another bug that significantly increases the number of queries in the database. It occurs if you do Prefetch through a reverse relationship.

class Author(models.Model):
	name = models.CharField(max_length=32)

class Book(models.Model):
	title = models.CharField(max_length=32)
	author = models.ForeignKey("Author", on_delete=models.CASCADE)

	# для модели Publisher это реверсивный внешний ключ
	publisher = models.ForeignKey(
		"Publisher",
		on_delete=models.CASCADE,
		related_name="books",
	)

class Publisher(models.Model):
	address = models.CharField(max_length=32)

With this GraphQL query…

{
	publishers {
		address
		books {
			title
			author {
				name
			}
		}
	}
}

… optimization should be as follows:

def resolve_publishers(self, info: ResolveInfo) -> list[Publisher]:
	queryset = Book.objects.select_related("author").only("title", "author__name")
	return Publisher.objects.prefetch_related(Prefetch("books", queryset)).only("address")

But in this case, there is a huge number of queries in the database.

The reason for the problem is this line:

Book.objects.select_related("author").only("title", "author__name")

It is worth adding a reverse pk (publisher_id) to only, and queries to the database become 10 times less (depending on the number of records in the database):

Book.objects.select_related("author").only("title", "author__name", "publisher_id")

Having found the cause of the bug and a way to fix it, I created another one ticket. But…

The developers replied that the function prefetch_related_objects should not implicitly add any columns as this can be unexpected and misleading. They added that those who use only and deferare solely responsible for the fields passed in them.

The ticket was not subject to a fix and was marked as New feature.

In our implementation, we simply removed the use only in Prefetch.

Conclusion

PS – Django 4.1

Release expected in August 2022 Django 4.1, which will provide an asynchronous interface for Django ORM. Under the hood, requests are still synchronous, but developers continue to work on implementing full asynchrony. I think when this happens, the choice will be obvious to me.

💻🎓 August 8 — the start of our free school for beginners front-end developers and backenders.

If you are starting to learn development and you are interested in such things, come!

Similar Posts

Leave a Reply