Validate it immediately

We will provide a dictionary as input so as not to depend on the web framework, i.e. We do not consider how exactly this dictionary was obtained from the request and how the json for the response will then be created from the dictionary.

I would like to draw your attention to the fact that sometimes you want to use the validation mechanism to also check data in external sources. For example, one of the query fields is a key in one of the database tables, so you may want to run an sql query as part of the validation. This is commendable, but throwing dependencies into validation schemes will open a portal to hell, so it is better to do this at a later stage in specially designated places.

Pydantic

Quite a popular library lately, because… is part of the hype FastApi.

Scheme code
from typing import Annotated, ClassVar, Literal

from pydantic import (
    UUID4,
    BaseModel,
    ConfigDict,
    EmailStr,
    Field,
    HttpUrl,
    NonNegativeInt,
    PastDatetime,
    PositiveInt,
    field_validator,
    model_validator,
)
from pydantic_extra_types.country import CountryAlpha3
from pydantic_extra_types.payment import PaymentCardNumber

from validators.common import Gender


class DocumentSchema(BaseModel):
    number: Annotated[int | str, Field(alias="full_number")]


class PydanticSchema(BaseModel):
    model_config: ClassVar = ConfigDict(extra="ignore")
    
    schema_version: Literal['3.14.15']
    id: UUID4
    created_at: PastDatetime
    name: Annotated[str, Field(min_length=2, max_length=32)]
    age: Annotated[int, Field(ge=0, le=100)]
    is_client: bool
    gender: Gender
    email: EmailStr
    social_profile_url: HttpUrl
    bank_cards: list[PaymentCardNumber] | None
    countries: Annotated[str, Field(min_length=1, max_length=64)]
    document: DocumentSchema
    page_number: NonNegativeInt
    page_size: PositiveInt

    @field_validator('age', mode="after")
    @classmethod
    def check_adults(cls, value: int) -> int:
        if value < 18:
            raise ValueError('only adults')
        return value

    @field_validator('countries')
    @classmethod
    def parse_counties(cls, value: str) -> list[CountryAlpha3]:
        return [CountryAlpha3(c) for c in value.split(',')]

    @model_validator(mode="before")
    @classmethod
    def general_check(cls, data: dict) -> dict:
        if data.get('is_client') and not data.get('bank_cards'):
            raise ValueError('cards are required for clients')
        return data

Here the standard type mechanism is used in the tail and mane, which makes the code quite concise and pleasant. Some additional types used to be part of Pydantic, but are now moved to a separate library.

Marshmallow

In my practice, it is usually used in conjunction with aiohttp or Flask.

Scheme code
import typing
from datetime import datetime

from marshmallow import (
    EXCLUDE,
    Schema,
    ValidationError,
    fields,
    validate,
    validates,
    validates_schema,
)
from marshmallow.utils import missing as missing_
from marshmallow_union import Union as UnionField

from validators.common import Gender


class CommaList(fields.Field):
    def __init__(
        self,
        *,
        load_default: typing.Any = missing_,
        missing: typing.Any = missing_,
        dump_default: typing.Any = missing_,
        default: typing.Any = missing_,
        data_key: str | None = None,
        attribute: str | None = None,
        validate: (
            None
            | typing.Callable[[typing.Any], typing.Any]
            | typing.Iterable[typing.Callable[[typing.Any], typing.Any]]
        ) = None,
        required: bool = False,
        allow_none: bool | None = None,
        load_only: bool = False,
        dump_only: bool = False,
        error_messages: dict[str, str] | None = None,
        metadata: typing.Mapping[str, typing.Any] | None = None,
        **additional_metadata,
    ) -> None:
        super().__init__(
            load_default=load_default,
            missing=missing,
            dump_default=dump_default,
            default=default,
            data_key=data_key,
            attribute=attribute,
            validate=validate,
            required=required,
            allow_none=allow_none,
            load_only=load_only,
            dump_only=dump_only,
            error_messages=error_messages,
            metadata=metadata,
            **additional_metadata,
        )
        marshmallow_type = metadata.get('marshmallow_type') if metadata else None
        self.marshmallow_type = marshmallow_type or (lambda x: x)

    def _deserialize(self, value, attr, data, **kwargs) -> list:
        try:
            return [self.marshmallow_type(x) for x in value.split(',')]
        except (ValueError, AttributeError, TypeError) as exc:
            raise ValidationError('Incorrect list') from exc


class DocumentSchema(Schema):
    class Meta:
        unknown = EXCLUDE

    number = UnionField(
        data_key='full_number',
        fields=[fields.Integer(), fields.Str()],
        required=True,
    )


class MarshmallowSchema(Schema):
    class Meta:
        unknown = EXCLUDE

    schema_version = fields.Str(required=True, validate=validate.Equal('3.14.15'))
    id = fields.UUID(required=True)
    created_at = fields.DateTime(required=True)
    name = fields.Str(required=True, validate=validate.Length(min=2, max=32))
    age = fields.Int(required=True, validate=validate.Range(min=0, max=100))
    is_client = fields.Bool(required=True)
    gender = fields.Enum(Gender, by_value=True, required=True)
    email = fields.Email(required=True)
    social_profile_url = fields.URL(required=True)
    bank_cards = fields.List(
        fields.Str(validate=validate.Length(min=15)),
        required=True,
        validate=validate.Length(min=1),
    )
    countries = CommaList(required=True, metadata={'marshmallow_type': str})
    document = fields.Nested(DocumentSchema)
    page_number = fields.Int(required=True, validate=validate.Range(min=0, max=100))
    page_size = fields.Int(required=True, validate=validate.Range(min=1, max=100))

    @validates('created_at')
    def date_must_be_in_past(self, value: datetime) -> None:
        if value >= datetime.utcnow():
            raise ValidationError('date must be in the past')

    @validates('age')
    def check_adults(self, value: int) -> None:
        if value < 18:
            raise ValidationError('only adults')

    @validates_schema
    def general_check(self, data: dict, **kwargs) -> None:
        if data.get('is_client') and not data.get('bank_cards'):
            raise ValidationError('cards are required for clients')

This uses a different concept: “assignment” instead of types – which makes it quite verbose. In addition, it was necessary to implement CommaList on one's own.

Trafaret

A very rare beast with the most exotic syntax.

Scheme code
import uuid
from datetime import datetime

import trafaret as t

from validators.common import Gender


class UUID(t.Trafaret):
    def check_and_return(self, value: uuid.UUID | bytes | str | None) -> uuid.UUID | None:
        if value is None:
            return None
        if isinstance(value, uuid.UUID):
            return value
        try:
            if isinstance(value, bytes) and len(value) == 16:
                return uuid.UUID(bytes=value)
            else:
                return uuid.UUID(value)
        except (ValueError, AttributeError, TypeError):
            self._failure('value is not a uuid')


class CommaList(t.Trafaret):
    def __init__(self, *args, trafaret_type: t.Trafaret, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.trafaret_type = trafaret_type

    def check_and_return(self, data: str) -> list:
        return [self.trafaret_type.check_and_return(x) for x in data.split(',')]


def past_date(frmt: str="%Y-%m-%dT%H:%M:%S") -> t.And:
    def check(value: str) -> datetime:
        converted_value = datetime.fromisoformat(value)
        if converted_value >= datetime.utcnow():
            raise t.DataError('date must be in the past')
        return converted_value

    return t.DateTime(format=frmt) >> check


def check_adults(value: int) -> int:
    if value < 18:
        raise t.DataError('only adults')
    return value


def check_schema(data: dict) -> dict:
    if data.get('is_client') and not data.get('bank_cards'):
        raise t.DataError('cards are required for clients')
    return data


document_schema = t.Dict(
    {
        t.Key('full_number') >> 'number': t.Int() | t.String(),
    },
)

trafaret_schema = (
    t.Dict(
        {
            'schema_version': t.Atom('3.14.15'),
            'id': UUID,
            'created_at': past_date(),
            'name': t.String(min_length=2, max_length=32),
            'age': t.Int(gte=0, lte=100) >> check_adults,
            'is_client': t.Bool(),
            'gender': t.Enum(*[i.value for i in Gender]),
            'email': t.Email,
            'social_profile_url': t.URL,
            'bank_cards': t.List(t.Int(gte=10**15), min_length=1),
            'countries': CommaList(trafaret_type=t.String()),
            'document': document_schema,
            'page_number': t.Int(gte=0, lte=100),
            'page_size': t.Int(gte=1, lte=100),
        },
        ignore_extra="*",
    )
    >> check_schema
)

We returned to types again, albeit non-standard ones. The resulting scheme looks ascetic, but there are probably fewer out-of-the-box types in this library, so you will have to implement them yourself.

After the two previous libraries, Trafaret's syntax can be a little intimidating, but after some time of use you realize that there is something to it.

Django REST framework (DRF)

It's difficult to separate DRF from Django the way you can with Pydantic and FastApi. Therefore, you can’t just take and validate a dictionary; you need to deal with the dragging Django: for example, make a monkeypatch for settings (django.settings). In this context, the most problematic library among those reviewed.

Scheme code
from datetime import datetime

from rest_framework import serializers as s

from validators.common import Gender


def check_schema_version(value: str) -> None:
    if value != '3.14.15':
        raise s.ValidationError('value must be equal "3.14.15"')


class CommaStrListField(s.Field):
    def to_representation(self, value: list[str]) -> list[str]:
        return value

    def to_internal_value(self, data: str) -> list[str]:
        try:
            return [str(x) for x in data.split(',')]
        except (ValueError, AttributeError, TypeError) as exc:
            raise s.ValidationError('Incorrect list') from exc


class IntOrStrField(s.Field):
    def to_representation(self, value: int | str) -> int | str:
        return value

    def to_internal_value(self, data: int | str) -> int | str:
        return data


class DocumentSchema(s.Serializer):
    full_number = IntOrStrField(source="number")


class DRFSchema(s.Serializer):
    class Meta:
        unknown = 'ignore'

    schema_version = s.CharField(validators=[check_schema_version])
    id = s.UUIDField()
    created_at = s.DateTimeField()
    name = s.CharField(min_length=2, max_length=32)
    age = s.IntegerField(min_value=0, max_value=100)
    is_client = s.BooleanField()
    gender = s.ChoiceField(choices=[(e.value, e.name) for e in Gender])
    email = s.EmailField()
    social_profile_url = s.URLField()
    bank_cards = s.ListField(child=s.CharField(min_length=15))
    countries = CommaStrListField()
    document = DocumentSchema()
    page_number = s.IntegerField(min_value=0, max_value=100)
    page_size = s.IntegerField(min_value=1, max_value=100)

    def validate_created_at(self, value: datetime) -> datetime:
        if value >= datetime.utcnow():
            raise s.ValidationError('date must be in the past')
        return value

    def validate_age(self, value: int) -> int:
        if value < 18:
            raise s.ValidationError('only adults')
        return value

    def validate(self, data: dict) -> dict:
        if data.get('is_client') and not data.get('bank_cards'):
            raise s.ValidationError('cards are required for clients')
        return data

And again we went away from types. We did not avoid creating custom validators that could be “out of the box”.

DRF forces us to use a “single” style for custom validator names. I'm not sure if this can be clearly classified as a plus or a minus.

It is also strange that for each species Union you have to write your own type. Perhaps I'm doing something wrong here, otherwise it looks like a controversial decision.

Performance testing

Each library in the bins has its own test, which shows that this particular library is the coolest, but I’ll write my own. It will be very inaccurate, but you can use it to put your finger on your nose.

Generation of test data
from enum import StrEnum, unique

import pytest
from faker import Faker


@pytest.fixture(scope="session")
def faker() -> Faker:
    return Faker('en_GB')


@unique
class Gender(StrEnum):
    MALE = 'male'
    FEMALE = 'female'
    HELICOPTER = 'helicopter'


@pytest.fixture
def data(faker: Faker) -> dict:
    return {
        'schema_version': '3.14.15',
        'id': faker.uuid4(cast_to=None),
        'created_at': faker.past_datetime().isoformat().split('.')[0],
        'name': faker.name(),
        'age': faker.pyint(min_value=18, max_value=100),
        'is_client': faker.pybool(),
        'gender': faker.enum(Gender).value,
        'email': faker.email(),
        'social_profile_url': faker.url(),
        'bank_cards': (
            [
                faker.credit_card_number('visa16')
                for _ in range(faker.pyint(min_value=1, max_value=3))
            ]
            if faker.pybool
            else None
        ),
        'countries': ','.join(
            [faker.currency_code() for _ in range(faker.pyint(min_value=1, max_value=5))]
        ),
        'document': {
            'full_number': faker.pyint() if faker.pybool() else faker.pystr(),
        },
        'page_number': faker.pyint(min_value=0, max_value=10),
        'page_size': faker.pyint(min_value=1, max_value=100),
    }

And then repeated execution. Example code for Pydantic:

def test_pydantic(data: dict) -> None:
    count = 10**5
    execution_time = timeit.timeit(stmt=lambda: PydanticSchema(**data), number=count)
    print('pydantic', count, execution_time)

We get the following results:

  1. Pydantic (6.85 sec)

  2. Trafaret (7.23 sec)

  3. Marshmallow (26.43 sec)

  4. DRF (36.42 sec)

Unexpectedly, Trafaret overtakes Marshmallow by two lengths and is breathing down the leader’s neck. The first and last places did not bring any surprises.

If we talk about the subjective ease of use of these libraries, then, in my opinion, the order will be the same.

Conclusion

Initially, the code for the article was written in December 22. Something went wrong, so I returned to this idea a year later – in December 23. Now it’s spring 24, and I hope that procrastination will still be defeated (and the article will be completed), but thanks to it you can see how the things being reviewed developed libraries in just over a year.

Final library versions used
pydantic[email]==2.6.4
pydantic-extra-types==2.6.0
pycountry==23.12.11
marshmallow==3.21.1
marshmallow-union==0.1.15.post1
trafaret==2.1.1
djangorestframework==3.14.0

For example, during this time Marshmallow was brought Enum (A Union still have to be installed additionally), and Pydantic updated the major version (along with it the speed records). But Trafaret released exactly one update during this period; they've probably reached zen.

Many useful features were not discussed in this article, but you can read about them in the documentation. Such as:

  • Using a non-standard json library, and third-party ones (for example, orjson), which promise higher speed of serialization and deserialization

  • Immutability of validated data (frozen)

  • Inheritance/merge of schemes

In addition to the libraries discussed, there are many others: Cerberus, jsonschema, WTForms – but, I believe, they are already in legacy status, so they will not be included in new projects.

PS if you have anything to say about bugs or improvements, feel free to write 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *