Movie recommender or how to write your own DIY solution for finding new movies over the weekend

I'll run the model in the no_grad context. This will disable backpropagation gradient calculation. It is used during model training to calculate error and adjust weights. But now I'm just running the model, so I don't need the gradient.

$ pip install transformers torch
import torch
from transformers import RobertaModel, RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")

def get_embed_text(text):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512
    )
    with torch.no_grad():
        out = model(**inputs)
    return out.last_hidden_state.mean(axis=1).squeeze().detach().numpy()

df["overview_vec"] = df["overview"].apply(get_embed_text)

I have an RTX 3050 video card and a Ryzen 3200G processor, so it takes 2-3 minutes to process the texts of two thousand movies. Processing the same amount of text only on the processor takes 7-10 minutes.

Let's combine the vectors

I vectorized the dataframe columns individually. Now I will combine their vectors using numpy. I will display the dimension of the final vector on the screen.

I will write the dataframe with vectors in pickle format in embedded.pkl:

import numpy as np

def concatenate(row, col_names):
    """Объединим векторы из колонок в один вектор фильма."""
    embedding = np.concatenate(row[col_names].values)
    embedding = np.concatenate((embedding, row['genres_vec']))
    embedding = np.concatenate((embedding, row['overview_vec']))
    return embedding

cat_col_names = [col + '_vec' for col, _, _ in categories_cols]
df.loc[:,'embedding'] = df.apply(lambda x: concatenate(x, cat_col_names), axis=1)
print('Embedding shape:', df['embedding'][0].shape)
df.to_pickle('embedded.pkl')

After running the script, I received a vector of dimension 6399 for each movie.

Vector Postgres

To store vectors, I chose the usual PostgreSQL. I chose the base according to the following criteria:

  • the database must work with both vectors and regular data. For example, to store “raw” fields: title, description and rating;

  • need to search for movies by title.

To work with vectors in PostGris, you will need the pgvector extension. It adds a vector data type and implements a search for nearby vectors.

Important: in pgvector, the dimension of vectors must be less than 16 thousand.

Next, I set up a container for the base with Docker and Docker-compose. Here are the contents of the files:

# psql.Dockerfile
FROM postgres:16-alpine3.20

RUN apk update && apk add --no-cache postgresql16-plpython3
RUN apk update; \
    apk add --no-cache --virtual .vector-deps \
      postgresql16-dev \
      git \
      build-base \
      clang15 \
      llvm15-dev \
      llvm15; \
    git clone https://github.com/pgvector/pgvector.git /build/pgvector; \
    cd /build/pgvector; \
    make; \
    make install; \
    apk del .vector-deps

COPY docker-entrypoint-initdb.d/* /docker-entrypoint-initdb.d/
# docker-compose.yml
version: '3'

services:
    postgres:
        build:
            dockerfile: psql.Dockerfile
            context: .
        ports:
            - 5432:5432
        environment:
            - POSTGRES_USER=user
            - POSTGRES_PASSWORD=password
            - POSTGRES_DB=db
        volumes:
            - pgdata:/var/lib/postgresql

volumes:
    pgdata:

Created an initialization SQL script for the database. I loaded pgvector in it:

$ mkdir docker-entrypoint-initdb.d/
$ touch docker-entrypoint-initdb.d/init.sql
-- docker-entrypoint-initdb.d/init.sql
CREATE EXTENSION IF NOT EXISTS vector;

I check the work: I launch containers via docker-compose and look at the logs.

$ docker-compose up -d
$ docker-compose logs

Now we need to create a table to store the movie vectors. I'll add columns for title, rating, description, and IMDb ID. The recommender will display them in the feed. For the vector I will set the embedding field. In pgvector, the dimension of the vector must be known in advance. I got a vector dimension of 6399 – this information was given by the script for processing the dataframe with films.

I'll add table creation to the docker-entrypoint-initdb.d/init.sql file.

-- docker-entrypoint-initdb.d/init.sql
...

CREATE TABLE movies (
    tconst VARCHAR(16) PRIMARY KEY NOT NULL UNIQUE,
    title VARCHAR(64) NOT NULL,
    title_desc VARCHAR(4096) NOT NULL,
    avg_vote NUMERIC NOT NULL DEFAULT 0.0,
    embedding vector(6399)
);

After changing init.sql, you need to rebuild the Docker image and restart the database. This will take no more than a minute, because Docker caches assemblies.

$ docker-compose down && docker-compose build && docker-compose up -d

Movie vectors need to be uploaded to the database. To do this, I wrote a script in a separate file that takes data from embedded.pkl. I will work with the database through the psycopg library. And for it to work with vectors, you need the pgvector-python library.

$ pip install psycopg pgvector
# filldb.py
import asyncio

from pgvector.psycopg import register_vector_async
import pandas as pd
import psycopg

df = pd.read_pickle("embedded.pkl")

async def fill_db():
    async with await psycopg.AsyncConnection.connect(
        'postgresql://user:password@localhost:5432/db'
    ) as conn:
        await register_vector_async(conn)
        async with conn.cursor() as cur:
            for _, row in df.iterrows():
                await cur.execute(
                    """
                    INSERT INTO movies (
                        tconst,
                        title,
                        title_desc,
                        avg_vote,
                        embedding
                    ) VALUES (%s, %s, %s, %s, %s)
                    """,
                    (
                        row['imdb_id'],
                        row['title'],
                        row['overview'],
                        row['imdb_rating'],
                        row["embedding"],
                    ),
                )

asyncio.run(fill_db())

After launch, all vectors will be recorded in the database and you can search for similar films.

$ python filldb.py

Interface in Flask

To make it easier to use the Recommender, I made a web interface in Flask and Jinja.

$ pip install flask

The interface consists of one page with an input field for the title of the movie. The recommendations feed appears below after submitting the form. I pass the request and page number in GET parameters. Each page displays 20 movies. For the search form, I made hints: 20 random titles that I pulled from the database. Instead of posters, I attached random photos with dogs. This makes issuing a Recommender look more fun.

The Flask application code and Jinja template are given below. Here is the app.py file:

# app.py
import psycopg
from flask import Flask, render_template, request
from pgvector.psycopg import register_vector

app = Flask(__name__)

@app.route('/')
def main():
    query = request.args.get('q')
    page = max(0, request.args.get('p', 0, type=int))
    with psycopg.connect(
        'postgres://user:password@localhost:5432/db'
    ) as conn:
        register_vector(conn)
        with conn.cursor() as cur:
            hints = cur.execute(
                'SELECT title FROM movies ORDER BY random() LIMIT 20;'
            )
            if query is not None:
                query = query.strip()
                queryset = cur.execute(
                    """
                    WITH selected_movie AS (
                        SELECT *
                        FROM movies
                        WHERE LOWER(title) = LOWER(%s)
                        LIMIT 1
                    )
                    SELECT
                        m2.*,
                        (SELECT COUNT(*) FROM movies) AS total_count,
                        selected_movie.embedding <-> m2.embedding AS euclidean_distance
                    FROM
                        movies m2,
                        selected_movie
                    ORDER BY
                        euclidean_distance ASC
                    LIMIT 20 OFFSET %s;
                    """,
                    (query, 20 * page),
                )
                result = queryset.fetchall()
                num = result[0][5] if result else 0
                return render_template(
                    'search.html',
                    query=query,
                    result=result,
                    page=page,
                    num=num,
                    hints=hints,
                )
            return render_template('search.html', hints=hints)

Here is the Jinja2 page template file templates/search.html:

<!DOCTYPE html>
<html lang="ru">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    {% if query %}
        <title>{{ query }} - Рекомендатель кино ({{ num }})</title>
    {% else %}
        <title>Рекомендатель кино</title>
    {% endif %}
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/water.css@2/out/light.min.css">
    <style>
        * {
            box-sizing: border-box;
        }
        body {
            max-width: 960px;
        }
        .app {
            margin-top: 30%;
        }
        .app>h1 {
            font-weight: normal;
            font-size: 4.5rem;
            margin-bottom: 1.5rem;
            text-align: center;
        }
        .app>h1>a,
        .app>h1>a:hover,
        .app>h1>a:active,
        .app>h1>a:focus,
        .app>h1>a:visited {
            color: #46178f;
            text-decoration: none;
        }
        #search-box {
            display: block;
            width: 100%;
            max-width: 700px;
            margin: 0 auto;
            padding: 1.25em;
            border-radius: 8px;
            background-color: hsl(0, 0%, 96%);
            border: none !important;
            outline: none !important;
            font-family: sans-serif;
        }
        #search-box:focus {
            box-shadow: 0px 0px 15px -2px #46178f !important;
        }
        .query-result {
            margin-top: 3em;
            width: 100%;
            max-width: 100%;
            overflow-x: auto;
        }
        .query-result>table {
            width: 100%;
        }
        .query-result td {
            padding-top: 1.5em;
            padding-bottom: 1.5em;
        }
        .query-result td.rating {
            vertical-align: middle;
            text-align: center;
            font-size: 1.5em;
        }
        .query-result th.special {
            text-align: center;
            width: 15%;
        }
        .query-result tr:nth-child(2) {
            background-color: #46178f22;
        }
        .pagination {
            font-size: x-large;
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="app">
            <h1><a href="/">Рекомендатель</a></h1>
            <form action="/" method="get">
                <input id="search-box" name="q" type="text" value="{{ query }}">
            </form>
        </div>
        {% if query %}
            {% if result %}
                <div class="query-result">
                    <table>
                        <tr>
                            <th class="special">Рейтинг</th>
                            <th class="special">Постер</th>
                            <th>Описание</th>
                        </tr>
                        {% for movie in result %}
                            <tr>
                                <td class="rating">{{ movie[3] }}</td>
                                <td>
                                    <img loading="lazy" decoding="async" src="https://placedog.net/149/209?id={{ loop.index }}" width="149" height="209" alt="">
                                </td>
                                <td>
                                    <b>{{ movie[1] }}</b>
                                    <p>{{ movie[2] }}</p>
                                    <span>
                                        <a href="/?q={{ movie[1]|urlencode }}">Искать похожие</a>
                                        |
                                        <a href="https://imdb.com/title/{{ movie[0] }}" target="_blank">Страничка на IMDb</a>
                                    </span>
                                </td>
                            </tr>
                        {% endfor %}
                    </table>
                </div>
                <p class="pagination">
                    {% if page and page > 0 %}
                        <a href="/?q={{ query }}&p={{ page - 1 }}">{{ page }}</a>
                    {% endif %}
                
                    {{ page + 1 }}
                
                    {% if (result|length) == 20 %}
                        <a href="/?q={{ query }}&p={{ page + 1 }}">{{ page + 2 }}</a>
                    {% endif %}
                </p>
            {% else %}
                <p>Результатов нет...</p>
            {% endif %}
        {% endif %}
    </div>

    <script type="text/javascript">
        let searchBox = document.getElementById("search-box");
        searchBox.addEventListener("keydown", event => {
            if (event.key != "Enter") return;
            let value = event.srcElement.value;
            if (value.length == 0) {
                event.preventDefault();
                return;
            }
        });
        const examples = [
            {% for hint in hints %}
                "{{ hint[0]|safe }}",
            {% endfor %}
        ].map((example) => example += "...");
        let exampleId = 0;
        let letterId = 0;
        let reversed = false;

        function getRandomInt(max) {
            return Math.floor(Math.random() * max);
        }

        function typewriteExample() {
            if (reversed) {
                setTimeout(typewriteExample, 100 - getRandomInt(25));
                if (letterId-- > 0) {
                    searchBox.placeholder = searchBox.placeholder.slice(0, -1);
                    return;
                }
                reversed = false;
                if (++exampleId >= examples.length) {
                    exampleId = 0;
                }
            } else {
                setTimeout(typewriteExample, 150 + (getRandomInt(150) - 75));
                if (letterId < examples[exampleId].length) {
                    searchBox.placeholder += examples[exampleId].charAt(letterId++);
                    return;
                }
                reversed = true;
            }
        }
        if (examples.length > 0) {
            typewriteExample();
        }
</script>
</body>
</html>

The interface can be launched and tested like this:

$ flask run

Once launched, a local link will appear that you can open in your browser.

What can be improved

The movie recommender is working: finding new movies has become easier. But for now it's only MVP.

For completeness you can:

  • Make movie posters. The dogs are beautiful, but I want relevance.

  • Sort the actors by rating, highlight the main roles to code them as categorical features.

  • Add more fields to the database tables to display more information about movies in the recommendation listing.

  • Do a fuzzy search by name.

  • Search by actors, genre, director, screenwriters.

  • Make facets for the results page to filter the results.

Author of the article: Dmitry Sidorov


The UFO flew in and left a promotional code here for our blog readers:
-15% on any VDS order (except for the Warm-up tariff) — HABRFIRSTVDS

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *