Movie recommender or how to write your own DIY solution for finding new movies over the weekend
I'll run the model in the no_grad context. This will disable backpropagation gradient calculation. It is used during model training to calculate error and adjust weights. But now I'm just running the model, so I don't need the gradient.
$ pip install transformers torch
import torch
from transformers import RobertaModel, RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")
def get_embed_text(text):
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512
)
with torch.no_grad():
out = model(**inputs)
return out.last_hidden_state.mean(axis=1).squeeze().detach().numpy()
df["overview_vec"] = df["overview"].apply(get_embed_text)
I have an RTX 3050 video card and a Ryzen 3200G processor, so it takes 2-3 minutes to process the texts of two thousand movies. Processing the same amount of text only on the processor takes 7-10 minutes.
Let's combine the vectors
I vectorized the dataframe columns individually. Now I will combine their vectors using numpy. I will display the dimension of the final vector on the screen.
I will write the dataframe with vectors in pickle format in embedded.pkl:
import numpy as np
def concatenate(row, col_names):
"""Объединим векторы из колонок в один вектор фильма."""
embedding = np.concatenate(row[col_names].values)
embedding = np.concatenate((embedding, row['genres_vec']))
embedding = np.concatenate((embedding, row['overview_vec']))
return embedding
cat_col_names = [col + '_vec' for col, _, _ in categories_cols]
df.loc[:,'embedding'] = df.apply(lambda x: concatenate(x, cat_col_names), axis=1)
print('Embedding shape:', df['embedding'][0].shape)
df.to_pickle('embedded.pkl')
After running the script, I received a vector of dimension 6399 for each movie.
Vector Postgres
To store vectors, I chose the usual PostgreSQL. I chose the base according to the following criteria:
the database must work with both vectors and regular data. For example, to store “raw” fields: title, description and rating;
need to search for movies by title.
To work with vectors in PostGris, you will need the pgvector extension. It adds a vector data type and implements a search for nearby vectors.
Important: in pgvector, the dimension of vectors must be less than 16 thousand.
Next, I set up a container for the base with Docker and Docker-compose. Here are the contents of the files:
# psql.Dockerfile
FROM postgres:16-alpine3.20
RUN apk update && apk add --no-cache postgresql16-plpython3
RUN apk update; \
apk add --no-cache --virtual .vector-deps \
postgresql16-dev \
git \
build-base \
clang15 \
llvm15-dev \
llvm15; \
git clone https://github.com/pgvector/pgvector.git /build/pgvector; \
cd /build/pgvector; \
make; \
make install; \
apk del .vector-deps
COPY docker-entrypoint-initdb.d/* /docker-entrypoint-initdb.d/
# docker-compose.yml
version: '3'
services:
postgres:
build:
dockerfile: psql.Dockerfile
context: .
ports:
- 5432:5432
environment:
- POSTGRES_USER=user
- POSTGRES_PASSWORD=password
- POSTGRES_DB=db
volumes:
- pgdata:/var/lib/postgresql
volumes:
pgdata:
Created an initialization SQL script for the database. I loaded pgvector in it:
$ mkdir docker-entrypoint-initdb.d/
$ touch docker-entrypoint-initdb.d/init.sql
-- docker-entrypoint-initdb.d/init.sql
CREATE EXTENSION IF NOT EXISTS vector;
I check the work: I launch containers via docker-compose and look at the logs.
$ docker-compose up -d
$ docker-compose logs
Now we need to create a table to store the movie vectors. I'll add columns for title, rating, description, and IMDb ID. The recommender will display them in the feed. For the vector I will set the embedding field. In pgvector, the dimension of the vector must be known in advance. I got a vector dimension of 6399 – this information was given by the script for processing the dataframe with films.
I'll add table creation to the docker-entrypoint-initdb.d/init.sql file.
-- docker-entrypoint-initdb.d/init.sql
...
CREATE TABLE movies (
tconst VARCHAR(16) PRIMARY KEY NOT NULL UNIQUE,
title VARCHAR(64) NOT NULL,
title_desc VARCHAR(4096) NOT NULL,
avg_vote NUMERIC NOT NULL DEFAULT 0.0,
embedding vector(6399)
);
After changing init.sql, you need to rebuild the Docker image and restart the database. This will take no more than a minute, because Docker caches assemblies.
$ docker-compose down && docker-compose build && docker-compose up -d
Movie vectors need to be uploaded to the database. To do this, I wrote a script in a separate file that takes data from embedded.pkl. I will work with the database through the psycopg library. And for it to work with vectors, you need the pgvector-python library.
$ pip install psycopg pgvector
# filldb.py
import asyncio
from pgvector.psycopg import register_vector_async
import pandas as pd
import psycopg
df = pd.read_pickle("embedded.pkl")
async def fill_db():
async with await psycopg.AsyncConnection.connect(
'postgresql://user:password@localhost:5432/db'
) as conn:
await register_vector_async(conn)
async with conn.cursor() as cur:
for _, row in df.iterrows():
await cur.execute(
"""
INSERT INTO movies (
tconst,
title,
title_desc,
avg_vote,
embedding
) VALUES (%s, %s, %s, %s, %s)
""",
(
row['imdb_id'],
row['title'],
row['overview'],
row['imdb_rating'],
row["embedding"],
),
)
asyncio.run(fill_db())
After launch, all vectors will be recorded in the database and you can search for similar films.
$ python filldb.py
Interface in Flask
To make it easier to use the Recommender, I made a web interface in Flask and Jinja.
$ pip install flask
The interface consists of one page with an input field for the title of the movie. The recommendations feed appears below after submitting the form. I pass the request and page number in GET parameters. Each page displays 20 movies. For the search form, I made hints: 20 random titles that I pulled from the database. Instead of posters, I attached random photos with dogs. This makes issuing a Recommender look more fun.
The Flask application code and Jinja template are given below. Here is the app.py file:
# app.py
import psycopg
from flask import Flask, render_template, request
from pgvector.psycopg import register_vector
app = Flask(__name__)
@app.route('/')
def main():
query = request.args.get('q')
page = max(0, request.args.get('p', 0, type=int))
with psycopg.connect(
'postgres://user:password@localhost:5432/db'
) as conn:
register_vector(conn)
with conn.cursor() as cur:
hints = cur.execute(
'SELECT title FROM movies ORDER BY random() LIMIT 20;'
)
if query is not None:
query = query.strip()
queryset = cur.execute(
"""
WITH selected_movie AS (
SELECT *
FROM movies
WHERE LOWER(title) = LOWER(%s)
LIMIT 1
)
SELECT
m2.*,
(SELECT COUNT(*) FROM movies) AS total_count,
selected_movie.embedding <-> m2.embedding AS euclidean_distance
FROM
movies m2,
selected_movie
ORDER BY
euclidean_distance ASC
LIMIT 20 OFFSET %s;
""",
(query, 20 * page),
)
result = queryset.fetchall()
num = result[0][5] if result else 0
return render_template(
'search.html',
query=query,
result=result,
page=page,
num=num,
hints=hints,
)
return render_template('search.html', hints=hints)
Here is the Jinja2 page template file templates/search.html:
<!DOCTYPE html>
<html lang="ru">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
{% if query %}
<title>{{ query }} - Рекомендатель кино ({{ num }})</title>
{% else %}
<title>Рекомендатель кино</title>
{% endif %}
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/water.css@2/out/light.min.css">
<style>
* {
box-sizing: border-box;
}
body {
max-width: 960px;
}
.app {
margin-top: 30%;
}
.app>h1 {
font-weight: normal;
font-size: 4.5rem;
margin-bottom: 1.5rem;
text-align: center;
}
.app>h1>a,
.app>h1>a:hover,
.app>h1>a:active,
.app>h1>a:focus,
.app>h1>a:visited {
color: #46178f;
text-decoration: none;
}
#search-box {
display: block;
width: 100%;
max-width: 700px;
margin: 0 auto;
padding: 1.25em;
border-radius: 8px;
background-color: hsl(0, 0%, 96%);
border: none !important;
outline: none !important;
font-family: sans-serif;
}
#search-box:focus {
box-shadow: 0px 0px 15px -2px #46178f !important;
}
.query-result {
margin-top: 3em;
width: 100%;
max-width: 100%;
overflow-x: auto;
}
.query-result>table {
width: 100%;
}
.query-result td {
padding-top: 1.5em;
padding-bottom: 1.5em;
}
.query-result td.rating {
vertical-align: middle;
text-align: center;
font-size: 1.5em;
}
.query-result th.special {
text-align: center;
width: 15%;
}
.query-result tr:nth-child(2) {
background-color: #46178f22;
}
.pagination {
font-size: x-large;
text-align: center;
}
</style>
</head>
<body>
<div class="container">
<div class="app">
<h1><a href="/">Рекомендатель</a></h1>
<form action="/" method="get">
<input id="search-box" name="q" type="text" value="{{ query }}">
</form>
</div>
{% if query %}
{% if result %}
<div class="query-result">
<table>
<tr>
<th class="special">Рейтинг</th>
<th class="special">Постер</th>
<th>Описание</th>
</tr>
{% for movie in result %}
<tr>
<td class="rating">{{ movie[3] }}</td>
<td>
<img loading="lazy" decoding="async" src="https://placedog.net/149/209?id={{ loop.index }}" width="149" height="209" alt="">
</td>
<td>
<b>{{ movie[1] }}</b>
<p>{{ movie[2] }}</p>
<span>
<a href="/?q={{ movie[1]|urlencode }}">Искать похожие</a>
|
<a href="https://imdb.com/title/{{ movie[0] }}" target="_blank">Страничка на IMDb</a>
</span>
</td>
</tr>
{% endfor %}
</table>
</div>
<p class="pagination">
{% if page and page > 0 %}
<a href="/?q={{ query }}&p={{ page - 1 }}">{{ page }}</a>
{% endif %}
{{ page + 1 }}
{% if (result|length) == 20 %}
<a href="/?q={{ query }}&p={{ page + 1 }}">{{ page + 2 }}</a>
{% endif %}
</p>
{% else %}
<p>Результатов нет...</p>
{% endif %}
{% endif %}
</div>
<script type="text/javascript">
let searchBox = document.getElementById("search-box");
searchBox.addEventListener("keydown", event => {
if (event.key != "Enter") return;
let value = event.srcElement.value;
if (value.length == 0) {
event.preventDefault();
return;
}
});
const examples = [
{% for hint in hints %}
"{{ hint[0]|safe }}",
{% endfor %}
].map((example) => example += "...");
let exampleId = 0;
let letterId = 0;
let reversed = false;
function getRandomInt(max) {
return Math.floor(Math.random() * max);
}
function typewriteExample() {
if (reversed) {
setTimeout(typewriteExample, 100 - getRandomInt(25));
if (letterId-- > 0) {
searchBox.placeholder = searchBox.placeholder.slice(0, -1);
return;
}
reversed = false;
if (++exampleId >= examples.length) {
exampleId = 0;
}
} else {
setTimeout(typewriteExample, 150 + (getRandomInt(150) - 75));
if (letterId < examples[exampleId].length) {
searchBox.placeholder += examples[exampleId].charAt(letterId++);
return;
}
reversed = true;
}
}
if (examples.length > 0) {
typewriteExample();
}
</script>
</body>
</html>
The interface can be launched and tested like this:
$ flask run
Once launched, a local link will appear that you can open in your browser.
What can be improved
The movie recommender is working: finding new movies has become easier. But for now it's only MVP.
For completeness you can:
Make movie posters. The dogs are beautiful, but I want relevance.
Sort the actors by rating, highlight the main roles to code them as categorical features.
Add more fields to the database tables to display more information about movies in the recommendation listing.
Do a fuzzy search by name.
Search by actors, genre, director, screenwriters.
Make facets for the results page to filter the results.
Author of the article: Dmitry Sidorov
The UFO flew in and left a promotional code here for our blog readers:
-15% on any VDS order (except for the Warm-up tariff) — HABRFIRSTVDS