Storing files uploaded by users

Introduction

At the very beginning of my career, I had the honor of single-handedly developing a project intended for the mass user. It must be said that almost all the fundamental principles of how to fail a project were followed, however, it is still alive. The project was intended for forced use by a certain category of public sector workers. Technical specifications, analytics, design documents, layouts in Figma, lavender smoothie, and other of these buzzwords of yours, without which N years ago the BAM was built and Transnsib was completely non-existent. But there were processes “on paper” that needed to be digitized. Therefore, what could be taken for technical specifications looked like “These (workers) fill out this (papers) then take it to those (inspectors) and then all this is stored, make it so that they can download and send from the computer, we have a whole floor here I’m busy with papers, a fire will start – everything will be ruined.” Using all my knowledge and experience in building high-load systems (at this point I moved away from writing the article to first laugh and then cry), I began implementation.

Fatal flaw

I have seen how some “large and mature” document management systems implement file storage. By default, the bodies of documents were written to the database, in a special table. It was also possible, using a special “File Storage” module, to set up file transfer to disk, distributing documents across several servers. “If such (serious) systems store files in the database, then it will work here too,” I thought, but the scale of the tragedy was underestimated. Those people for whom the system was intended began to download everything that was bad – videos, plump scans of equally plump documents, archives, etc. The DBMS, which was PostgreSQL, withstood such abuse in good faith. The base was puffy, but it held up. Until one day it became clear that users love to download, but disk space is still limited. And someone turned off the autovacuum, otherwise this is all a prank. They turned on the autovacuum, started it up, and part of the space returned, although it was clear that it would not be for long.

Actually, if we talk about storing files in a database, I can highlight only one advantage – everything is in one place. We make a backup of the database and get a backup of the files. If we lose the base, we lose everything, there is nothing more to worry about.

As a fairly obvious disadvantage, it is worth highlighting the process of downloading a file. In order to receive a file from the database at the user's request, you need to do a SELECT, which will lead to the file being read into the memory of your service. If file sizes and the number of simultaneous downloads are not strictly limited, then users have a chance to cause an OOM-killer or OutOfMemoryError and the container will fall into the abyss.

The standard approach is to store files on the file system, but in this case you will have to practice rsync`ah and installing additional disks when storage capacity is exhausted.

The million dollar idea is to store files in /dev/null. No space needed, no replication needed. There is, however, a problem with downloading. So this is a solution in case no one ever needs the files.

Alternative

I would like to talk about one of the options for solving the problem of storing user files using object storage (specifically, using MiniO).

MiniO is a fairly convenient alternative to the above methods for storing user files, since many issues are resolved out of the box:

  1. Backup

  2. Replication

  3. Load Balancing

  4. API

  5. Some frameworks (for example Laravel) out of the box they can work with object storage.

However, not all frameworks can work with MiniO, so let’s launch MiniO and make a simple service for uploading and downloading previously saved files via a temporary link.

Deploying MiniO

Deployment is most conveniently done using docker; the minio repository has the corresponding documentation with examples.

We use the following script to launch:

docker run \
  -p 9000:9000 \
  -p 9001:9001 \
  --name minio1 \
  -e "MINIO_ROOT_USER=AKIAIOSFODNN7EXAMPLE" \
  -e "MINIO_ROOT_PASSWORD=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
  -v D:\data:/data \
  quay.io/minio/minio server /data --console-address ":9001"

After executing it, we will see the following output in the console:

Formatting 1st pool, 1 set(s), 1 drives per set.
WARNING: Host local has more than 0 drives of set. A host failure will result in data becoming unavailable.
MinIO Object Storage Server
Copyright: 2015-2024 MinIO, Inc.
License: GNU AGPLv3 <https://www.gnu.org/licenses/agpl-3.0.html>
Version: RELEASE.2024-03-15T01-07-19Z (go1.21.8 linux/amd64)
API: http://172.17.0.4:9000  http://127.0.0.1:9000
WebUI: http://172.17.0.4:9001 http://127.0.0.1:9001
Docs: https://min.io/docs/minio/linux/index.html
Status:         1 Online, 0 Offline.
STARTUP WARNINGS:
- The standard parity is set to 0. This can lead to data loss.

The main thing here is that now there is a WebUI waiting for us on port 9001, and an API for accessing buckets on port 9000.

You can log into WebUI using the login and password specified in the container launch command.

We will need to create a bucket and keys to access our application. To do this, go to AdministrationBuckets and press Create bucketset a name and save.

Creating Buckets

Creating Buckets

Next – keys for the application in the menu UserAccess keysCreate access key.

Key creation

Key creation

New key

New key

We copy ourselves Access key And Secret keyfill in the fields with the name and description (and expiration date, if necessary), click Create.

Great, now let's move on to the demo application.

Application

For implementation we will use Python (FastAPI). Let's implement the following idea:

  • The service will support 4 methods:

    • Uploading a file to storage

    • Getting a list of files in storage

    • Downloading a file using a temporary link

    • Getting a temporary link to a file

  • To implement a temporary link, we use JWT (this may not seem like the best solution, but it allows you not to store any state, the JWT token itself contains all the information, alternatively, you can store the information in Redis with the TTL setting for writing, after which the link will be deleted )

First, let's implement a wrapper for MiniO.

from typing import BinaryIO

import minio
from minio import Minio


class MinioHandler:
    def __init__(self, minio_endpoint: str, access_key: str, secret_key: str, bucket: str, secure: bool = False):
        self.client = Minio(
            minio_endpoint,
            access_key=access_key,
            secret_key=secret_key,
            secure=secure
        )
        self.bucket = bucket

    def upload_file(self, name: str, file: BinaryIO, length: int):
        return self.client.put_object(self.bucket, name, file, length=length)

    def list(self):
        objects = list(self.client.list_objects(self.bucket))
        return [{"name": i.object_name, "last_modified": i.last_modified} for i in objects]

    def stats(self, name: str) -> minio.api.Object:
        return self.client.stat_object(self.bucket, name)

    def download_file(self, name: str):
        info = self.client.stat_object(self.bucket, name)
        total_size = info.size
        offset = 0
        while True:
            response = self.client.get_object(self.bucket, name, offset=offset, length=2048)
            yield response.read()
            offset = offset + 2048
            if offset >= total_size:
                break

Points to be especially noted:

  • Method upload_file does not accept bytes, but BinaryIOwhich makes it possible to forward from UploadFile in FastAPI it is not bytes, but a stream. This eliminates the need for additional squats to read data into temporary memory.

  • Method download_file first receives information about the file (we are interested in its length), and then outputs it in blocks of 2048 bytes. Because We receive data via an HTTP request, then the output performance will raise questions, but on the other hand, subtracting the entire file into memory and giving it to the user is such a task.

The FastAPI service looks like this:

import datetime
import os
import jwt
from typing import Annotated

from dateutil.relativedelta import relativedelta

from fastapi import FastAPI, UploadFile, File, Form
from starlette.responses import StreamingResponse, JSONResponse

from minio_fastapi.minio_handler import MinioHandler
app = FastAPI()

minio_handler = MinioHandler(
    os.getenv('MINIO_URL'),
    os.getenv('MINIO_ACCESS_KEY'),
    os.getenv('MINIO_SECRET_KEY'),
    os.getenv('MINIO_BUCKET'),
    False
)


@app.post('/upload')
async def upload(file: Annotated[UploadFile, Form()]):
    minio_handler.upload_file(file.filename, file.file, file.size)
    return {
        "status": "uploaded",
        "name": file.filename
    }


@app.get('/list')
async def list_files():
    return minio_handler.list()


@app.get('/link/{file}')
async def link(file: str):
    obj = minio_handler.stats(file)
    payload = {
        "filename": obj.object_name,
        "valid_til": str(datetime.datetime.utcnow() + relativedelta(minutes=int(os.getenv('LINK_VALID_MINUTES', 10))))
    }
    encoded_jwt = jwt.encode(payload, os.getenv('JWT_SECRET'), algorithm="HS256")

    return {
        "link": f"/download/{encoded_jwt}"
    }


@app.get('/download/{temp_link}')
async def download(temp_link: str):
    try:
        decoded_jwt = jwt.decode(temp_link, os.getenv('JWT_SECRET'), algorithms=["HS256"])
    except:
        return JSONResponse({
            "status": "failed",
            "reason": "Link expired or invalid"
        }, status_code=400)

    valid_til = datetime.datetime.strptime(decoded_jwt['valid_til'], '%Y-%m-%d %H:%M:%S.%f')
    if valid_til > datetime.datetime.utcnow():
        filename = decoded_jwt['filename']
        return StreamingResponse(
            minio_handler.download_file(filename),
            media_type="application/octet-stream"
        )
    return JSONResponse({
        "status": "failed",
        "reason": "Link expired or invalid"
    }, status_code=400)
  • Method /upload only uploads the file to storage

  • Method /list provides information about files and modification time

  • Method /link by file name it will display a link with a JWT token for downloading

  • Method /download using the JWT token, the time and file name will be retrieved from the link; if the time has expired, the user will not receive any file

Summary

As a result, we received a demo example of a service that more or less accurately uses server resources when downloading/uploading files. Of course, with balancing, replication, speed measurement and backup, the article would be comprehensive for understanding all the pitfalls, but… not this time.

Links

My colleagues and I talk about practical development tools in online courses. Look at the catalog and choose the appropriate direction.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *