Infrastructure for data engineer S3

In this article I would like to talk about such technology as S3 from the data engineering side.

S3 is one of the services that is used to build Data Lake and share files.

Let's start with a definition:
S3 (Simple Storage Service) is a data transfer protocol developed by Amazon. Also an object storage.

What do we have as a result? This is a file sharing service in its concept. Just as you organize file storage on your computer, you can organize it in S3 in exactly the same way.

Below in the article we will use S3 to create a Data Lake.

So let's introduce some terms we'll need:

  • bucket is a container where you can store your files (folder)

  • path is a link that points to a specific part of the bucket (path in explorer)

  • object – is some physical file that is located at path in bucket. Can have different formats. (file)

  • access key – access key to the bucket (login)

  • secret key – secret access key for the bucket (password)

Since S3 is usually deployed somewhere, you can use different clients with UI to connect to it:

  • CyberDuck

  • Commander One

  • etc

But in our case we will communicate with S3 via Python.

Let's first deploy this service locally in Docker (all code and all sources will be available in my repository)

version: "3.9"  
  
services:  
  minio:  
    image: minio/minio:RELEASE.2024-07-04T14-25-45Z  
    restart: always  
    volumes:  
      - ./data:/data  
    environment:  
      - MINIO_ROOT_USER=minioadmin  
      - MINIO_ROOT_PASSWORD=minioadmin  
    command: server /data --console-address ":9001"  
    ports:  
      - "9000:9000"  
      - "9001:9001"

And to start the service you need to run the command: docker-compose up -d.

Then we go to the address http://localhost:9001/browser and we will see the Web UI of our object storage.

Let's start by going to point Access Keys and create the first access keys when you click the button Create access key + we will go to the interface where we can create our Access Key And Secret Key. To use S3 further, they must be saved.

Now we can work with our S3 via Python

To do this, first create a local environment with the command and install all dependencies for the project:

python3.12 -m venv venv && \
source venv/bin/activate && \
pip install --upgrade pip && \
pip install -r requirements.txt

Then we'll create a little code that will check for the existence of a bucket in our S3:

from minio import Minio  
  
# Импорт из локальной переменной секретных данных  
from cred import s3_minio_access_key, s3_minio_secret_key  
  
# Не меняется, это единый endpoint для подключения к S3  
endpoint="localhost:9000"  
# access key для подключения к bucket  
access_key = s3_minio_access_key  
# secret key для подключения к bucket  
secret_key = s3_minio_secret_key  
  
client = Minio(  
    endpoint=endpoint,  
    access_key=access_key,  
    secret_key=secret_key,  
    secure=False,  # https://github.com/minio/minio/issues/8161#issuecomment-631120560  
)  
  
buckets = client.list_buckets()  
for bucket in buckets:  
    print(bucket.name, bucket.creation_date)

If we run it, we won't see anything, because there isn't a single bucket in our S3 yet. So let's create a new bucket with the following code:

from minio import Minio  
  
# Импорт из локальной переменной секретных данных  
from cred import s3_minio_access_key, s3_minio_secret_key  
  
# Не меняется, это единый endpoint для подключения к S3  
endpoint="localhost:9000"  
# access key для подключения к bucket  
access_key = s3_minio_access_key  
# secret key для подключения к bucket  
secret_key = s3_minio_secret_key  
  
client = Minio(  
    endpoint=endpoint,  
    access_key=access_key,  
    secret_key=secret_key,  
    secure=False,  # https://github.com/minio/minio/issues/8161#issuecomment-631120560  
)  
  
client.make_bucket(  
    bucket_name="test-local-bucket"  
)

And if you run the previous code again to check the bucket, it will display the presence of the bucket. test-local-bucket

It is important to note that underscores are not supported in S3 and therefore dashes must be used instead.

Now let's upload some file to our bucket. To do this, we'll use the following code:

import pandas as pd  
  
# Импорт из локальной переменной секретных данных  
from cred import s3_minio_access_key, s3_minio_secret_key  
  
bucket_name="test-local-bucket"  
file_name="titanic.csv"  
  
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')  
  
df.to_csv(  
    path_or_buf=f's3://{bucket_name}/{file_name}',  
    index=False,  
    escapechar="\\",  
    compression='gzip',  
    storage_options={  
        "key": s3_minio_access_key,  
        "secret": s3_minio_secret_key,  
        # https://github.com/mlflow/mlflow/issues/1990#issuecomment-659914180  
        "client_kwargs": {"endpoint_url": "http://localhost:9000"},  
    },  
)

In the Web UI you can check the existence of the file

Availability of titanic.csv file

Availability of titanic.csv file

An important property to know when working with path in a bucket is that you can specify your own path manually. If you look at the example above, our path looks like this:
file_name="titanic.csv"but you can set any path, for example like this: file_name="raw/kaggle/2022-04-01/titanic.csv"
and we will get the following structure in the bucket:

Manually written path to titanic.csv

Manually written path to titanic.csv

It is also worth noting that if we delete a file at a given path, the entire path will disappear and there will be no need to clear the empty path.

Now we will also need to specify this entire path to read it. Let's use the code below to read our .csv from bucket

import pandas as pd  
  
# Импорт из локальной переменной секретных данных  
from cred import s3_minio_access_key, s3_minio_secret_key  
  
bucket_name="test-local-bucket"  
file_name="titanic.csv"  
# file_name="raw/kaggle/2022-04-01/titanic.csv"  
  
df = pd.read_csv(  
    filepath_or_buffer=f's3://{bucket_name}/{file_name}',  
    escapechar="\\",  
    storage_options={  
        "key": s3_minio_access_key,  
        "secret": s3_minio_secret_key,  
        # https://github.com/mlflow/mlflow/issues/1990#issuecomment-659914180  
        "client_kwargs": {"endpoint_url": "http://localhost:9000"},  
    },  
    compression='gzip'  
)  
  
print(df)

In general, all examples of how to use S3 Minio are described in the official GitHub package.

In this article I have shown only the basics of how you can interact with S3 for your pet projects.

In general, S3 is gaining popularity in data engineering because it is a fairly simple service in its structure and covers many tasks of data engineers.

S3 can also be improved using third-party services:

  1. Apache Hudi

  2. LakeFS

  3. etc

Write in the comments what else you would like to know about S3 and services for data engineers.

Also, if you need a consultation/mentoring/mock interview and other questions on data engineering, you can contact me. All contacts are listed by link.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *