Open source solutions for managing and working with data in the cloud
Garage
Self-hosting storage built on the S3 API. Project was launched by the non-profit organization Deuxfleurs in 2020. The company develops highly available geo-distributed cloud infrastructure. It is therefore not surprising that Garage allows storing images, videos and documents in a multi-node environment and is focused on hyper-converged systems.
Requests to retrieve data via the S3 API are resolved to a specific object in a bucket. Objects are distributed across multiple nodes, and to ensure consistency, Garage combines nodes into a quorum. The advantages of the tool include simplicity and performance, as it is undemanding to hardware. Garage processes lightweight files much faster than competing solutions such as Minio.
However, there are some drawbacks – for example, some users note the inconvenience associated with the AWS sigv4 authentication protocol, which Garage uses. In their opinion, it adds complexity to projects. Plus, the solution does not provide POSIX compatibility, and synchronization is implemented via network messages, which seriously limits the possibilities for server deployment.
It is also worth noting the peculiarity of Garage's work when deleting data. Deleting tables and blocks occurs with a delay. Time limits prevent deleted data from reappearing after rebalancing. To improve this approach, a more complex mechanism is needed that disables the garbage collector during data transfer, but it has not yet been implemented.
The Garage developers have provided detailed documentation and quick start guides to help with installation and setup. For example, there you can find instructions on how to generate the first config file. There are also instructions on how to install it on various operating systems: Alpine Linux, Arch Linux, FreeBSD and NixOS.
LinDB
A distributed time series database management system written in Go. The project was launched in 2016 by ELEME as a replacement for GraphiteSince then, the solution has been rewritten three times. The open source version appeared only in 2019.
LinDB allows for processing over a million write operations per second on a single server, thanks to a data compression method and parallel query processing. The data is split into parts and placed on different shards. Each shard can have multiple replicas, minimizing data inconsistency. As for the key-value store, its structure is similar to LSM tree with some differences. For example, uint32 is chosen as the type for keys, since strings are converted to it based on time characteristics.
Overall, LinDB is easily scalable and suitable for deployment in a cloud infrastructure with multiple data centers. The solution is distributed free of charge under the Apache License 2.0. Detailed instructions for installing and configuring the system can be found here. find in the official documentation in the project's GitHub repository.
Triplit
A database management system that also runs on the client side and oriented for web developers who can define schemas in TypeScript. All queries are updated in real time, both on the server and locally. Triplit also provides incremental updates and conflict resolution at the object property level, and thanks to CRDTclients can work offline and update data when connected to the network.
The solution is easy to integrate with existing projects: Triplit Server can be hosted independently, or you can use a managed instance in the cloud, which is offered by the developers. They plan to add an API for authentication, file uploads and user monitoring, similar to Supabase or Firebase.
However, there are also disadvantages, among which the lack of aggregation functions in the Triplit query language is highlighted. Users also note the shortcomings of the accompanying documentation, which is difficult to understand. For example, it is difficult to find information on generating an authentication token when hosting a server yourself. In any case, you can evaluate the quality of the documentation yourself. It available on the solution's website and includes instructions on how to set it up and work with it.
By the way, in thematic threads, as an alternative to Triplit, is mentioned Evoluwhich offers a more familiar approach to queries for many specialists.
LitData
LitData Library intended for scaling data processing tasks (data scraping, image scaling, embedding generation) in machine learning. The project is distributed under the MIT license.
LitData supports working with data stored in the cloud without the need to download it locally. The data is streamed during the training of the ML model, and the function optimize converts them into the optimal format on the fly. The parallel processing function is worth highlighting separately:
map(
download,
inputs=metadata,
output_dir="/teamspace/datasets/segment-anything/raw",
num_workers=os.cpu_count(),
machine=Machine.DATA_PREP,
num_nodes=8,
)
It is this that allows performing various operations with data: changing image sizes and forming vector representations on several machines simultaneously. One of the key features of the solution is also the ability to pause data streaming during long-term training and resume it without losses exactly from the point where it was interrupted.
More open source collections on our blog: