how we built a bridge between Spark and Greenplum in ITSumma

In this article, ITSumma Lead Database Administrator Alexey Ponomarevsky will talk about how we integrated the popular distributed data processing framework Apache Spark with the powerful massively parallel Greenplum database.

The text will be useful for developers solving similar problems of integrating distributed processing frameworks with relational databases using parallel computing.

Briefly what the article is about:

  1. Why was it necessary to create a connector for interaction between Spark and Greenplum.

  2. How the connector was developed: architecture and stages.

  3. Further development and optimization of the solution.

  4. Best Practices in developing such solutions.

The article is made from Alexey’s report at our webinar, you can watch it here.

The need to create a connector

Apache Spark out of the box provides a number of connectors for interacting with various data sources, including relational databases, through the JDBC interface. This generic solution limits performance and scalability because data exchange occurs through a single master node of the Spark cluster, without enabling parallel computation on all nodes.

Therefore, to enable efficient distributed communication between Spark and Greenplum, a specialized connector is required that can take full advantage of the massively parallel computing capabilities of both systems. The development of this connector was driven by the need for high-performance data transfer between Spark and Greenplum, and our desire to create an open source solution with full control over the internal implementation.

Architecture and development stages

Efficient parallel data exchange between distributed systems such as Spark and Greenplum is a fairly complex engineering problem. In addition to coordinating and synchronizing the work of multiple nodes in clusters, it is necessary to support the transactional nature of Greenplum as an ACID-compliant database. This requires the use of data buffering mechanisms and two-phase transaction committing.

We initially expected to use some of Spark's built-in tools and libraries, but it turned out that this would not be enough to create a full-fledged connector. Therefore, a two-phase development plan was adopted:

1. Creation of your own tools and basic functionality for parallel data exchange.

2. Further development and optimization of the solution, including improving throughput and reducing latency.

In the first stage, the following key components were developed:

  • Classes for synchronizing and coordinating the interaction of Spark nodes (master and slave) based on the RMI remote procedure call mechanism.

  • Means of data exchange between nodes on top of the same RMI.

  • Components for efficient buffering of transmitted data using the “zero-copy” approach. A special implementation of collections and event model in Java was created.

  • Classes for converting data format between Spark, Network Protocol, and Greenplum representations.

As a result, a basic workable connector was obtained that provides parallel data exchange with the ability to scale both in the Spark cluster and on the Greenplum side. At the same time, the throughput per Greenplum segment increased from the original 2-5 MB/s to 10 MB/s or more.

Further development and optimization

Having received a functional connector based on our own tools, we began to analyze performance and look for opportunities to improve it. We applied the principle of assessing the “ideality” of the algorithm by the proportion of time it spends waiting for I/O to complete.

The analysis showed that the bottleneck preventing the growth of data transfer speed is the serialization and deserialization of data between the internal formats of Spark, Greenplum and the network protocol. These operations are time-consuming for the CPU and do not allow full use of network and disk subsystem bandwidth.

To reduce this overhead, we have developed two approaches:

1. Increasing the number of CPU cores allocated to each Spark executor, or the total number of involved executors. This made it possible to increase the speed of serialization and deserialization by 1.5-2 times.

2. Eliminate intermediate data conversion steps by transmitting it in Java's native binary serialization format instead of a text representation. This will require changes on the Greenplum side, in particular, support for Protocol Buffers and the use of the PXF extension.

According to our estimates, these optimizations will further increase the throughput of the connector several times, reaching values ​​of the order of 50 MB/s per Greenplum segment. Further growth may not be advisable as it will begin to negatively impact the experience of other users and unduly increase the load on the database.

Factors that helped develop the solution successfully

Developing efficient mechanisms for data exchange between heterogeneous distributed systems is a non-trivial and complex engineering task. Here's what will help you deal with it effectively:

  • Deep understanding of the architecture and features of interacting systems, both at the level of concepts and protocols, data formats and development practices.

  • Creation of the most universal tools, not strictly tied to the specific environments and frameworks used. This approach allows us to further use developments for other combinations of systems.

  • Applying the right criteria and metrics that will identify performance bottlenecks at all stages of data processing and transmission.

  • The desire to ensure maximum possible parallelism, load balancing and minimizing data copying between components.

Further development of the open-source connector between Spark and Greenplum is seen in increasing its basic performance and scalability, and in the implementation of additional features specific to these systems.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *