Comparison of operating costs

Spoiler alert: Redpanda is 10x faster and delivers 6x savings over Kafka.

Author: Tristan Stevens

In today’s economic climate, spending for all is a key topic. Total Cost of Ownership (TCO) should be the primary factor in evaluating the Return On Investment (ROI) from implementing a new software platform.

TCO is an aggregated cost that includes the costs of deploying, configuring, securing, provisioning, and operating software over its expected lifespan. This includes infrastructure costs, staff salaries, training and subscriptions.

For this comparison, we define TCO as a combination of the following components:

  • Infrastructure: Compute and storage costs, in this case on the AWS (Amazon Web Services) platform.

  • Administration: Costs for deployment, installation and support of clusters.

In this article, we explore the overall cost of running Apache Kafka® and Redpanda clusters for streaming data and throughput in a real-world environment using a self-hosted deployment model. We will start by defining a cost model, test the physical characteristics of both systems using representative configurations, including security aspects and Disaster Recovery (DR), and then estimate their infrastructure, administrative, and overall costs.

Let’s look into this.

Comparison of infrastructure costs

Determining throughput at a given delay

Redpanda is built with the understanding that software should make full use of the hardware on which it is deployed. Redpanda is designed to fully load fast storage devices such as SSD or NVMe devices, and to take advantage of multi-core processors and computers with large amounts of RAM. This allows you to achieve maximum performance when processing large amounts of data and queries.

We ran over 200 hours of tests with small, medium and large workloads to form performance profile for Kafka and Redpanda.

During our performance tests, we aimed to achieve low and predictable latency from start to finish. We adjusted the number of nodes so that it remained relatively stable (i.e. the system did not overload even at high throughput).

Our results from performance benchmark report revealed the following:

  1. Redpanda’s medium and P99+ end-to-end latency profiles remain incredibly stable even at high bandwidth.

  2. Kafka was unable to handle a load of 500 MB/s or more (total throughput of 1 GB/s) on three nodes. The tests did not complete with the required performance.

  3. We had to repeatedly create larger Kafka clusters to keep the latency profiles unchanged, but even in this case, P99.9 latency exceeded 200ms with a cluster size 3 times larger than Redpanda.

  4. For light workloads, Redpanda on low-cost AWS Graviton (ARM) processors showed a marginal speed advantage, while Kafka was unable to run on them at all.

When developing the cost model, we needed to determine the optimal cluster sizes for Redpanda and Kafka in order to provide comparable performance characteristics. Our goal was to have 99.9% of operations complete in less than 20 milliseconds (P99.9 latency). Performance tests prompted us with the required cluster sizes. To achieve the same latency as Redpanda, in some scenarios, the Kafka system required significantly more computing resources. At baseline, latency in Kafka was significantly higher (by a factor of 20) compared to latency in Redpanda. To equalize the delay and achieve comparable performance, we needed to increase the number of nodes (servers) in the Kafka cluster three times compared to Redpanda.

All information is presented in the table below and is the result of our analysis of performance in different conditions.

Figure 1: Comparison of infrastructure requirements for small, medium, and large workloads with a target latency profile.

Figure 1: Comparison of infrastructure requirements for small, medium, and large workloads with a target latency profile.

One of the main advantages of using Redpanda is the ease of deployment. Because Redpanda is deployed as a single executable with no external dependencies, we don’t need any infrastructure for ZooKeeper or Schema Registry (SR). Redpanda also includes automatic partition and leader balancing capabilities so there is no need to run Cruise Control (a tool for managing and optimizing an Apache Kafka cluster).

In the presented cost model, we demonstrate the infrastructure costs for working with Redpanda in the following scenarios:

You need a certain number of brokers to run your data processing workload. In the case of Apache Kafka, these are the servers that process and manage the flow of data. Additionally, in a Kafka environment, for certain options, such as performance management, data schema storage, and fault tolerance, additional instances (instances) are also required and resources are again needed for them. For example, for the performance management system (Cruise Control), schema storage (Schema Registry) (two nodes for high availability), stability (ZooKeeper) (three nodes) to ensure reliable operation of these components.

So, the total number of servers (brokers) includes not only those that directly process the workload, but also additional servers to provide different functionality and quality of service for the system.

Note: At the time of writing this blog post, Apache Kafka added support for KRaft as a replacement for ZooKeeper in version 3.3. However, KRaft still does not fully possess all the necessary capabilities. We expect it to be some time before this feature becomes widespread, which is why we didn’t include KRaft in our cost model.

Light workload: 50 MB/s

For a small workload of 50 MB/s, we noticed that Redpanda and Kafka show a similar performance profile when running on i3en.xlarge instances. However, Redpanda was able to show performance gains over Kafka on smaller i3en.large instances.

However, we noted that we were not able to fully utilize the i3en.large machines, simply because the load was not large enough. By implementing AWS Graviton instances (based on the ARM architecture), we’ve improved Redpanda’s performance at a significantly lower cost. As detailed in blog “Kafka performance vs. Redpanda”Kafka was unable to run on Graviton instances.

The table below compares the cost of Kafka running on i3en.xlarge and Redpanda running on is4gen.medium instances.

Figure 2: Comparison of infrastructure costs for a 50 MB/s workload between Kafka and Redpanda.

Figure 2: Comparison of infrastructure costs for a 50 MB/s workload between Kafka and Redpanda.

Compared to running Redpanda on AWS Graviton instances, Kafka costs 3-4 times more! Performance on i3en.large instances for Kafka was not as good as on i3en.xlarge. It is also worse than Redpanda’s performance on the same hardware or Graviton. Using Redpanda for this workload, you can save up to $12,969 per year.

Medium to large workloads: 500 MB/s and 1 GB/s

We got similar results for medium and large workloads. On identical hardware configurations (three nodes), Kafka was unable to handle the load with the required throughput, so we had to add nodes to get comparable results.

To keep tail latency within the 3x Redpanda tolerance for medium to large workloads, we needed to scale Kafka to nine nodes, which significantly affected the cost of infrastructure.

The following tables compare Redpanda and Kafka with the required number of nodes to maintain throughput at reasonable latency thresholds. All of these tests were done on i3en hardware.

Figure 3: Comparison of infrastructure costs for a 500 MB/s workload between Kafka and Redpanda.

Figure 3: Comparison of infrastructure costs for a 500 MB/s workload between Kafka and Redpanda.

Figure 4: Infrastructure cost comparison for a 1 GB/s workload between Kafka and Redpanda.

Figure 4: Infrastructure cost comparison for a 1 GB/s workload between Kafka and Redpanda.

In terms of infrastructure costs alone, savings can be expected from $80,000 to $150,000 depending on the volume and scale of the workload, which is a significant 3x savings compared to Kafka.

Comparison of administrative costs

Redpanda is designed with convenience and ease of use in mind (along with record-breaking performance). The following features of Redpanda contribute significantly to reducing the administrative burden compared to Kafka:

  1. Auto tuning – Automatically determines the optimal settings for your equipment and configures it in such a way that it takes the best advantage of a particular deployment.

  2. Leadership Balancing – Improves cluster performance by distributing leadership among nodes (and, moreover, between cores, so that several leaders do not overload certain cores at the same time).

  3. Continuous Data Balancing (new in version 22.2) – Automatically moves data from nodes that run out of disk space or when a node fails to ensure that the performance of the entire cluster as a whole is maintained.

  4. Maintenance mode – Allows smooth (controlled and stepwise shutdown) to take nodes out of service, transferring leadership to other nodes before shutting down (for installing patches or when servicing disks).

  5. gradual update – Update the cluster without interrupting the work of consumers or producers.

In addition, Redpanda is designed with data security in mind, as stated in report from Jepsen. Improved data security significantly reduces the operational and management burden of running a Redpanda cluster, ultimately reducing overall costs in this area.

The Jepsen report highlights the main differences between Redpanda and Kafka, in particular the shortcomings of Kafka’s ISR (In-Sync Replicas) mechanism, which ensures the reliability and consistency of data in a distributed messaging system. If the ISR mechanism does not function properly, it can result in data loss or insecure new leader election, which will affect system reliability and continuity. Redpanda does not have this vulnerability, so it much more stable in failure scenarios, including due to the presence of a single fault domain (compared to Kafka, where fault domains are ISR and ZooKeeper/KRaft).

To make a rough comparison of the cost of Redpanda and Kafka, we relied on data received from our customers. Because Redpanda doesn’t need a JVM or ZooKeeper, customers have confirmed that they can save time on partition balancing, JVM, ZooKeeper and operating system setup, and recovery from ISR failures.

However, we have made the following assumptions based on direct customer feedback:

  • Running a 3-node Redpanda cluster with small, medium and large number of instances does not increase operational complexity and is quite capable of a technical support team that can manage other platforms in parallel.

  • Running a 9-node Kafka cluster as well as three ZooKeeper nodes at high throughput is much more difficult, with a high probability of failures and the need for manual intervention on a regular basis.

Figure 5: SRE (Site Reliability Engineering) team cost comparison for a 50 MB/s workload between Kafka and Redpanda.

Figure 5: SRE (Site Reliability Engineering) team cost comparison for a 50 MB/s workload between Kafka and Redpanda.

Site Reliability Engineering (SRE) is a methodology and approach to managing operational reliability and performance in information technology. SRE combines software development practices with systems administration practices to create systems that are reliable, scalable, and resilient.

Figure 6: SRE command cost comparison for a 500 MB/s workload between Kafka and Redpanda.

Figure 6: SRE command cost comparison for a 500 MB/s workload between Kafka and Redpanda.

Figure 7: SRE command cost comparison for a 1 GB/s workload between Kafka and Redpanda.

Figure 7: SRE command cost comparison for a 1 GB/s workload between Kafka and Redpanda.

TCO Redpanda and Kafka

Even with relatively small workloads, using Kafka can be up to 3 times more expensive than using Redpanda. For large and complex loads, this figure can grow 5 times or more.

In the consolidated cost model, we combine the cost of hosting the infrastructure of the main cluster and administration, as described above.

In this model, we do not take into account the cost of a backup DR site (disaster recovery site) and the associated data transfer costs. Although, in fairness, it should be noted that the cost of infrastructure in this case will increase at least twice. Since hosting a MirrorMaker2 cluster on Kafka Connect will require additional resources. (Though it is possible for Redpanda Enterprise to use S3 replication – see our Highly Available (HA) Deployment Blog).

Figure 8: Consolidated comparison of Kafka and Redpanda TCO for all workloads.

Figure 8: Consolidated comparison of Kafka and Redpanda TCO for all workloads.

All of the cost metrics above compare Kafka with the version Redpanda Community Edition. According to this model, the possible savings in infrastructure and administrative costs can range from $76,000 for light workloads up to $552,000 for heavy workloads, i.e. 6 times.

Figure 9: Consolidated comparison of Kafka and Redpanda TCO for all workloads.

Figure 9: Consolidated comparison of Kafka and Redpanda TCO for all workloads.

Estimated Additional Savings with Redpanda Enterprise

Redpanda Enterprise includes several features that can be used to further reduce the total cost of ownership (TCO) of a Redpanda cluster, even when compared to commercial Kafka offerings. Among them – Redpanda Console and the possibility of tiered data storage (tiered-storage capability).

Tiered data storage in Redpanda is done by asynchronously publishing private log segments to an S3 compatible object store such as AWS S3, GCS, Ceph, MinIO, or a physical appliance such as Dell ECS, PureStorage, or NetApp ONTAP. Redpanda’s tiered storage provides two additional features. Firstly, it is an opportunity for slow consumers to freely read old offsets without changing the client and with high throughput. Second, create read-only topics on other clusters that can be used for analytics, machine learning, or disaster recovery.

Kafka is also working on data tiering within KIP-405; however, this process has not yet been completed and has been ongoing for more than two years. Some vendors support their own tiered storage systems; however, these solutions do not provide read-only replicas, as well as the ability to restore the cluster in case of emergency (DR). Therefore, to use Kafka in an active/passive DR topology, an additional Kafka Connect cluster and MirrorMaker will be required.

The biggest cost savings opportunities come when we’re smart about sizing our data storage in a cluster. It is critical to find the right balance so that the amount of data we store stays within the limits of available disk space. It’s important not to overfill the disks to get all the information in, but also not to make the size too small to ensure good performance even when the load is heavy. The main thing is to choose the golden mean, to avoid unnecessary costs and maintain high performance.

Figure 10: Daily storage requirement for different instance types with sustained workload throughput.

Figure 10: Daily storage requirement for different instance types with sustained workload throughput.

Since storage in S3 (Simple Storage Service) [S3 – управляемое облачное хранилище данных, предоставляемое Amazon Web Services] significantly cheaper than using SSD/NVMe based instances, storage tiering is best. This will not only reduce cloud or infrastructure costs, but will also serve to reduce the operational complexity associated with running a large storage-only Kafka cluster.

The following tables provide a visual comparison of using Redpanda Enterprise, commercial Kafka (including tiering, subject to the limitations described above and that the Kafka cluster needs to be larger anyway to provide throughput), and Redpanda Community and Kafka without tiering.

For each workload, we evaluate the potential infrastructure costs of storing data for one, two, three days and compare them using Redpanda Enterprise with the tiering option enabled. This calculation allows us to determine the cost savings of using Redpanda’s storage tiering compared to open source peers.

Figure 11: Comparison of annual infrastructure costs for three days of storage for a 50 MB/s workload (considering Kafka, Redpanda Community, Commercial Kafka, and Redpanda Enterprise).

Figure 11: Comparison of annual infrastructure costs for three days of storage for a 50 MB/s workload (considering Kafka, Redpanda Community, Commercial Kafka, and Redpanda Enterprise).

Figure 12: Comparison of annual infrastructure costs for three days of storage for a 500 MB/s workload (Kafka, Redpanda Community, Commercial Kafka, and Redpanda Enterprise considered).

Figure 12: Comparison of annual infrastructure costs for three days of storage for a 500 MB/s workload (Kafka, Redpanda Community, Commercial Kafka, and Redpanda Enterprise considered).

Figure 13: Comparison of annual infrastructure costs for three days of storage for a 1 GB/s workload (considering Kafka, Redpanda Community, Commercial Kafka, and Redpanda Enterprise).

Figure 13: Comparison of annual infrastructure costs for three days of storage for a 1 GB/s workload (considering Kafka, Redpanda Community, Commercial Kafka, and Redpanda Enterprise).

From the tables presented, it can be seen that in the absence of tiered storage, additional storage costs in clusters can be quite significant. The table below summarizes the results for all load types.

Figure 14: Summary of incremental cost savings for Redpanda Enterprise versus Kafka (infrastructure costs only).

Figure 14: A summary of additional cost savings for Redpanda Enterprise versus Kafka (infrastructure costs only).

We see that savings from a corporate subscription can range from $70,000 to $1,200,000 and even higher for large workloads or data storage requirements. This does not consider the indirect savings due to the features of the corporate version of Redpanda Enterprise, such as the Redpanda Console with SSO and RBAC, remote reading of replicas, continuous data balancing and hot-fixing (hot-patching).

Conclusion

In this article, we compared the total cost of ownership (TCO) between Kafka and Redpanda based on our benchmarks on public cloud infrastructure.

Our main results:

  • Redpanda is 3-6 times more economicalthan a similar infrastructure and Kafka team, while still delivering better performance.

  • Redpanda Enterprise offers a number of features designed to simplify cluster management. Redpanda’s tiered storage system provides infrastructure savings ranging from $70,000 to $1,200,000 depending on the load and cluster size.

Outcome: Redpanda is 6 times more efficient in terms of cost savings compared to Kafka. In addition, Redpanda’s flexible deployment options make the deployment process simple and convenient. You can deploy Redpanda either in the Redpanda Cloud Infrastructure, or in your own cloud environment, or on your local infrastructure, on physical hardware, or in a Kubernetes environment.

This means you have the flexibility to customize Redpanda to your needs and requirements, no matter where you plan to place your cluster. With this flexibility, you can maximize the cost-effectiveness and performance of your data cluster, making Redpanda an attractive solution for a variety of business scenarios.

Material prepared in anticipation of the start online course “Highload Architect”.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *