Ambar - Blog

Kafka is Costing You Years of Engineering Time

Tom Schutte

Jun 15, 2024

7 Mins Read

In today's data-driven world, real-time data streaming is essential for businesses seeking a competitive edge. Apache Kafka has emerged as a powerful tool for building data streaming solutions, enabling organizations to handle vast amounts of data with low latency and high throughput. However, even with managed services, the complexity of Kafka's ecosystem and the challenges of configuring, scaling, and managing Kafka clusters require years of engineering time over the lifetime of an application. This complexity slows development teams down, requiring more time on data-streaming pipelines than on core business applications.

At Ambar, we have heard numerous stories about the sharp edges of managing, configuring, maintaining, and utilizing Kafka. This post highlights the most common pain points.

The Complexity of Running a Kafka Cluster

The core component of Kafka is a distributed system of broker nodes that handle streaming traffic. Fault tolerance and traffic requirements often mean having multiple nodes spread across multiple server racks to ensure network, power, and data redundancy to keep your system up during a hardware failure. These nodes must have enough computing and network capabilities to handle expected data rates at peak times and the overhead of data replication. Additionally, expected data rates, retention settings, and replication factors will require disks to be large and fast enough (sometimes needing to offload data to object storage).

Looking at the ends of the Kafka pipeline, i.e., producers and consumers, there are more considerations when writing your application. Do you write using the producer API or the stream API? Should you configure your producers to batch messages or send them immediately? Do you care about delivery guarantees, and thus, what should your acknowledgment settings from Kafka brokers be? For consumers, you need to consider the number of partitions, how many consumers you should put in a group, how long it takes before your application is ready for the following message, and how much lag your system can tolerate. Apache Kafka has a mature ecosystem that can help arrive at a good enough solution. Still, you must maintain your cluster, pick the correct sizes (for brokers, producers, and consumers), handle vertical and horizontal scaling, distribute helper libraries, deploy connectors and sinks, address correctness pitfalls, manage access control, and much more! Ultimately, your team will spend years on these problems.

Let's dive deeper.

Management Overheads

Configuration Complexity

There are many considerations when choosing the hardware to run your cluster on, and many of them can also impact the settings you choose when configuring the Kafka software. How many IO and non-IO threads can your machine handle? How much system memory? Are you using SSDs or hard drives? What about JBOD or RAID? Configuring Kafka for optimal performance requires careful tuning of numerous parameters. These parameters influence various aspects of the system, such as replication factors, partitioning, and retention policies. Misconfigurations can lead to performance bottlenecks, data loss, or system instability. Engineers need to deeply understand these settings and their implications to avoid common pitfalls.

Sizing and Scaling

Scaling Kafka clusters to handle increased data loads is another significant challenge. As data volumes grow, Kafka clusters need to scale horizontally by adding more brokers or vertically by increasing the compute power of nodes. Moreover, scaling requires rebalancing data across brokers, managing resource utilization, and ensuring the system remains performant during these system mutations — improper scaling results in degraded performance, increased latency, and higher operational costs. Further, improper scaling causes system downtime when brokers become unavailable due to unmanageable load shifts, potentially bringing an entire cluster offline when the load shift cascades through all broker nodes.

Maintenance and Monitoring

Keeping a Kafka cluster alive also involves multiple considerations, including continuous monitoring of system health, performance metrics, and resource utilization. Engineers need to set up robust monitoring solutions to detect and resolve issues promptly, which requires expertise in monitoring tools, log analysis, and alerting mechanisms. Engineers must also develop plans and runbooks to update the cluster as security and feature improvements become available.

Data Loss

Data loss is another critical risk. Kafka's primary function is to ensure the reliable delivery and durable storage of data. However, misconfigurations, hardware failures, or software bugs can lead to data being lost or not delivered to the intended consumers. Ensuring data integrity and reliability requires meticulous configuration and monitoring. Requirements for higher levels of correctness, such as ordering or delivery guarantees, can further complicate system configuration and increase the likelihood of a misconfiguration leading to message delivery failures or out-of-order delivery.

Development Overheads

Java Centric Library Support

You'll also need to consider that Kafka is designed for the Java ecosystem, which can be a limiting factor for organizations primarily using other programming languages. The Java-centric nature means that engineers must be proficient in Java to utilize Kafka's capabilities fully. Libraries, such as librdkafka, provide interoperability but are inadequate for languages that struggle with long-lived tasks, such as Python, PHP, Ruby, Perl, and Javascript. For teams without strong Java expertise, this leads to retrofitting subpar third-party (possibly paid) libraries, longer development times, and increased learning curves.

Extensive Codebases

Implementing a data streaming pipeline with Kafka requires writing a significant amount of code. Engineers need to develop custom producers and consumers, handle data serialization and deserialization, and manage error handling and retries. Even with the help of third-party libraries, the codebase becomes extensive and complex.

Code Maintenance Burden

Maintaining a large codebase adds to the engineering overhead. Teams must effectively manage updates, bug fixes, and feature enhancements. Over time, technical debt accumulates, making the system harder to maintain and evolve. This ongoing maintenance burden slows down new features and impacts overall productivity.

Conclusion

Apache Kafka is a powerful tool for building real-time data streaming solutions. Still, choosing it presents significant engineering challenges. The complexity of its ecosystem, coupled with the need for precise configuration, scaling, and maintenance, can overload an engineering team, leading to mismanagement, application downtime, and data loss. Moreover, the extensive codebase required to interact with Kafka significantly increases the burden on engineering teams. All in all, teams must budget to spend years of engineering time on these problems.

Ambar addresses these problems out of the box, with the fastest integration experience in the market and posing no maintenance burden to engineering teams. But that's a story for another post (see here).