As a Confluent partner, we could not miss the opportunity to join the other 3,000 participants (virtually and in person) at the Kafka Summit 2024 in London, to get the latest insights about new features and future developments of Kafka as a streaming technology and to continue learning about this technology in order to provide our customers with the best solution for their Apache Kafka use cases.
Intro and Keynote
The keynote, delivered by a team of speakers from Confluent, covered developments in both Apache Kafka and their own platform offerings. This year, Apache Kafka reached its 1000th KIP, a strong indicator of the vibrancy and commitment of the community to the development of Kafka as a streaming platform. Other Kafka milestones celebrated in the past year include new versions 3.6 and 3.7, the release of an official Apache Kafka Docker image, and new features such as Tiered Storage (KIP-405, in early access) and client-side metrics (KIP-714). Among the upcoming features in development, one of particular interest is Queues for Kafka (KIP-932), still in the voting stage, which will allow consumers to share partitions when consuming data, removing the current parallelism limit of one consumer per partition.
The Confluent team presented their Confluent Cloud platform as a unified warehouse for data products across operations and analytics. All data can be stored in Kafka topics.
Several of their demonstrations showed ways to manage this all-Kafka data system, including through a map of stream processing that they provide (the “Stream Lineage” view) and a centralized, searchable hub for both technical and business metadata about topics (the “Stream Catalog”) that allows developers to find which data resource is most relevant to their needs.
Perhaps the most interesting aspect of Confluent Cloud’s development is the integration of business-oriented Kafka data and analytics-oriented Parquet data, consumed and managed by Apache Iceberg. Its “TableFlow” functionality allows Kafka data to be stored in the cloud as Parquet files, accessible as a data lake through Iceberg, providing both table- and stream-based views of the same data. Their goal is to make this fully bi-directional, so that updates from both Iceberg and Kafka are visible on either side, unifying the two data processing regimes and making operational and analytical data widely available across an organization’s data ecosystem.
Components Customization
The Confluent platform provides a set of components that allow an organization to focus its efforts on the development of the business logic using the Confluent platform component as a baseline for asynchronous communication. However, the various components within the Confluent platform can be completely replaced with components that provide specific value. The main value offers can be:
Cost Reduction
- For use cases where certain performance requirements can be relaxed, WarpStream presents its view on reducing the cost of the Kafka cluster using its cloud architecture, which offers a cloud-native architecture of Kafka storing the data in S3. The Kafka brokers have been refactored to get rid of most of the cost by exploiting the property of S3 which offers economy, reliability, scalability, and elasticity.
- Without any trade-off: RedPanda offers a streaming platform built one the same abstraction as Confluent did for Kafka (infinite log, topic split in partition, and pub-sub model). They claim to be “6x more cost-effective than Apache Kafka—and 10x faster”. After all, the tech stack is not JVM based and it does not use a coordinator service like Apache Zookeeper but has RAFT embedded in it (as Kafka already did with KIP-500).
Performance Enhancement
- For particular contexts: Responsive is a company that offers a stateful-streaming solution library that could be interchanged with Kafka stream. The main difference is that the state stores are remote allowing the application to be stateless. This solves the painful rebalancing operation when we scale a fleet of microservices implementing stateful stream processing.
- Without any trade-off: Pulsar is a streaming platform like the Confluent platform. It has a more complex architecture than Kafka, with specialized components such as Bookkeeper, and still Zookeeper. This complexity may provide better performance but also the effort to manage it.
Most of the customization presented can be seen as a refactoring of the Kafka component to take advantage of cloud-native services, and it is possible that Confluent has done the same thing with Kora: enhancing the scalability and lowering the bill for its customers.
Our Favorite Talk: “Attacking (and Defending) Apache Kafka“
In general, there are contexts in which integration systems such as Apache Kafka take on the burden of transporting private, sensitive data that, if in the hands of someone with bad intentions, could lead to serious damage, not only through the theft of data relating to identifiable individuals, but also through the manipulation of “command messages” that could allow manipulation and/or control of the systems receiving that information.
As Francesco Tisiot explained in his interesting talk “Attacking (and Defending) Apache Kafka”, which outlined the system’s most susceptible points and the potential consequences of various attacks, proactive steps can be implemented during the design phase to safeguard your data.
Kafka is:
- Distributed: there are many nodes (brokers) that can be attacked. but also the network as well. For example, it is easy to target the network between the nodes of the nodes themselves with a DDOS attack or use a man-in-the-middle to sniff traffic. SSL, data masking, and encryption should nowadays be included by design.
- A real-time or near-real-time streaming tool: it means that everything happens fast, including the damage. In the event of an attack, time to react is critical. Monitoring should be enabled at an early stage.
- Flexible: Kafka does not really care about the structure of the data passing through, it is important to define a very strict structure for the data and sanitize it at each step, as a message structured in this way can easily break the soft rules of the target db or give unwanted commands to the target system by exploiting its business logic.
- Integratable: Kafka is often so easy to integrate that a configuration file is all that it is needed, which carries a lot of credentials and infrastructure information.
“Event Modeling Anti-patterns“
A complex part of creating any software artifact is modeling its domain. In the case of an integration system in particular, it is important to know how to model the events that will occur and how to handle errors, and often well-designed and well-known design patterns come to the rescue.
In his entertaining talk “Event modeling anti-patterns”, Oskar Dudycz described the process of modeling events “by contradiction”, starting from very “static” designs often used when technology was also “static”, a kind of “database driven architecture” as Oscar defined it, to arrive instead at more modern ways of modeling events so that they can pass through an event-driven architecture (which can be easily implemented using Kafka), bringing the information that is really important for subscribers’ business logic.
What should really be modeled, after understanding their semantics, are three types of events: Command, Event, Document. In general, events cannot just be the information taken from the db writing log (creation, update, …), i.e. the CRUD sourcing, as these events do not carry enough information for the business logic to happen. On the other hand, there is what is called “property sourcing” i.e. an event is sent for every property change, so the architecture is flooded with events. As Oscar explained, it is important to study how to group the (only) relevant information to implement the semantic of a specific type of event.
Another important aspect of implementing these types of architectures is to enumerate all the errors that can occur and understand how to deal with them. Again, many design patterns are available in the literature. As Cedric Schaller explained in his talk “Error Handling with Kafka: From Patterns to Code”, given the nature of the Apache Kafka architecture, these are the most common errors that need to be handled in a robust implementation (possibly without invalidating Kafka’s good properties by imposing a too strict management):
- Validation Error: solve this kind of error is as simple as implementing any data validation mechanism
- 3rd Party System Errors (database, web services, etc.)
- Kafka Infrastructural Issues (nodes not available in the cluster, data loss due to insufficient replication, etc.)
- Write Issues (loss of ordering or transactionality, write failures, etc.)
- Consumption and Processing Issues (poison pills, application exceptions while processing data, etc.)
As Cedric explained, the first step to a solid implementation that prevents some of these errors is to have a strong and reliable customized infrastructure configuration. After securing the infrastructure, it is important to choose the right pattern to use to implement the logical requirements. Cedric explained and demonstrated the patterns in detail: “stop on error”, “dead letter topic” to discard unrecoverable messages, “retry topic” to reprocess some messages, “maintaining the order of events” to queue broken messages maintaining order to be reprocessed later (manually, eventually).
Conclusions
Kafka Summit 2024 in London buzzed with advancements in Apache Kafka and Confluent Cloud. Attendees learned about new Kafka versions, features, and the exciting Queues for Kafka. Confluent Cloud impressed with its data unification vision using Stream Lineage, Stream Catalog, and TableFlow.
The summit also highlighted the flexibility of the Confluent platform. Solutions like WarpStream, RedPanda, and Pulsar offer customization for cost reduction and performance enhancement.
Security was a key theme, with experts emphasizing proactive measures like encryption, data validation, and monitoring. Additionally, best practices for event modeling and robust error handling were addressed.
By embracing these advancements, organizations can leverage the power of Kafka and Confluent Cloud to build secure and efficient data pipelines.
Main Authors: Giovanni Morlin, Mario Taccori, Bradley Gersh Software Engineers @ Bitrock