The Kafka Summit 2023, held recently, brought together a diverse group of professionals, enthusiasts, and experts in the field of data streaming and event-driven architectures. This year’s summit was an exceptional gathering, filled with insightful discussions, cutting-edge demonstrations, and valuable networking opportunities. Of course, the Bitrock’s Engineering team couldn’t miss to attend it and to share the key insights from the event.
During the keynote presentation, Jay Kreps, Confluent Co-founder & CEO, presented a rundown of enhancements coming to Kafka over the next year and beyond.
After the Zookeeper removal in favor of KRaft (KIP-866) available from Confluent Platform 7.4.0, another big surprise announced is the KIP-932 Queues for Kafka which allows many consumers to read from the same partition, enabling use cases like the classic pub sub-queues. This will be made possible thanks to the introduction of share groups and acknowledgment of single records in Kafka Consumer protocol.
Jay also unveiled Confluent’s Kora Engine, the Apache Kafka engine built for the cloud. Kora is the engine that powers Confluent Cloud as a cloud-native, 10x Kafka service, bringing GBps+ elastic scaling, guaranteed reliability, and predictable low latency to 30K+ clusters worldwide.
Another important announcement made at Kafka Summit 2023 in London, is the upcoming Apache Flink-powered stream processing offering in Confluent Cloud, expected in winter 2023. The recent acquisition of Immerok by Confluent has positioned this data streaming giant to offer both streaming storage (via Apache Kafka) and streaming computation (via Apache Flink) capabilities.
The current rebalancing protocol has different issues, one is definitely that most of the logic is client side (fat client), for example, session timeouts and intervals are defined client side, but its main pain point is that the current protocol will stop processing new messages (strop the world) when executing a rebalancing. Fault group members might cause issues to the whole consumer group. The new protocol is based on three design goals in mind: Server side, Consumer Protocol and Incremental. The new reconciliation protocol has 3 main phases:
The group coordinator server side will receive the partition assignment from the members and compute the new assignment for the partitions due to a new member.
The group coordinator communicates which partitions should be revoked and the consumer acknowledges.
The partition can be assigned to the new member of the consumer group.
During the evening party, our colleagues enjoyed a beer and international foods while attending the performance by Sam Aaron, Live Coding Musician and Creator of Sonic Pi with his futuristic music sets improvised through the manipulation of live code.
The Kafka Summit 2023 was an outstanding event that showcased the advancements and future directions of the Kafka ecosystem which continues to be a driving force in enabling real-time data streaming and event-driven architectures in an increasingly data-centric world.
Last month we had the chance to attend the amazing Kafka Summit 2022 event organized by Confluent, one of Bitrock’s key Technology Partners.
Over 1500 people attended the event, which took place at the O2 in east London over two days of workshops, presentations, and networking.
Lots of news was given regarding Kafka, the Confluent Platform, Confluent Cloud, and the ecosystem altogether. An incredible opportunity to meet so many enthusiasts of this technology and discuss what is currently happening and what is on the radar for the upcoming future.
Modern Data Flow: Data Pipelines Done Right
The opening keynote of the event was hosted by Jay Kreps (CEO @ Confluent). The main topic (no pun intended :D) of the talk revolved around modern data flow and the growing need to process and move data in near real time.
From healthcare to grocery delivery, a lot of applications and services we use everyday are based on streaming data: in this scenario, Kafka stands as one of the main and most compelling technologies. The growing interest in Kafka is confirmed by the numerous organizations that are currently using it (more than 100.000 companies) and by the amount of interest and support that the project is receiving. The community is growing year after year: Kafka meetups are very popular and numerous people express a lot of interest about it, as proved by the countless questions asked on a daily basis on StackOverflow and the big amount of Jira tickets opened on the Apache Kafka project.
Of course, this success is far from accidental: if it is true that Kafka is a perfect fit for the requirements of modern architectures, it is also important to remember how many improvements were introduced in the Kafka ecosystem that helped create the image of a very mature, reliable tool when it comes to build fast, scalable, and correct streaming applications and pipelines.
This can be seen, for instance, in the new features introduced in Confluent Cloud (the Confluent solution for managed Kafka) to enhance the documentation and the monitoring of the streaming pipelines running in the environment with the new Stream Catalog and Lineage system. Those two features provide an easy-to-access way to identify and search the different resources and data available in the environment, and how this data flows inside the system improving the governance and monitoring of the platform.
The near future of Kafka - Upcoming features
Among all the numerous upcoming features in the ecosystem presented during the event, there are some that we really appreciated and we had been waiting for quite some time.
One of these is KIP-516, which introduces topic IDs to uniquely identify topics. As you may know since the very beginning - and this holds also today - the identifier for a topic is its name. This has some drawbacks, such as the fact that a topic cannot be renamed (for instance, when you would like to update your naming strategy), since this would be required both to delete and recreate the topic, migrating the whole content, and to update all the producers and consumers that refer to that specific topic. An equally annoying issue is when you want to delete a topic and then recreate another one with the same name, with the goal of dropping its content and creating the new one with different configurations. Also in this scenario, we can currently face issues, since Kafka will not immediately delete the topic, but will plan a deletion that needs to be spread through the cluster without the certainty on when this operation will be actually completed. This makes the operation, as of today, not automatable (our consultants have often faced this limitation in some of our client projects).
The second long-awaited feature is the possibility to run Kafka without Zookeeper. At first, it was very useful and practical to take advantage of the distributed configuration management capabilities provided by Zookeeper (this is specifically important in processes like controller election or partition leader election). During the past years, Kafka has started incorporating more and more functionalities and also maintaining a Zookeeper cluster, instead of just the Kafka one, which feels like an unnecessary effort, risk and cost. As of today, this feature is not yet production-ready, but we can say that it’s pretty close. Indeed, Confluent has shared the plan, and we are all waiting for this architecture simplification to arrive.
The third upcoming feature that we found extremely valuable is the introduction of modular topologies for ksqlDB. ksqlDB is relatively recent in the Kafka ecosystem, but it’s having a good momentum given its capability to easily write stream transformations with minimal effort and just an SQL-like command, without the need to create dedicated Kafka-Stream applications that will require a good amount of boilerplate that, later, have to be maintained.
ksqlDB will not be able to complete the detailed development of some Kafka-streams but, for a good amount of them, it will be an excellent solution. The introduction of modular topologies will simplify the management of the streams inside ksqlDB, and it will simplify its scalability (which is currently limited in some scenarios).
Our Insights from Breakout Sessions & Lightning Talks
The inner beauty of tech conferences lies in the talks, and Kafka Summit was no different!
During the event, indeed, not only the feature announcements caught our attention, but also what was presented during the various breakout sessions and talks: an amazing variety of topics gave us plenty of options to dig more into the Kafka world.
One of the sessions that we particularly enjoyed is, for sure, the one led by New Relic (“Monitoring Kafka Without Instrumentation Using eBPF”). The contribution focused on an interesting way of monitoring Kafka and Kafka-based applications using eBPF without the need for Instrumentation. Antón Rodríguez, as speaker, ran a cool demo of Pixie, in which it was very easy to see what is going on with our applications. It was also easy to get a graphical representation of the actual topology of the streams, and all the links between producers to topics, and topics to consumers, easing answering questions like “Who is producing to topic A?” or “Who is consuming from topic B?”.
Another session that we particularly enjoyed was the talk by LinkedIn (“Geo-replicated Kafka Streams Apps”): Ryanne Dolan outlined some strategies to deal with geo-replicated Kafka topics - in particular in case of Kafka streams applications. Ryanne gave some precious tips on how to manage the replication of Kafka topics in a disaster recovery cluster to guarantee high availability in case of failure, and on how to develop our Kafka streams application to work almost transparently in the original cluster and in the DR one. The talk was also a great opportunity to highlight the high scalability of Kafka in a multi-datacenter scenario, where different clusters can coexist creating some kind of layered architecture composed by a scalable ingestion layer that can fan out the data to different geo-replicated clusters in a transparent way for the Kafka streams applications.
Undoubtedly, the event has been a huge success, bringing the Apache Kafka community together to share best practices, learn how to build next-generation systems, and discuss the future of streaming technologies.
For us, this experience has been a blend of innovation, knowledge, and networking: all the things we missed from in-person conferences were finally back. It was impressive seeing people interact with each other after two years of social distancing, and we could really feel that “sense of community” that online events can only partially deliver.
If you want to know more about the event and its main topics - from real-time analytics to machine learning and event streaming - be sure to also check the dedicated Blog post by our sister-company Radicalbit. You can read it here.
Turning Data at REST into Data in Motion with Kafka StreamsTurning Data at REST into Data in Motion with Kafka Streams
From Confluent Blog
Another great achievement for our Team: we are now on Confluent Official Blog with one of our R&D projects based on Event Stream Processing.
Event stream processing continues to grow among business cases that have been reliant primarily on batch data processing. In recent years, it has proven especially prominent when the decision-making process must take place within milliseconds (for ex. in cybersecurity and artificial intelligence), when the business value is generated by computations on event-based data sources (for ex. in industry 4.0 and home automation applications), and – last but not least – when the transformation, aggregation or transfer of data residing in heterogeneous sources involves serious limitations (for ex. in legacy systems and supply chain integration).
Our R&D decided to start an internal POC based on Kafka Streams and Confluent Platform (primarily Confluent Schema Registry and Kafka Connect) to demonstrate the effectiveness of these components in four specific areas:
1. Data refinement: filtering the raw data in order to serve it to targeted consumers, scaling the applications through I/O savings
2. System resiliency: using the Apache Kafka® ecosystem, including monitoring and streaming libraries, in order to deliver a resilient system
3. Data update: getting the most up-to-date data from sources using Kafka
4. Optimize machine resources: decoupling data processing pipelines and exploiting parallel data processing and non-blocking IO in order to maximize hardware capacity
These four areas can impact data ingestion and system efficiency by improving system performance and limiting operational risks as much as possible, which increases profit margin opportunities by providing more flexible and resilient systems.
At Bitrock, we tackle software complexity through domain-driven design, borrowing the concept of bounded contexts and ensuring a modular architecture through loose coupling. Whenever necessary, we commit to a microservice architecture.
Due to their immutable nature, events are a great fit as our unique source of truth. They are self-contained units of business facts and also represent a perfect implementation of a contract amongst components. The Team chose the Confluent Platform for its ability to implement an asynchronous microservice architecture that can evolve over time, backed by a persistent log of immutable events ready to be independently consumed by clients.
This inspired our Team to create a dashboard that uses the practices above to clearly present processed data to an end user—specifically, air traffic, which provides an open, near-real-time stream of ever-updating data.
If you want to read the full article and discover all project details, architecture, findings and roadmap, click here: https://bit.ly/3c3hQfP.
In this three-day hands-on course you will learn how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka experts.
You will learn how Kafka and the Confluent Platform work, their main subsystems, how they interact, and how to set up, manage, monitor, and tune your cluster.
Throughout the course, hands-on exercises reinforce the topics being discussed. Exercises include:
Basic cluster operations
Viewing and interpreting cluster metrics
Recovering from a Broker failure
Performance-tuning the cluster
Securing the cluster
This course is designed for engineers, system administrators, and operations staff responsible for building, managing, monitoring, and tuning Kafka clusters.
Attendees should have a strong knowledge of Linux/Unix, and understand basic TCP/IP networking concepts. Familiarity with the Java Virtual Machine (JVM) is helpful. Prior knowledge of Kafka is helpful, but is not required.
In this three-day hands-on course you will learn how to build an application that can publish data to, and subscribe to data from, an Apache Kafka cluster.
You will learn the role of Kafka in the modern data distribution pipeline, discuss core Kafka architectural concepts and components, and review the Kafka developer APIs. As well as core Kafka, Kafka Connect, and Kafka Streams, the course also covers other components in the broader Confluent Platform, such as the Schema Registry and the REST Proxy.
Throughout the course, hands-on exercises reinforce the topics being discussed. Exercises include:
Using Kafka’s command-line tools
Writing Consumers and Producers
Writing a multi-threaded Consumer
Using the REST Proxy
Storing Avro data in Kafka with the Schema Registry
Ingesting data with Kafka Connect
This course is designed for application developers, ETL (extract, transform, and load) developers, and data scientists who need to interact with Kafka clusters as a source of, or destination for, data.
Attendees should be familiar with developing in Java (preferred)
or Python. No prior knowledge of Kafka is required.
The Motivation for Apache Kafka
Real-Time Processing is Becoming Prevalent
Kafka: A Stream Data Platform
An Overview of Kafka
Kafka’s Use of ZooKeeper
Kafka’s Log Files
Replicas for Reliability
Kafka’s Write Path
Kafka’s Read Path
Partitions and Consumer Groups for Scalability
Developing With Kafka
Using Maven for Project Management
Programmatically Accessing Kafka* Writing a Producer in Java