Last month we had the chance to attend the amazing Kafka Summit 2022 event organized by Confluent, one of Bitrock’s key Technology Partners.
Over 1500 people attended the event, which took place at the O2 in east London over two days of workshops, presentations, and networking.
Lots of news was given regarding Kafka, the Confluent Platform, Confluent Cloud, and the ecosystem altogether. An incredible opportunity to meet so many enthusiasts of this technology and discuss what is currently happening and what is on the radar for the upcoming future.
Modern Data Flow: Data Pipelines Done Right
The opening keynote of the event was hosted by Jay Kreps (CEO @ Confluent). The main topic (no pun intended :D) of the talk revolved around modern data flow and the growing need to process and move data in near real time.
From healthcare to grocery delivery, a lot of applications and services we use everyday are based on streaming data: in this scenario, Kafka stands as one of the main and most compelling technologies. The growing interest in Kafka is confirmed by the numerous organizations that are currently using it (more than 100.000 companies) and by the amount of interest and support that the project is receiving. The community is growing year after year: Kafka meetups are very popular and numerous people express a lot of interest about it, as proved by the countless questions asked on a daily basis on StackOverflow and the big amount of Jira tickets opened on the Apache Kafka project.
Of course, this success is far from accidental: if it is true that Kafka is a perfect fit for the requirements of modern architectures, it is also important to remember how many improvements were introduced in the Kafka ecosystem that helped create the image of a very mature, reliable tool when it comes to build fast, scalable, and correct streaming applications and pipelines.
This can be seen, for instance, in the new features introduced in Confluent Cloud (the Confluent solution for managed Kafka) to enhance the documentation and the monitoring of the streaming pipelines running in the environment with the new Stream Catalog and Lineage system. Those two features provide an easy-to-access way to identify and search the different resources and data available in the environment, and how this data flows inside the system improving the governance and monitoring of the platform.
The near future of Kafka - Upcoming features
Among all the numerous upcoming features in the ecosystem presented during the event, there are some that we really appreciated and we had been waiting for quite some time.
One of these is KIP-516, which introduces topic IDs to uniquely identify topics. As you may know since the very beginning - and this holds also today - the identifier for a topic is its name. This has some drawbacks, such as the fact that a topic cannot be renamed (for instance, when you would like to update your naming strategy), since this would be required both to delete and recreate the topic, migrating the whole content, and to update all the producers and consumers that refer to that specific topic. An equally annoying issue is when you want to delete a topic and then recreate another one with the same name, with the goal of dropping its content and creating the new one with different configurations. Also in this scenario, we can currently face issues, since Kafka will not immediately delete the topic, but will plan a deletion that needs to be spread through the cluster without the certainty on when this operation will be actually completed. This makes the operation, as of today, not automatable (our consultants have often faced this limitation in some of our client projects).
The second long-awaited feature is the possibility to run Kafka without Zookeeper. At first, it was very useful and practical to take advantage of the distributed configuration management capabilities provided by Zookeeper (this is specifically important in processes like controller election or partition leader election). During the past years, Kafka has started incorporating more and more functionalities and also maintaining a Zookeeper cluster, instead of just the Kafka one, which feels like an unnecessary effort, risk and cost. As of today, this feature is not yet production-ready, but we can say that it’s pretty close. Indeed, Confluent has shared the plan, and we are all waiting for this architecture simplification to arrive.
The third upcoming feature that we found extremely valuable is the introduction of modular topologies for ksqlDB. ksqlDB is relatively recent in the Kafka ecosystem, but it’s having a good momentum given its capability to easily write stream transformations with minimal effort and just an SQL-like command, without the need to create dedicated Kafka-Stream applications that will require a good amount of boilerplate that, later, have to be maintained.
ksqlDB will not be able to complete the detailed development of some Kafka-streams but, for a good amount of them, it will be an excellent solution. The introduction of modular topologies will simplify the management of the streams inside ksqlDB, and it will simplify its scalability (which is currently limited in some scenarios).
Our Insights from Breakout Sessions & Lightning Talks
The inner beauty of tech conferences lies in the talks, and Kafka Summit was no different!
During the event, indeed, not only the feature announcements caught our attention, but also what was presented during the various breakout sessions and talks: an amazing variety of topics gave us plenty of options to dig more into the Kafka world.
One of the sessions that we particularly enjoyed is, for sure, the one led by New Relic (“Monitoring Kafka Without Instrumentation Using eBPF”). The contribution focused on an interesting way of monitoring Kafka and Kafka-based applications using eBPF without the need for Instrumentation. Antón Rodríguez, as speaker, ran a cool demo of Pixie, in which it was very easy to see what is going on with our applications. It was also easy to get a graphical representation of the actual topology of the streams, and all the links between producers to topics, and topics to consumers, easing answering questions like “Who is producing to topic A?” or “Who is consuming from topic B?”.
Another session that we particularly enjoyed was the talk by LinkedIn (“Geo-replicated Kafka Streams Apps”): Ryanne Dolan outlined some strategies to deal with geo-replicated Kafka topics - in particular in case of Kafka streams applications. Ryanne gave some precious tips on how to manage the replication of Kafka topics in a disaster recovery cluster to guarantee high availability in case of failure, and on how to develop our Kafka streams application to work almost transparently in the original cluster and in the DR one. The talk was also a great opportunity to highlight the high scalability of Kafka in a multi-datacenter scenario, where different clusters can coexist creating some kind of layered architecture composed by a scalable ingestion layer that can fan out the data to different geo-replicated clusters in a transparent way for the Kafka streams applications.
Undoubtedly, the event has been a huge success, bringing the Apache Kafka community together to share best practices, learn how to build next-generation systems, and discuss the future of streaming technologies.
For us, this experience has been a blend of innovation, knowledge, and networking: all the things we missed from in-person conferences were finally back. It was impressive seeing people interact with each other after two years of social distancing, and we could really feel that “sense of community” that online events can only partially deliver.
If you want to know more about the event and its main topics - from real-time analytics to machine learning and event streaming - be sure to also check the dedicated Blog post by our sister-company Radicalbit. You can read it here.
Authors: Simone Esposito, Software Engineer @ Bitrock - Luca Tronchin, Software Engineer @ Bitrock