The recent hype surrounding Apache Flink especially after the Kafka Summit 2023 in London sparked our curiosity and prompted us to better understand the reasons for such enthusiasm.  Specifically, we wanted to know how much Flink differs from Kafka Streams, the learning curve, and the use cases where these technologies can be applied. Both solutions offer powerful tools for processing data in real-time, but they have significant differences in terms of purpose and features.

Processing Data: Apache Flink vs Kafka Streams

Apache Flink is an open-source, unified stream and batch data processing framework. It  is a distributed computing system that can process large amounts of data in real-time with fault tolerance and scalability.

On the other hand, Kafka Streams is a specific library built into Apache Kafka that provides a framework for building different applications and microservices that process data in real-time.

Kafka Streams provides a programming model that allows developers to define transformation operations over data streams using DSL-like functional APIs. This model is based on two types of APIs: the DSL API and the Processor API.  The DSL API is built on top of the Processor API and is recommended especially for beginners. The Processor API is meant for advanced applications development and involves the employment of low-level Kafka capabilities. 

Kafka Streams was created to provide a native option for processing streaming data without the need for external frameworks or libraries. A Kafka Streams job is essentially a standalone application that can be orchestrated at the user’s discretion. 

Main Differences

As previously mentioned, Flink is a running engine on which processing jobs run, while Kafka Streams is a Java library that enables client applications to run streaming jobs without the need for extra distributed systems besides a running Kafka cluster. This implies that if users want to leverage Flink for stream processing, they will need to work with two systems.

In addition, both Apache Flink and Kafka Streams offer high-level APIs (Flink DataStream APIs, Kafka Streams DSL) as well as advanced APIs for more complex implementations, such as the Kafka Streams Processor APIs.

Now, let's take a closer look at the main differences between Apache Kafka and Flink.

  1. Integrations

How do these systems establish connections with the external world? Apache Flink offers native integration with a wide range of technologies, including Hadoop, RDBMS, Elasticsearch, Hive, and more. This integration is made possible through the utilization of the Flink Connectors suite, where these connectors function as sources within Flink pipelines.

Kafka Streams is tightly integrated with Kafka for processing streaming data. The Kafka ecosystem provides Kafka Connect, which allows for the integration of external data sources as events are journaled into topics. For example, using the Kafka Connect Debezium connector, users can stream  Change Data Capture stream events into a Kafka topic. A Kafka Stream topology can then consume this topic and apply processing logic to meet specific business requirements.

  1. Scalability

 Apache Flink is an engine designed to scale out across a cluster of machines, and its scalability is only bound by the cluster definition. On the other hand, while it is possible to scale Kafka Streams applications out horizontally, the potential scalability is limited to the maximum number of partitions owned by the source topics. 

  1. Fault tolerance and reliability

 Both Kafka Streams and  Apache Flink ensure high availability and fault tolerance, but they employ different approaches. Kafka Stream delegates to the capabilities of Kafka brokers. Apache Flink depends  on external systems for persistent state management by using tiered storage and it relies on systems like Zookeeper or Kubernetes for achieving high availability.

  1. Operation

 Kafka Stream as a library, requires users to write their applications and operate them as they would normally. For example, a Kubernetes deployment can be used for this purpose and by adding Horizontal Pod Autoscaling (HPA) , it can enable horizontal scale-out.

 Apache Flink is an engine that needs to be orchestrated in order to enable Flink workloads. Currently, Flink users can leverage a Kubernetes Flink operator developed by the community to integrate Flink executions natively over Kubernetes clusters.

  1. Windowing 

Both Kafka Stream and Flink support windowing (tumbling, sliding, session) with some differences:

  • Kafka Stream manages windowing based on event time and processing time.
  • Apache Flink manages flexible windowing based on event time, processing time, and ingestion time.

Use Cases

While both frameworks offer unique features and benefits, they have different strengths when it comes to specific use cases. 

Apache Flink is the go-to choice for:

  • Real-Time Data Processing: real-time event analysis, performance monitoring, anomaly detection, and IoT sensor data processing.
  • Complex Event Processing: pattern recognition, aggregation, and related event processing, such as detecting sequences of events or managing time windows.
  • Batch Data Processing: report generation and archives data processing.
  • Machine Learning on Streaming Data: train and apply machine learning models on streaming data, enabling real-time processing of machine learning outcomes and predictions.

Kafka Stream is the go-to choice for:

  • Microservices Architectures: particularly leveraged for the implementations of event-driven patterns like event sourcing or CQRS.
  • Kafka Input and Output Data Processing: transform, filter, aggregate or enrich input data and produce output data in real-time.
  • Log Data Processing: analyze website access logs, monitor service performance, or detect significant events from system logs.
  • Real-time Analytics: data aggregation, real-time reporting, and triggering event-based actions
  • Machine Learning:  train and apply machine learning models on streaming data for real-time scoring.

Learning curve and resources

The learning curve of a technology is not always an objective fact and can vary depending on various factors. However, we have attempted to provide a general overview based on factors like resources availability and examples.

The basic concepts of Kafka Streams, such as KStreams (data streams) and KTables (data tables), can be easily grasped. While the mastery of advanced functions such as the aggregation of time windows or the processing of correlated events, may require further exploration. The official Kafka Streams documentation, available on Kafka’s website at https://kafka.apache.org/documentation/streams/ serves as a valuable reference for learning, exploring and leveraging all its capabilities.

To get started with Apache Flink, it is recommended to learn the basic programming model, including working with data streams and data sets. Once again, mastering advanced concepts such as state management, time windows, or grouping, may require additional study and practice time. The official Flink documentation available on Confluent's website at https://nightlies.apache.org/flink/flink-docs-stable/ serves as a comprehensive resource for learning and exploring as well.

Conclusions

To cut a long story short, Apache Flink and Kafka Streams are two open-source frameworks with their strengths and weaknesses for stream processing that can process large amounts of data in real-time. 

Apache Flink is a fully-stateful framework that can store the state of the data during processing, making it ideal for applications that require complex calculations or data consistency. Kafka Streams is a partially-stateful framework and it is ideal for applications that require low latency or to process large amounts of data. Apache Flink is a more generalized framework that can be used for various applications, including log processing, real-time data processing, and data analytics. Kafka Streams is more specific to stream processing.

We can conclude affirming that the best framework for a specific application will depend on the specific needs of the application.

Author: Luigi Cerrato, Software Engineer @ Bitrock

Thanks to the technical team of our sister company Radicalbit for their valuable contributions to this article.

Read More
Apache Flink

Nowadays organizations are facing the challenge to process massive amounts of data. Traditional batch processing systems don't meet the modern data analytics requirements anymore.  And that’s where Apache Flink comes into play.

Apache Flink is an open-source stream processing framework that provides powerful capabilities for processing and analyzing data streams in real-time. Among its key strengths, we can mention:

  • Elastic scalability to handle large-scale workloads 
  • Language flexibility  to provide API for Java, Python and SQL
  • Unified processing to perform streaming, batch and analytics computations 

Apache Flink can rely on a supportive and active community as well as offers seamless integration with Kafka, making it a versatile solution for various use cases. It provides comprehensive support for a wide range of scenarios, including streaming, batching, data normalization, pattern recognition, payment verification, clickstream analysis, log aggregation, and frequency analysis. Additionally, Flink is highly scalable and can efficiently handle workloads of any size.

During the Kafka Summit 2023, Apache Flink received significant attention, highlighting its increasing popularity and relevance. To further demonstrate the growing interest in this technology, Confluent, a leading company in the Kafka ecosystem, presented a roadmap outlining upcoming Flink-powered features for Confluent Cloud:

  • SQL Service (public preview fall 2023)
  • SQL (general availability winter 2023)
  • Java and Python API (2024)

Unleashing the power of Apache Flink: the perfect partner for Kafka

Stream processing is a data processing paradigm that involves continuously analyzing events from one or multiple data sources. It focuses on processing data in motion, as opposed to batch processing. 

Stream processing can be categorized into two types: stateless and stateful. Stateless processing involves filtering or transforming individual messages, while stateful processing involves operations like aggregation or sliding windows.

Managing state in distributed streaming computations is a complex task. However, Apache Flink aims to simplify this challenge by offering stateful processing capabilities for building streaming applications. Apache Flink provides APIs, advanced operators, and low-level control for distributed states. It is designed to be scalable, even for complex streaming JOIN queries.

The scalability and flexibility of Flink's engine are crucial in providing a powerful stream processing framework for handling big data workloads.

Furthermore, Apache Flink offers additional features and capabilities:

  • Unified Streaming and Batch APIs
  • Transactions Across Kafka and Flink
  • Machine Learning with Kafka, Flink, and Python
  • Standard SQL support

Unified Streaming and Batch APIs

Apache Flink's DataStream API combines both batch and streaming capabilities by offering support for various runtime execution modes and by doing so providing a unified programming model. When using the SQL/Table API, the execution mode is automatically determined based on the characteristics of the data sources. If all events are bound, the batch execution mode is utilized. On the other hand, if at least one event is unbounded, the streaming execution mode is employed. This flexibility allows Apache Flink to seamlessly adapt to different data processing scenarios.

Transactions Across Kafka and Flink

Apache Kafka and Apache Flink are widely deployed in robust and essential architectures. The concept of exactly-once semantics (EOS) ensure that stream processing applications  can process data through Kafka without loss or duplication. Many companies have already  adopted EOS in production  using Kafka Streams.  The advantage is that EOS can also be leveraged when combining Kafka and Flink, thanks to Flink's Kafka connector API.   This capability is particularly valuable when using Flink for transactional workloads. This feature is mature and battle-tested in production, however operating separate clusters for transactional workloads can still be challenging and sometimes cloud services with similar aimings can take over this burden in favor of simplicity.

Machine Learning with Kafka, Flink, and Python

The combination of data streaming and machine learning offers a powerful solution to efficiently deploy analytical models for real-time scoring, regardless of the scale of the operation. PyFlink, a Python API for Apache Flink, allows you to build scalable batch and streaming workloads, such as real-time data processing pipelines, large-scale exploratory data analysis, Machine Learning (ML) pipelines, and ETL processes.

Standard SQL Support

Structured Query Language (SQL) is a domain-specific language used for managing data in a relational database management system (RDBMS). However, SQL is not limited to RDBMS and is also adopted by various streaming platforms and technologies. Apache Flink  offers comprehensive support for ANSI SQL, encompassing the Data Definition Language (DDL), Data Manipulation Language (DML), and Query Language.
This is advantageous because SQL is already widely used by different professionals including developers, architects, and business analysts, in their daily work.
The SQL integration  is facilitated through the Flink SQL Gateway, which is a component of the Flink framework  and allows other applications to interact with a Flink cluster through a REST API.   This opens up the possibility of integrating Flink SQL with traditional business intelligence tools.

Additionally, Flink provides a Table API that complements the SQL capabilities by offering declarative features to imperative-like jobs. This means that users can seamlessly combine the DataStream APIs and Table APIs as shown in the following example:

 DataStream APIs and Table APIs

Flink Runtime

As mentioned, Flink supports streaming and batch processing. It’s now time to delve into the analysis of the main difference:

Streaming works with bounded or unbounded streams. In other words:

  • Entire pipeline must always be running  
  • Input must be processed as it arrives   
  • Results are reported as they become ready  
  • Failure recovery resumes from a recent snapshot    
  • Flink can guarantee effectively exactly-once results when properly set up with Kafka

Batch works only with a bounded stream. In other words:

  • Execution proceeds in stages and running as needed
  • Input may be pre-sorted by time and key
  • Results are reported at the end of the job
  • Failure recovery does a reset and full restart 
  • Effectively exactly-once guarantees are more straightforward

Flink Runtime

Conclusions

In a nutshell, Apache Flink offers numerous benefits for stream processing and real-time data analytics. Its seamless integration with Kafka allows for efficient data ingestion and processing. Flink's stateful processing capabilities simplify the management of distributed computations, making it easier to build complex streaming applications. The scalability and flexibility of Flink's engine enable it to handle large-scale workloads with ease. 

Additionally, Flink provides unified streaming and batch APIs, enabling developers to work with both types of data processing paradigms seamlessly. With the support of machine learning and standard SQL, Flink empowers users to perform advanced analytics and leverage familiar query languages. Overall, Apache Flink is a powerful and versatile platform that enables organizations to unlock the full potential of their real-time data.

Author: Luigi Cerrato, Software Engineer @ Bitrock

Thanks to the technical team of our sister's company Radicalbit for their valuable contributions to this article.

Read More