Continuous Deployment

Continuous Deployment of Kafka Connectors


Introduction to the Project

During one of our projects, we worked on a real-time data streaming application using Kafka. After receiving the data from some external services using REST API, we manipulated it using Kafka streams pipelines; then, we called an external REST API service to make the data available to the destination system. In order to develop the input and output components of the ETL flow, we wrote some custom Source Connectors and a Sink Connector for the Kafka Connect module.

Schema


The Problem

Probably, most of you have already figured out the problem: for each of our connectors, we needed to build the “fat Jar” and move it to the Kafka Connect folder, which, for the sake of making things more complicated, was running in Kubernetes. Unfortunately for us, the connectors couldn’t be counted on the fingers of one hand: consequently, the manual procedure was long, repetitive, and very time-consuming. Another issue we encountered was that the creation/deletion of our connectors was “made by hand”, which made it impossible to determine which configurations were being provisioned by looking at the Git repository.


The Solution

In our project, each connector was versioned using a separate git repository. We decided to aim at CI (Continuous Integration) and CD (Continuous Delivery), with Continuous Deployments in the development environment.

Flow

In our scenario, we decided to use Jenkins as a CI/CD daemon and Terraform to manage all the infrastructure and configurations for the connectors, aiming at achieving a full GitOps experience.

In our daily work, whenever we merge a pull request, this event triggers a Jenkins pipeline that builds the artifact and publishes it into a private Artifactory repository. If this phase succeeds, then the next step is to trigger the deployment job.

Code

Given that the deployment is fully automated in the development environment, the latest versions of our connectors are always running in a matter of minutes after having merged code changes. For the other environments, Jenkins's job requires manual approval, in order to avoid having unwanted changes promoted past the development environment. The deployment of connector Jars itself is performed by using a custom Helm chart that fetches the desired connectors artifacts before starting Kafka Connect container.

As seen before, in our project we were using Terraform for managing infrastructure and connectors configuration. We decided to use a single repository for all the Terraform codebase, using different Terraform states to manage different environments. For each connector, we created a Terraform module containing the connector resource definition...

Code

...and the expected configuration variables:

Code

In each environment configuration, we declared which connectors to configure, by instantiating the proper Terraform modules that were previously created. Terraform modules were versioned as well: this means that, in different environments, we could run different configurations of the same connector artifact. In this way, we were able to deploy the jar without doing it manually.

Code

The last missing piece was the creation of all the required topics for our application. We decided to define them into a simple yaml file and, with the help of a simple bash script, they got created when the Jenkins job ran.

Code


Conclusions

In this article we have explored how to improve and engineer the deployment of Kafka connectors in various environments without the need for manual intervention. Developers can focus on enhancing their code and almost ignore the deployment part, giving that they are now able to perform one-click deployments. Enabling connectors or changing configurations are now just a few lines changed in the Terraform repo, without the need of executing Kafka Connect API requests by hand. From our perspective, it makes sense for a lot of companies to invest time, money and resources to automate the deployment of the connectors.


Authors: Alberto Adami, Software Engineer @Bitrock - Daniele Marenco, Software Engineer @Bitrock

Read More
Data Engineering

Data Engineering - Handling Unreliable Sources

Most of you have probably heard the phrase "data is the new oil", and that's because everything in our world produces valuable information. It's up to us to be able to extract the value from all the noisy, messy data that is being produced every instant.

But working with data is not easy: as seen before, real data is always noisy, messy, and often incomplete, and even the process of extraction sometimes is affected by some faults.

It is thus very important to make the data usable via a process known as data wrangling (i.e. the process of cleaning, structuring, and enriching raw data into the desired format) for better decision making. The crucial thing to understand here is that bad data lead to poor decision-making, so it's important to make this process stable, repeatable, and idempotent, in order to ensure that our transformations are improving the quality of the data and not degrading it.

Let's have a look at one of the aspects of the data wrangling process: how to handle data sources that cannot guarantee the quality of the data they are providing.


The Context

In a recent project we have been involved in, we faced the scenario in which the data sources were heavily unreliable.

Given the early definitions, the expected data, coming from a set of sensors, should have been:

  • approximately ten different types of data
  • every type at a fixed pace (every 10 minutes)
  • data will arrive in a landing bucket
  • data will be in CSV, with a predefined schema and a fixed number of rows

Starting from this, we would have performed validation, cleaning, and aggregation, in order to compute some KPIs. Moreover, these KPIs were the starting point of a later Machine Learning based prediction.

On top of this, there was a requirement to produce updated reports and predictions every 10 minutes with the most up-to-date information received.

As in many real-world data projects, the source data was suffering from multiple issues, like missing data in the CSV (sometimes some value missing in some cells, or entire rows were missing, or sometimes there were duplicated rows), or late-arriving data (even not arriving at all).


The Solution

In similar scenarios, it is fundamental to track the transformations that the data pipeline will apply, and to answer questions like these:

  • which are the source values for a given result?
  • does a result value come from real data or imputed data?
  • did all the sources arrive on time?
  • how reliable is a given result?

To be able to answer this type of questions, we first have to isolate three different kinds of data, in at least three areas:

1 Data Engineering

Specifically, the Landing Area is a place in which the external systems (i.e. data sources) will write, the data pipeline can only read from or delete after a safe retention time.

In the Raw Area instead, we are going to copy the CSVs from the Landing Area keeping the data as-is, but enriching the metadata (e.g. labeling the file, or putting it in a better directory structure). This will be our Data Lake, from which we can always retrieve the original data, in case of errors during processing or a new functionality is developed after the data has already been processed by the pipeline.

Finally, in the Processed Area we keep validated and cleaned data. This area will be the starting place for the Visualization part and the Machine Learning part.


After having defined the previous three areas to store the data, we need to introduce another concept that allows us to track the information through the pipeline: the Run Control Value

The Run Control Value is metadata, it's often a serial value or a timestamp, or others, and it gives us the possibility to correlate the data in the different areas with the pipeline executions.

This concept is quite simple to implement, but it's not so obvious to understand. On the other hand, it is easy to be misled; someone could think it is superfluous, and could be removed in favor of information already present in the data, such as a timestamp, but it would be wrong.

Let's now see, with a few examples, the benefit of using the data separation described above, together with the Run Control Value.


Example 1: Tracking data imputation

Let's first consider a scenario in which the output is odd and seems apparently wrong. The RCV column represents the Run Control Value and it's being added by the pipeline.

Here we can see that, if we look only into processed data, for the input at hour 11:00 we are missing the entry with ID=2, and the Counter with ID=1 has a strange zero as its value (let's just assume that our domain expert said that zeros in Counter column are anomalous).

In this case, we can backtrack in the pipeline stages, using the Run Control Value, and see which values have concretely contributed to the output, if all the inputs were available by the time the computation has run, or if some files were missing in the Raw Area and thus they have been fulfilled with the imputed values.

In the image above, we can see that in the Raw Area the inputs with RCV=101 were both negative, and the entity with ID=2 is related to time=12:00. If we then check the original file in the Landing Area we can see that this file was named 1100.csv (in the image represented as a couple of table rows for simplicity), so the entry related to the hour 12:00 was an error; the entry got thus removed in the Processed Area, while the other one was reset to zero by an imputation rule.

The solution of keeping the Landing Area distinct from the Raw Area allows us also to handle the case of Late Arriving Data.

Given the scenario described at the beginning of the article, we receive data in batches with a scheduler that drives the ingestion. So, what if, at the time of the scheduled ingestion, one of the inputs was missing and it has been fulfilled with the imputed values, but, at the time we are going to debug it, we can see that it's available?

In this case, it will be available in the Landing Area but it will be missing in the Raw Area; so, without even opening the file to check the values, we can quickly understand that for that specific run, those values have been imputed.


Example 2: Error from the sources with input data re-submission

In the first example, we discussed about how to retrospectively analyze the processing or how to debug it. We now consider another case: a source with a problem submitted bad data on a given run; after the problem has been fixed, we want to re-ingest the data for the same run to update our output, re-executing it in the same context.

The following image shows the status of the data warehouse when the input at hour 11.00 has a couple of issues: the entry with ID=2 is missing and the entry ID=1 has a negative value and we have a validation rule to convert to zero the negative values. So the Processed Area table contains the validated data.

In the fixed version of the file, there is a valid entry for each entity. The pipeline will use the RCV=101 as a reference to clean up the table from the previous run and ingest the new file.

In this case, the Run Control Value allows us to identify precisely which portion of data has been ingested with the previous execution so we can safely remove it and re-execute it with the correct one.


These are just two simple scenarios that can be tackled in this way, but many other data pipeline issues can benefit from this approach.

Furthermore, this mechanism allows us to have idempotency of the pipeline stages, i.e. being able to track the data flowing at the different stages enables the possibility to re-apply the transformations on the same input and to obtain the same result.


Conclusions

In this article, we have dived a bit into the data engineering world, specifically discovering how to handle data from unreliable sources, most of the cases in real-world projects.

We have seen why the stage separation is important in designing a data pipeline and also which properties every "area" will hold; this helps us better understand what is happening and identify the potential issues.

Another aspect we have highlighted is how this technique facilitates the handling of late-arriving data or re-ingesting corrected data, in case an issue can be recovered at the source side.


Author: Luca Tronchin, Software Engineer @Bitrock

Read More
Interview with Marco Stefani

Software Development @Bitrock

An interview with Marco Stefani (Head of Software Engineering

We met up with Marco Stefani, Head of Software Engineering at Bitrock, in order to understand his point of view and vision related to software development today. A good opportunity to explore the key points of Bitrock technology experience, and to better understand the features that a developer must have to become part of this team.

Marco Stefani (Head of Engineering @Bitrock) interviewed by Luca Lanza (Corporate Strategist @Databiz Group).


1) How has the approach to software development evolved over time, and what does it mean to be a software developer today in this company?

When I started working, in the typical enterprise environment the developer was required to code all day long, test his work a little bit, and then pass it to somebody else who would package and deploy it. The tests were done by some other unit, and if some bug was found it was reported back to the developer. No responsibility was taken by developers after they thought the software was ready ("It works on my machine"). Releases to production were rare and painful, because nobody was really in charge. In the last years we have seen a radical transformation in the way of working in this industry, thanks to the Agile methodologies and practices. XP, Scrum and DevOps shifted the focus of developers on quality, commitment and control of the whole life cycle of software: developers don't just write source code, they own the code. At Bitrock developers want to own the code.

2) What are the strengths of your Unit today? And which are the differentiating values?

Developers at Bitrock range from highly experienced to just graduated. Regardless of the seniority, they all want to improve their skills in this craft. We work both with old style corporate and with innovative companies, and we are always able to provide useful contributions, not just in terms of lines of codes, but also with technical leadership. Despite the fact that we can manage essentially any kind of software requirement, as a company our preference goes to functional programming and event-based, micro services architectures. We believe that in complex systems, these two paradigms are key to creating the most flexible and manageable software: something that can can evolve as easily as possible with the new and often unforeseen needs of the future.

3) What challenges are you prepared to face with your Unit, and with which approach?

Today's expectations of end users are that software applications must be always available and fast; the expectations of our customers are that the same system must be as economical as possible, in terms of development and maintainability, but also in terms of operational costs. That means that we need to build systems that are responsive, resilient and elastic. If you throw in asynchronous messages as the means of communication between components, then you get a "reactive system" as defined by the Reactive Manifesto.

(Discover Reactive Manifesto at: https://www.reactivemanifesto.org/)

Modern software systems development requires a set of skills that are not just technical, but also of understanding the underlying business and organisational: if the we want to model the messages (the events, actually) correctly, or wants to split the application in meaningful and tractable components, we need to understand the business logic and processes; and the definition of the components will have an impact on how the work and the teams are structured. Designing a software architecture requires tools that are able to connect all these different levels. Best practices, design and architectural patterns must be in the developer's toolbox.

4) What skills should a new member of your team have?

Software development is a very complex and challenging discipline. Technical skills need to be improved and change all the time; keeping up with technology advances is a job in itself, and often is not enough. At Bitrock we believe in continuous learning, so for new members it is important to understand that what they know during the interview is just the first step of a never-ending process. Working in a consultant company adds even more challenges, because it requires a constant adaptation to new environments and ways of working that are dictated by the customers. The payoff is that both technical and soft skills can improve in a much shorter time, compared to a "comfortable" product company.

Read More