A Joint Event from Bitrock and HashiCorp

Last week we hosted an exclusive event in Milan dedicated to the exploration of modern tools and technologies for the next-generation enterprise.
The first event of its kind, it was held in collaboration with HashiCorp, US market leader in multi-cloud infrastructure automation, after the Partnership we signed in May 2020.

HashiCorp's open-source tools Terraform, Vault, Nomad and Consul enable organizations to accelerate their digital evolution, as well as adopt a common cloud operating model for infrastructure, security, networking, and application automation.
As companies scale and increase in complexity, enterprise versions of these products enhance the open-source tools with features that promote collaboration, operations, governance, and multi-data center functionality.
Organizations must also rely on a trusted Partner that is able to guide them through the architectural design phase and who can grant enterprise-grade assistance when it comes to application development, delivery and maintenance. And that’s exactly where Bitrock comes into play.

During the Conference Session, the Speakers had the chance to describe to the audience how large companies can rely on more agile, flexible and secure infrastructure thanks to HashiCorp’s suite and Bitrock’s expertise and consulting services. Especially when it comes to the provisioning, protection and management of services and applications across private, hybrid and public cloud architectures.

“We are ready to offer Italian and European companies the best tools to evolve their infrastructure and digital services. By working closely with HashiCorp, we jointly enable organizations to benefit from a cloud operating model.” – said Leo Pillon, Bitrock CEO.

After the Keynotes, the event continued with a pleasant Dinner & Networking night at the fancy restaurant by Cascina Cuccagna in Milan.
Take a look at the pictures below to see how the event went on, and keep following us on our blog and social media channels to discover what other incredible events we have in store!

Read More
Continuous Deployment

Continuous Deployment of Kafka Connectors


Introduction to the Project

During one of our projects, we worked on a real-time data streaming application using Kafka. After receiving the data from some external services using REST API, we manipulated it using Kafka streams pipelines; then, we called an external REST API service to make the data available to the destination system. In order to develop the input and output components of the ETL flow, we wrote some custom Source Connectors and a Sink Connector for the Kafka Connect module.

Schema


The Problem

Probably, most of you have already figured out the problem: for each of our connectors, we needed to build the “fat Jar” and move it to the Kafka Connect folder, which, for the sake of making things more complicated, was running in Kubernetes. Unfortunately for us, the connectors couldn’t be counted on the fingers of one hand: consequently, the manual procedure was long, repetitive, and very time-consuming. Another issue we encountered was that the creation/deletion of our connectors was “made by hand”, which made it impossible to determine which configurations were being provisioned by looking at the Git repository.


The Solution

In our project, each connector was versioned using a separate git repository. We decided to aim at CI (Continuous Integration) and CD (Continuous Delivery), with Continuous Deployments in the development environment.

Flow

In our scenario, we decided to use Jenkins as a CI/CD daemon and Terraform to manage all the infrastructure and configurations for the connectors, aiming at achieving a full GitOps experience.

In our daily work, whenever we merge a pull request, this event triggers a Jenkins pipeline that builds the artifact and publishes it into a private Artifactory repository. If this phase succeeds, then the next step is to trigger the deployment job.

Code

Given that the deployment is fully automated in the development environment, the latest versions of our connectors are always running in a matter of minutes after having merged code changes. For the other environments, Jenkins's job requires manual approval, in order to avoid having unwanted changes promoted past the development environment. The deployment of connector Jars itself is performed by using a custom Helm chart that fetches the desired connectors artifacts before starting Kafka Connect container.

As seen before, in our project we were using Terraform for managing infrastructure and connectors configuration. We decided to use a single repository for all the Terraform codebase, using different Terraform states to manage different environments. For each connector, we created a Terraform module containing the connector resource definition...

Code

...and the expected configuration variables:

Code

In each environment configuration, we declared which connectors to configure, by instantiating the proper Terraform modules that were previously created. Terraform modules were versioned as well: this means that, in different environments, we could run different configurations of the same connector artifact. In this way, we were able to deploy the jar without doing it manually.

Code

The last missing piece was the creation of all the required topics for our application. We decided to define them into a simple yaml file and, with the help of a simple bash script, they got created when the Jenkins job ran.

Code


Conclusions

In this article we have explored how to improve and engineer the deployment of Kafka connectors in various environments without the need for manual intervention. Developers can focus on enhancing their code and almost ignore the deployment part, giving that they are now able to perform one-click deployments. Enabling connectors or changing configurations are now just a few lines changed in the Terraform repo, without the need of executing Kafka Connect API requests by hand. From our perspective, it makes sense for a lot of companies to invest time, money and resources to automate the deployment of the connectors.


Authors: Alberto Adami, Software Engineer @Bitrock - Daniele Marenco, Software Engineer @Bitrock

Read More
Exploring BDD

Exploring Behavior Driven Development (BDD)


What is BDD

Behavior Driven Development (BDD) is an Agile software development process that encourages collaboration among developers, QA and non-technical or business participants in a software project. It encourages teams to use conversation and concrete examples to formalize a shared understanding of how the application should behave.

The main concept behind BDD is the cooperation between all the stakeholders of a project, in order to share the definition of a set of functionalities and how they should behave through a set of concrete examples.

From this point of view, BDD as a practice involving both technical and business users is strictly related to the Agile methodology principles.


Mind the Gap: Business Vs. Developers Perspective

Let's consider the existing scenario before BDD emerged as a practice. As developers know, the process of translating software requirements into a set of well defined feature specifications is tedious, frustrating and error prone. A software requirements document was the typical interaction between business users and developers when the waterfall methodology was in place. This is typically a static kind of interaction: business users wrote the requirements, and then developers extracted from it a set of functionalities to implement.

Software requirements documents can contain a lot of unnecessary details, a lot of contradictory descriptions of the same functionalities and also a lot of insufficient definitions of some other functionalities. So developers typically need to ask business users to integrate the document several times, but every version of the document is not a 1 to 1 mapping to the set of functionalities to implement.

The first thing to note is that the process of extraction of well defined functionalities from this kind of document is error prone; therefore, there is no guarantee that we cover all the functionalities nor that we define them correctly.

The extraction process can produce a considerably and not acceptable gap between the business users’ point of view and the developers one.

This gap is mainly due to the fact that who creates software requirements documents and who uses them to create features are two distinct teams. If we consider that QA is another team, then we easily understand that this process is quite problematic.

BDD as a practice to encourage the collaboration between people having different cultures and mindsets, and together explore and define features behavior, is a way to fill this gap.


BDD as a Way to Describe Features

The central point of BDD is the sharing of tools and competences between technical and non-technical stakeholders, in order to share concepts and meet a common understanding of a set of functionalities.

The first tool we can use is the language: the concrete examples that describe the desired behavior of the system are written in a language that is very close to the natural language, so in the domain of business users.

BDD is made upon a three-step iterative process, where the steps are: Discovery, Formulation and Automation:

BDD - Scheme


1. Discovery

BDD helps teams have the right conversations at the right time, so that you can minimise the amount of time spent in meetings and maximise the amount of valuable code you produce.

In this phase team members, both technical and non technical users talk about the requirements related to one or more functionalities (user stories), in order to obtain a shared understanding of the expected behavior through a set of concrete examples that describes how the system should work in different scenarios.

This phase is based on structured conversations called discovery workshops, where team members focus around real world examples that describe the features from the user’s perspective.


2. Formulation

In this phase every example is expressed in a way that can be documented and then checked. The way is to express those examples using a medium that can be read both by humans and by automated processes.

A widely adopted language is gherkin: this is similar to a natural language and allows to describe the features through one or more scenarios.

Every scenario is a concrete example that explains how the feature should behave in a particular circumstance.

A typical scenario can be expressed in gherkin describing three things: 1) what are the preconditions to meet before beginning to use a feature, 2) what are the actions to be taken in order to use the feature, 3) and then what are the assertions to check if the feature is correctly implemented.

Here’s the structure of a typical BDD scenario:

Given a precondition

And another one precondition

And

When I do something

And something else

And ...

Then I expect something

And something else

And ...

Here’s an example of definition in gherkin of the feature related to money withdraw using an ATM:

BDD - Code


3. Automation

In this phase we take one scenario at a time and we make a test that satisfies the preconditions (expressed in the given clause), make the actions (expressed in the when clause) and then verify the assertions (expressed in the then clause).

The test is an automated way to verify if a functionality behaves as described in the corresponding scenario.

As we do in TDD, the test is made before the implementation: it is thus a failing test at the beginning, and then we implement the feature in order to make it pass.

Since this kind of test is defined by a team having business people in it, it has a recognized business value; therefore, it can be used as part of the acceptance tests.

BDD is supported by several open source and commercial tools; a couple of them are:

Cucumber https://cucumber.io/

JBehave https://jbehave.org/


In the following example you can see how the tests related to the first scenario of the ATM feature can be implemented:

BDD - Example


As you can see from the following example, the automation process produces a set of reports that are very useful for the team members to verify, during all the phases of the development, if the features are implemented, if they behave as expected or, for some reason, they need to be investigated further (in case a refactoring or some other change broke some of them).

Another key point is that BDD can be viewed as a sort of documentation: indeed, it explains what the features are, how they should behave, and how you can verify them.



BDD - Scenario example (ATM)

BDD with TDD

BDD does not replace TDD. The automation phase produces a set of automated tests, but compared to those created by TDD, they are at a higher abstraction layer: they are actually used to verify a scenario related to a feature or to a user story.

A BDD test will guide the implementation of a feature as a whole and how to meet the expectations described in the related scenario. The implementation phase of a single feature involves many low level components and, embracing TDD practices, that implementation will start with a failing unit test of a small component; then, the component will be implemented, resulting in a green test. The cycle is then repeated for other components, as described in the following image:

BDD - Scheme 2

Since every feature is made of many small components, then every BDD test corresponds to many TDD tests.

Both BDD and TDD are iterative processes: they start from a failing test, then the implementation will have the side effect to fix the test and, when a refactoring causes a test to fail, the iteration will start again.


TDD Vs. BDD

BDD tests are part of a shared understanding of a set of features between all the stakeholders of a project; they have a recognized business value and can be used as acceptance tests in order to verify if the system is behaving as expected.

They can be used to decide if it is safe to install in production (go / no go); typically, they can tell if something is broken, but not exactly what.

TDD tests are part of the development process, they are useful for developers to gain confidence about the quality of the software and also to do refactoring without the fear of breaking something, but they have not an immediately understandable business value.

TDD tests can tell exactly what small piece of the software is broken but, when one of these tests fails, it is usually safe to install in production.


Conclusions

Although tests are a fundamental part of it, defining BDD as a test practice is highly reductive.

BDD is intimately related to the concept of Agile: with its procedures and tools, it facilitates the collaboration between people with different backgrounds and roles, in order to define a common point of view on the features to be implemented.

BDD allows you to verify the correct behavior of your software at any time during the development process, and helps provide a structured documentation on how your software should work.

BDD and TDD are not in competition with each other: each practice completes the other and is a fundamental aspect of the development process.


References

https://cucumber.io/

https://jbehave.org/

https://cucumber.io/docs/bdd/

https://gojko.net/2020/03/17/sbe-10-years.html


Author: Massimo Da Ros, Lead Software Engineer @ Bitrock

Read More
Data Engineering

Data Engineering - Handling Unreliable Sources

Most of you have probably heard the phrase "data is the new oil", and that's because everything in our world produces valuable information. It's up to us to be able to extract the value from all the noisy, messy data that is being produced every instant.

But working with data is not easy: as seen before, real data is always noisy, messy, and often incomplete, and even the process of extraction sometimes is affected by some faults.

It is thus very important to make the data usable via a process known as data wrangling (i.e. the process of cleaning, structuring, and enriching raw data into the desired format) for better decision making. The crucial thing to understand here is that bad data lead to poor decision-making, so it's important to make this process stable, repeatable, and idempotent, in order to ensure that our transformations are improving the quality of the data and not degrading it.

Let's have a look at one of the aspects of the data wrangling process: how to handle data sources that cannot guarantee the quality of the data they are providing.


The Context

In a recent project we have been involved in, we faced the scenario in which the data sources were heavily unreliable.

Given the early definitions, the expected data, coming from a set of sensors, should have been:

  • approximately ten different types of data
  • every type at a fixed pace (every 10 minutes)
  • data will arrive in a landing bucket
  • data will be in CSV, with a predefined schema and a fixed number of rows

Starting from this, we would have performed validation, cleaning, and aggregation, in order to compute some KPIs. Moreover, these KPIs were the starting point of a later Machine Learning based prediction.

On top of this, there was a requirement to produce updated reports and predictions every 10 minutes with the most up-to-date information received.

As in many real-world data projects, the source data was suffering from multiple issues, like missing data in the CSV (sometimes some value missing in some cells, or entire rows were missing, or sometimes there were duplicated rows), or late-arriving data (even not arriving at all).


The Solution

In similar scenarios, it is fundamental to track the transformations that the data pipeline will apply, and to answer questions like these:

  • which are the source values for a given result?
  • does a result value come from real data or imputed data?
  • did all the sources arrive on time?
  • how reliable is a given result?

To be able to answer this type of questions, we first have to isolate three different kinds of data, in at least three areas:

1 Data Engineering

Specifically, the Landing Area is a place in which the external systems (i.e. data sources) will write, the data pipeline can only read from or delete after a safe retention time.

In the Raw Area instead, we are going to copy the CSVs from the Landing Area keeping the data as-is, but enriching the metadata (e.g. labeling the file, or putting it in a better directory structure). This will be our Data Lake, from which we can always retrieve the original data, in case of errors during processing or a new functionality is developed after the data has already been processed by the pipeline.

Finally, in the Processed Area we keep validated and cleaned data. This area will be the starting place for the Visualization part and the Machine Learning part.


After having defined the previous three areas to store the data, we need to introduce another concept that allows us to track the information through the pipeline: the Run Control Value

The Run Control Value is metadata, it's often a serial value or a timestamp, or others, and it gives us the possibility to correlate the data in the different areas with the pipeline executions.

This concept is quite simple to implement, but it's not so obvious to understand. On the other hand, it is easy to be misled; someone could think it is superfluous, and could be removed in favor of information already present in the data, such as a timestamp, but it would be wrong.

Let's now see, with a few examples, the benefit of using the data separation described above, together with the Run Control Value.


Example 1: Tracking data imputation

Let's first consider a scenario in which the output is odd and seems apparently wrong. The RCV column represents the Run Control Value and it's being added by the pipeline.

Here we can see that, if we look only into processed data, for the input at hour 11:00 we are missing the entry with ID=2, and the Counter with ID=1 has a strange zero as its value (let's just assume that our domain expert said that zeros in Counter column are anomalous).

In this case, we can backtrack in the pipeline stages, using the Run Control Value, and see which values have concretely contributed to the output, if all the inputs were available by the time the computation has run, or if some files were missing in the Raw Area and thus they have been fulfilled with the imputed values.

In the image above, we can see that in the Raw Area the inputs with RCV=101 were both negative, and the entity with ID=2 is related to time=12:00. If we then check the original file in the Landing Area we can see that this file was named 1100.csv (in the image represented as a couple of table rows for simplicity), so the entry related to the hour 12:00 was an error; the entry got thus removed in the Processed Area, while the other one was reset to zero by an imputation rule.

The solution of keeping the Landing Area distinct from the Raw Area allows us also to handle the case of Late Arriving Data.

Given the scenario described at the beginning of the article, we receive data in batches with a scheduler that drives the ingestion. So, what if, at the time of the scheduled ingestion, one of the inputs was missing and it has been fulfilled with the imputed values, but, at the time we are going to debug it, we can see that it's available?

In this case, it will be available in the Landing Area but it will be missing in the Raw Area; so, without even opening the file to check the values, we can quickly understand that for that specific run, those values have been imputed.


Example 2: Error from the sources with input data re-submission

In the first example, we discussed about how to retrospectively analyze the processing or how to debug it. We now consider another case: a source with a problem submitted bad data on a given run; after the problem has been fixed, we want to re-ingest the data for the same run to update our output, re-executing it in the same context.

The following image shows the status of the data warehouse when the input at hour 11.00 has a couple of issues: the entry with ID=2 is missing and the entry ID=1 has a negative value and we have a validation rule to convert to zero the negative values. So the Processed Area table contains the validated data.

In the fixed version of the file, there is a valid entry for each entity. The pipeline will use the RCV=101 as a reference to clean up the table from the previous run and ingest the new file.

In this case, the Run Control Value allows us to identify precisely which portion of data has been ingested with the previous execution so we can safely remove it and re-execute it with the correct one.


These are just two simple scenarios that can be tackled in this way, but many other data pipeline issues can benefit from this approach.

Furthermore, this mechanism allows us to have idempotency of the pipeline stages, i.e. being able to track the data flowing at the different stages enables the possibility to re-apply the transformations on the same input and to obtain the same result.


Conclusions

In this article, we have dived a bit into the data engineering world, specifically discovering how to handle data from unreliable sources, most of the cases in real-world projects.

We have seen why the stage separation is important in designing a data pipeline and also which properties every "area" will hold; this helps us better understand what is happening and identify the potential issues.

Another aspect we have highlighted is how this technique facilitates the handling of late-arriving data or re-ingesting corrected data, in case an issue can be recovered at the source side.


Author: Luca Tronchin, Software Engineer @Bitrock

Read More
Polymorphic Messages in Kafka Streams

Polymorphic Messages in Kafka Streams


Things usually start simple...

You are designing a Kafka Streams application which must read commands and produce the corresponding business event.
The Avro models you’re expecting to read look like this:

While the output messages you’re required to produce look like this:

You know you can leverage the sbt-avrohugger plugin to generate the corresponding Scala class for each Avro schema, so that you can focus only on designing the business logic.

Since the messages themselves are pretty straightforward, you decide to create a monomorphic function to map properties between each command and the corresponding event.
The resulting topology ends up looking like this:


...But then the domain widens

Today new functional requirements have emerged: your application must now handle multiple types of assets, each with its own unique properties.
You are pondering how to implement this requirement and make your application more resilient to further changes in behavior.


Multiple streams

You could split both commands and events into multiple topics, one per asset type, so that the corresponding Avro schema stays consistent and its compatibility is ensured.
This solution, however, would have you replicate pretty much the same topology multiple times, so it’s not recommended unless the business logic has to be customized for each asset type.


“All-and-none” messages

Avro doesn’t support inheritance between records, so any OOP strategy to have assets inherit properties from a common ancestor is unfortunately not viable.
You could however create a “Frankenstein” object with all the properties of each and every asset and fill in only those required for each type of asset.
This is definitely the worst solution from an evolutionary and maintainability point of view.


Union types

Luckily for you, Avro offers an interesting feature named union types: you could express the diversity in each asset’s properties via a union of multiple payloads, still relying on one single message as wrapper.


Enter polymorphic streams

Objects with no shape

To cope with this advanced polymorphism, you leverage the shapeless library, which introduces the Coproduct type, the perfect companion for the Avro union type.
First of all, you update the custom types mapping of sbt-avrohugger, so that it generates an additional sealed trait for each Avro protocol containing multiple records:

The generated command class ends up looking like this:


Updating the business logic

Thanks to shapeless’ Poly1 trait you then write the updated business logic in a single class:

Changes to the topology are minimal, as you’d expect:


A special kind of Serde

Now for the final piece of the puzzle, Serdes. Introducing the avro4s library, which takes Avro GenericRecords above and beyond.
You create a type class to extend a plain old Serde providing a brand new method:

Now each generated class has its own Serde, tailored on the corresponding Avro schema.


Putting everything together

Finally, the main program where you combine all ingredients:



Conclusions

When multiple use cases share (almost) the same business logic, you can create a stream processing application with ad-hoc polymorphism and reduce the duplication of code to the minimum, while making your application even more future-proof.

Read More