Exploring BDD

Exploring Behavior Driven Development (BDD)

What is BDD

Behavior Driven Development (BDD) is an Agile software development process that encourages collaboration among developers, QA and non-technical or business participants in a software project. It encourages teams to use conversation and concrete examples to formalize a shared understanding of how the application should behave.

The main concept behind BDD is the cooperation between all the stakeholders of a project, in order to share the definition of a set of functionalities and how they should behave through a set of concrete examples.

From this point of view, BDD as a practice involving both technical and business users is strictly related to the Agile methodology principles.

Mind the Gap: Business Vs. Developers Perspective

Let's consider the existing scenario before BDD emerged as a practice. As developers know, the process of translating software requirements into a set of well defined feature specifications is tedious, frustrating and error prone. A software requirements document was the typical interaction between business users and developers when the waterfall methodology was in place. This is typically a static kind of interaction: business users wrote the requirements, and then developers extracted from it a set of functionalities to implement.

Software requirements documents can contain a lot of unnecessary details, a lot of contradictory descriptions of the same functionalities and also a lot of insufficient definitions of some other functionalities. So developers typically need to ask business users to integrate the document several times, but every version of the document is not a 1 to 1 mapping to the set of functionalities to implement.

The first thing to note is that the process of extraction of well defined functionalities from this kind of document is error prone; therefore, there is no guarantee that we cover all the functionalities nor that we define them correctly.

The extraction process can produce a considerably and not acceptable gap between the business users’ point of view and the developers one.

This gap is mainly due to the fact that who creates software requirements documents and who uses them to create features are two distinct teams. If we consider that QA is another team, then we easily understand that this process is quite problematic.

BDD as a practice to encourage the collaboration between people having different cultures and mindsets, and together explore and define features behavior, is a way to fill this gap.

BDD as a Way to Describe Features

The central point of BDD is the sharing of tools and competences between technical and non-technical stakeholders, in order to share concepts and meet a common understanding of a set of functionalities.

The first tool we can use is the language: the concrete examples that describe the desired behavior of the system are written in a language that is very close to the natural language, so in the domain of business users.

BDD is made upon a three-step iterative process, where the steps are: Discovery, Formulation and Automation:

BDD - Scheme

1. Discovery

BDD helps teams have the right conversations at the right time, so that you can minimise the amount of time spent in meetings and maximise the amount of valuable code you produce.

In this phase team members, both technical and non technical users talk about the requirements related to one or more functionalities (user stories), in order to obtain a shared understanding of the expected behavior through a set of concrete examples that describes how the system should work in different scenarios.

This phase is based on structured conversations called discovery workshops, where team members focus around real world examples that describe the features from the user’s perspective.

2. Formulation

In this phase every example is expressed in a way that can be documented and then checked. The way is to express those examples using a medium that can be read both by humans and by automated processes.

A widely adopted language is gherkin: this is similar to a natural language and allows to describe the features through one or more scenarios.

Every scenario is a concrete example that explains how the feature should behave in a particular circumstance.

A typical scenario can be expressed in gherkin describing three things: 1) what are the preconditions to meet before beginning to use a feature, 2) what are the actions to be taken in order to use the feature, 3) and then what are the assertions to check if the feature is correctly implemented.

Here’s the structure of a typical BDD scenario:

Given a precondition

And another one precondition


When I do something

And something else

And ...

Then I expect something

And something else

And ...

Here’s an example of definition in gherkin of the feature related to money withdraw using an ATM:

BDD - Code

3. Automation

In this phase we take one scenario at a time and we make a test that satisfies the preconditions (expressed in the given clause), make the actions (expressed in the when clause) and then verify the assertions (expressed in the then clause).

The test is an automated way to verify if a functionality behaves as described in the corresponding scenario.

As we do in TDD, the test is made before the implementation: it is thus a failing test at the beginning, and then we implement the feature in order to make it pass.

Since this kind of test is defined by a team having business people in it, it has a recognized business value; therefore, it can be used as part of the acceptance tests.

BDD is supported by several open source and commercial tools; a couple of them are:

Cucumber https://cucumber.io/

JBehave https://jbehave.org/

In the following example you can see how the tests related to the first scenario of the ATM feature can be implemented:

BDD - Example

As you can see from the following example, the automation process produces a set of reports that are very useful for the team members to verify, during all the phases of the development, if the features are implemented, if they behave as expected or, for some reason, they need to be investigated further (in case a refactoring or some other change broke some of them).

Another key point is that BDD can be viewed as a sort of documentation: indeed, it explains what the features are, how they should behave, and how you can verify them.

BDD - Scenario example (ATM)

BDD with TDD

BDD does not replace TDD. The automation phase produces a set of automated tests, but compared to those created by TDD, they are at a higher abstraction layer: they are actually used to verify a scenario related to a feature or to a user story.

A BDD test will guide the implementation of a feature as a whole and how to meet the expectations described in the related scenario. The implementation phase of a single feature involves many low level components and, embracing TDD practices, that implementation will start with a failing unit test of a small component; then, the component will be implemented, resulting in a green test. The cycle is then repeated for other components, as described in the following image:

BDD - Scheme 2

Since every feature is made of many small components, then every BDD test corresponds to many TDD tests.

Both BDD and TDD are iterative processes: they start from a failing test, then the implementation will have the side effect to fix the test and, when a refactoring causes a test to fail, the iteration will start again.


BDD tests are part of a shared understanding of a set of features between all the stakeholders of a project; they have a recognized business value and can be used as acceptance tests in order to verify if the system is behaving as expected.

They can be used to decide if it is safe to install in production (go / no go); typically, they can tell if something is broken, but not exactly what.

TDD tests are part of the development process, they are useful for developers to gain confidence about the quality of the software and also to do refactoring without the fear of breaking something, but they have not an immediately understandable business value.

TDD tests can tell exactly what small piece of the software is broken but, when one of these tests fails, it is usually safe to install in production.


Although tests are a fundamental part of it, defining BDD as a test practice is highly reductive.

BDD is intimately related to the concept of Agile: with its procedures and tools, it facilitates the collaboration between people with different backgrounds and roles, in order to define a common point of view on the features to be implemented.

BDD allows you to verify the correct behavior of your software at any time during the development process, and helps provide a structured documentation on how your software should work.

BDD and TDD are not in competition with each other: each practice completes the other and is a fundamental aspect of the development process.






Author: Massimo Da Ros, Lead Software Engineer @ Bitrock

Read More
Data Engineering

Data Engineering - Handling Unreliable Sources

Most of you have probably heard the phrase "data is the new oil", and that's because everything in our world produces valuable information. It's up to us to be able to extract the value from all the noisy, messy data that is being produced every instant.

But working with data is not easy: as seen before, real data is always noisy, messy, and often incomplete, and even the process of extraction sometimes is affected by some faults.

It is thus very important to make the data usable via a process known as data wrangling (i.e. the process of cleaning, structuring, and enriching raw data into the desired format) for better decision making. The crucial thing to understand here is that bad data lead to poor decision-making, so it's important to make this process stable, repeatable, and idempotent, in order to ensure that our transformations are improving the quality of the data and not degrading it.

Let's have a look at one of the aspects of the data wrangling process: how to handle data sources that cannot guarantee the quality of the data they are providing.

The Context

In a recent project we have been involved in, we faced the scenario in which the data sources were heavily unreliable.

Given the early definitions, the expected data, coming from a set of sensors, should have been:

  • approximately ten different types of data
  • every type at a fixed pace (every 10 minutes)
  • data will arrive in a landing bucket
  • data will be in CSV, with a predefined schema and a fixed number of rows

Starting from this, we would have performed validation, cleaning, and aggregation, in order to compute some KPIs. Moreover, these KPIs were the starting point of a later Machine Learning based prediction.

On top of this, there was a requirement to produce updated reports and predictions every 10 minutes with the most up-to-date information received.

As in many real-world data projects, the source data was suffering from multiple issues, like missing data in the CSV (sometimes some value missing in some cells, or entire rows were missing, or sometimes there were duplicated rows), or late-arriving data (even not arriving at all).

The Solution

In similar scenarios, it is fundamental to track the transformations that the data pipeline will apply, and to answer questions like these:

  • which are the source values for a given result?
  • does a result value come from real data or imputed data?
  • did all the sources arrive on time?
  • how reliable is a given result?

To be able to answer this type of questions, we first have to isolate three different kinds of data, in at least three areas:

1 Data Engineering

Specifically, the Landing Area is a place in which the external systems (i.e. data sources) will write, the data pipeline can only read from or delete after a safe retention time.

In the Raw Area instead, we are going to copy the CSVs from the Landing Area keeping the data as-is, but enriching the metadata (e.g. labeling the file, or putting it in a better directory structure). This will be our Data Lake, from which we can always retrieve the original data, in case of errors during processing or a new functionality is developed after the data has already been processed by the pipeline.

Finally, in the Processed Area we keep validated and cleaned data. This area will be the starting place for the Visualization part and the Machine Learning part.

After having defined the previous three areas to store the data, we need to introduce another concept that allows us to track the information through the pipeline: the Run Control Value

The Run Control Value is metadata, it's often a serial value or a timestamp, or others, and it gives us the possibility to correlate the data in the different areas with the pipeline executions.

This concept is quite simple to implement, but it's not so obvious to understand. On the other hand, it is easy to be misled; someone could think it is superfluous, and could be removed in favor of information already present in the data, such as a timestamp, but it would be wrong.

Let's now see, with a few examples, the benefit of using the data separation described above, together with the Run Control Value.

Example 1: Tracking data imputation

Let's first consider a scenario in which the output is odd and seems apparently wrong. The RCV column represents the Run Control Value and it's being added by the pipeline.

Here we can see that, if we look only into processed data, for the input at hour 11:00 we are missing the entry with ID=2, and the Counter with ID=1 has a strange zero as its value (let's just assume that our domain expert said that zeros in Counter column are anomalous).

In this case, we can backtrack in the pipeline stages, using the Run Control Value, and see which values have concretely contributed to the output, if all the inputs were available by the time the computation has run, or if some files were missing in the Raw Area and thus they have been fulfilled with the imputed values.

In the image above, we can see that in the Raw Area the inputs with RCV=101 were both negative, and the entity with ID=2 is related to time=12:00. If we then check the original file in the Landing Area we can see that this file was named 1100.csv (in the image represented as a couple of table rows for simplicity), so the entry related to the hour 12:00 was an error; the entry got thus removed in the Processed Area, while the other one was reset to zero by an imputation rule.

The solution of keeping the Landing Area distinct from the Raw Area allows us also to handle the case of Late Arriving Data.

Given the scenario described at the beginning of the article, we receive data in batches with a scheduler that drives the ingestion. So, what if, at the time of the scheduled ingestion, one of the inputs was missing and it has been fulfilled with the imputed values, but, at the time we are going to debug it, we can see that it's available?

In this case, it will be available in the Landing Area but it will be missing in the Raw Area; so, without even opening the file to check the values, we can quickly understand that for that specific run, those values have been imputed.

Example 2: Error from the sources with input data re-submission

In the first example, we discussed about how to retrospectively analyze the processing or how to debug it. We now consider another case: a source with a problem submitted bad data on a given run; after the problem has been fixed, we want to re-ingest the data for the same run to update our output, re-executing it in the same context.

The following image shows the status of the data warehouse when the input at hour 11.00 has a couple of issues: the entry with ID=2 is missing and the entry ID=1 has a negative value and we have a validation rule to convert to zero the negative values. So the Processed Area table contains the validated data.

In the fixed version of the file, there is a valid entry for each entity. The pipeline will use the RCV=101 as a reference to clean up the table from the previous run and ingest the new file.

In this case, the Run Control Value allows us to identify precisely which portion of data has been ingested with the previous execution so we can safely remove it and re-execute it with the correct one.

These are just two simple scenarios that can be tackled in this way, but many other data pipeline issues can benefit from this approach.

Furthermore, this mechanism allows us to have idempotency of the pipeline stages, i.e. being able to track the data flowing at the different stages enables the possibility to re-apply the transformations on the same input and to obtain the same result.


In this article, we have dived a bit into the data engineering world, specifically discovering how to handle data from unreliable sources, most of the cases in real-world projects.

We have seen why the stage separation is important in designing a data pipeline and also which properties every "area" will hold; this helps us better understand what is happening and identify the potential issues.

Another aspect we have highlighted is how this technique facilitates the handling of late-arriving data or re-ingesting corrected data, in case an issue can be recovered at the source side.

Author: Luca Tronchin, Software Engineer @Bitrock

Read More

Getting Started with Prometheus

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit written in Go. Released by SoundCloud in 2012, it joined Cloud Native Computing Foundation in 2016 and in 2018 became the second graduated project alongside Kubernetes.

Based on metrics and not on logs, Prometheus uses its own time series database called TSDB and its own query language (PromQL).

The CNCF community loves Prometheus because:

  • it’s easy to configure, deploy, and maintain
  • it’s designed in multiple services, aiming at modularity
  • it’s container ready, “docker run” is enough to have it started
  • it’s orchestrator ready, supporting dynamic configurations
  • it’s an ecosystem: many client libraries and exporters maintained both by Prometheus team and the community

1 Prometheus

  • Prometheus collects data
  • Exporters expose data
  • Applications expose data
  • Grafana displays data
  • Alertmanager dispatches alerts

Prometheus is a pull-based monitoring system that scrapes metrics from configured endpoints, stores them efficiently and supports a powerful query language to compose dynamic information from a variety of otherwise unrelated data points.

To monitor your services using Prometheus, your services need to expose a Prometheus endpoint. This endpoint is an HTTP interface that exposes a list of metrics and the respective current values. Prometheus has a wide range of service discovery options to find your services and start collecting metrics data. The Prometheus server continuously polls the metrics interface on your services and stores the data. This provides a standardized way for metrics gathering.

Prometheus is designed to fetch data in intervals measured in seconds. And while Prometheus 2.x can handle somewhere north of ten millions series over a time window, which is rather generous, some unwise label choices can eat that surprisingly quickly.

Every 2 hours Prometheus compacts the data that has been buffered up in memory onto blocks on disk.

To reduce disk footprint, TSDB can have a shorter metrics retention period of the metrics or it can be configured to have a disk space limit. The data can be compacted and the WAL compressed as well.

The data structure is self-sufficient and can be moved from one instance to another independently given each time series is atomic and uniquely identified by its metric name (1). In recent Prometheus versions, remote storage support has been introduced in order to provide long term storage.

Core Prometheus server is a single binary and each Prometheus server is an independent process with its own storage. One of the downsides of this core implementation is the lack of clustering or backfilling “missing” data when a scrape fails.

Prometheus is not supposed to only be used with standard exporters (2), you can instrument your own code to capture the metrics that matter to you, business ones for example. Prometheus comes with the support for a wide range of languages (Go, Java or Scala, Python, Ruby, etc). Many upstream libraries are already instrumented by the maintainers, so you will get that for free!

What is a metric?

A metric is any numeric value that tells you something about how your system is operating. For example:

  • How much memory it is using
  • How long the last operation took* How many request were served today

3 Prometheus

In Prometheus there are 4 types of metrics: counter, gauge, histogram and summary.

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests.

A histogram samples observations, for example request durations or response sizes, and counts them in configurable buckets. It also provides a sum of all observed values. A histogram with a base metric name of exposes multiple time series during a scrape:

  • cumulative counters for the observation buckets, exposed as _bucket{le=""}
  • the total sum of all observed values, exposed as _sum
  • the count of events that have been observed, exposed as _count (identical to _bucket{le="+Inf"} above)

Similar to a histogram, a summary samples observations, for example request durations and response sizes. While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

A summary with a base metric name of exposes multiple time series during a scrape:

  • streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as {quantile="<φ>"}
  • the total sum of all observed values, exposed as _sum
  • the count of events that have been observed, exposed as _count

The essential difference between summaries and histograms is that summaries calculate streaming φ-quantiles on the client side and expose them directly, while histograms expose bucketed observation counts and the calculation of quantiles from the buckets of a histogram happens on the server side using the histogram_quantile() function.



Understanding metrics

Prometheus metrics have a name and might have any arbitrary number of labels:

A Metric has metadata (labels) and lots of functions to filter, change, remove those while fetching them from the targets. The name “node_cpu_seconds_total” consist of a prefix for the namespace (node metrics) and suffix for the unit of the value ( Seconds of CPU time in total )


promtool allows to lint them for consistency and correctness.


5 Prometheus


Prometheus Query Language (PromQL) supports a wide range of functions for interacting with scraped metrics. Some examples:

  • Filtering by label: _http_requeststotal{status=\~"5.."}
  • Calculating rates: _rate(http_requeststotal[5m])
  • Arithmetic ( +, *, /, -, %, ^) and Comparison ( >, <, >=, <=, ==, != ) operations
  • Aggregation and Grouping: _sum(rate(node_network_receive_bytestotal[5m])) by (instance)* Quantile: _histogram_quantile(0.95, sum(rate(http_request_duration_secondsbucket[5m])) by (le))
  • Recording Rule: precompute frequently needed or computationally expensive expressions, in order to make recurring queries much faster to compute


Our motto is: if you can graph it, you can alert on it! It’s really easy to set up alerts in Prometheus, it’s just a matter of defining which query to evaluate and which is the range of safe values:

6 Prometehus

Prometheus will evaluate the alerting rule regularly and will mark it as firing in case the rule matches. However, Prometheus core component will not take care directly of sending alerts to final users. Alertmanager instead will take care of performing alert related operations.

Alertmanager :

  • Receives alerts from Prometheus
  • Groups them
  • Inhibits them, for example in case of false positives
  • Dispatches them to downstream services, such as Slack or PagerDuty and many more
  • Built In HA leveraging gossip protocol

7 Prometehus



(1) https://github.com/bitnami/kube-prod-runtime/blob/master/docs/migration-guides/prometheus-migration.md

(2) [https://prometheus.io/docs/instrumenting/exporters] https://github.com/prometheus/prometheus/wiki/default-port-allocations

Author: Matteo Gazzetta, DevOps Engineer @Bitrock

Read More
Terraform Community Tools

Terraform Community Tools

Despite not having reached version 1.0 yet, Terraform has become the de facto tool for cloud infrastructure management. One of its major winning points is definitely the extensive cross cloud support, which allows projects to span from one cloud vendor to another with a minimal operational effort. Moreover, the popularity in the community continuously releasing reusable infrastructure components, the Terraform modules, makes it easy to bootstrap new projects with a fully functional setup right from the start.

In order to address all the different use cases of Terraform, whether it is executed as part of a GitOps pipeline or right from developers machines, the community has built a set of tools to enhance the developers experience.

In this blog post we will describe some of them, focusing on those that might not be that popular or widely adopted, but certainly deserve some attention.

Pull Request Automation


GitHub Website


Atlantis is a golang application that listens for Terraform pull request events via webhooks. It allows users to remotely execute "terraform plan" and "terraform apply" according to the pull request content commenting back the result. Atlantis is a good starting point for making infrastructure changes visible to all teams, allowing even non-operations ones to contribute to Terraform infrastructure codebase. If you want to see Atlantis in action, check this walkthrough video.

If you want to restrict and audit the execution of Terraform changes still providing a friendly interface, Terraform Cloud and Enterprise support invoking remote operations by UI, VCS, CLI and API. The offering includes an extensive set of capabilities for integrating infrastructure changes in CI pipelines.

Importing Existing Cloud Resources

Importing existing resources into a Terraform codebase is a long and tedious process. Terraform is capable of importing an existing resource into its state through "import" command, however the responsibility of writing the HCL describing the resource is on the developer. The community has come up with tools that are able to automate this process.


GitHub Website

Terraforming supports the export of existing AWS resources into Terraform resources, importing them to Terraform state and writing the configuration to a file.



Terraformer supports the export of existing resources from many different providers, such as AWS, Azure and GCP. The tool leverages Terraform providers for performing the mapping of resource attributes to Terraform ones, which makes it more resilient to API upgrades. Terraformer has been developed by Waze and now maintained by Google Cloud Platform team.

Version Management



When working with projects that are based on different Terraform versions, it is tedious to switch from one version to another and the risk of updating the states’ Terraform version to a new one is high. tfenv comes in support and makes it easy to have different Terraform versions installed on the same machine.

Security and Compliance Scanning




tfsec performs static analysis of your Terraform code in order to detect potential vulnerabilities in the resulting infrastructure configuration. It comes with a set of rules that work cross provider and a set of provider specific ones, with support for AWS, Azure and GCP. It supports disabling checks on specific resources making it easy to include the tool in a CI pipeline.


GitHub Website


Terrascan detects security and compliance violations in your Terraform codebase, mitigating the risk of provisioning unsecure cloud infrastructures. The tool supports AWS, Azure, GCP and Kubernetes, and comes with a set of more than 500 policies for security best practices. It is possible to write custom policies with Open Policy Agent Rego language.



Regula is a tool that inspects Terraform code looking for security misconfigurations and compliance violations. It supports AWS, Azure and GCP, and includes a library of rules written in Open Policy Agent language Rego. Regula consists of two parts, the first one generates a Terraform plan in JSON that is then consumed by the Rego framework which in turn evaluates the rules and produces a report.

Terraform Compliance

GitHub Website


Terraform Compliance approaches the problem from a different perspective, allowing to write compliance rules in a Behaviour Driven Development (BDD) fashion. An extensive set of examples provides an overview of the capabilities of the tool. It is easy to bring Terraform Compliance into your CI chain and validate infrastructure before deployment.

While Terraform Compliance is free to use and easy to get started with, a much wider set of policies can be defined using HashiCorp Sentinel, which is part of the HashiCorp Enterprise offering. Sentinel supports fine-grained condition-based policies, with different enforcing levels, that are evaluated as part of a Terraform remote execution.




TFLint is a Terraform linter that focuses on potential errors and best practices. The tool comes with a general purpose and AWS rule set while the rules for other cloud providers such as Azure and GCP are being added. It does not focus on security or compliance issues, rather on validating configuration variables such as instance types, which might cause a runtime error when applying the changes. TFLint tries to fill the gap of “terraform validate”, which is not able to validate variable values beside syntax and internal consistency checks.

Cost Estimation


GitHub Website


Keeping track of infrastructure pricing is quite a mess and one usually discovers the actual cost of a deployment after running it for days if not weeks. infracost comes in help providing a way to estimate how much the resources you are going to deploy will cost. At the moment the tool supports only AWS, providing insights for the costs of both hourly priced resources and usage based resources such as AWS Lambda Functions. For the latter, it requires the usage of infracost Terraform provider which allows describing usage estimates for a more realistic cost estimate. This enables quick “what-if” analysis like “what if this month my Lambda gets 2 times more requests?”. The ability to output a “diff” of the costs is useful when integrating infracost in your CI pipeline.

Terraform Enterprise provides a Cost Estimation feature that extends infracost offering with the support for the three major public cloud providers: AWS, Azure and GCP. Moreover, Sentinel policies can be applied for example to prevent the execution of Terraform changes according to the increment of costs.

Author: Simone Ripamonti, DevOps Engineer @Bitrock

Read More
Monitoring Kafka Connector with Kubernetes

Monitoring Kafka Connector with Kubernetes

The Problem

The popularity of microservice architecture has enormously increased recently; but this comes with new challenges.

One of these is monitoring. In one of our projects, we used a Kafka connector to intercept changes in our database and write data to a topic. This was a very important component of the system, so we needed to consider its health status carefully.


In our first version, we created a Kubernetes’ CronnJob with a simple shell script that checks the status of the connector and, eventually, deletes the failed and restarts it.

This worked quite well; however, this is different from how the other services are health checked with the Kubernetes.

The connector was deployed with Kubernetes; the most natural thing to do is thus using k8s for monitoring pods and eventually restarting it.

The Kafka Connect framework comes with Rest API, and one of these gives you the state of the connectors:

i.e : https://bitrock.it/blog/monitoring-kafka-connector-with-kubernetes/

This seems to resolve our problem... But is it really the case?

Kubernetes health check controls the HTTP status code; the problem is that the Kafka connector API returns 200 HTTP status.

For instance, if the task is failed, the API will return:

HTTP/1.1 200 OK


In this case, from the Kubernetes point of view, everything is ok.

The solution that worked well for us consisted in adding a sidecar container that takes responsibility for exposing the state of the connector task.

The sidecar pattern allows you to extract some functionalities of your application in a different component. For example, we can separate the authentication layer from our “main” component that contains the business logic or - as in our case - extracts the monitoring part.

Our goal is to obtain something like this:

First of all, we created a simple application that takes care of calling the connector API and exposes an API for Kubernetes (we used a simple Python application using Flask - but you can use whatever you want). Something like this:

As you can see, the code is very simple.

The application does two different things: first of all, it exposes an endpoint at “/health” paths that will be called periodically by Kubernetes; secondly, it checks the status of a task and eventually returns an Internal Server Error, in case the HTTP status of the connector was not 200 or if the status was not “RUNNING”.

Now, this application needs to be deployed in the same pods of the connector. This can be done by adding to our deployment.yaml file the container that contains our Python application:


The logical result?

Both containers expose the health check of the sidecar, since Kubernetes does not restart the entire pods if one container is up; exposing the same API, the destiny of both containers would be the same.

Once the connector is in FAILED state, Kubernetes will restart the pod.

Some cloud providers may provide a built-in solution for problems like this; but if you can’t use it - for whatever reason - this can be a possible solution.

Author: Marco Tosini, Principal Engineer @Bitrock

Read More
getting Started with React Push-based Architecture

Getting Started with React Push-based Architecture

When approaching the React world, using Redux or MobX as state management is almost automatic. Or, in any case, the libraries change, but the basic architecture doesn’t: it is always something similar to the Redux Pattern with reducers, actions, selector, middleware, etc.

But is there the possibility of using a different architecture? Something with RxJs as with Angular? By doing some research, it seems so. Let's see more in detail what we are talking about.

First of all, we need to think outside the classic pull-based pattern and move to something new for those coming from the React world: a push-based architecture.

With data-push architectures, view components simply react to asynchronous data change notifications and render the current data values.

The library that allows us to manage the store in this way is Akita:

“Akita is a state management pattern, built on top of RxJS, which takes the idea of multiple data stores from Flux and the immutable updates from Redux, along with the concept of streaming data, to create the Observable Data Stores model.

So basically, Akita enables us to easily build reactive, asynchronous, data-push solutions for our state management needs.

Another important concept to add is the one related to the Facades. Facades are a programming pattern in which a simpler public interface is provided to mask a composition of internal, more-complex, component usages.

In order to build our application, we rely on RxJS and React Hooks; nothing else is needed.

Let's now consider a very simple example built on the ideas found in some articles.

In our case we need to have a list of users and to be able to interface through the classic CRUD functions.

Starting from the well-known create-react-app with the addition of TypeScript, we create a folder that will contain our entities; in this case, it will only have a "user" folder as a child.

Inside, we define a simple interface of our "user" entity in the model.ts file:

Let's now start by initializing the store of our entity, creating a "UsersState" interface and then creating a "UsersStore" store by extending the Akita store, and finally exporting it:

At this point, we can create services to manipulate the store, also relying on the methods that an Akita store provides.

This is where we can use all our knowledge of RxJS in order to be able to create more complex flows to act on the store.

Finally, through the "QueryEntity", we can take the whole store - or just a filtered part - and channel it into an observable stream of RxJS.

Last but not least, the creation of a custom Hooks that will internally manage all issues regarding RxJS, Facades, and Akita.

First, we map and expose the services of our "userService", in this case all.Then, we create the internal state of our custom hook. Finally, we need to build the selectors for \users\ and \active\ state changes and manage subscriptions with auto-cleanup.

Now our user entity should have everything needed. We import our custom Hooks, and that's it.

To play a little bit, let’s divide the application into several components in order to test it. The result? Well, it works!

And here’s the child component:

Here’s how the application works in the browser:


Although this example is quite simple, the outcomes are pretty surprising. It was really easy - and also quite logical - to connect all the pieces to compose the state management and, as we have seen, no configurations (of any kind) were needed.

For those approaching an architecture like this for the first time, the greatest difficulty is certainly represented by RxJS. To write simple services or queries, it may be enough to know the basics of RxJS; however, in case of large applications with complex services, a good knowledge of technology makes a huge (positive difference), really giving an edge. Furthermore, you need to be very careful where and how you use all the various facades in your application. Being in a push pattern, any change of state triggers the React lifecycle in every component that uses our hooks; watching and controlling performance is thus very important.

Obviously, this is just the beginning: there is a world of things to say about Akita, RxJS, push-patterns etc, and it would take much more than one simple article to explore all of them.

The aim of this contribution was to give you just a little idea of this "new" architecture for state management with React. I hope I’ve hit the target.

Author: Mattia Ripamonti, UX/UI Engineer @Bitrock

Useful Resources:

1 - React Facade Best Practices

2 - React Hooks RxJs Facades

3 - Push Based Architectures with RxJs

4 - Managing State in React with Akita

Read More
Polymorphic Messages in Kafka Streams

Polymorphic Messages in Kafka Streams

Things usually start simple...

You are designing a Kafka Streams application which must read commands and produce the corresponding business event.
The Avro models you’re expecting to read look like this:

While the output messages you’re required to produce look like this:

You know you can leverage the sbt-avrohugger plugin to generate the corresponding Scala class for each Avro schema, so that you can focus only on designing the business logic.

Since the messages themselves are pretty straightforward, you decide to create a monomorphic function to map properties between each command and the corresponding event.
The resulting topology ends up looking like this:

...But then the domain widens

Today new functional requirements have emerged: your application must now handle multiple types of assets, each with its own unique properties.
You are pondering how to implement this requirement and make your application more resilient to further changes in behavior.

Multiple streams

You could split both commands and events into multiple topics, one per asset type, so that the corresponding Avro schema stays consistent and its compatibility is ensured.
This solution, however, would have you replicate pretty much the same topology multiple times, so it’s not recommended unless the business logic has to be customized for each asset type.

“All-and-none” messages

Avro doesn’t support inheritance between records, so any OOP strategy to have assets inherit properties from a common ancestor is unfortunately not viable.
You could however create a “Frankenstein” object with all the properties of each and every asset and fill in only those required for each type of asset.
This is definitely the worst solution from an evolutionary and maintainability point of view.

Union types

Luckily for you, Avro offers an interesting feature named union types: you could express the diversity in each asset’s properties via a union of multiple payloads, still relying on one single message as wrapper.

Enter polymorphic streams

Objects with no shape

To cope with this advanced polymorphism, you leverage the shapeless library, which introduces the Coproduct type, the perfect companion for the Avro union type.
First of all, you update the custom types mapping of sbt-avrohugger, so that it generates an additional sealed trait for each Avro protocol containing multiple records:

The generated command class ends up looking like this:

Updating the business logic

Thanks to shapeless’ Poly1 trait you then write the updated business logic in a single class:

Changes to the topology are minimal, as you’d expect:

A special kind of Serde

Now for the final piece of the puzzle, Serdes. Introducing the avro4s library, which takes Avro GenericRecords above and beyond.
You create a type class to extend a plain old Serde providing a brand new method:

Now each generated class has its own Serde, tailored on the corresponding Avro schema.

Putting everything together

Finally, the main program where you combine all ingredients:


When multiple use cases share (almost) the same business logic, you can create a stream processing application with ad-hoc polymorphism and reduce the duplication of code to the minimum, while making your application even more future-proof.

Read More
Bringing GDPR in Kafka with Vault

Bringing GDPR in Kafka with Vault

Part 1: Concepts

GDPR introduced the “right to be forgotten”, which allows individuals to make verbal or written requests for personal data erasure. One of the common challenges when trying to comply with this requirement in an Apache Kafka based application infrastructure is being able to selectively delete all the Kafka records related to one of the application users.

Kafka’s data model was never supposed to support such a selective delete feature, so businesses had to find and implement workarounds. At the time of writing, the only way to delete messages in Kafka is to wait for the message retention to expire or to use compact topics that expect tombstone messages to be published, which isn't feasible in all environments and just doesn't fit all the use cases.

HashiCorp Vault provides Encryption as a Service, and as it happens, can help us implement a solution without workarounds, either in application code or Kafka data model.

Vault Encryption as a Service

Vault Transit secrets engine handles cryptographic operations on in-transit data without persisting any information. This allows a straightforward introduction of cryptography in existing or new applications by performing a simple HTTP request.

Vault fully and transparently manages the lifecycle of encryption keys, so neither developers or operators have to worry about keys compliance and rotation, while the securely stored data can always be encrypted and decrypted as long as the Vault is accessible.

Kafka Integration

What if instead of trying to selectively eliminate the data the application is not allowed to keep, we would just make sure the application (or anyone for this matter) cannot read the data under any circumstances? This would equal physical removal of data, just as requested by GDPR compliance. Such a result can be achieved by selectively encrypting information that we might want to be able to delete and throwing away the key when the deletion is requested.

However, it is necessary to perform encryption and decryption in a transparent way for the application, to reduce refactoring and integration effort for each of the applications that are using Kafka, and unlock this functionality for the applications that cannot be adapted at all.

Kafka APIs support interceptors on message production and consumption, which is the candidate link in the chain where to leverage Vault’s encryption as a service. Inside the interceptor, we can perform the needed message transformation:

  • before a record is sent to Kafka, the interceptor performs encryption and adjusts the record content with the encrypted data
  • before a record is returned to a consumer client, the interceptor performs decryption and adjusts the record content with the decrypted data

Logical Deletion

Does this allow us to delete all the Kafka messages related to a single user? Yes, and it is really simple. If the encryption key that we use for encrypting data in Kafka messages is different for each of our application’s users, we can go ahead and delete the encryption key to guarantee that it is no longer possible to read the user data.

Replication Outside EU

Given that now the sensitive data stored in our Kafka cluster is encrypted at rest, it is possible to replicate our Kafka cluster outside the EU, for example for disaster recovery purposes. The data will only be accessible by those users that have the right permissions to perform the cryptographic operations in Vault.

Part 2: Technicalities

In the previous part we drafted the general idea behind the integration of HashiCorp Vault and Apache Kafka for performing a fine grained encryption at rest of the messages, in order to address GDPR compliance requirements within Kafka. In this part, instead, we do a deep dive on how to bring this idea alive.

Vault Transit Secrets Engine

Vault Transit secrets engine is part of Vault Open Source, and it is really easy to get started with. Setting the engine up is just a matter of enabling it and creating some encryption keys:

Crypto operations can be performed as well in a really simple way, it’s just a matter of providing base64 encoded plaintext data:

The resulting ciphertext will look like vault:v1: - where v1 represents the first key generation, given it has not been rotated yet.

What about decryption? Well, it’s just another API call:

Integrating Vault’s Encryption as a Service within your application becomes really easy to implement and requires little to no refactoring of the existing codebase.

Kafka Producer Interceptor

The Producer Interceptor API can intercept and possibly mutate the records received by the producer before they are published to the Kafka cluster. In this scenario, the goal is to perform encryption within this interceptor, in order to avoid sending plaintext data to the Kafka cluster...

Integrating encryption in the Producer Interceptor is straightforward, given that the onSend method is invoked one message at a time.

Kafka Consumer Interceptor

The Consumer Interceptor API can intercept and possibly mutate the records received by the consumer. In this scenario, we want to perform decryption of the data received from Kafka cluster and return plaintext data to the consumer.

Integrating decryption with Consumer Interceptor is a bit trickier because we wanted to leverage the batch decryption capabilities of Vault, in order to minimize Vault API calls.


Once you have built your interceptors, enabling them is just a matter of configuring your Consumer or Producer client:


Notice that value and key serializer class must be set to the StringSerializer, since Vault Transit can only handle strings containing base64 data. The client invoking Kafka Producer and Consumer API, however, is able to process any supported type of data, according to the serializer or deserializer configured in the interceptor.value.serializer or interceptor.value.deserializer properties.


HashiCorp Vault Transit secrets engine is definitely the technological component you may want to leverage when addressing cryptographical requirements in your application, even when dealing with legacy components. The entire set of capabilities offered by HashiCorp Vault makes it easy to modernize applications on a security perspective, allowing developers to focus on the business logic rather than spending time in finding a way to properly manage secrets.

Author: Simone Ripamonti, DevOps Engineer @Bitrock

Read More
The JAMStack Proposition

The JAMStack Proposition

With the surge in popularity of JavaScript frameworks, Node and container technologies, the past years have seen the rise of microservices as the leading pattern in the architecture of distributed applications on the Web; the lingua franca of these applications being, of course, APIs. Developers adopting these modern tools for frontend environments have though faced emerging challenges when dealing with search engine optimization, rendering the content and serving the applications compared to the common LAMP stack, where PHP does the bulk of the work and JavaScript only provides the interactions and dynamic elements.

From Client Side to Hybrid

While in the past using frameworks on the client side meant single page applications, using “hacks” such as the hashbang to provide navigation, in recent years the leading JavaScript frameworks embraced the hybrid approach in rendering, where both the server and the client would run the same virtual DOM and reconnect on the browser later, “rehydrating” the application on the client. This way, applications supported both the common navigation controls of the browser and provided accessibility to users with older browsers or even no JavaScript, since the page is readily available on the server. This provided improved performance on the first load, and supported the traditional spiders from search engines.

However, this meant:

  • having an improved developer experience as the entire application uses only JavaScript and HTML, with a single code base...
  • ...but Node doesn’t actually support the same modules and featuresets of a browser
  • taking a hit on a number of metrics, such as Time To First Byte and Time To Interactive, as the code runs on both ends
  • relying on an increasingly complex deployment on the server compared to traditional shared hosting
  • using “brute force” solutions to prerendering and caching applications, such as headless browsers

Limits of CMSs

Many modern web applications are still nothing more than glorified lists of contents, sometimes enabling modest interactions - such as filtering or sorting contents -, providing taxonomies and interacting with limited components - such as forms for comments or the search bar. As most of the content is static, a complete frontend solution is eventually considered an added cost compared to existing, monolith CMSs; they still offer big communities, a plethora of themes and plugins, and well rodated interfaces for content creators.

Yet, these CMSs still do not provide the same speed or developer experience as the applications written for Node, which can be started on any machine with nothing more than the Node runtime and have very fast cycles for changes. Moreover, they usually have limited support for components, or restrict them to the content side, while requiring the developer to code in additional custom parts for the rest of the page, often in a different language than the one spoken by the browser. Interaction is still tackled on, with JavaScript being unable to cross the boundaries of the single static page.

Last but not least, popular CMS such as WordPress come with larger surface areas for attack as they’re both incredibly popular and the very same endpoint for both the backend and the frontend; hosted on the same old machine, with the same address for both: subject to varying degrees of loads which might need horizontal scaling, creating issues for the cache.

Enter the Static Sites Generators

Even if CMSs can definitely render pages quickly and avoid the lengthy reconciliation with the browser’s context, they still require a server and dedicated support, with a plurality of codebases and a bad experience for the developer that does not have everything on hand on the local machine, or might not even have the required knowledge to deal with all issues in this tightly coupled project.

Static websites, instead, have no server loading times; require no session on the server, no instance of Linux running, and no real requirements other than a web server to deliver the resources. They can even live off incredibly cheap storage, such as S3 from Amazon.

A Modern Solution to Static Content

While traditional frameworks such as Jekyll or Hugo are fast and still a good solution, the new frameworks that have entered the space in the last years (such as Gatsby, Gridsome, Nuxt and Next.js) have took static sites to the future. Learning the lessons from hybrid applications and SPAs, they rely on improved tooling running on Node; modern web frameworks are now first class - improving both UX and developer experience. They feature:

  • complete SEO support - as pages are just HTML and CSS as before
  • no need for a running service; deploy simply delivers the static files to the CDN
  • improved performance on all metrics, and well engineered solutions for smaller JavaScript bundles and pipelines for other resources
  • the same complex interactions of a SPAs, such as transitions and persistent state, and the same tools (like Webpack, Parcel, etc.)
  • support for content from many sources: static files, version repositories, headless CMSs, APIs and more

Frameworks such as Gatsby and Gridsome take a page off the CMS’s playbook by offering solutions to different needs in the form of themes and plugins, yet retaining a single cohesive codebase with dependencies handled through the very same ecosystem of JavaScript, well familiar with developers already used to working with modern frameworks. They also come with configurations for older browsers, solving the rebus of packagers, and simple commands to either build or start a development service.

The reduction in complexity is significant. A streamlined frontend solution enables developers to focus on the core experience of users instead of wasting time on configuration. The absence of an actual service running the pages allows the website to run just about everywhere, letting backend developers focus on the business logic, and in between, APIs representing the contract between the two sides. DevOps have one less thing to be concerned about. It is the JAMstack: JavaScript, APIs, and markup languages.

Challenges of the JAM Stack

The massive improvements brought forth by the stack still face issues that many of the other solutions don’t, and by the nature of static contents; consequently, its strong points also represent its main pain points.

First of all, static content is never entirely static: contents will be probably updated in time and might even be real time. While recreating the bundle every time the content changes represents the quickest solution, it takes time - more than in real time at least. There have been big improvements in the speed of the tools when generating this bundle; it used to take way longer than today and support up to a few thousands pages, while today, with solutions such as partial builds, it can be mitigated. It is still not real time; real time content can only be handled through dynamic components on the client side.

Secondly, by relying on the delivery through services such as Netlify, S3, Vercel and so on, we’re leaving to the middleman to handle security and performance optimization for static files. We can also do it on our own, of course.

Third and last (but also probably the trickiest part), is that by having static files we move the concern for sessions and authentication/authorization to external microservices instead of the server. With careful reliance on Service Workers, and/or solutions such as Firebase, we can solve this. The JAM stack also strongly favours a serverless approach to server interactions: by writing simple functions in Node, to be deployed in the same hands-off approach from the same codebase (possibly even sharing code), we can handle just about everything as before with traditional AJAX requests.

Both the serverless approach and the delivery of static files is a big reduction in costs compared to deploying virtual servers, as we only use what we need and scale naturally as more the resources are required. But rarely accessed contents or functions do require some extra time as the provider has to “boot up” the context of the functions for us.

A common Use-Case: a Blog and its Pages

Static websites are really suited for delivering the contents of an editorial product. Text and images are usually not updated in real time, and the business requirements are more often aligned with the value proposition of static site generators:

  • A safe environment with low attack surface for the public.
  • Fast performance on all metrics, to boost the SEO and mobile performances.
  • Low costs, in development, deployment and maintenance

This environment has been, for the past decades, very much dominated by WordPress and its themes, which provide a good enough solution for most companies and, since they are so commonly used, editors do not need to learn again how to do things. As WordPress developed its own API in the last decade, we can rely on it to provide the base for our contents and access the existing ecosystem and know-how, deploying it on a low cost solution such as WordPress.com or perhaps a small VPS. All that we really need is a safe way to get our content from our install to our static site generator, that will generate the page at build time calling the APIs. We could just as easily deploy an headless CMS or even a completely custom solution - on the frontend site it doesn’t really matter much.

Our choice for a static site generator can also be decided according to the skills of the developers working on the project. On the React side, both Gatsby and Next.js are very popular solutions, with Gatsby having an already established set of plugins and starters very similar to WordPress that can speed up the development. On the Vue side, Vuepress and Gridsome are two common solutions: the first one being the easiest of the two in terms of features and approach to content (by using Markdown files), while the second more similar to Gatsby, providing plugins and starters. Both Gridsome and Gatsby, in fact, use GraphQL as a lingua franca for our contents, so that we can integrate many sources and use them in a common way.

Last but not least, we can decide where to deploy our contents. There’s a huge number of possibilities, from CDNs to storages (such as S3) or many services that pride themselves on simplicity like Netlify and Heroku. Anyway, what we really need is a channel to deliver the bundle of the contents to our users; whenever we update our contents, we will simply call an API to trigger again the build process and reload the files.

An Example with Gatsby

To build an example solution, we’re going to use DigitalOcean to host our WordPress installation. Our generator will be Gatsby for this very specific example, but the concepts are quite similar for many of them. Note that we will be just using a function to build the pages, but many of these generators offer integration with external CMSs and might use GraphQL and such to do the queries; here is just a generic example. To begin, we created a droplet on DigitalOcean using their image for WordPress on Ubuntu 18.04. You can find more information about this on their website, as their wizard will do the bulk of the work for you. Don’t forget to follow the installation of WordPress itself. For this example, we’re not going to even configure a domain for our install, but you definitely want to use a proper configuration. Many hostings also offer simple solutions to host applications such as WordPress, and will do the job nicely.

Now that our WordPress is set up, we can start working on our frontend. First thing first, we create the project using the command line interface for Gatsby.

npm install -g gatsby-cli

gatsby new example-wordpress

This creates our base project using the default starter kit from Gatsby. Inside the \example-wordpress\ we can find the modules needed already preinstalled, some configuration for styling the code (with Prettier), and the source code folder (\src\); the latter having inside both the folder for the React \components\ and the folder for the \pages\. Files inside the \pages\ folder will be accessible by default through their filename (for example, \page-2\ will be located at \example.com/page-2\).

What we want to do is to hook up into the build process of Gatsby and generate our pages from the WordPress API. You can find more information about the APIs from the REST API Handbook, but the gist of it is that we’re requesting the posts resource from it using the correct endpoint. You can preview the available resources by going at the page \example.com\\wp-json\; we will be accessing the \wp\\v2\, under which we have the editorial contents, and query for the posts. Our URL will be something like \example.com/wp-json/wp/v2/posts\.

Now we just have to pull it inside Gatsby and build the pages. To do this, we open our project and navigate to the \gatsby-node.js\ file that should be in the root. We install and import the module \node-fetch\, so that we have an easy interface to get our resource by using:

npm install --save node-fetch

And putting at the top of the file our import:

_const fetch = require(\MARKDOWN_HASH03c697f1f26e7438c661b7bc6dd0f4b2MARKDOWNHASH);

Next, we hook up into the \createPages\ step of Gatsby. In order to do so, we will export an asynchronous method from our file called \createPages\, which receives an object with the \actions\ available to us and a \reporter\ object that can tell Gatsby if something went wrong. Inside this function, we fetch our posts and create a page for each of them.

Let’s create the template page for the blog posts. We create a file in the \pages\ folder named \post.js\, and access the post data by reading it from the props of our page component.

We should now have a corresponding page in our frontend:

This is of course just the beginning.

  • To host our content online, we could rely on something like Netlify. We just push our project to Github, then add Netlify as an application.
  • To trigger the rebuilding of our contents, we could for example make a call using cURL to our services on the \save_post\ hook.
  • We could build our taxonomy pages, either by using the API from WordPress or the posts JSON. To better integrate it into Gatsby (or Gridsome perhaps), we could add our posts as GraphQL nodes.
  • A common criticism of this kind of solution is that editors don’t really have an idea of how the content will end up looking on the frontend. We can build a simple WYSIWYG editor on the frontend side by relying on a good library like Draft.js. Of course this also requires authentication and so forth. We could also share the same CSS and major HTML between both the WordPress environment and Gatsby.
  • We could lock our WordPress APIs behind a simple authentication using either Apache or Nginx, as they’re quite common in this kind of setups. Logging in through Node is trivial.


Static site generators enable us to provide a good user experience and also good performance, with a bit more effort than using common CMSs such as WordPress as a monolithic approach. We can integrate different sources and create very custom solutions using modern scaffolding and tooling. However, it does require quite a bit more effort, and the disconnect between frontend and backend can be a pain point for our editors.

Author: Federico Muzzo, UX/UI Engineer @Bitrock

Read More
From Layered to Hexagonal Architecture

From layered to Hexagonal Architecture (Hands On)


The hexagonal architecture (also called “ports and adapters”) is an architectural pattern used in software design designed in 2005 by Alistair Cockburn.

The hexagonal architecture is allegedly at the origin of the microservices architecture.

What it Brings to the Table

The most used service architecture is layered. Often, this type of architecture leads to dependencies of business logic from external contract (e.g., database, external service, and so on). This brings stiffness and coupling to the system, forcing us to recompile classes that contain the business logic whenever an API changes.

Loose coupling

In the hexagonal architecture, components communicate with each other using a number of exposed ports, which are simple interfaces. This is an application of the Dependency Inversion Principle (the “D” in SOLID).

Exchangeable components

An adapter is a software component that allows a technology to interact with a port of the hexagon. Adapters make it easy to exchange a certain layer of the application without impacting business logic. This is a core concept of evolutionary architectures.

Maximum isolation

Components can be tested in isolation from the outside environment or you can use dependency injection and other techniques (e.g., mocks, stubs) to enable easier testing.

Contract testing supersedes integration testing for a faster and easier development flow.

The domain at the center

Domain objects can contain both state and behavior. The closer the behavior is to the state, the easier the code will be to understand, reason about, and maintain.

Since domain objects have no dependencies on other layers of the application, changes in other layers don’t affect them. This is a prime example of the Single Responsibility Principle (the “S” in “SOLID”).

How to Implement it

Let's now have a look on what it means to build a project following the hexagonal architecture to better understand the difference and its benefit in comparison with a more common plain layered architecture.

Project layout

In a layered architecture project, the package structure usually looks like the following:

Here we can find a package for each application layer:

  • the one responsible for exposing the service for external communication (e.g., REST APIs);
  • the one where the core business logic is defined;
  • the one with all the database integration code;
  • the one responsible for communicating with other external services;
  • and more...

Layers Coupling

At first glance, this could look like a nice and clean solution to keep the different pieces of the application separated and well organized, but, if we dive a bit deeper into the code, we can find some code smells that should alert us. In fact, after a quick inspection of the core business logic of the application, we immediately find something definitely in contrast with our idea of clean and well defined separation of the various components. The business logic that we'd like to keep isolated from all the external layers clearly references some dependencies from the database and the external service package.

These dependencies imply that in case of changes in the database code or in the external service communication, we'll need to recompile the main logic and probably change and adapt it, in order to make it compatible with the new database and external service versions. This means that we need to spend time on this new integration, test it properly and, during this process, we expose ourselves to the introduction of some bugs.

Interfaces to the Rescue

This is where the hexagonal architecture really shines and helps us avoid all of this. First we need to decouple the business logic from its database dependencies: this can be easily achieved with the introduction of a simple interface (also called “port”) that will define the behavior that a certain database class needs to implement to be compatible with our main logic.

Then we can use this contract in the actual database implementation to be sure that it's compliant with the defined behavior.

Now we can come back to our main logic class and, thanks to the changes described above, we can finally get rid of the database dependency and have the business logic completely decoupled from the persistence details.

It's important to note that the new interface we introduced is defined inside the business logic package and, therefore, it’s part of it and not of the database layer. This trick allows us to apply the Dependency Inversion Principle and keep our application core pure and isolated from all the external details.

We can then apply the same approach to the external service dependency and finally clear the whole logic class of all its dependencies from the other layer of the application.

DTO for model abstraction

This already give us a nice level of separation, but there is still room for improvement. In fact if you look at the definition of the Database class you will notice that we are using the same model from our main logic to operate on the persistence layer. While this is not a problem for the isolation of our core logic, it could be a good idea to create a separate model for the persistence layer, so that if we need to make some changes in the structure of the table, for example, we are not forced to propagate the changes also to the business logic layer. This can be achieved with the introduction of a DTO (Data transfer object).

A DTO is nothing more that a new external model with pair of mapping function that allow us to transform our internal business model to the external one and the other way around. First of all, we need to define the new private model for our database and external service layers.

Then we need to create a proper function to transform this new database model into the internal business logic model (and vice versa based on the application needs).

Now we can finally change the Database class to work with the newly introduced model and transform it into the logic one when it communicates with the business logic layer.

This approach works very well to protect our logic from external interference, but it has some consequence. The main one is an explosion of the number of the models, when most of the time the models are the same; the other one is that the logic about transforming models can be tedious and always need to be properly tested to avoid errors. One compromise that we can take is starting only with the business models (defining them in the correct package) and introduce the external models only when the two models diverge.

When to embrace it

Hexagonal architecture is no silver bullet. If you’re building an application with rich business rules that can be expressed in a rich domain model that combines state with behavior, then this architecture really shines because it puts the domain model in the center.

Combine it with microservices architecture and you’ll get a future-proof evolutionary architecture.

Read More