Bitrock & CNCF

We are thrilled to announce that Bitrock has recently joined the Cloud Native Computing Foundation (CNCF), vendor-neutral home for many of the fastest-growing projects on GitHub (including Kubernetes and Prometheus) that fosters the collaboration between the industry’s top developers, end users, and vendors.

Kubernetes and other CNCF projects have quickly gained adoption and diverse community support in recent years, becoming some of the most rapidly-moving projects in the open-source history.

The CNCF, part of the non-profit Linux Foundation, creates long-term ecosystems and fosters a community around a collection of high-quality projects that orchestrate containers as part of microservice architectures.

Cloud Native

Organizations can use cloud native technologies to build and run scalable applications in modern, dynamic environments including public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.

These techniques enable resilient, manageable, and observable loosely-coupled systems. They allow developers to make high-impact changes frequently and predictably with minimal effort when combined with robust automation.

The CNCF aims to accelerate adoption of this paradigm by cultivating and maintaining an ecosystem of open source, vendor-neutral initiatives that democratize cutting-edge patterns and make them available to everybody.

Bitrock’s New Membership

Bitrock has recently joined the CNCF community with a Silver Membership: a community that includes the world’s major public cloud and enterprise software firms as well as innovative startups. 

The choice to join this community was based on Bitrock Team's intention to build and influence the cloud native ecosystem alongside industry peers.

By providing governance, thought leadership, and engineering resources to shape and influence the development of the cloud-native ecosystem, Bitrock helps CNCF achieve its mission of making cloud-native computing ubiquitous.

Bitrock also upholds the community's key values as a member - find out more by accessing the CNCF charter:

  • Fast is better than slow 
  • Open 
  • Fair
  • Strong technical identity
  • Clear boundaries 
  • Scalable 
  • Platform agnostic

As a result of this new membership, Bitrock will be able to demonstrate thought leadership in the cloud native space, provide a first-class experience with Kubernetes and other CNCF projects in collaboration with the CNCF and its community, and design and maintain the technologies that are at the forefront of the industry.

Read More
Caravan Series - GitOps

This is the second entry in our article series about Caravan, Bitrock’s Cloud-Native Platform based on the HashiCorp stack. Click here for the first part.

What is GitOps

GitOps is "a paradigm or a set of practices that empowers developers to perform tasks that typically fall under the purview of IT operations. GitOps requires us to describe and observe systems with declarative specifications that eventually form the basis of continuous everything" (source: Cloudbees).

GitOps upholds the principle that Git is the only source of truth. GitOps requires the system’s desired state to be stored in version control such that anyone can view the entire audit trail of changes. All changes to the desired state are fully traceable commits, associated with committer information, commit IDs, and time stamps.

Together with Terraform, GitOps allows the creation of Immutable Infrastructure as Code. When we need to add or perform an update, we have to modify our code and create a Merge/Pull Request to let our colleagues review our changes. After validating our changes we merge to our main branch and let our CI/CD pipelines apply the changes to our infrastructure environments.

Another approach in GitOps avoids triggering a CI/CD pipeline after a new change is merged. Instead, the system automatically pulls the new changes from the source code, and executes the needed actions to align the current state of the system to the new desired state declared in the source code.

Caravan Logo

How GitOps helped us build Caravan

GitOps provides us with the ability and framework to automate Caravan provisioning. In practice, GitOps is achieved by combining IAC, Git repositories, MRs/PRs, and CI/CD pipelines.

First of all we define our infra resources as code. Each layer of the Caravan stack is built following GitOps principles, and the first one is of course the Infrastructure layer that allows declaring the required building block for the major cloud provider. Networking, Compute resources and Security rules are all tracked in the Git repository.

Then, the following layer is the Platform one where we bring online the needed components with the required configuration. Finally, we declare the Application Support components deployed on top of the Platform.

Currently, the applications are deployed using a simpler approach leveraging standard Terraform files that we called “Carts”. Nomad itself can pull configuration files from git repository but lacks a solution like ArgoCD for automatically pulling all the nomad job descriptors from git.


Want to know more about Caravan? Visit the dedicated website, check our GitHub repository and explore our documentation.

Authors: Matteo Gazzetta, DevOps Engineer @ Bitrock - Simone Ripamonti, DevOps Engineer @ Bitrock

Read More
Caravan Series Part 1

Introduction

The current IT industry is characterised by multiple needs, often addressed by an heterogeneous number of products and services. To help professionals adopt the best performing solutions for sustainable development, the Cloud Native Computing Foundation was created in 2015 with the aim of advancing container technology and aligning the IT industry around its evolution.

We conceived Bitrock's Caravan project following the Cloud Native principles defined by the CNCF:

  • leverage the Cloud
  • be designed to tolerate Failure and be Observable
  • be built using modern SW engineering practices
  • base the Architecture on containers and service meshes

The HashiCorp stack fulfills these needs, enabling developers to build and run applications faster and more efficiently.

The Caravan Project

Caravan is your open-source platform builder based on the HashiCorp stack. Terraform and Packer are used to build and deploy a cloud-native and ready-to-use platform composed of Vault, Consul and Nomad.

Vault allows you to keep secrets, credentials and certificates safe across the Company. Consul allows the Service Discovery and, with Consul Connect, a Service Mesh to get the power of a truly dynamic communication among your next gen and legacy applications. Nomad allows powerful placing, scaling and balancing of your workloads that may be containerized or legacy, services or batches.

Thanks to Terraform and Ansible, the Infrastructure and Configuration as Code lie at the core of Caravan.

The rationale behind Caravan is to provide a one-click experience to deploy an entire infrastructure and the configuration needed to run the full HashiCorp stack in your preferred cloud environment.

Caravan’s codebase is modular and layered to achieve maximum flexibility and cover the most common use cases. Multiple cloud providers and optional components can be mixed to achieve specific goals.

Caravan supports both Open Source and Enterprise versions of HashiCorp products.

Caravan Project Functioning

Caravan in a nutshell

Caravan is the perfect modern platform for your containerized and legacy applications:

  • Security by default
  • Service mesh out of the box
  • Scheduling & Orchestration
  • Observability
  • Fully automated

Want to know more about Caravan? Visit the dedicated website, check our GitHub repository and explore our documentation.

Authors: Matteo Gazzetta, DevOps Engineer @ Bitrock - Simone Ripamonti, DevOps Engineer @ Bitrock

Read More

Let's Encrypt with Terraform

Today’s web traffic is virtually impossible without encryption. The need to cryptographically protect the data in transit whenever real or not has become a norm and a requirement for any kind of service to be properly implemented. From a simple portfolio website that has its ranking downgraded by the search engines to public API gateways that move around sensitive data. Everything has to be verified and encrypted.

This increase in the usage however has to deal with the complexity of the technological implementation. SSL and later TLS, with public CA signed certificates and cross-signed private PKI implementations were always something many IT professionals struggled to comprehend and use properly. It just seemed to add a hardly justifiable overhead.

Then the automation came. With the “automate all the things” approach the TLS certificates were given another push with all kinds of APIs and scripts that allowed for dynamic creation, distribution, and maintenance of certificates and complete in-house Certificate Authorities.

But as it always goes with automation the tool that solves one problem isn’t always good for solving another just because it was tagged with the same words in the ticket. So the scripts and services should be chosen to satisfy the specific need. There is however a simple case that will cover most of the uses, i.e. a humble HTTPS certificate. Bring up a website, a REST API or your installation packages download endpoint and you need a certificate to go with it. And if it is a public service you need it to be signed by a public CA. And if it is in the cloud you have to manage it dynamically. And if you do then it is better to manage it as code.

Here at Bitrock when it comes to automation we start with terraform first and see what can we drop on top of it to achieve the goal with all the things IaC as much as we can. And this is where we start with the certificates too. Once the use case is identified, analyzed and solved we can easily reuse it using terraform in other projects. Which given the flexibility of the tool and similarities between cloud platforms should work most of the time. This article illustrates our approach at automating the certificate as code management in the specific case of public HTTP service behind an in cloud load balancer.

A bit of context

First a refresher on the details before the implementation of the process can start.

Let’s start with the Certificate Authority (CA) which, for the sake of simplifying, is a provider of digital certificates. There are many components in a CA but we are only interested in one. As a service consumer you ask CA to certify that you own a property on the internet. In most cases it will be a domain name. Such as “bitrock.best”. The result of this certification is a signed TLS certificate, usually a file you keep in reach of your web server. The standard process is performed in three iterations:

  1. Consumer generates a private key and a certificate signing request
  2. Consumer sends the certificate signing request to the CA
  3. CA verifies the ownership of the property described in the requests and issues the certificate to the consumer

What the consumer is left with are at least two items: the private key and the certificate. The certificate can be read by anyone but can only be used for encryption by the private key owner. And the private key is what should be kept private.

20 years ago... I was there Gandalf when they sent faxes
20 years ago… I was there Gandalf when they sent faxes

The process of issuing a certificate by itself is simple but the verification of the property ownership is what usually complicates it. Since the 90s having a certificate that was signed and trusted by any client meant to pay for the service and service provider used to verify via email, fax, phone calls and even in person that the consumer owns a domain name or a business name.

The way of Let’s Encrypt

Then came the free
Then came the free

While it still makes sense today for banks or large e-commerce companies, for a simple website or service everything changed a few years ago when the Let’s Encrypt project went public. The project has built a protocol and a service provider which together allow having a certificate signed by a publicly trusted CA with a couple of API calls.

Having a certificate issued and signed by Let’s Encrypt on your “normal” server is extremely easy. You just install the “certbot” package using your package manager and run it. If you are using a supported web server software such as apache or nginx the certbot will even set it up for you. Otherwise you can get the certificate by just pointing certbot to where your web root is and then point the web server configuration to your freshly signed certificate and its private key.

The “normal” usage of the certbot however implies the “normal” server which doesn’t match the “cattle vs pets” model of modern infrastructure. In a modern architecture the node where your web server is running should be an immutable and disposable element of your architecture. The certificate and the key then should be configured on an external entity. Think a cloud compute instance and a cloud load balancer. The load balancer accepts the client requests, does all the TLS termination heavy lifting and forwards the request to any compute instance there available.

The above use case eliminates the possibility of using certbot as easily as with a “normal” server. The verification process is trickier to implement using the web server files and the certbot process does not have access to where the certificate and key files are stored. This forces a different verification usage approach based on DNS. In the case of files the ownership verification relies on the consumer owning the web server responsible for serving the content of the domain name. A file with cryptographic content is stored by the certbot on the server and Let’s Encrypt servers reach for it to verify that indeed the cerrtbot is running on the domain name’s web server. The DNS verification uses the same cryptography verification but the consumer has to publish a TXT record for the domain name which will be verified by Let’s Encrypt to certify the ownership of the domain name.

Hashicorp Terraform, GCP and ... Let’s Encrypt

The above looks very much like technical requirements: deploy a web service in the cloud to provide public services using HTTPS. The TLS certificate should be issued by Let’s Encrypt using DNS verification and the termination should be handled by the cloud provider’s load balancer. The deployment must be performed using terraform with no manual operations that interrupt the process.

To satisfy the requirements we are going to use the GCP services and the HashiCorp’s google provider to provision the infrastructure. Then we will use GCP’s Cloud DNS to configure the records using an excellent terraform ACME protocol provider. Terraform Cloud will take care of the state so it can be kept separated from the infrastructure it describes.

The domain name registrar used has its own API implemented but a terraform provider doesn’t seem to exist for it. So we can use a bash script that leverages curl to configure nameservers of the domain name to point to a freshly created zone in GCP’s Cloud DNS.

The resulting terraform code and all the scripts are available on Bitrock’s github.

./
├── cert-gcp.tf
├── domain.tf
├── gcp.tf
├── LICENSE
├── providers.tf
├── README.md
├── scripts
│   └── startup-script.sh
├── terraform.tfvars
└── variables.tf
1 directory, 9 files

What we did

We have separated the cloud infrastructure into a straightforward terraform file that contains all the resources specific to google. This takes the solution closer to the multi cloud pattern making the infrastructure easily replaceable. The exact layout certainly should be built on the modules pattern. Which shouldn’t be an issue to refactor and integrate. To summarize the infrastructure here is what is being provisioned as resources in our GCP Project:

  • network, subnet and firewall
  • an instance group manager with an instance template and a startup
  • script that prepares our web service
  • a managed DNS zone
  • load balancer that uses the instance group as backend
  • the certificate resource used by the balancer

# terraform.tfvars
# Domain name
domainname = "your-domain-name"
# GCP access
project_id = "GCP project id"
google_account_file = "path to the GCP credentials json"
# Registrar login
domain_user = "login"
domain_password = "password"
# Let's encrypt registration and production endpoint
email_address = "you+acme@gmail.com"
le_endpoint = "https://acme-v02.api.letsencrypt.org/directory"

When a domain name is being registered one has to provide valid nameservers that are supposed to be authoritative for it. With GCP and some other cloud providers it can be a problem since every zone created has its own authoritative servers assigned to it. So after the zone is created its authoritative servers have to be set through the registrar and everything has to wait until the change. We manage it with a single HTTP request and a DNS resolving test in a loop. Both implemented as local-exec provisioners of a null resource in the domain.tf file.

# This is how our registrar can be called to update the nameservers. YMMV
curl 'https://coreapi.1api.net/api/call.cgi?s_login=login&s_pw=password&command=ModifyDomain&domain=your-domain-name&nameservers'
# And now we wait
while true; do
dig +trace ns your-domain-name | grep '^your-domain-name\.' | grep your-new-namserver && exit 0
echo Waiting for nameservers to be updated ...
sleep 15
done
# Checkhout domain.tf to see the complete usage

Once the zone is up and nameservers are updated the ACME provider can proceed with the certificate request. The certificate’s generation is described in the gcp-cert.tf file that includes the creation of two keys, one for Let’s encrypt registration and the other for the certificate itself. Being resources and passed as arguments the keys will be kept in the secure remote state on Terraform Cloud. Small details to keep in mind:

  • the TTL of the records you create (SOA, NS, A, etc.) should be low to avoid waiting for propagation and to reach the service sooner
  • Let’s Encrypt has rate limits in place so play with the staging endpoint first
  • to configure your LB’s TLS properly don’t forget to add the certificate chain (your issuers certificates)
Honest Work

Once all is in place point your browser to the https://your-domain-name should result in a happy lock icon and your smiling face.

Authors: Michael Tabolsky & Francesco Bartolini, DevOps @ Bitrock

Read More
Continuous Deployment

Continuous Deployment of Kafka Connectors


Introduction to the Project

During one of our projects, we worked on a real-time data streaming application using Kafka. After receiving the data from some external services using REST API, we manipulated it using Kafka streams pipelines; then, we called an external REST API service to make the data available to the destination system. In order to develop the input and output components of the ETL flow, we wrote some custom Source Connectors and a Sink Connector for the Kafka Connect module.

Schema


The Problem

Probably, most of you have already figured out the problem: for each of our connectors, we needed to build the “fat Jar” and move it to the Kafka Connect folder, which, for the sake of making things more complicated, was running in Kubernetes. Unfortunately for us, the connectors couldn’t be counted on the fingers of one hand: consequently, the manual procedure was long, repetitive, and very time-consuming. Another issue we encountered was that the creation/deletion of our connectors was “made by hand”, which made it impossible to determine which configurations were being provisioned by looking at the Git repository.


The Solution

In our project, each connector was versioned using a separate git repository. We decided to aim at CI (Continuous Integration) and CD (Continuous Delivery), with Continuous Deployments in the development environment.

Flow

In our scenario, we decided to use Jenkins as a CI/CD daemon and Terraform to manage all the infrastructure and configurations for the connectors, aiming at achieving a full GitOps experience.

In our daily work, whenever we merge a pull request, this event triggers a Jenkins pipeline that builds the artifact and publishes it into a private Artifactory repository. If this phase succeeds, then the next step is to trigger the deployment job.

Code

Given that the deployment is fully automated in the development environment, the latest versions of our connectors are always running in a matter of minutes after having merged code changes. For the other environments, Jenkins's job requires manual approval, in order to avoid having unwanted changes promoted past the development environment. The deployment of connector Jars itself is performed by using a custom Helm chart that fetches the desired connectors artifacts before starting Kafka Connect container.

As seen before, in our project we were using Terraform for managing infrastructure and connectors configuration. We decided to use a single repository for all the Terraform codebase, using different Terraform states to manage different environments. For each connector, we created a Terraform module containing the connector resource definition...

Code

...and the expected configuration variables:

Code

In each environment configuration, we declared which connectors to configure, by instantiating the proper Terraform modules that were previously created. Terraform modules were versioned as well: this means that, in different environments, we could run different configurations of the same connector artifact. In this way, we were able to deploy the jar without doing it manually.

Code

The last missing piece was the creation of all the required topics for our application. We decided to define them into a simple yaml file and, with the help of a simple bash script, they got created when the Jenkins job ran.

Code


Conclusions

In this article we have explored how to improve and engineer the deployment of Kafka connectors in various environments without the need for manual intervention. Developers can focus on enhancing their code and almost ignore the deployment part, giving that they are now able to perform one-click deployments. Enabling connectors or changing configurations are now just a few lines changed in the Terraform repo, without the need of executing Kafka Connect API requests by hand. From our perspective, it makes sense for a lot of companies to invest time, money and resources to automate the deployment of the connectors.


Authors: Alberto Adami, Software Engineer @Bitrock - Daniele Marenco, Software Engineer @Bitrock

Read More
Exploring BDD

Exploring Behavior Driven Development (BDD)


What is BDD

Behavior Driven Development (BDD) is an Agile software development process that encourages collaboration among developers, QA and non-technical or business participants in a software project. It encourages teams to use conversation and concrete examples to formalize a shared understanding of how the application should behave.

The main concept behind BDD is the cooperation between all the stakeholders of a project, in order to share the definition of a set of functionalities and how they should behave through a set of concrete examples.

From this point of view, BDD as a practice involving both technical and business users is strictly related to the Agile methodology principles.


Mind the Gap: Business Vs. Developers Perspective

Let's consider the existing scenario before BDD emerged as a practice. As developers know, the process of translating software requirements into a set of well defined feature specifications is tedious, frustrating and error prone. A software requirements document was the typical interaction between business users and developers when the waterfall methodology was in place. This is typically a static kind of interaction: business users wrote the requirements, and then developers extracted from it a set of functionalities to implement.

Software requirements documents can contain a lot of unnecessary details, a lot of contradictory descriptions of the same functionalities and also a lot of insufficient definitions of some other functionalities. So developers typically need to ask business users to integrate the document several times, but every version of the document is not a 1 to 1 mapping to the set of functionalities to implement.

The first thing to note is that the process of extraction of well defined functionalities from this kind of document is error prone; therefore, there is no guarantee that we cover all the functionalities nor that we define them correctly.

The extraction process can produce a considerably and not acceptable gap between the business users’ point of view and the developers one.

This gap is mainly due to the fact that who creates software requirements documents and who uses them to create features are two distinct teams. If we consider that QA is another team, then we easily understand that this process is quite problematic.

BDD as a practice to encourage the collaboration between people having different cultures and mindsets, and together explore and define features behavior, is a way to fill this gap.


BDD as a Way to Describe Features

The central point of BDD is the sharing of tools and competences between technical and non-technical stakeholders, in order to share concepts and meet a common understanding of a set of functionalities.

The first tool we can use is the language: the concrete examples that describe the desired behavior of the system are written in a language that is very close to the natural language, so in the domain of business users.

BDD is made upon a three-step iterative process, where the steps are: Discovery, Formulation and Automation:

BDD - Scheme


1. Discovery

BDD helps teams have the right conversations at the right time, so that you can minimise the amount of time spent in meetings and maximise the amount of valuable code you produce.

In this phase team members, both technical and non technical users talk about the requirements related to one or more functionalities (user stories), in order to obtain a shared understanding of the expected behavior through a set of concrete examples that describes how the system should work in different scenarios.

This phase is based on structured conversations called discovery workshops, where team members focus around real world examples that describe the features from the user’s perspective.


2. Formulation

In this phase every example is expressed in a way that can be documented and then checked. The way is to express those examples using a medium that can be read both by humans and by automated processes.

A widely adopted language is gherkin: this is similar to a natural language and allows to describe the features through one or more scenarios.

Every scenario is a concrete example that explains how the feature should behave in a particular circumstance.

A typical scenario can be expressed in gherkin describing three things: 1) what are the preconditions to meet before beginning to use a feature, 2) what are the actions to be taken in order to use the feature, 3) and then what are the assertions to check if the feature is correctly implemented.

Here’s the structure of a typical BDD scenario:

Given a precondition

And another one precondition

And

When I do something

And something else

And ...

Then I expect something

And something else

And ...

Here’s an example of definition in gherkin of the feature related to money withdraw using an ATM:

BDD - Code


3. Automation

In this phase we take one scenario at a time and we make a test that satisfies the preconditions (expressed in the given clause), make the actions (expressed in the when clause) and then verify the assertions (expressed in the then clause).

The test is an automated way to verify if a functionality behaves as described in the corresponding scenario.

As we do in TDD, the test is made before the implementation: it is thus a failing test at the beginning, and then we implement the feature in order to make it pass.

Since this kind of test is defined by a team having business people in it, it has a recognized business value; therefore, it can be used as part of the acceptance tests.

BDD is supported by several open source and commercial tools; a couple of them are:

Cucumber https://cucumber.io/

JBehave https://jbehave.org/


In the following example you can see how the tests related to the first scenario of the ATM feature can be implemented:

BDD - Example


As you can see from the following example, the automation process produces a set of reports that are very useful for the team members to verify, during all the phases of the development, if the features are implemented, if they behave as expected or, for some reason, they need to be investigated further (in case a refactoring or some other change broke some of them).

Another key point is that BDD can be viewed as a sort of documentation: indeed, it explains what the features are, how they should behave, and how you can verify them.



BDD - Scenario example (ATM)

BDD with TDD

BDD does not replace TDD. The automation phase produces a set of automated tests, but compared to those created by TDD, they are at a higher abstraction layer: they are actually used to verify a scenario related to a feature or to a user story.

A BDD test will guide the implementation of a feature as a whole and how to meet the expectations described in the related scenario. The implementation phase of a single feature involves many low level components and, embracing TDD practices, that implementation will start with a failing unit test of a small component; then, the component will be implemented, resulting in a green test. The cycle is then repeated for other components, as described in the following image:

BDD - Scheme 2

Since every feature is made of many small components, then every BDD test corresponds to many TDD tests.

Both BDD and TDD are iterative processes: they start from a failing test, then the implementation will have the side effect to fix the test and, when a refactoring causes a test to fail, the iteration will start again.


TDD Vs. BDD

BDD tests are part of a shared understanding of a set of features between all the stakeholders of a project; they have a recognized business value and can be used as acceptance tests in order to verify if the system is behaving as expected.

They can be used to decide if it is safe to install in production (go / no go); typically, they can tell if something is broken, but not exactly what.

TDD tests are part of the development process, they are useful for developers to gain confidence about the quality of the software and also to do refactoring without the fear of breaking something, but they have not an immediately understandable business value.

TDD tests can tell exactly what small piece of the software is broken but, when one of these tests fails, it is usually safe to install in production.


Conclusions

Although tests are a fundamental part of it, defining BDD as a test practice is highly reductive.

BDD is intimately related to the concept of Agile: with its procedures and tools, it facilitates the collaboration between people with different backgrounds and roles, in order to define a common point of view on the features to be implemented.

BDD allows you to verify the correct behavior of your software at any time during the development process, and helps provide a structured documentation on how your software should work.

BDD and TDD are not in competition with each other: each practice completes the other and is a fundamental aspect of the development process.


References

https://cucumber.io/

https://jbehave.org/

https://cucumber.io/docs/bdd/

https://gojko.net/2020/03/17/sbe-10-years.html


Author: Massimo Da Ros, Lead Software Engineer @ Bitrock

Read More
Data Engineering

Data Engineering - Handling Unreliable Sources

Most of you have probably heard the phrase "data is the new oil", and that's because everything in our world produces valuable information. It's up to us to be able to extract the value from all the noisy, messy data that is being produced every instant.

But working with data is not easy: as seen before, real data is always noisy, messy, and often incomplete, and even the process of extraction sometimes is affected by some faults.

It is thus very important to make the data usable via a process known as data wrangling (i.e. the process of cleaning, structuring, and enriching raw data into the desired format) for better decision making. The crucial thing to understand here is that bad data lead to poor decision-making, so it's important to make this process stable, repeatable, and idempotent, in order to ensure that our transformations are improving the quality of the data and not degrading it.

Let's have a look at one of the aspects of the data wrangling process: how to handle data sources that cannot guarantee the quality of the data they are providing.


The Context

In a recent project we have been involved in, we faced the scenario in which the data sources were heavily unreliable.

Given the early definitions, the expected data, coming from a set of sensors, should have been:

  • approximately ten different types of data
  • every type at a fixed pace (every 10 minutes)
  • data will arrive in a landing bucket
  • data will be in CSV, with a predefined schema and a fixed number of rows

Starting from this, we would have performed validation, cleaning, and aggregation, in order to compute some KPIs. Moreover, these KPIs were the starting point of a later Machine Learning based prediction.

On top of this, there was a requirement to produce updated reports and predictions every 10 minutes with the most up-to-date information received.

As in many real-world data projects, the source data was suffering from multiple issues, like missing data in the CSV (sometimes some value missing in some cells, or entire rows were missing, or sometimes there were duplicated rows), or late-arriving data (even not arriving at all).


The Solution

In similar scenarios, it is fundamental to track the transformations that the data pipeline will apply, and to answer questions like these:

  • which are the source values for a given result?
  • does a result value come from real data or imputed data?
  • did all the sources arrive on time?
  • how reliable is a given result?

To be able to answer this type of questions, we first have to isolate three different kinds of data, in at least three areas:

1 Data Engineering

Specifically, the Landing Area is a place in which the external systems (i.e. data sources) will write, the data pipeline can only read from or delete after a safe retention time.

In the Raw Area instead, we are going to copy the CSVs from the Landing Area keeping the data as-is, but enriching the metadata (e.g. labeling the file, or putting it in a better directory structure). This will be our Data Lake, from which we can always retrieve the original data, in case of errors during processing or a new functionality is developed after the data has already been processed by the pipeline.

Finally, in the Processed Area we keep validated and cleaned data. This area will be the starting place for the Visualization part and the Machine Learning part.


After having defined the previous three areas to store the data, we need to introduce another concept that allows us to track the information through the pipeline: the Run Control Value

The Run Control Value is metadata, it's often a serial value or a timestamp, or others, and it gives us the possibility to correlate the data in the different areas with the pipeline executions.

This concept is quite simple to implement, but it's not so obvious to understand. On the other hand, it is easy to be misled; someone could think it is superfluous, and could be removed in favor of information already present in the data, such as a timestamp, but it would be wrong.

Let's now see, with a few examples, the benefit of using the data separation described above, together with the Run Control Value.


Example 1: Tracking data imputation

Let's first consider a scenario in which the output is odd and seems apparently wrong. The RCV column represents the Run Control Value and it's being added by the pipeline.

Here we can see that, if we look only into processed data, for the input at hour 11:00 we are missing the entry with ID=2, and the Counter with ID=1 has a strange zero as its value (let's just assume that our domain expert said that zeros in Counter column are anomalous).

In this case, we can backtrack in the pipeline stages, using the Run Control Value, and see which values have concretely contributed to the output, if all the inputs were available by the time the computation has run, or if some files were missing in the Raw Area and thus they have been fulfilled with the imputed values.

In the image above, we can see that in the Raw Area the inputs with RCV=101 were both negative, and the entity with ID=2 is related to time=12:00. If we then check the original file in the Landing Area we can see that this file was named 1100.csv (in the image represented as a couple of table rows for simplicity), so the entry related to the hour 12:00 was an error; the entry got thus removed in the Processed Area, while the other one was reset to zero by an imputation rule.

The solution of keeping the Landing Area distinct from the Raw Area allows us also to handle the case of Late Arriving Data.

Given the scenario described at the beginning of the article, we receive data in batches with a scheduler that drives the ingestion. So, what if, at the time of the scheduled ingestion, one of the inputs was missing and it has been fulfilled with the imputed values, but, at the time we are going to debug it, we can see that it's available?

In this case, it will be available in the Landing Area but it will be missing in the Raw Area; so, without even opening the file to check the values, we can quickly understand that for that specific run, those values have been imputed.


Example 2: Error from the sources with input data re-submission

In the first example, we discussed about how to retrospectively analyze the processing or how to debug it. We now consider another case: a source with a problem submitted bad data on a given run; after the problem has been fixed, we want to re-ingest the data for the same run to update our output, re-executing it in the same context.

The following image shows the status of the data warehouse when the input at hour 11.00 has a couple of issues: the entry with ID=2 is missing and the entry ID=1 has a negative value and we have a validation rule to convert to zero the negative values. So the Processed Area table contains the validated data.

In the fixed version of the file, there is a valid entry for each entity. The pipeline will use the RCV=101 as a reference to clean up the table from the previous run and ingest the new file.

In this case, the Run Control Value allows us to identify precisely which portion of data has been ingested with the previous execution so we can safely remove it and re-execute it with the correct one.


These are just two simple scenarios that can be tackled in this way, but many other data pipeline issues can benefit from this approach.

Furthermore, this mechanism allows us to have idempotency of the pipeline stages, i.e. being able to track the data flowing at the different stages enables the possibility to re-apply the transformations on the same input and to obtain the same result.


Conclusions

In this article, we have dived a bit into the data engineering world, specifically discovering how to handle data from unreliable sources, most of the cases in real-world projects.

We have seen why the stage separation is important in designing a data pipeline and also which properties every "area" will hold; this helps us better understand what is happening and identify the potential issues.

Another aspect we have highlighted is how this technique facilitates the handling of late-arriving data or re-ingesting corrected data, in case an issue can be recovered at the source side.


Author: Luca Tronchin, Software Engineer @Bitrock

Read More
Prometheus

Getting Started with Prometheus


What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit written in Go. Released by SoundCloud in 2012, it joined Cloud Native Computing Foundation in 2016 and in 2018 became the second graduated project alongside Kubernetes.

Based on metrics and not on logs, Prometheus uses its own time series database called TSDB and its own query language (PromQL).

The CNCF community loves Prometheus because:

  • it’s easy to configure, deploy, and maintain
  • it’s designed in multiple services, aiming at modularity
  • it’s container ready, “docker run” is enough to have it started
  • it’s orchestrator ready, supporting dynamic configurations
  • it’s an ecosystem: many client libraries and exporters maintained both by Prometheus team and the community



1 Prometheus


  • Prometheus collects data
  • Exporters expose data
  • Applications expose data
  • Grafana displays data
  • Alertmanager dispatches alerts

Prometheus is a pull-based monitoring system that scrapes metrics from configured endpoints, stores them efficiently and supports a powerful query language to compose dynamic information from a variety of otherwise unrelated data points.

To monitor your services using Prometheus, your services need to expose a Prometheus endpoint. This endpoint is an HTTP interface that exposes a list of metrics and the respective current values. Prometheus has a wide range of service discovery options to find your services and start collecting metrics data. The Prometheus server continuously polls the metrics interface on your services and stores the data. This provides a standardized way for metrics gathering.

Prometheus is designed to fetch data in intervals measured in seconds. And while Prometheus 2.x can handle somewhere north of ten millions series over a time window, which is rather generous, some unwise label choices can eat that surprisingly quickly.

Every 2 hours Prometheus compacts the data that has been buffered up in memory onto blocks on disk.

To reduce disk footprint, TSDB can have a shorter metrics retention period of the metrics or it can be configured to have a disk space limit. The data can be compacted and the WAL compressed as well.

The data structure is self-sufficient and can be moved from one instance to another independently given each time series is atomic and uniquely identified by its metric name (1). In recent Prometheus versions, remote storage support has been introduced in order to provide long term storage.

Core Prometheus server is a single binary and each Prometheus server is an independent process with its own storage. One of the downsides of this core implementation is the lack of clustering or backfilling “missing” data when a scrape fails.

Prometheus is not supposed to only be used with standard exporters (2), you can instrument your own code to capture the metrics that matter to you, business ones for example. Prometheus comes with the support for a wide range of languages (Go, Java or Scala, Python, Ruby, etc). Many upstream libraries are already instrumented by the maintainers, so you will get that for free!


What is a metric?

A metric is any numeric value that tells you something about how your system is operating. For example:

  • How much memory it is using
  • How long the last operation took* How many request were served today


3 Prometheus


In Prometheus there are 4 types of metrics: counter, gauge, histogram and summary.

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests.

A histogram samples observations, for example request durations or response sizes, and counts them in configurable buckets. It also provides a sum of all observed values. A histogram with a base metric name of exposes multiple time series during a scrape:

  • cumulative counters for the observation buckets, exposed as _bucket{le=""}
  • the total sum of all observed values, exposed as _sum
  • the count of events that have been observed, exposed as _count (identical to _bucket{le="+Inf"} above)

Similar to a histogram, a summary samples observations, for example request durations and response sizes. While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

A summary with a base metric name of exposes multiple time series during a scrape:

  • streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as {quantile="<φ>"}
  • the total sum of all observed values, exposed as _sum
  • the count of events that have been observed, exposed as _count

The essential difference between summaries and histograms is that summaries calculate streaming φ-quantiles on the client side and expose them directly, while histograms expose bucketed observation counts and the calculation of quantiles from the buckets of a histogram happens on the server side using the histogram_quantile() function.

https://prometheus.io/docs/concepts/metric_types/

https://prometheus.io/docs/practices/histograms/


Understanding metrics

Prometheus metrics have a name and might have any arbitrary number of labels:

A Metric has metadata (labels) and lots of functions to filter, change, remove those while fetching them from the targets. The name “node_cpu_seconds_total” consist of a prefix for the namespace (node metrics) and suffix for the unit of the value ( Seconds of CPU time in total )

https://prometheus.io/docs/practices/naming/

promtool allows to lint them for consistency and correctness.

Examples:

5 Prometheus

PromQL

Prometheus Query Language (PromQL) supports a wide range of functions for interacting with scraped metrics. Some examples:

  • Filtering by label: _http_requeststotal{status=\~"5.."}
  • Calculating rates: _rate(http_requeststotal[5m])
  • Arithmetic ( +, *, /, -, %, ^) and Comparison ( >, <, >=, <=, ==, != ) operations
  • Aggregation and Grouping: _sum(rate(node_network_receive_bytestotal[5m])) by (instance)* Quantile: _histogram_quantile(0.95, sum(rate(http_request_duration_secondsbucket[5m])) by (le))
  • Recording Rule: precompute frequently needed or computationally expensive expressions, in order to make recurring queries much faster to compute


Alerting

Our motto is: if you can graph it, you can alert on it! It’s really easy to set up alerts in Prometheus, it’s just a matter of defining which query to evaluate and which is the range of safe values:

6 Prometehus

Prometheus will evaluate the alerting rule regularly and will mark it as firing in case the rule matches. However, Prometheus core component will not take care directly of sending alerts to final users. Alertmanager instead will take care of performing alert related operations.

Alertmanager :

  • Receives alerts from Prometheus
  • Groups them
  • Inhibits them, for example in case of false positives
  • Dispatches them to downstream services, such as Slack or PagerDuty and many more
  • Built In HA leveraging gossip protocol


7 Prometehus


References


Notes

(1) https://github.com/bitnami/kube-prod-runtime/blob/master/docs/migration-guides/prometheus-migration.md

(2) [https://prometheus.io/docs/instrumenting/exporters] https://github.com/prometheus/prometheus/wiki/default-port-allocations


Author: Matteo Gazzetta, DevOps Engineer @Bitrock

Read More
Terraform Community Tools

Terraform Community Tools

Despite not having reached version 1.0 yet, Terraform has become the de facto tool for cloud infrastructure management. One of its major winning points is definitely the extensive cross cloud support, which allows projects to span from one cloud vendor to another with a minimal operational effort. Moreover, the popularity in the community continuously releasing reusable infrastructure components, the Terraform modules, makes it easy to bootstrap new projects with a fully functional setup right from the start.

In order to address all the different use cases of Terraform, whether it is executed as part of a GitOps pipeline or right from developers machines, the community has built a set of tools to enhance the developers experience.

In this blog post we will describe some of them, focusing on those that might not be that popular or widely adopted, but certainly deserve some attention.

Pull Request Automation

Atlantis

GitHub Website

Atlantis

Atlantis is a golang application that listens for Terraform pull request events via webhooks. It allows users to remotely execute "terraform plan" and "terraform apply" according to the pull request content commenting back the result. Atlantis is a good starting point for making infrastructure changes visible to all teams, allowing even non-operations ones to contribute to Terraform infrastructure codebase. If you want to see Atlantis in action, check this walkthrough video.

If you want to restrict and audit the execution of Terraform changes still providing a friendly interface, Terraform Cloud and Enterprise support invoking remote operations by UI, VCS, CLI and API. The offering includes an extensive set of capabilities for integrating infrastructure changes in CI pipelines.


Importing Existing Cloud Resources

Importing existing resources into a Terraform codebase is a long and tedious process. Terraform is capable of importing an existing resource into its state through "import" command, however the responsibility of writing the HCL describing the resource is on the developer. The community has come up with tools that are able to automate this process.

Terraforming

GitHub Website

Terraforming supports the export of existing AWS resources into Terraform resources, importing them to Terraform state and writing the configuration to a file.

Terraformer

GitHub

Terraformer supports the export of existing resources from many different providers, such as AWS, Azure and GCP. The tool leverages Terraform providers for performing the mapping of resource attributes to Terraform ones, which makes it more resilient to API upgrades. Terraformer has been developed by Waze and now maintained by Google Cloud Platform team.

Version Management

tfenv

GitHub

When working with projects that are based on different Terraform versions, it is tedious to switch from one version to another and the risk of updating the states’ Terraform version to a new one is high. tfenv comes in support and makes it easy to have different Terraform versions installed on the same machine.

Security and Compliance Scanning

tfsec

GitHub

Logo

tfsec performs static analysis of your Terraform code in order to detect potential vulnerabilities in the resulting infrastructure configuration. It comes with a set of rules that work cross provider and a set of provider specific ones, with support for AWS, Azure and GCP. It supports disabling checks on specific resources making it easy to include the tool in a CI pipeline.

Terrascan

GitHub Website

Terrascan

Terrascan detects security and compliance violations in your Terraform codebase, mitigating the risk of provisioning unsecure cloud infrastructures. The tool supports AWS, Azure, GCP and Kubernetes, and comes with a set of more than 500 policies for security best practices. It is possible to write custom policies with Open Policy Agent Rego language.

Regula

GitHub

Regula is a tool that inspects Terraform code looking for security misconfigurations and compliance violations. It supports AWS, Azure and GCP, and includes a library of rules written in Open Policy Agent language Rego. Regula consists of two parts, the first one generates a Terraform plan in JSON that is then consumed by the Rego framework which in turn evaluates the rules and produces a report.

Terraform Compliance

GitHub Website

Logo

Terraform Compliance approaches the problem from a different perspective, allowing to write compliance rules in a Behaviour Driven Development (BDD) fashion. An extensive set of examples provides an overview of the capabilities of the tool. It is easy to bring Terraform Compliance into your CI chain and validate infrastructure before deployment.

While Terraform Compliance is free to use and easy to get started with, a much wider set of policies can be defined using HashiCorp Sentinel, which is part of the HashiCorp Enterprise offering. Sentinel supports fine-grained condition-based policies, with different enforcing levels, that are evaluated as part of a Terraform remote execution.

Linting

TFLint

GitHub

TFLint is a Terraform linter that focuses on potential errors and best practices. The tool comes with a general purpose and AWS rule set while the rules for other cloud providers such as Azure and GCP are being added. It does not focus on security or compliance issues, rather on validating configuration variables such as instance types, which might cause a runtime error when applying the changes. TFLint tries to fill the gap of “terraform validate”, which is not able to validate variable values beside syntax and internal consistency checks.


Cost Estimation

infracost

GitHub Website

Infracost

Keeping track of infrastructure pricing is quite a mess and one usually discovers the actual cost of a deployment after running it for days if not weeks. infracost comes in help providing a way to estimate how much the resources you are going to deploy will cost. At the moment the tool supports only AWS, providing insights for the costs of both hourly priced resources and usage based resources such as AWS Lambda Functions. For the latter, it requires the usage of infracost Terraform provider which allows describing usage estimates for a more realistic cost estimate. This enables quick “what-if” analysis like “what if this month my Lambda gets 2 times more requests?”. The ability to output a “diff” of the costs is useful when integrating infracost in your CI pipeline.

Terraform Enterprise provides a Cost Estimation feature that extends infracost offering with the support for the three major public cloud providers: AWS, Azure and GCP. Moreover, Sentinel policies can be applied for example to prevent the execution of Terraform changes according to the increment of costs.


Author: Simone Ripamonti, DevOps Engineer @Bitrock

Read More
Monitoring Kafka Connector with Kubernetes

Monitoring Kafka Connector with Kubernetes


The Problem

The popularity of microservice architecture has enormously increased recently; but this comes with new challenges.

One of these is monitoring. In one of our projects, we used a Kafka connector to intercept changes in our database and write data to a topic. This was a very important component of the system, so we needed to consider its health status carefully.


Solution

In our first version, we created a Kubernetes’ CronnJob with a simple shell script that checks the status of the connector and, eventually, deletes the failed and restarts it.

This worked quite well; however, this is different from how the other services are health checked with the Kubernetes.

The connector was deployed with Kubernetes; the most natural thing to do is thus using k8s for monitoring pods and eventually restarting it.

The Kafka Connect framework comes with Rest API, and one of these gives you the state of the connectors:

i.e : https://bitrock.it/blog/monitoring-kafka-connector-with-kubernetes/

This seems to resolve our problem... But is it really the case?

Kubernetes health check controls the HTTP status code; the problem is that the Kafka connector API returns 200 HTTP status.

For instance, if the task is failed, the API will return:

HTTP/1.1 200 OK

{"state":"FAILED","id":1,"worker_id":"192.168.86.101:8083"}

In this case, from the Kubernetes point of view, everything is ok.

The solution that worked well for us consisted in adding a sidecar container that takes responsibility for exposing the state of the connector task.

The sidecar pattern allows you to extract some functionalities of your application in a different component. For example, we can separate the authentication layer from our “main” component that contains the business logic or - as in our case - extracts the monitoring part.

Our goal is to obtain something like this:

First of all, we created a simple application that takes care of calling the connector API and exposes an API for Kubernetes (we used a simple Python application using Flask - but you can use whatever you want). Something like this:

As you can see, the code is very simple.

The application does two different things: first of all, it exposes an endpoint at “/health” paths that will be called periodically by Kubernetes; secondly, it checks the status of a task and eventually returns an Internal Server Error, in case the HTTP status of the connector was not 200 or if the status was not “RUNNING”.

Now, this application needs to be deployed in the same pods of the connector. This can be done by adding to our deployment.yaml file the container that contains our Python application:


Conclusions

The logical result?

Both containers expose the health check of the sidecar, since Kubernetes does not restart the entire pods if one container is up; exposing the same API, the destiny of both containers would be the same.

Once the connector is in FAILED state, Kubernetes will restart the pod.

Some cloud providers may provide a built-in solution for problems like this; but if you can’t use it - for whatever reason - this can be a possible solution.


Author: Marco Tosini, Principal Engineer @Bitrock

Read More