Platform Engineering

Platform Engineering is one of the biggest trends that is undoubtedly here to stay, with a community and tool environment that are both expanding quickly. But there are still many unsolved questions, as with any (relatively) new trend. 

In this Blog post, we’ll try to understand the concept from scratch and answer some basic questions that both people within and outside the community are curious about.

Without further ado, here’s the essential information you should be aware of regarding Platform Engineering. Agreeing on a common definition is rather difficult; however, we can define Platform Engineering as the process of creating, developing, and maintaining the supporting infrastructure and apparatus necessary to the efficient operation of software applications.

In order to facilitate the development, deployment, and management of apps across various environments, Platform Engineering focuses on building scalable, reliable, and efficient platforms. 

Software Engineering, Infrastructure Engineering, and DevOps are all combined in Platform Engineering. Its main objective, indeed, is to provide faster software application and service delivery while also enhancing agility and flexibility in response to changing corporate needs. 

Or, in other words, to standardize processes in order to enhance the developer experience, quicken innovation cycles, and reduce the engineering organization’s time to market. 

Platform Engineering allows development teams to build upon a strong foundation, which enables businesses to innovate more quickly and stay one step ahead of the competition.

The Platform team can gain a holistic understanding of developers’ pain points and common difficulties inside the company by conducting user research, requesting user feedback, and getting internal buy-in from stakeholders. It can identify the features programmers require and provide a golden path that includes those answers. 

However, the Platform journey continues on. In order to ensure that developers are using the Platform and that it is actually improving developers’ life, successful Platform teams maintain open lines of contact with developers and track technical KPIs.

DevOps vs. Platform Engineering

DevOps and Platform Engineering are closely related topics.

DevOps brings development and operations teams closer together, and focuses on using tools, systems and iterative processes to shorten feedback cycles. It represents a software lifecycle management’s philosophical and methodological approach that contributes to the creation of the Platform and is, in turn, one of the services delivered by it. 

Platform Engineering, on the other hand, can basically be considered as the discipline focused on the technology infrastructures and platforms’ design, development and management, on which digital services and software applications are delivered. Through the Platform, IT teams can access self-service mode for the digital assets they need, and digital business services are delivered.

Platform Engineering is also a development of DevOps, so to speak. DevOps outlines guidelines for simplifying development through automation, autonomy, and collaboration. These qualities are also crucial to Platform Engineering so the technique helps you achieve good DevOps performance.

DevOps is used as a methodology to compose Platform Engineering; one of the methodologies that, those who use Platform, find out-of-the-box. 

An Internal Corporate Service

Platform Engineering is definitely an internal corporate service that is addressed to all IT figures involved in creating digital initiatives within an organization.

Platform Engineering, in fact, can be seen as a technological-organizational moment that gives development teams access to services like monitoring and security in a “self-service” mode, which makes use of automations and the above-mentioned DevOps. 

It is a kind of large distributor of digital services within the organization, essential to any new initiative that needs access to digital assets.

Externally, however, Platform Engineering is realized in the tangible result of the application of the service built through the Platform itself. 

Impacts on IT Governance

The centralization of digital services within the Platform has significant impacts. 

First, the Platform enables considerable cost management and control, especially when it integrates elements usually external to typical corporate governance, as in the case of data centers and cloud services.

Another relevant effect resulting from Platform Engineering is the harmonization within a company’s organizational set-up of all the offerings that arrive from suppliers in terms of products and services. These have to adapt to organizational security standards and methodologies for updating and maintaining infrastructure and applications. 

Therefore, IT Governance has to integrate into the Platform’s context, also embedding itself in a self-service delivered system that represents the enterprise technology asset’s backbone.

We can affirm that Platform Engineering can be seen as the answer to every CTO’s dreams!

Conclusions

The Platform Engineering community started in 2021 with a handful of meetup groups in the USA (Austin) and Europe (Berlin). Over 10,000 Platform developers are now engaged, spread over 19 meetup groups worldwide. In light of this community movement, Platform Engineering should be taken more seriously by organizations.

And you? What are you waiting for? Discover more about Platform Engineering by listening to the latest episode of our Bitrock Tech Radio podcast, or get in contact with one of our experienced engineers and consultants!

Author: Franco Geraci, Head Of Engineering @ Bitrock

Read More
Introduction to HC Boundary

Secure your access in a dynamic world

The current IT landscape is characterized by multiple challenges and quite a bit of them are related to the increasing dynamicity of the environments IT professionals are working in. One of these challenges is securing access to private services. The dynamic nature of the access manifests itself on multiple levels:

  • Services: they tend to be deployed in multiple instances per environment
  • Environments: hosts, where the workload is deployed, can change in a  transparent way to the final user
  • Users: people change role, people come and go from a team
  • Credentials: the more often they are changed, the more secure they are

Tools developed when this kind of dynamism  was not foreseeable are starting to show their limitations. For example, accessing a service would often mean to provide networking access to a subnet where careful network and firewall policies need to be set up. The resulting access is allowed to a user independently from their current role. 

Zero trust in a dynamic environment

A Zero trust approach is highly desirable in every environment. Being able to assume zero trust and granularly providing access to resources with role based rules without the need to configure delicate resources like network and firewalls are paramount in a modern IT architecture.

This is even more so in a dynamic environment, where the rate of change can put under pressure the security teams and their toolchains as  they try to keep access configurations up to date. 

Boundary to the rescue

In the following diagram we can see how HashiCorp’s Boundary is designed to fulfill the requirements of granting secure access in a zero trust environment. The access to a remote resource is granted by defining policies on high level constructs that encapsulate the dynamic nature of the access.

The main components are:

  • Controller (control plane): the admin user interacts with the controller to configure access to resources. The normal user interacts to ask for authentication / authorization.
  • Worker (data plane): the connection is established between the local agent and the remote host by passing through this gateway that allows for the connection based on what the controller allows.
  • Local Agent: interact with the controller and the worker to establish the connection.

Boundary concepts

Identity is a core concept in Boundary. Identity is represented by two types of resources, mapping to common security principals:

  • Users, which represent distinct entities that can be tied to authentication accounts
  • Groups, which are collections of Users that allow easier access management

Roles map users and groups to a set of grants, which provide the ability to perform actions within the system.

Boundary’s permissions model is based on RBAC and each grant string is a mapping that describes a resource or set of resources and the permissions that should be granted to them.

A scope is a permission boundary modeled as a container. There are three types of scopes in Boundary: 

  • a single global scope: which is the outermost container
  • organizations: which are contained by the global scope
  • projects:  which are contained by orgs

Each scope is itself a resource.

Boundary administrators define host catalogs that contain information about hosts. These hosts are then collected into host sets which represent sets of equivalent hosts. Finally, targets tie together host sets with connection information.

Boundary interfaces

Boundary offers multiple interfaces to interact with the tool:

  • a CLI that we DevOps engineers love
  • a user friendly Desktop application
  • a Web UI for the server

Integration is key

So how can this be kept up to date with the current dynamic environments?

The answer lies in the integrations that are available to add flexibility to the tool: specifically when it comes to the authentication of users, the integration with an identity provider with standard OIDC protocol can be leveraged. When it comes to credentials, the integration with HashiCorp Vault surely (pun intended) covers the need of correctly managed secrets with their lifecycle (Vault Credentials Brokering). Finally, when we talk about the list of hosts and services we can leverage the so-called Dynamic Hosts Catalog. The catalog can be kept up to date in a push mode by using the integration with HashiCorp Terraform or in a pull mode by interacting with HashiCorp Consul.

Want to get your feet wet?

Seems like this tool is providing a lot of value: so why not integrate it into your environment? We are already planning to add it into our open source Caravan tool.

There’s a high chance for you to get your feet wet playing with Boundary and other cool technologies, don’t be shy, join us on (the) Caravan


Discover more on Zero Trust in our upcoming Webinar in collaboration with HashiCorp

When: Thursday, 31st March 2022
Where: Virtual Event
More details available soon – Follow us on our Social Media channels to find out more!

Read More
PNRR Bitrock

Non c’è alternativa”, recitava un vecchio slogan politico che portò alla creazione del governo più duraturo del Novecento. Oggi, il medesimo mantra si può applicare alle molteplici necessità di innovazione e digitalizzazione del tessuto produttivo italiano, che si (ri)affaccia sul mercato globale al termine, si spera, della crisi pandemica già affardellato da decennali cali di produttività e competitività. Per chi vuole prosperare nuovo scenario, il cambiamento tecnologico rappresenta un principio cogente.

In questo senso, il Piano Nazionale di Ripresa e Resilienza (PNRR) costituisce una opportunità rilevante. Elaborato in risposta alla grave crisi economica e sociale innescata dal Covid19, prevede l’allocazione di 191,5 miliardi di euro in una serie di interventi atti a rilanciare la fragile economia italiana e stimolare l’occupazione. Gli ambiti spaziano dallo sviluppo della mobilità sostenibile, alla transizione ecologica e all’inclusione di gruppi sociali ulteriormente marginalizzati dalla precarietà lavorativa.

Transizione digitale 4.0 per il sistema Italia

La prima missione del PNRR mette al centro “Digitalizzazione, Innovazione, Competitività, Cultura e Turismo”, valorizzando i concetti chiave che fungono da leitmotiv per l’intero Recovery Plan. Prevede lo stanziamento di 40,32 miliardi di euro per un programma di transizione digitale che interessa sia il settore pubblico sia quello privato. 

L’obiettivo è quello di sostenere lo sviluppo e la capacità competitiva di un sistema paese che, al momento, si posizione al 25mo posto (su 28) nel Digital Economy and Society Index (DESI). Come ricorda il PNRR (pag. 83), tale arretratezza fa il paio con il calo di produttività che ha caratterizzato l’economia italiana nell’ultimo ventennio, a fronte di una tendenza positiva nel resto del continente europeo. Questa contrazione è sovente legata alla ridotta innovazione digitale delle piccole e medie imprese, che rappresentano il 92% delle aziende e impiegano l’82% dei lavoratori in Italia (Il Sole 24 Ore).

La missione si articola in tre componenti:

  1. Digitalizzazione, Innovazione e Sicurezza nella PA (9,75 Mrd)
  2. Digitalizzazione, Innovazione e Competitività del Sistema Produttivo (23,89 Mrd)
  3. Turismo e Cultura (6,68 Mrd)

Vediamo nel dettaglio il secondo punto, cui è dedicato uno dei maggiori investimenti del PNRR.

Digitalizzazione, Innovazione e Competitività del Sistema Produttivo: come funziona

Il programma per il settore privato si prefigge, nelle parole del documento, di rafforzare “la politica di incentivazione fiscale già in corso (studiata per colmare il gap di “digital intensity” del nostro sistema produttivo verso il resto d’Europa – minori investimenti valutabili in due punti di Pil – specie nella manifattura e nelle PMI), che ha avuto effetti positivi sia sulla digitalizzazione delle imprese che sull’occupazione, soprattutto giovanile e nelle nuove professioni” (pag. 98).

Prevede una serie di investimenti e riforme che hanno l’obbiettivo di potenziare la digitalizzazione, innovazione tecnologica e internazionalizzazione del tessuto produttivo e imprenditoriale, con un occhio specifico alle PMI che maggiormente risentono del clima di volatilità contemporanea. 

All’interno del PNRR, il piano di investimento “Transizione 4.0” costituisce un’evoluzione del già noto programma Industria 4.0 del 2017, di cui viene allargato il novero delle aziende potenzialmente beneficiarie. Prevede tra le altre cose l’erogazione di un credito di imposta per società che decidono di investire in

  1. Beni capitali, materiali e immateriali
  2. Ricerca, sviluppo e innovazione
  3. Attività di formazione alla digitalizzazione e di sviluppo delle relative competenze

La prima voce riguarda l’investimento per strumenti “direttamente connessi alla trasformazione digitale dei processi produttivi” – i cosiddetti Beni 4.0 già indicati negli allegati A e B alla legge 232 del 2016 –, e “beni immateriali di natura diversa, ma strumentali all’attività dell’impresa (pag. 99)

Se il primo allegato dettaglia una serie di componenti hardware, tra cui macchinari, utensili e sistemi di monitoraggio, il secondo si concentra su soluzioni software ad alto tasso tecnologico che possono sostenere le aziende in un percorso di crescita scalabile e sostenibile.

Le applicazioni possibili

Integrati all’interno di una visione strategica, le soluzioni hardware e software menzionate nel PNRR possono trovare applicazione in una serie di ambiti, tra cui:

  • La transizione verso il paradigma Cloud Native, un approccio che sfrutta le tecnologie del Cloud Computing per progettare e implementare applicazioni sulla base dei principi di flessibilità, adattabilità, efficienza e resilienza. Grazie a strumenti metodologici e tecnologici come DevOps, container e microservizi, il Cloud Native permette di ridurre il time to market e sostenere l’evoluzione agile dell’intero ecosistema aziendale.
  • La valorizzazione del patrimonio informativo aziendale attraverso l’implementazione di sistemi di Data Analysis in tempo reale, IIoT (Industrial Internet of Things) e Data Streaming che, combinati con Machine Learning e Intelligenza Artificiale, possono essere sfruttati per la manutenzione predittiva, con un evidente ottimizzazione dei costi. Rientrano in questo ambito anche i Digital Twin, le copie virtuali di risorse o processi industriali che permettono di sperimentare in vitro nuove soluzioni e prevenire malfunzionamenti.
  • La cybersecurity, sempre più centrale in un contesto di crescente digitalizzazione di processi e servizi, e di crescente interdipendenza di attori nazionali e stranieri, pubblici e privati all’interno della catena del valore digitale.

Questi percorsi di maturazione digitale possono essere rilevanti sia per le grandi realtà, sia per le PMI che faticano maggiormente a tenere il passo con l’evoluzione tecnologica e la competizione internazionale. Lo sforzo è premiato: come riporta l’Osservatorio innovazione digitale PMI del Politecnico di Milano, le aziende medie e piccole digitalizzate riportano in media un incremento del 28% nell’utile netto, con il margine di profitto più alto del 18% (La Repubblica).

Perché quindi le aziende non digitalizzano? Il problema, spesso, è nella mancanza di personale qualificato. La carenza di staff qualificato affligge il 42% delle PMI italiane (La Repubblica), e la cifra sale al 70% se prendiamo in esame l’intero tessuto produttivo europeo (Commissione Europea). Un altro possibile fattore bloccante concerne la renitenza all’abbandono o evoluzione di sistemi legacy già consolidati all’interno dei processi aziendali.

Questi sono solo alcuni dei motivi per cui è fondamentale affiancarsi a un partner qualificato, che possa accompagnare l’azienda nella pianificazione degli investimenti tecnologici e digitali resi possibili dal PNRR (e non solo).

Bitrock ha competenze certificate ed esperienza internazionale per offrire soluzioni su misura che innovano l’ecosistema tecnologico e digitale, mantenendo gli investimenti legacy del cliente. Il know-how specializzato in ambito DevOps, Software Engineering, UX&Front-End e Data&Analytics è la chiave per affrontare il percorso di evoluzione digitale, con al centro i valori di semplificazione e automazione che generano valore duraturo.

Per conoscere nel dettaglio come possiamo supportare la tua azienda, contattaci subito!

Read More

A Joint Event from Bitrock and HashiCorp

Last week we hosted an exclusive event in Milan dedicated to the exploration of modern tools and technologies for the next-generation enterprise.
The first event of its kind, it was held in collaboration with HashiCorp, US market leader in multi-cloud infrastructure automation, after the Partnership we signed in May 2020.

HashiCorp’s open-source tools Terraform, Vault, Nomad and Consul enable organizations to accelerate their digital evolution, as well as adopt a common cloud operating model for infrastructure, security, networking, and application automation.
As companies scale and increase in complexity, enterprise versions of these products enhance the open-source tools with features that promote collaboration, operations, governance, and multi-data center functionality.
Organizations must also rely on a trusted Partner that is able to guide them through the architectural design phase and who can grant enterprise-grade assistance when it comes to application development, delivery and maintenance. And that’s exactly where Bitrock comes into play.

During the Conference Session, the Speakers had the chance to describe to the audience how large companies can rely on more agile, flexible and secure infrastructure thanks to HashiCorp’s suite and Bitrock’s expertise and consulting services. Especially when it comes to the provisioning, protection and management of services and applications across private, hybrid and public cloud architectures.

“We are ready to offer Italian and European companies the best tools to evolve their infrastructure and digital services. By working closely with HashiCorp, we jointly enable organizations to benefit from a cloud operating model.” – said Leo Pillon, Bitrock CEO.

After the Keynotes, the event continued with a pleasant Dinner & Networking night at the fancy restaurant by Cascina Cuccagna in Milan.
Take a look at the pictures below to see how the event went on, and keep following us on our blog and social media channels to discover what other incredible events we have in store!

Read More
Caravan Series Part 1

Introduction

The current IT industry is characterised by multiple needs, often addressed by an heterogeneous number of products and services. To help professionals adopt the best performing solutions for sustainable development, the Cloud Native Computing Foundation was created in 2015 with the aim of advancing container technology and aligning the IT industry around its evolution.

We conceived Bitrock’s Caravan project following the Cloud Native principles defined by the CNCF:

  • leverage the Cloud
  • be designed to tolerate Failure and be Observable
  • be built using modern SW engineering practices
  • base the Architecture on containers and service meshes

The HashiCorp stack fulfills these needs, enabling developers to build and run applications faster and more efficiently.

The Caravan Project

Caravan is your open-source platform builder based on the HashiCorp stack. Terraform and Packer are used to build and deploy a cloud-native and ready-to-use platform composed of Vault, Consul and Nomad.

Vault allows you to keep secrets, credentials and certificates safe across the Company. Consul allows the Service Discovery and, with Consul Connect, a Service Mesh to get the power of a truly dynamic communication among your next gen and legacy applications. Nomad allows powerful placing, scaling and balancing of your workloads that may be containerized or legacy, services or batches.

Thanks to Terraform and Ansible, the Infrastructure and Configuration as Code lie at the core of Caravan.

The rationale behind Caravan is to provide a one-click experience to deploy an entire infrastructure and the configuration needed to run the full HashiCorp stack in your preferred cloud environment.

Caravan’s codebase is modular and layered to achieve maximum flexibility and cover the most common use cases. Multiple cloud providers and optional components can be mixed to achieve specific goals.

Caravan supports both Open Source and Enterprise versions of HashiCorp products.

Caravan Project Functioning

Caravan in a nutshell

Caravan is the perfect modern platform for your containerized and legacy applications:

  • Security by default
  • Service mesh out of the box
  • Scheduling & Orchestration
  • Observability
  • Fully automated

Want to know more about Caravan? Visit the dedicated website, check our GitHub repository and explore our documentation.

Authors: Matteo Gazzetta, DevOps Engineer @ Bitrock – Simone Ripamonti, DevOps Engineer @ Bitrock

Read More
Intro to HashiCorp Vault

A Hands-on Workshop by Bitrock and HashiCorp


On June, 16th 2021 we held our virtual HashiCorp Vault Hands-On Workshop, an important event in collaboration with our partner HashiCorp, during which attendees had the opportunity to get a thorough presentation of the HashiCorp stack before starting a hands-on labs session to learn how to secure sensitive data with Vault.

Do you already know all the secrets of HashiCorp Vault?

HashiCorp Vault is an API-driven, cloud agnostic Secrets Management System, which allows you to safely store and manage sensitive data in hybrid cloud environments. You can also use Vault to generate dynamic short-lived credentials, or encrypt application data on the fly.

Vault was designed to address the security needs of modern applications. It differs from the traditional approach by using:

  • Identity based rules allowing security to stretch across network perimeters
  • Dynamic, short lived credentials that are rotated frequently
  • Individual accounts to maintain provenance (tie action back to entity)
  • Credentials and Entities that can easily be invalidated

Vault can be used in untrusted networks. It can authenticate users and applications against many systems, and it runs in highly available clusters that can be replicated across regions.

Thanks to our experts Gianluca Mascolo (Senior DevOps Engineer at Bitrock) and Luca Bolli (Senior Solution Engineer at HashiCorp) for the overview of the HashiCorp toolset and the unmissable labs session.

If you’d like to learn more about our enterprise offerings or if you want to receive the presentation slides, please reach out to us at info@bitrock.it

To access the workshop recording, simply click here

We look forward to seeing you at a future Bitrock event!

Read More

Let’s Encrypt with Terraform

Today’s web traffic is virtually impossible without encryption. The need to cryptographically protect the data in transit whenever real or not has become a norm and a requirement for any kind of service to be properly implemented. From a simple portfolio website that has its ranking downgraded by the search engines to public API gateways that move around sensitive data. Everything has to be verified and encrypted.

This increase in the usage however has to deal with the complexity of the technological implementation. SSL and later TLS, with public CA signed certificates and cross-signed private PKI implementations were always something many IT professionals struggled to comprehend and use properly. It just seemed to add a hardly justifiable overhead.

Then the automation came. With the “automate all the things” approach the TLS certificates were given another push with all kinds of APIs and scripts that allowed for dynamic creation, distribution, and maintenance of certificates and complete in-house Certificate Authorities.

But as it always goes with automation the tool that solves one problem isn’t always good for solving another just because it was tagged with the same words in the ticket. So the scripts and services should be chosen to satisfy the specific need. There is however a simple case that will cover most of the uses, i.e. a humble HTTPS certificate. Bring up a website, a REST API or your installation packages download endpoint and you need a certificate to go with it. And if it is a public service you need it to be signed by a public CA. And if it is in the cloud you have to manage it dynamically. And if you do then it is better to manage it as code.

Here at Bitrock when it comes to automation we start with terraform first and see what can we drop on top of it to achieve the goal with all the things IaC as much as we can. And this is where we start with the certificates too. Once the use case is identified, analyzed and solved we can easily reuse it using terraform in other projects. Which given the flexibility of the tool and similarities between cloud platforms should work most of the time. This article illustrates our approach at automating the certificate as code management in the specific case of public HTTP service behind an in cloud load balancer.

A bit of context

First a refresher on the details before the implementation of the process can start.

Let’s start with the Certificate Authority (CA) which, for the sake of simplifying, is a provider of digital certificates. There are many components in a CA but we are only interested in one. As a service consumer you ask CA to certify that you own a property on the internet. In most cases it will be a domain name. Such as “bitrock.best”. The result of this certification is a signed TLS certificate, usually a file you keep in reach of your web server. The standard process is performed in three iterations:

  1. Consumer generates a private key and a certificate signing request
  2. Consumer sends the certificate signing request to the CA
  3. CA verifies the ownership of the property described in the requests and issues the certificate to the consumer

What the consumer is left with are at least two items: the private key and the certificate. The certificate can be read by anyone but can only be used for encryption by the private key owner. And the private key is what should be kept private.

20 years ago... I was there Gandalf when they sent faxes
20 years ago… I was there Gandalf when they sent faxes

The process of issuing a certificate by itself is simple but the verification of the property ownership is what usually complicates it. Since the 90s having a certificate that was signed and trusted by any client meant to pay for the service and service provider used to verify via email, fax, phone calls and even in person that the consumer owns a domain name or a business name.

The way of Let’s Encrypt

Then came the free
Then came the free

While it still makes sense today for banks or large e-commerce companies, for a simple website or service everything changed a few years ago when the Let’s Encrypt project went public. The project has built a protocol and a service provider which together allow having a certificate signed by a publicly trusted CA with a couple of API calls.

Having a certificate issued and signed by Let’s Encrypt on your “normal” server is extremely easy. You just install the “certbot” package using your package manager and run it. If you are using a supported web server software such as apache or nginx the certbot will even set it up for you. Otherwise you can get the certificate by just pointing certbot to where your web root is and then point the web server configuration to your freshly signed certificate and its private key.

The “normal” usage of the certbot however implies the “normal” server which doesn’t match the “cattle vs pets” model of modern infrastructure. In a modern architecture the node where your web server is running should be an immutable and disposable element of your architecture. The certificate and the key then should be configured on an external entity. Think a cloud compute instance and a cloud load balancer. The load balancer accepts the client requests, does all the TLS termination heavy lifting and forwards the request to any compute instance there available.

The above use case eliminates the possibility of using certbot as easily as with a “normal” server. The verification process is trickier to implement using the web server files and the certbot process does not have access to where the certificate and key files are stored. This forces a different verification usage approach based on DNS. In the case of files the ownership verification relies on the consumer owning the web server responsible for serving the content of the domain name. A file with cryptographic content is stored by the certbot on the server and Let’s Encrypt servers reach for it to verify that indeed the cerrtbot is running on the domain name’s web server. The DNS verification uses the same cryptography verification but the consumer has to publish a TXT record for the domain name which will be verified by Let’s Encrypt to certify the ownership of the domain name.

Hashicorp Terraform, GCP and … Let’s Encrypt

The above looks very much like technical requirements: deploy a web service in the cloud to provide public services using HTTPS. The TLS certificate should be issued by Let’s Encrypt using DNS verification and the termination should be handled by the cloud provider’s load balancer. The deployment must be performed using terraform with no manual operations that interrupt the process.

To satisfy the requirements we are going to use the GCP services and the HashiCorp’s google provider to provision the infrastructure. Then we will use GCP’s Cloud DNS to configure the records using an excellent terraform ACME protocol provider. Terraform Cloud will take care of the state so it can be kept separated from the infrastructure it describes.

The domain name registrar used has its own API implemented but a terraform provider doesn’t seem to exist for it. So we can use a bash script that leverages curl to configure nameservers of the domain name to point to a freshly created zone in GCP’s Cloud DNS.

The resulting terraform code and all the scripts are available on Bitrock’s github.

./
├── cert-gcp.tf
├── domain.tf
├── gcp.tf
├── LICENSE
├── providers.tf
├── README.md
├── scripts
│   └── startup-script.sh
├── terraform.tfvars
└── variables.tf
1 directory, 9 files

What we did

We have separated the cloud infrastructure into a straightforward terraform file that contains all the resources specific to google. This takes the solution closer to the multi cloud pattern making the infrastructure easily replaceable. The exact layout certainly should be built on the modules pattern. Which shouldn’t be an issue to refactor and integrate. To summarize the infrastructure here is what is being provisioned as resources in our GCP Project:

  • network, subnet and firewall
  • an instance group manager with an instance template and a startup
  • script that prepares our web service
  • a managed DNS zone
  • load balancer that uses the instance group as backend
  • the certificate resource used by the balancer
# terraform.tfvars
# Domain name
domainname = "your-domain-name"
# GCP access
project_id = "GCP project id"
google_account_file = "path to the GCP credentials json"
# Registrar login
domain_user = "login"
domain_password = "password"
# Let's encrypt registration and production endpoint
email_address = "you+acme@gmail.com"
le_endpoint = "https://acme-v02.api.letsencrypt.org/directory"

When a domain name is being registered one has to provide valid nameservers that are supposed to be authoritative for it. With GCP and some other cloud providers it can be a problem since every zone created has its own authoritative servers assigned to it. So after the zone is created its authoritative servers have to be set through the registrar and everything has to wait until the change. We manage it with a single HTTP request and a DNS resolving test in a loop. Both implemented as local-exec provisioners of a null resource in the domain.tf file.

# This is how our registrar can be called to update the nameservers. YMMV
curl 'https://coreapi.1api.net/api/call.cgi?s_login=login&s_pw=password&command=ModifyDomain&domain=your-domain-name&nameservers'
# And now we wait
while true; do
dig +trace ns your-domain-name | grep '^your-domain-name\.' | grep your-new-namserver && exit 0
echo Waiting for nameservers to be updated ...
sleep 15
done
# Checkhout domain.tf to see the complete usage

Once the zone is up and nameservers are updated the ACME provider can proceed with the certificate request. The certificate’s generation is described in the gcp-cert.tf file that includes the creation of two keys, one for Let’s encrypt registration and the other for the certificate itself. Being resources and passed as arguments the keys will be kept in the secure remote state on Terraform Cloud. Small details to keep in mind:

  • the TTL of the records you create (SOA, NS, A, etc.) should be low to avoid waiting for propagation and to reach the service sooner
  • Let’s Encrypt has rate limits in place so play with the staging endpoint first
  • to configure your LB’s TLS properly don’t forget to add the certificate chain (your issuers certificates)
Honest Work

Once all is in place point your browser to the https://your-domain-name should result in a happy lock icon and your smiling face.

Authors: Michael Tabolsky & Francesco Bartolini, DevOps @ Bitrock

Read More
Prometheus

Getting Started with Prometheus


What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit written in Go. Released by SoundCloud in 2012, it joined Cloud Native Computing Foundation in 2016 and in 2018 became the second graduated project alongside Kubernetes.

Based on metrics and not on logs, Prometheus uses its own time series database called TSDB and its own query language (PromQL).

The CNCF community loves Prometheus because:

  • it’s easy to configure, deploy, and maintain
  • it’s designed in multiple services, aiming at modularity
  • it’s container ready, “docker run” is enough to have it started
  • it’s orchestrator ready, supporting dynamic configurations
  • it’s an ecosystem: many client libraries and exporters maintained both by Prometheus team and the community



1 Prometheus


  • Prometheus collects data
  • Exporters expose data
  • Applications expose data
  • Grafana displays data
  • Alertmanager dispatches alerts

Prometheus is a pull-based monitoring system that scrapes metrics from configured endpoints, stores them efficiently and supports a powerful query language to compose dynamic information from a variety of otherwise unrelated data points.

To monitor your services using Prometheus, your services need to expose a Prometheus endpoint. This endpoint is an HTTP interface that exposes a list of metrics and the respective current values. Prometheus has a wide range of service discovery options to find your services and start collecting metrics data. The Prometheus server continuously polls the metrics interface on your services and stores the data. This provides a standardized way for metrics gathering.

Prometheus is designed to fetch data in intervals measured in seconds. And while Prometheus 2.x can handle somewhere north of ten millions series over a time window, which is rather generous, some unwise label choices can eat that surprisingly quickly.

Every 2 hours Prometheus compacts the data that has been buffered up in memory onto blocks on disk.

To reduce disk footprint, TSDB can have a shorter metrics retention period of the metrics or it can be configured to have a disk space limit. The data can be compacted and the WAL compressed as well.

The data structure is self-sufficient and can be moved from one instance to another independently given each time series is atomic and uniquely identified by its metric name (1). In recent Prometheus versions, remote storage support has been introduced in order to provide long term storage.

Core Prometheus server is a single binary and each Prometheus server is an independent process with its own storage. One of the downsides of this core implementation is the lack of clustering or backfilling “missing” data when a scrape fails.

Prometheus is not supposed to only be used with standard exporters (2), you can instrument your own code to capture the metrics that matter to you, business ones for example. Prometheus comes with the support for a wide range of languages (Go, Java or Scala, Python, Ruby, etc). Many upstream libraries are already instrumented by the maintainers, so you will get that for free!


What is a metric?

A metric is any numeric value that tells you something about how your system is operating. For example:

  • How much memory it is using
  • How long the last operation took* How many request were served today


3 Prometheus


In Prometheus there are 4 types of metrics: counter, gauge, histogram and summary.

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests.

A histogram samples observations, for example request durations or response sizes, and counts them in configurable buckets. It also provides a sum of all observed values. A histogram with a base metric name of exposes multiple time series during a scrape:

  • cumulative counters for the observation buckets, exposed as _bucket{le=""}
  • the total sum of all observed values, exposed as _sum
  • the count of events that have been observed, exposed as _count (identical to _bucket{le="+Inf"} above)

Similar to a histogram, a summary samples observations, for example request durations and response sizes. While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

A summary with a base metric name of exposes multiple time series during a scrape:

  • streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as {quantile="<φ>"}
  • the total sum of all observed values, exposed as _sum
  • the count of events that have been observed, exposed as _count

The essential difference between summaries and histograms is that summaries calculate streaming φ-quantiles on the client side and expose them directly, while histograms expose bucketed observation counts and the calculation of quantiles from the buckets of a histogram happens on the server side using the histogram_quantile() function.

https://prometheus.io/docs/concepts/metric_types/

https://prometheus.io/docs/practices/histograms/


Understanding metrics

Prometheus metrics have a name and might have any arbitrary number of labels:

A Metric has metadata (labels) and lots of functions to filter, change, remove those while fetching them from the targets. The name “node_cpu_seconds_total” consist of a prefix for the namespace (node metrics) and suffix for the unit of the value ( Seconds of CPU time in total )

https://prometheus.io/docs/practices/naming/

promtool allows to lint them for consistency and correctness.

Examples:

5 Prometheus

PromQL

Prometheus Query Language (PromQL) supports a wide range of functions for interacting with scraped metrics. Some examples:

  • Filtering by label: _http_requeststotal{status=\~"5.."}
  • Calculating rates: _rate(http_requeststotal[5m])
  • Arithmetic ( +, *, /, -, %, ^) and Comparison ( >, <, >=, <=, ==, != ) operations
  • Aggregation and Grouping: _sum(rate(node_network_receive_bytestotal[5m])) by (instance)* Quantile: _histogram_quantile(0.95, sum(rate(http_request_duration_secondsbucket[5m])) by (le))
  • Recording Rule: precompute frequently needed or computationally expensive expressions, in order to make recurring queries much faster to compute


Alerting

Our motto is: if you can graph it, you can alert on it! It’s really easy to set up alerts in Prometheus, it’s just a matter of defining which query to evaluate and which is the range of safe values:

6 Prometehus

Prometheus will evaluate the alerting rule regularly and will mark it as firing in case the rule matches. However, Prometheus core component will not take care directly of sending alerts to final users. Alertmanager instead will take care of performing alert related operations.

Alertmanager :

  • Receives alerts from Prometheus
  • Groups them
  • Inhibits them, for example in case of false positives
  • Dispatches them to downstream services, such as Slack or PagerDuty and many more
  • Built In HA leveraging gossip protocol


7 Prometehus


References


Notes

(1) https://github.com/bitnami/kube-prod-runtime/blob/master/docs/migration-guides/prometheus-migration.md

(2) [https://prometheus.io/docs/instrumenting/exporters] https://github.com/prometheus/prometheus/wiki/default-port-allocations


Author: Matteo Gazzetta, DevOps Engineer @Bitrock

Read More
Terraform Community Tools

Terraform Community Tools

Despite not having reached version 1.0 yet, Terraform has become the de facto tool for cloud infrastructure management. One of its major winning points is definitely the extensive cross cloud support, which allows projects to span from one cloud vendor to another with a minimal operational effort. Moreover, the popularity in the community continuously releasing reusable infrastructure components, the Terraform modules, makes it easy to bootstrap new projects with a fully functional setup right from the start.

In order to address all the different use cases of Terraform, whether it is executed as part of a GitOps pipeline or right from developers machines, the community has built a set of tools to enhance the developers experience.

In this blog post we will describe some of them, focusing on those that might not be that popular or widely adopted, but certainly deserve some attention.

Pull Request Automation

Atlantis

GitHub Website

Atlantis

Atlantis is a golang application that listens for Terraform pull request events via webhooks. It allows users to remotely execute "terraform plan" and "terraform apply" according to the pull request content commenting back the result. Atlantis is a good starting point for making infrastructure changes visible to all teams, allowing even non-operations ones to contribute to Terraform infrastructure codebase. If you want to see Atlantis in action, check this walkthrough video.

If you want to restrict and audit the execution of Terraform changes still providing a friendly interface, Terraform Cloud and Enterprise support invoking remote operations by UI, VCS, CLI and API. The offering includes an extensive set of capabilities for integrating infrastructure changes in CI pipelines.


Importing Existing Cloud Resources

Importing existing resources into a Terraform codebase is a long and tedious process. Terraform is capable of importing an existing resource into its state through "import" command, however the responsibility of writing the HCL describing the resource is on the developer. The community has come up with tools that are able to automate this process.

Terraforming

GitHub Website

Terraforming supports the export of existing AWS resources into Terraform resources, importing them to Terraform state and writing the configuration to a file.

Terraformer

GitHub

Terraformer supports the export of existing resources from many different providers, such as AWS, Azure and GCP. The tool leverages Terraform providers for performing the mapping of resource attributes to Terraform ones, which makes it more resilient to API upgrades. Terraformer has been developed by Waze and now maintained by Google Cloud Platform team.

Version Management

tfenv

GitHub

When working with projects that are based on different Terraform versions, it is tedious to switch from one version to another and the risk of updating the states’ Terraform version to a new one is high. tfenv comes in support and makes it easy to have different Terraform versions installed on the same machine.

Security and Compliance Scanning

tfsec

GitHub

Logo

tfsec performs static analysis of your Terraform code in order to detect potential vulnerabilities in the resulting infrastructure configuration. It comes with a set of rules that work cross provider and a set of provider specific ones, with support for AWS, Azure and GCP. It supports disabling checks on specific resources making it easy to include the tool in a CI pipeline.

Terrascan

GitHub Website

Terrascan

Terrascan detects security and compliance violations in your Terraform codebase, mitigating the risk of provisioning unsecure cloud infrastructures. The tool supports AWS, Azure, GCP and Kubernetes, and comes with a set of more than 500 policies for security best practices. It is possible to write custom policies with Open Policy Agent Rego language.

Regula

GitHub

Regula is a tool that inspects Terraform code looking for security misconfigurations and compliance violations. It supports AWS, Azure and GCP, and includes a library of rules written in Open Policy Agent language Rego. Regula consists of two parts, the first one generates a Terraform plan in JSON that is then consumed by the Rego framework which in turn evaluates the rules and produces a report.

Terraform Compliance

GitHub Website

Logo

Terraform Compliance approaches the problem from a different perspective, allowing to write compliance rules in a Behaviour Driven Development (BDD) fashion. An extensive set of examples provides an overview of the capabilities of the tool. It is easy to bring Terraform Compliance into your CI chain and validate infrastructure before deployment.

While Terraform Compliance is free to use and easy to get started with, a much wider set of policies can be defined using HashiCorp Sentinel, which is part of the HashiCorp Enterprise offering. Sentinel supports fine-grained condition-based policies, with different enforcing levels, that are evaluated as part of a Terraform remote execution.

Linting

TFLint

GitHub

TFLint is a Terraform linter that focuses on potential errors and best practices. The tool comes with a general purpose and AWS rule set while the rules for other cloud providers such as Azure and GCP are being added. It does not focus on security or compliance issues, rather on validating configuration variables such as instance types, which might cause a runtime error when applying the changes. TFLint tries to fill the gap of “terraform validate”, which is not able to validate variable values beside syntax and internal consistency checks.


Cost Estimation

infracost

GitHub Website

Infracost

Keeping track of infrastructure pricing is quite a mess and one usually discovers the actual cost of a deployment after running it for days if not weeks. infracost comes in help providing a way to estimate how much the resources you are going to deploy will cost. At the moment the tool supports only AWS, providing insights for the costs of both hourly priced resources and usage based resources such as AWS Lambda Functions. For the latter, it requires the usage of infracost Terraform provider which allows describing usage estimates for a more realistic cost estimate. This enables quick “what-if” analysis like “what if this month my Lambda gets 2 times more requests?”. The ability to output a “diff” of the costs is useful when integrating infracost in your CI pipeline.

Terraform Enterprise provides a Cost Estimation feature that extends infracost offering with the support for the three major public cloud providers: AWS, Azure and GCP. Moreover, Sentinel policies can be applied for example to prevent the execution of Terraform changes according to the increment of costs.


Author: Simone Ripamonti, DevOps Engineer @Bitrock

Read More
Bringing GDPR in Kafka with Vault

Bringing GDPR in Kafka with Vault


Part 1: Concepts

GDPR introduced the “right to be forgotten”, which allows individuals to make verbal or written requests for personal data erasure. One of the common challenges when trying to comply with this requirement in an Apache Kafka based application infrastructure is being able to selectively delete all the Kafka records related to one of the application users.

Kafka’s data model was never supposed to support such a selective delete feature, so businesses had to find and implement workarounds. At the time of writing, the only way to delete messages in Kafka is to wait for the message retention to expire or to use compact topics that expect tombstone messages to be published, which isn’t feasible in all environments and just doesn’t fit all the use cases.

HashiCorp Vault provides Encryption as a Service, and as it happens, can help us implement a solution without workarounds, either in application code or Kafka data model.


Vault Encryption as a Service

Vault Transit secrets engine handles cryptographic operations on in-transit data without persisting any information. This allows a straightforward introduction of cryptography in existing or new applications by performing a simple HTTP request.

Vault fully and transparently manages the lifecycle of encryption keys, so neither developers or operators have to worry about keys compliance and rotation, while the securely stored data can always be encrypted and decrypted as long as the Vault is accessible.


Kafka Integration

What if instead of trying to selectively eliminate the data the application is not allowed to keep, we would just make sure the application (or anyone for this matter) cannot read the data under any circumstances? This would equal physical removal of data, just as requested by GDPR compliance. Such a result can be achieved by selectively encrypting information that we might want to be able to delete and throwing away the key when the deletion is requested.

However, it is necessary to perform encryption and decryption in a transparent way for the application, to reduce refactoring and integration effort for each of the applications that are using Kafka, and unlock this functionality for the applications that cannot be adapted at all.

Kafka APIs support interceptors on message production and consumption, which is the candidate link in the chain where to leverage Vault’s encryption as a service. Inside the interceptor, we can perform the needed message transformation:

  • before a record is sent to Kafka, the interceptor performs encryption and adjusts the record content with the encrypted data
  • before a record is returned to a consumer client, the interceptor performs decryption and adjusts the record content with the decrypted data


Logical Deletion

Does this allow us to delete all the Kafka messages related to a single user? Yes, and it is really simple. If the encryption key that we use for encrypting data in Kafka messages is different for each of our application’s users, we can go ahead and delete the encryption key to guarantee that it is no longer possible to read the user data.


Replication Outside EU

Given that now the sensitive data stored in our Kafka cluster is encrypted at rest, it is possible to replicate our Kafka cluster outside the EU, for example for disaster recovery purposes. The data will only be accessible by those users that have the right permissions to perform the cryptographic operations in Vault.



Part 2: Technicalities

In the previous part we drafted the general idea behind the integration of HashiCorp Vault and Apache Kafka for performing a fine grained encryption at rest of the messages, in order to address GDPR compliance requirements within Kafka. In this part, instead, we do a deep dive on how to bring this idea alive.


Vault Transit Secrets Engine

Vault Transit secrets engine is part of Vault Open Source, and it is really easy to get started with. Setting the engine up is just a matter of enabling it and creating some encryption keys:

Crypto operations can be performed as well in a really simple way, it’s just a matter of providing base64 encoded plaintext data:

The resulting ciphertext will look like vault:v1: – where v1 represents the first key generation, given it has not been rotated yet.

What about decryption? Well, it’s just another API call:

Integrating Vault’s Encryption as a Service within your application becomes really easy to implement and requires little to no refactoring of the existing codebase.


Kafka Producer Interceptor

The Producer Interceptor API can intercept and possibly mutate the records received by the producer before they are published to the Kafka cluster. In this scenario, the goal is to perform encryption within this interceptor, in order to avoid sending plaintext data to the Kafka cluster…

Integrating encryption in the Producer Interceptor is straightforward, given that the onSend method is invoked one message at a time.


Kafka Consumer Interceptor

The Consumer Interceptor API can intercept and possibly mutate the records received by the consumer. In this scenario, we want to perform decryption of the data received from Kafka cluster and return plaintext data to the consumer.

Integrating decryption with Consumer Interceptor is a bit trickier because we wanted to leverage the batch decryption capabilities of Vault, in order to minimize Vault API calls.

Usage

Once you have built your interceptors, enabling them is just a matter of configuring your Consumer or Producer client:

or

Notice that value and key serializer class must be set to the StringSerializer, since Vault Transit can only handle strings containing base64 data. The client invoking Kafka Producer and Consumer API, however, is able to process any supported type of data, according to the serializer or deserializer configured in the interceptor.value.serializer or interceptor.value.deserializer properties.


Conclusions

HashiCorp Vault Transit secrets engine is definitely the technological component you may want to leverage when addressing cryptographical requirements in your application, even when dealing with legacy components. The entire set of capabilities offered by HashiCorp Vault makes it easy to modernize applications on a security perspective, allowing developers to focus on the business logic rather than spending time in finding a way to properly manage secrets.



Author: Simone Ripamonti, DevOps Engineer @Bitrock

Read More