Prometheus

Getting Started with Prometheus


What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit written in Go. Released by SoundCloud in 2012, it joined Cloud Native Computing Foundation in 2016 and in 2018 became the second graduated project alongside Kubernetes.

Based on metrics and not on logs, Prometheus uses its own time series database called TSDB and its own query language (PromQL).

The CNCF community loves Prometheus because:

  • it’s easy to configure, deploy, and maintain
  • it’s designed in multiple services, aiming at modularity
  • it’s container ready, “docker run” is enough to have it started
  • it’s orchestrator ready, supporting dynamic configurations
  • it’s an ecosystem: many client libraries and exporters maintained both by Prometheus team and the community



1 Prometheus


  • Prometheus collects data
  • Exporters expose data
  • Applications expose data
  • Grafana displays data
  • Alertmanager dispatches alerts

Prometheus is a pull-based monitoring system that scrapes metrics from configured endpoints, stores them efficiently and supports a powerful query language to compose dynamic information from a variety of otherwise unrelated data points.

To monitor your services using Prometheus, your services need to expose a Prometheus endpoint. This endpoint is an HTTP interface that exposes a list of metrics and the respective current values. Prometheus has a wide range of service discovery options to find your services and start collecting metrics data. The Prometheus server continuously polls the metrics interface on your services and stores the data. This provides a standardized way for metrics gathering.

Prometheus is designed to fetch data in intervals measured in seconds. And while Prometheus 2.x can handle somewhere north of ten millions series over a time window, which is rather generous, some unwise label choices can eat that surprisingly quickly.

Every 2 hours Prometheus compacts the data that has been buffered up in memory onto blocks on disk.

To reduce disk footprint, TSDB can have a shorter metrics retention period of the metrics or it can be configured to have a disk space limit. The data can be compacted and the WAL compressed as well.

The data structure is self-sufficient and can be moved from one instance to another independently given each time series is atomic and uniquely identified by its metric name (1). In recent Prometheus versions, remote storage support has been introduced in order to provide long term storage.

Core Prometheus server is a single binary and each Prometheus server is an independent process with its own storage. One of the downsides of this core implementation is the lack of clustering or backfilling “missing” data when a scrape fails.

Prometheus is not supposed to only be used with standard exporters (2), you can instrument your own code to capture the metrics that matter to you, business ones for example. Prometheus comes with the support for a wide range of languages (Go, Java or Scala, Python, Ruby, etc). Many upstream libraries are already instrumented by the maintainers, so you will get that for free!


What is a metric?

A metric is any numeric value that tells you something about how your system is operating. For example:

  • How much memory it is using
  • How long the last operation took* How many request were served today


3 Prometheus


In Prometheus there are 4 types of metrics: counter, gauge, histogram and summary.

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests.

A histogram samples observations, for example request durations or response sizes, and counts them in configurable buckets. It also provides a sum of all observed values. A histogram with a base metric name of exposes multiple time series during a scrape:

  • cumulative counters for the observation buckets, exposed as _bucket{le=""}
  • the total sum of all observed values, exposed as _sum
  • the count of events that have been observed, exposed as _count (identical to _bucket{le="+Inf"} above)

Similar to a histogram, a summary samples observations, for example request durations and response sizes. While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

A summary with a base metric name of exposes multiple time series during a scrape:

  • streaming φ-quantiles (0 ≤ φ ≤ 1) of observed events, exposed as {quantile="<φ>"}
  • the total sum of all observed values, exposed as _sum
  • the count of events that have been observed, exposed as _count

The essential difference between summaries and histograms is that summaries calculate streaming φ-quantiles on the client side and expose them directly, while histograms expose bucketed observation counts and the calculation of quantiles from the buckets of a histogram happens on the server side using the histogram_quantile() function.

https://prometheus.io/docs/concepts/metric_types/

https://prometheus.io/docs/practices/histograms/


Understanding metrics

Prometheus metrics have a name and might have any arbitrary number of labels:

A Metric has metadata (labels) and lots of functions to filter, change, remove those while fetching them from the targets. The name “node_cpu_seconds_total” consist of a prefix for the namespace (node metrics) and suffix for the unit of the value ( Seconds of CPU time in total )

https://prometheus.io/docs/practices/naming/

promtool allows to lint them for consistency and correctness.

Examples:

5 Prometheus

PromQL

Prometheus Query Language (PromQL) supports a wide range of functions for interacting with scraped metrics. Some examples:

  • Filtering by label: _http_requeststotal{status=\~"5.."}
  • Calculating rates: _rate(http_requeststotal[5m])
  • Arithmetic ( +, *, /, -, %, ^) and Comparison ( >, <, >=, <=, ==, != ) operations
  • Aggregation and Grouping: _sum(rate(node_network_receive_bytestotal[5m])) by (instance)* Quantile: _histogram_quantile(0.95, sum(rate(http_request_duration_secondsbucket[5m])) by (le))
  • Recording Rule: precompute frequently needed or computationally expensive expressions, in order to make recurring queries much faster to compute


Alerting

Our motto is: if you can graph it, you can alert on it! It’s really easy to set up alerts in Prometheus, it’s just a matter of defining which query to evaluate and which is the range of safe values:

6 Prometehus

Prometheus will evaluate the alerting rule regularly and will mark it as firing in case the rule matches. However, Prometheus core component will not take care directly of sending alerts to final users. Alertmanager instead will take care of performing alert related operations.

Alertmanager :

  • Receives alerts from Prometheus
  • Groups them
  • Inhibits them, for example in case of false positives
  • Dispatches them to downstream services, such as Slack or PagerDuty and many more
  • Built In HA leveraging gossip protocol


7 Prometehus


References


Notes

(1) https://github.com/bitnami/kube-prod-runtime/blob/master/docs/migration-guides/prometheus-migration.md

(2) [https://prometheus.io/docs/instrumenting/exporters] https://github.com/prometheus/prometheus/wiki/default-port-allocations


Author: Matteo Gazzetta, DevOps Engineer @Bitrock

Read More
Terraform Community Tools

Terraform Community Tools

Despite not having reached version 1.0 yet, Terraform has become the de facto tool for cloud infrastructure management. One of its major winning points is definitely the extensive cross cloud support, which allows projects to span from one cloud vendor to another with a minimal operational effort. Moreover, the popularity in the community continuously releasing reusable infrastructure components, the Terraform modules, makes it easy to bootstrap new projects with a fully functional setup right from the start. In order to address all the different use cases of Terraform, whether it is executed as part of a GitOps pipeline or right from developers machines, the community has built a set of tools to enhance the developers experience. In this blog post we will describe some of them, focusing on those that might not be that popular or widely adopted, but certainly deserve some attention.

Pull Request Automation

Atlantis

[ GitHub ][ Website ] Atlantis is a golang application that listens for Terraform pull request events via webhooks. It allows users to remotely execute \terraform plan\ and \terraform apply\ according to the pull request content commenting back the result. Atlantis is a good starting point for making infrastructure changes visible to all teams, allowing even non-operations ones to contribute to Terraform infrastructure codebase. If you want to see Atlantis in action, check this walkthrough video [ Youtube ]. If you want to restrict and audit the execution of Terraform changes still providing a friendly interface, Terraform Cloud and Enterprise support invoking remote operations by UI, VCS, CLI and API. The offering includes an extensive set of capabilities for integrating infrastructure changes in CI pipelines.

Importing Existing Cloud Resources

Importing existing resources into a Terraform codebase is a long and tedious process. Terraform is capable of importing an existing resource into its state through \import\ command, however the responsibility of writing the HCL describing the resource is on the developer. The community has come up with tools that are able to automate this process.

Terraforming

[ GitHub ] [ Web ] Terraforming supports the export of existing AWS resources into Terraform resources, importing them to Terraform state and writing the configuration to a file.

Terraformer

[ GitHub ] Terraformer supports the export of existing resources from many different providers, such as AWS, Azure and GCP. The tool leverages Terraform providers for performing the mapping of resource attributes to Terraform ones, which makes it more resilient to API upgrades. Terraformer has been developed by Waze and now maintained by Google Cloud Platform team.

Version Management

tfenv

[ GitHub ] When working with projects that are based on different Terraform versions, it is tedious to switch from one version to another and the risk of updating the states’ Terraform version to a new one is high. tfenv comes in support and makes it easy to have different Terraform versions installed on the same machine.

Security and Compliance Scanning

tfsec

[ GitHub ] tfsec performs static analysis of your Terraform code in order to detect potential vulnerabilities in the resulting infrastructure configuration. It comes with a set of rules that work cross provider and a set of provider specific ones, with support for AWS, Azure and GCP. It supports disabling checks on specific resources making it easy to include the tool in a CI pipeline.

Terrascan

[ GitHub ] [ Website ] Terrascan detects security and compliance violations in your Terraform codebase, mitigating the risk of provisioning unsecure cloud infrastructures. The tool supports AWS, Azure, GCP and Kubernetes, and comes with a set of more than 500 policies for security best practices. It is possible to write custom policies with Open Policy Agent Rego language.

Regula

[ GitHub ] Regula is a tool that inspects Terraform code looking for security misconfigurations and compliance violations. It supports AWS, Azure and GCP, and includes a library of rules written in Open Policy Agent language Rego. Regula consists of two parts, the first one generates a Terraform plan in JSON that is then consumed by the Rego framework which in turn evaluates the rules and produces a report.

Terraform Compliance

[ GitHub ] [ Website ] Terraform Compliance approaches the problem from a different perspective, allowing to write compliance rules in a Behaviour Driven Development (BDD) fashion. An extensive set of examples provides an overview of the capabilities of the tool. It is easy to bring Terraform Compliance into your CI chain and validate infrastructure before deployment. While Terraform Compliance is free to use and easy to get started with, a much wider set of policies can be defined using HashiCorp Sentinel, which is part of the HashiCorp Enterprise offering. Sentinel supports fine-grained condition-based policies, with different enforcing levels, that are evaluated as part of a Terraform remote execution.

Linting

TFLint

[ GitHub ] TFLint is a Terraform linter that focuses on potential errors and best practices. The tool comes with a general purpose and AWS rule set while the rules for other cloud providers such as Azure and GCP are being added. It does not focus on security or compliance issues, rather on validating configuration variables such as instance types, which might cause a runtime error when applying the changes. TFLint tries to fill the gap of “terraform validate”, which is not able to validate variable values beside syntax and internal consistency checks.

Cost Estimation

infracost

[ GitHub ][ Website ] Keeping track of infrastructure pricing is quite a mess and one usually discovers the actual cost of a deployment after running it for days if not weeks. infracost comes in help providing a way to estimate how much the resources you are going to deploy will cost. At the moment the tool supports only AWS, providing insights for the costs of both hourly priced resources and usage based resources such as AWS Lambda Functions. For the latter, it requires the usage of infracost Terraform provider which allows describing usage estimates for a more realistic cost estimate. This enables quick “what-if” analysis like “what if this month my Lambda gets 2 times more requests?”. The ability to output a “diff” of the costs is useful when integrating infracost in your CI pipeline. Terraform Enterprise provides a Cost Estimation feature that extends infracost offering with the support for the three major public cloud providers: AWS, Azure and GCP. Moreover, Sentinel policies can be applied for example to prevent the execution of Terraform changes according to the increment of costs. Author: Simone Ripamonti, DevOps Engineer @Bitrock
Read More
Bringing GDPR in Kafka with Vault

Bringing GDPR in Kafka with Vault


Part 1: Concepts

GDPR introduced the “right to be forgotten”, which allows individuals to make verbal or written requests for personal data erasure. One of the common challenges when trying to comply with this requirement in an Apache Kafka based application infrastructure is being able to selectively delete all the Kafka records related to one of the application users.

Kafka’s data model was never supposed to support such a selective delete feature, so businesses had to find and implement workarounds. At the time of writing, the only way to delete messages in Kafka is to wait for the message retention to expire or to use compact topics that expect tombstone messages to be published, which isn’t feasible in all environments and just doesn’t fit all the use cases.

HashiCorp Vault provides Encryption as a Service, and as it happens, can help us implement a solution without workarounds, either in application code or Kafka data model.


Vault Encryption as a Service

Vault Transit secrets engine handles cryptographic operations on in-transit data without persisting any information. This allows a straightforward introduction of cryptography in existing or new applications by performing a simple HTTP request.

Vault fully and transparently manages the lifecycle of encryption keys, so neither developers or operators have to worry about keys compliance and rotation, while the securely stored data can always be encrypted and decrypted as long as the Vault is accessible.


Kafka Integration

What if instead of trying to selectively eliminate the data the application is not allowed to keep, we would just make sure the application (or anyone for this matter) cannot read the data under any circumstances? This would equal physical removal of data, just as requested by GDPR compliance. Such a result can be achieved by selectively encrypting information that we might want to be able to delete and throwing away the key when the deletion is requested.

However, it is necessary to perform encryption and decryption in a transparent way for the application, to reduce refactoring and integration effort for each of the applications that are using Kafka, and unlock this functionality for the applications that cannot be adapted at all.

Kafka APIs support interceptors on message production and consumption, which is the candidate link in the chain where to leverage Vault’s encryption as a service. Inside the interceptor, we can perform the needed message transformation:

  • before a record is sent to Kafka, the interceptor performs encryption and adjusts the record content with the encrypted data
  • before a record is returned to a consumer client, the interceptor performs decryption and adjusts the record content with the decrypted data


Logical Deletion

Does this allow us to delete all the Kafka messages related to a single user? Yes, and it is really simple. If the encryption key that we use for encrypting data in Kafka messages is different for each of our application’s users, we can go ahead and delete the encryption key to guarantee that it is no longer possible to read the user data.


Replication Outside EU

Given that now the sensitive data stored in our Kafka cluster is encrypted at rest, it is possible to replicate our Kafka cluster outside the EU, for example for disaster recovery purposes. The data will only be accessible by those users that have the right permissions to perform the cryptographic operations in Vault.



Part 2: Technicalities

In the previous part we drafted the general idea behind the integration of HashiCorp Vault and Apache Kafka for performing a fine grained encryption at rest of the messages, in order to address GDPR compliance requirements within Kafka. In this part, instead, we do a deep dive on how to bring this idea alive.


Vault Transit Secrets Engine

Vault Transit secrets engine is part of Vault Open Source, and it is really easy to get started with. Setting the engine up is just a matter of enabling it and creating some encryption keys:

Crypto operations can be performed as well in a really simple way, it’s just a matter of providing base64 encoded plaintext data:

The resulting ciphertext will look like vault:v1: – where v1 represents the first key generation, given it has not been rotated yet.

What about decryption? Well, it’s just another API call:

Integrating Vault’s Encryption as a Service within your application becomes really easy to implement and requires little to no refactoring of the existing codebase.


Kafka Producer Interceptor

The Producer Interceptor API can intercept and possibly mutate the records received by the producer before they are published to the Kafka cluster. In this scenario, the goal is to perform encryption within this interceptor, in order to avoid sending plaintext data to the Kafka cluster…

Integrating encryption in the Producer Interceptor is straightforward, given that the onSend method is invoked one message at a time.


Kafka Consumer Interceptor

The Consumer Interceptor API can intercept and possibly mutate the records received by the consumer. In this scenario, we want to perform decryption of the data received from Kafka cluster and return plaintext data to the consumer.

Integrating decryption with Consumer Interceptor is a bit trickier because we wanted to leverage the batch decryption capabilities of Vault, in order to minimize Vault API calls.

Usage

Once you have built your interceptors, enabling them is just a matter of configuring your Consumer or Producer client:

or

Notice that value and key serializer class must be set to the StringSerializer, since Vault Transit can only handle strings containing base64 data. The client invoking Kafka Producer and Consumer API, however, is able to process any supported type of data, according to the serializer or deserializer configured in the interceptor.value.serializer or interceptor.value.deserializer properties.


Conclusions

HashiCorp Vault Transit secrets engine is definitely the technological component you may want to leverage when addressing cryptographical requirements in your application, even when dealing with legacy components. The entire set of capabilities offered by HashiCorp Vault makes it easy to modernize applications on a security perspective, allowing developers to focus on the business logic rather than spending time in finding a way to properly manage secrets.



Author: Simone Ripamonti, DevOps Engineer @Bitrock

Read More
HashiCorp and Bitrock sign Partnership

HashiCorp and Bitrock sign Partnership to boost IT Infrastructure Innovation

The product suite of the American leader combined with the expertise of the Italian system integrator are now at the service of companies

Bitrock, Italian system integrator specialized in delivering innovation and evolutionary architecture to companies, has signed a high-value strategic partnership with HashiCorp, a market leader in multi-cloud infrastructure automation and member of the Cloud Native Computing Foundation (CNCF).

HashiCorp is well-known in the IT infrastructure environment; their open source tools Terraform, Vault, Nomad and Consul are downloaded tens of millions of times each year and enable organizations to accelerate their digital transformation, as well as adopt a common cloud operating model for HashiCorp’s portfolio of multi-cloud infrastructure automation products for infrastructure, security, networking, and application automation.

As companies scale and increase in complexity, enterprise versions of these products enhance the open-source tools with features that promote collaboration, operations, governance, and multi-data center functionality. They must also rely on a trusted partner that is able to guide them through the architectural design phase and who can grant enterprise-grade assistance when it comes to application development, delivery and maintenance.

Due to the highly technical nature of the HashiCorp portfolio, being a HashiCorp partner means that above all, the Bitrock DevOps Team has the expertise and know-how required to manage the complexity of large infrastructures. Composed of highly-skilled professionals who can already count on several “Associate” Certifications and attended the first Vault Bootcamp in Europe – Bitrock are proudly one of the most certified HashiCorp partner in Italy. The partnership with HashiCorp represents not only the result of Bitrock’s investments in the DevOps area, but also the start of a new journey that will allow large Italian companies to rely on more agile, flexible and secure infrastructure. Especially when it comes to the provisioning, protection and management of services and applications across private, hybrid and public cloud architectures.

We are very proud of this new partnership, which not only rewards the hard work of our DevOps Team, but also allows us to offer Italian and European companies the best tools to evolve their infrastructure and digital services” – says Leo Pillon, Bitrock CEO.

With its dedication in delivering reliable innovation through the design and development of business-driven IT solutions, Bitrock is an ideal partner for HashiCorp in Italy. We look forward to working closely with the Bitrock team to jointly enable organizations across the country to benefit from a cloud operating model. With Bitrock’s expertise around DevOps, we are confident in the results we can jointly deliver to organizations leveraging our suite of products” – says Michelle Graff, Global Partner Chief for HashiCorp.

Read More
Bitrock DevOps Team joining HashiCorp EMEA Vault CHIP Virtual Bootcamp

Bitrock DevOps Team joining HashiCorp EMEA Vault CHIP Virtual Bootcamp

Another great achievement for our DevOps Team: the possibility to take part in HashiCorp EMEA Vault CHIP Virtual Bootcamp.

The Bootcamp – coming for the first time to the EMEA region – involves the participation of high-skilled professionals that already have experience with Vault and want to get Vault CHIP (Certified HashiCorp Implementation Partner) certified for delivering on Vault Enterprise.

Our DevOps Team will be challenged with a series of highly technical tasks to demonstrate their expertise in the field. A 3 full-day training, that will get them ready to implement in a customer engagement.

This comes after the great success of last week, which saw our DevOps Team members Matteo Gazzetta, Michael Tabolsky, Gianluca Mascolo, Francesco Bartolini and Simone Ripamonti successfully obtaining HashiCorp Certification as Vault Associate. A source of pride for the Bitrock community and a remarkable recognition of our DevOps Team expertise and know-how worldwide.

With the Virtual Bootcamp, the Team is now ready to raise the bar and takes on a new challenge, proving that there’s no limit to self-improvement and continuous learning.


HashiCorp EMEA Vault CHIP Virtual Bootcamp

May 5–May 8, 2020

https://www.hashicorp.com/

Read More
Bitrock at "Jenkins World"

Jenkins World 2018

Jenkins World brings together the DevOps community in two locations, providing opportunities to learn, explore, network and help shape the future of DevOps and Jenkins. DevOps World | Jenkins World is designed specifically for IT executives, DevOps practitioners, Jenkins users and partners.

2,500 attendees will attend this year, from all over the globe. They’ll get access to 100+ workshops, training opportunities and sessions covering software automation, DevOps culture, performance measurement, security and more.

Bitrock is present with Matteo Gazzetta, Simone Ripamonti, and Andrea Simonini, members of Bitrock’s DevOps Team. But not only DevOps are attending… Our Backend Developer Simone Esposito is attending too.


./jw-1.jpg
Nice, France | Palace of Congresses and Exhibitions Nice Acropolis October 22-25

./jw-2.jpg
Our Team at the Event

./jw-3.jpg
Nice, France | Palace of Congresses and Exhibitions Nice Acropolis October 22-25

Read More