New partnership announcement with Databricks

Artificial intelligence ceased a while ago to represent only and exclusively the future, playing a key role in the technological development of businesses already today. Bitrock is aware of this, and starting this year it has decided to launch its new Data, AI & ML Engineering area, precisely to apply next-generation technologies to one of the activities that most affect business growth: the proper management of data. A topic dear to the 100% Made in Italy consulting company, but also to the entire reference Group (Fortitude), which with its sister company Radicalbit has been dealing with streaming data analysis for several years now.

Collecting the information at hand, cataloging, and exploring it, training a model, running it and maintaining it, are all steps in an extremely complex cycle that, if completed correctly, yields countless benefits: from the ability to make timely decisions to the ability to minimize the waste of energy and raw materials, with related impact on business costs. With this in mind, and to give the new unit greater momentum, a partnership has been signed with Databricks, the company known for creating Apache Spark, MLflow and the data lakehouse with Delta Lake: a combination of data warehouse and data lake in a single, simple platform to better manage all types of structured, semi-structured and unstructured data.

Antonio Barbuzzi has been appointed to Head the unit. The manager, who has a degree in telecommunications engineering and a Ph.D. in electrical engineering, has always been involved, including abroad, in everything related to data analytics, both for large companies and emerging startups. After several years in France and the UK, he returned to Italy at the end of 2019, to Unicredit Services, as Head of GCC CBK Branch Tools and Head of ICT CRM and later as technical manager of the integration of the bank's new CRM. He joins Bitrock, precisely as Head of Data, AI & ML Engineering, in September last year.

"I am delighted to have joined such an innovative company as Bitrock. Helping the company in this new path will certainly be a difficult challenge but also a very compelling one. - declares Barbuzzi, Head of Data, AI & ML Engineering Area at Bitrock - Artificial intelligence and Machine Learning technologies, together with the Cloud, are crucial for our clients' business development, particularly when applied to data management and analysis. The goal will, therefore, be to provide them with tools and skills that can support them in the most congenial way, creating tailor-made services from time to time."

"Automation, simplification, and Artificial Intelligence are in our view the pillars of the future on which we base our work to ensure speed of development, cost reduction, and overall increase in efficiency for businesses. - Adds Leo Pillon, CEO of Bitrock - This is the vision of the entire Fortitude Group, as well as of Bitrock as it begins this new journey. The hope is that in a short time we can become an authoritative reference in a specific sector that is becoming more and more important day by day."

Scenario

According to recent estimates by Expert Market Research (2022), investment in data management-related activities amounts to about $70 billion, one-fifth of the total spending used for infrastructure creation in 2021 according to Gartner. A fast-growing trend that is also reflected in the job market, where data scientist, data engineer and machine learning engineer are among the most sought-after figures globally. A similar scenario is expected for the future. According to McKinsey, by 2025 companies will base all kinds of decisions on data analysis, relying on real-time processing for increasingly precise insights.

Read More

Vision & Offering

This is the second part of our article which introduces Bitrock’s vision and offering in the Data, AI & ML Engineering area. The first part delimits the context where we focus and operate, while this one defines our vision and the proposition that follows.

Vision

Artificial Intelligence (AI) is shaping the future of mankind in nearly all industries, and it is driving advancements in heterogeneous fields such as big data, robotics, and Internet of Things. We have a strong conviction that AI will continue to be a driving force of innovation and progress in the future. As a company, we recognize the vital importance of AI and ML for organizations to not just survive but thrive in the market. 

That’s why we’re committed to providing our customers with the platform, tools, and expertise to harness the full potential of AI and help them create innovative solutions, helping them with operationalization of robust and reliable AI-based solutions, and we tailor our offering to meet the needs of customers in this field.

AI/ML is the last piece of the puzzle, the last stretch in a race. It needs strong pillars to build upon: a reliable and scalable data platform, designed to evolve and not just for latest delivery, where security and governance are central, with automatic tests, continuous integration/deployment in place. Indeed, for data even more so, the motto “garbage-in, garbage-out” is valid.

Data platforms should be tailored to the customer needs: there is no one-size-fit-all approach to data engineering problems, rather there are companies, customers, partners with different backgrounds and needs requiring different solutions. Paraphrasing Maslow's hammer, not everything is a nail and can be pounded using a hammer.

We believe in bespoke solutions for our clients, driving them through the intricacies of the current data landscape, and designing the platform better fitting their existing infrastructure and needs.

Our ambition is also to help our clients to define a clear and effective data strategy that aligns with the overall business objective. Organizations should define goals, processes and business targets; provide data governance framework and processes balancing security, privacy concerns and simplifying the process to discover, access and use data.

In order to provide the best services, we value our partnerships: as of today, we’re partners with Databricks, Confluent and HashiCorp.

Design Principles

Our solutions follow specific design principles, driving our choices and design:

Cloud first

Cloud first means prioritizing cloud over on-premise solutions. In other words, having to justify picking on-premise solutions rather than making a case for cloud ones.

We’re aware of the reluctance of some companies towards cloud solutions: nevertheless, nowadays there are still very few reasons to not embrace cloud. The advantages provided by the cloud are too many: faster time to market, easy scaling, no upfront license/hardware costs, lower operative cost. Basically, it allows us to outsource non-core processes and focus on what matters the most to the business.

ML/AI from the beginning

Machine Learning (ML) and Artificial Intelligence (AI) have witnessed a tremendous leap forward in the latest years, mainly due to the increased availability of computing resources (faster GPUs, bigger memories) and data. Artificial intelligence has reached or surpassed human-level performances in many complex tasks: autonomous driving is now a reality and social networks use ML profusely to detect harmful content and target advertisements, while generative networks such as OpenAI’s GPT-3 or Google’s Imagen could be game changers in the quest toward artificial general intelligence (AGI).

AI/ML is no longer the future to look at, it’s the present. 

Some organizations will use it as a competitive advantage over its competitors; others will see it as a homework to keep up and remain competitive on the market. For sure, no one can really afford to ignore it anymore (or maybe just monopolies and the public administration?). 

AI and ML have a central role in our vision and shape our architectural and technological choices.

In this context, continuously interpreting data, discovering patterns and making timely decisions based on historical and real-time data, the so-called Continuous Intelligence, will play a crucial role in defining the business strategies and will be one of the most widespread applications of machine learning. Indeed, Gartner estimates that, within 3 years, more than 50% of all business initiatives will require continuous intelligence and, by 2023, more than one-third of enterprises will have analysts practising decision intelligence, including decision modelling.

MLOps and AI Engineering

MLOps, or Machine Learning Operations, is a field in the ML community that is rapidly gaining momentum. It advocates for the need to manage the ML lifecycle following software-inspired best practices and DevOps philosophy. This approach aims to make ML-powered software reproducible, testable, and evolvable, ensuring that models are deployed and updated in a controlled and efficient manner. The importance of MLOps lies in the ability to improve the speed and reliability of ML model deployment, while reducing the risk of errors and improving the overall performance of models.

Data democratization

We’ve already underlined the importance of data democratization. Achieving it requires several key elements to be in place. Firstly, it requires a data culture where data is seen as a strategic asset valued and leveraged throughout the company. This requires a buy-in and commitment from top management.

A widespread access to data urges for a widespread adoption of more robust Data Governance solutions, with data discoverability features, to effectively manage complex data processes and make data available and usable by everybody in need. 

Making data accessible means also lowering the entry-barrier to it, and therefore providing more user-friendly platforms, which can be usable in autonomy, without advanced knowledge (the so-called self-service platform).

Data Mesh is an approach oriented towards large-scale environments, going in this direction. It addresses silos and bottlenecks in large companies and emphasises the decentralization of data ownership, moving data ownership to the business domain teams.

Data mesh is an approach which increases overall complexity and introduces new challenges in organizations adopting it, but it may help them when scalability and data silos effectively represent an entry barrier to a company-wide data usage.

Reference Architecture

We at Bitrock refrain from providing a one-size-fit-all solution; we rather provide a reference data architecture modelled after technology stacks used across multiple companies, updated with more recent innovations.

We focus on a Multimodal data processing architecture, specialized in AI/ML and operational use-cases, able to support analytical needs typical of data warehouses. As previously explained, this is an alternative to a Business Intelligence oriented alternative, based on data warehouses.

At the core of the system there are the concepts of data lake and data lakehouse.

A data lake is a centralised repository that allows you to store and manage all your structured and unstructured data at any scale. They are traditionally oriented towards advanced data processing of operational data and ML/AI. The data lakehouse concept adds to them a robust storage layer paired with a processing engine (spark, presto, …) to enhance it with data-warehousing capabilities, making data lakes suitable for analytical workloads too. 

There is growing recognition for this architecture, which is supported by a wide range of vendors, including Databricks AWS, Google Cloud, Starburst, and Dremio - and by data warehouses vendors like Snowflake too.

For a more detailed introduction to it, please refer to a previous article on our Blog (Data Lakehouse, beyond the hype).

Our processing engine of choice is Apache Spark, which is the de-facto standard for operational workloads - paired with the battle-tested and reliable Apache Airflow or Astronom, a SaaS version. In the orchestration world, Dagster or Prefect are alternatives to Airflow which are gaining a lot of traction. They foster a switch to a higher-level abstraction, from managing workflow to handling dataflows.

Spark is suitable for both batch and real-time workloads, but for real-time data processing Apache Flink and Kafka Streams may be good alternatives, especially for applications with more stringent latency requirements.

In the streaming world applied to AI and ML, another option is Helicon from Radicalbit, which is a solution aimed at reducing the gap between data scientist and data engineering using a no-code/low-code approach. There’s a revived interest in the no-code/low-code solutions, which are ringing new users (i.e. analysts and software developers) into the ML market, pushed by new low code ML solutions like Databricks AutoML, H2O, Datarobot, etc.

Quick data exploration may be achieved by either the use of ad-hoc query engines like Trino/Presto/Starburst/Databricks SQL or using notebooks like Jupyter or their managed versions.

The integration is the boring homework preceding the fun part. However, it represents the largest fraction of cost of most data projects, ranging from 20-30% on average up to 70% for some pessimistic cases.

From a technical point of view, the injection layer is quite diversified and it is generally shaped following the organization's data sources and infrastructure.

Traditionally, data is extracted from operational data sources and transformed before being loaded into a data warehouse, the so called ETL. Cheap cloud storage and the separation of storage and computing laid the foundation for a paradigm shift advocating the anticipation of the loading phase before the transformation phase (ELT). This pattern, actually not totally new for data lakes, shines as it removes the business logic from loading phase in the injection layer, making it possible to simplify the integration by outsourcing it.

Fivetran, along with Airbyte, Matillion and many others, are examples of ELT tools. Strictly speaking, ETL term usually is generally used more in data-warehousing context, however those integration tools are beneficial to lakes and lakehouse architectures too: Fivetran has recently become a partner of Databricks too for example.

In the injection layer, Confluent is also playing a more and more important role with Kafka Connectors, allowing it to pull (and push too) data from a variety of sources. The pair Kafka and CDC (Change Data Capture), with software like Debezium/Qlik/Fivetran, is a more and more common integration pattern used in this context.

The following figure, based on the unified data platform from Horowitz (Bornstein, Li, and Casado 2020), exemplifies our architecture, in particular the boxes highlighted in yellow:

Emerging Architectures for Modern Data Infrastructure

ML-platform

A central role in our platform is reserved to the operationalization of ML models and AI-based software.

MLOps, or Machine Learning Operations, is a rapidly growing field in the ML community that advocates for the need to manage the ML lifecycle following software-inspired best practices and DevOps philosophy. This approach aims to make ML-powered software reproducible, testable, and evolvable, ensuring that models are deployed and updated in a controlled and efficient manner. The importance of MLOps lies in the ability to improve the speed and reliability of ML model deployment, while reducing the risk of errors and improving the overall performance of models. Our idea of a generic platform for machine learning providing all the tools to operationalize ML lifecycle is best described by the following figure, based on (Bornstein, Li, and Casado 2020).

Emerging Architectures for Modern Data Infrastructure

Conclusions

We believe AI and ML are crucial for any organization and will be fundamental to succeed and thrive in the market.

Bitrock is committed to providing customers with the platform, tools, and expertise to harness the full potential of Artificial Intelligence (AI) and Machine Learning (ML) and operationalize it through AI engineering and MLOps.

We tailor our offering to meet the unique needs of our customers and believe in providing bespoke solutions for our clients. Our ambition is to jointly define a clear and effective data strategy that aligns with their overall business objectives. 

If you have any questions, doubts or just want to discuss data-related topics, please feel free to get in touch: we’d be more than happy to help or just chat!

References

Author: Antonio Barbuzzi, Head of Data, AI & ML Engineering @Bitrock

Read More

Vision & Offering

In this blog post we’re introducing Bitrock’s vision and offering in the Data, AI & ML Engineering area. We’ll provide an overview of the current data landscape, delimit the context where we focus and operate, and define our proposition.

This first part describes the technical and cultural landscape of the data and AI world, with an emphasis on the market and technology trends. The second part that defines our vision and technical offering is available here.

A Cambrian Explosion

The Data & AI landscape is rapidly evolving, with heavy investments in data infrastructure and an increasing recognition of the importance of data and AI in driving business growth.

Investment in managing data has been estimated to be worth over $70B [Expert Market Research 2022], accounting for over one-fifth of all enterprise infrastructure spent in 2021 according to (Gartner 2021).

This trend is tangible in the job market too: indeed, data scientists, data engineers, and machine learning engineers are listed in Linkedin’s fastest-growing roles globally (LinkedIn 2022).

And this trend doesn’t seem to slow down. According to (McKinsey 2022), by 2025 organizations will leverage on data for every decision, interaction, and process, shifting towards real-time processing to get faster and more powerful insights.

This growth is reflected also in the number of tools, applications, and companies in this area, and from what is generally called a “Cambrian explosion”, comparing this growth to the explosion of diverse life forms during the Cambrian period, when many new types of organisms appeared in a relatively short period of time. This is clearly depicted in the following figure, based on (Turk 2021).

A Cambrian Explosion

The Technological Scenario

Data architectures serve two main objectives, helping the business make better decisions exploiting and analyzing data - the so-called analytical plane - and provide intelligence to customer-facing applications - the so-called operational plane.

These two use-cases have led to two different architectures and ecosystems around them: analytical systems, based on data warehouses, and operational systems, based on data lakes.

The former, built upon data warehouses, have grown rapidly.  They’re focused on Business Intelligence, business users and business analysts, typically familiar with SQL. Cloud warehouses, like Snowflake, are driving this growth; the shift from on-prem towards cloud is at this point relentless.

Operational systems have grown too. These are based on data lakes; their growth is driven by the emerging lakehouse pattern and the huge interest in AI/ML. They are specialized in dealing with unstructured and structured data, supporting BI use cases too.

Since a few years ago, a path towards a convergence of both technologies has emerged. Data lake houses added ACID transactions and data-warehousing capabilities to data lakes, while warehouses have become capable of handling unstructured data and AI/ML workloads. Anyway, the two ecosystems are still quite different, and may or may not converge in the future.

In the ingestion and transformation sides, there’s a clear architectural shift from ETL to ELT (that is, data is firstly ingested and then transformed). This trend, made possible by the separation between storage and computing brought by the cloud, is pushed by the rise of CDC technologies and the promise to offload the non-business details to external vendors.

In this context Fivetran/DBT shine in the analytical world (along with new players like airbyte/matillion), while Databricks/Spark, Confluent/Kafka and Astronomer/Airflow are the de-facto standards in the operational world.

It is also noteworthy that there has been an increase in the use of stream processing for real-time data analysis. For instance, the usage of stream processing products from companies such as Databricks and Confluent has gained momentum.

Artificial Intelligence (AI) topics are gaining momentum too, and Gartner, in its annual report on strategic technological trends (Gartner 2021), lists Decision Intelligence, AI Engineering, Generative AI as priorities to accelerate growth and innovation.

Decision Intelligence involves the use of machine learning, natural language processing, and decision modelling to extract insights and inform decision-making. According to the report, in the next two years, a third of large organisations will be using it as a competitive advantage.

AI Engineering focuses on the operationalization of AI models to integrate them with the software development lifecycle and make them robust, reliable. According to Gartner analysts, it will generate three times more value than most enterprises not using it.
Generative AI is one of the most exciting and powerful examples of AI. It learns the context from training data and uses it to generate brand-new, completely original, realistic artefacts and will be used for a multitude of applications. It will account for 10% of all data produced by 2025 according to Gartner.

Data-driven Culture and Democratization

Despite the clear importance of data, it's a common experience that many data initiatives fail. Gartner has estimated that 85% of big data projects fail (O'Neill 2019) and that through 2022 only 20% of analytic insights will deliver business outcomes (White 2019).

What goes wrong? Rarely problems lie in the inadequacies of the technical solutions. Technical problems are probably the simplest. Indeed, since ten years ago, technologies have evolved tremendously fast and Big Data technologies have matured a lot. More often, problems are rather cultural.

It’s not a mystery that a data lake by itself does not provide any business value. Collecting, storing, and managing data is a cost. Data become (incredibly) valuable when they are used to produce knowledge, hints, actions. To make the magic happen, data should be accessible and available to everybody in the company. In other words, organizations should invest in a company-wide data-driven culture and aim at a true data democratization.

Data should be considered a strategic asset that is valued and leveraged throughout the organization. Managers, starting from the C-levels, should remove obstacles and create the conditions for people in need of data to access them, by removing obstacles, bottlenecks, and simplifying processes.

Creating a data culture and democratizing data allows organizations to fully leverage their data assets and make better use of data-driven insights. By empowering employees with data, organizations can improve decision-making, foster innovation, and drive business growth.

Last but not least, Big Data’s power does not erase the need for vision or human insight (Waller 2020). It is fundamental to have a data strategy in mind to define how the company needs to use data and the link to the business strategy. And, of course, a buy-in and commitment from all management levels, starting from the top. 

The second part of this article can be found here.

References

Author: Antonio Barbuzzi, Head of Data, AI & ML Engineering @ Bitrock

Read More
Data Lakehouse Cover

Why the Lakehouse is here to stay

Introduction

The past few years witnessed a contraposition between two different ecosystems, the data warehouses and the data lakes - the former designed as the core for analytical and business intelligence, generally SQL centred, and the latter based on data lakes, providing the backbone for advanced processing and AI/ML, operating on a wide variety of languages ranging from Scala to Python, R and SQL.

Despite the contraposition between respective market leaders, thinking for example to Snowflake vs Databricks, the emerging pattern shows also a convergence between these two core architectural patterns [Bor20].

The lakehouse is the new concept that moves data lakes closer to data warehouses, making them able to compete in the BI and analytical world.

Of course, as with any emerging technical innovations, it is hard to separate the marketing hype from the actual technological value, which, ultimately, only time and adoption can prove. While it is undeniable that marketing is playing an important role in spreading the concept, there’s a lot more in this concept than just buzzwords.

Indeed, the Lakehouse architecture has been introduced separately and basically in parallel by three important and trustworthy companies, and with three different implementations. 

Databricks published its seminal paper on data lake [Zah21], followed by open sourcing Delta Lake framework [Delta, Arm20]

In parallel Netflix, in collaboration with Apple, introduced Iceberg [Iceberg], while Uber introduced Hudi [Hudi] (pronounced “Hoodie”), both becoming top tier Apache projects in May 2020.

Moreover, all major data companies are competing to support it, from AWS to Google Cloud, passing through Dremio, Snowflake and Cloudera, and the list is growing.

In this article, I will try to explain, in plain language, what a lakehouse is, why it is generating so much hype, and why it is rapidly becoming a centerpiece of modern data platform architectures.

What is a Lakehouse?

In a single sentence, a lakehouse is a “data lake” on steroids, unifying the concept of “data lake” and “data warehouse”.

In practice, the lakehouse leverages a new metadata layer providing a “table abstraction” and some features typical of data warehouses on top of a classical Data Lake.

This new layer is built on top of existing technologies in particular on a binary, often columnar, file format, which can be either Parquet, ORC or Avro, and on a storage layer.

Therefore, the main building blocks of a lakehouse platform (see figure 1.x), from a bottom-up perspective, are:

  • A File Storage layer, generally cloud based, for example AWS S3 or GCP Cloud Storage or Azure Data Lake Storage Gen2.
  • A binary file format like Parquet or ORC used to store data and metadata
  • The new table file format layer, Delta Lake, Apache Iceberg or Apache Hudi
  • A processing engine supporting the above table format, for example Spark or Presto or Athena, and so on.

To better understand the idea behind the lakehouse and the evolution towards it, let’s start with the background.

First generation, the data warehouse

Data Warehouses have been around for 40+ years now. 

They were invented to answer some business questions which were too computational intensive for the operational databases and to be able to join datasets coming from multiple sources.

The idea was to extract data from the operational systems, transform them in the more suitable format to answer those questions and, finally, load them into a single specialised database. Incidentally, this process is called ETL (Extract, Transform, Load).

This is sometimes also referred to as the first generation.

To complete the concept, a data mart is a portion of a data warehouse focused on a specific line of business or department.

The second generation, data lakes

The growing volume of data to handle, along with the need to deal with unstructured data (i.e. images, videos, text documents, logs, etc) made data warehouses more and more expensive and inefficient.

To overcome these problems, the second generation data analytics platforms started offloading all the raw data into data lakes, low-cost storage systems providing a file-like API. 

Data lakes started with Mapreduce and Hadoop (even if the name data lake came later) and were successively followed up by cloud data lakes, such as the one based on S3, ADLS and GCS.

Lakes feature low cost storage, higher speed, and greater scalability, but, on the other hand, they gave up many of the advantages of warehouses.

Data Lakes and Warehouses

Lakes did not replace warehouses: they were complementary, each of them addressed different needs and use cases. Indeed, raw data was initially imported into data lakes, manipulated, transformed and possibly aggregated. Then, a small subset of it would later be ETLed to a downstream data warehouse for decision intelligence and BI applications.

This two-tier data lake + warehouse architecture is now largely used in the industry, as you can see in the figure below:

Source: Martin Fowler

Problems with two-tiered Data Architectures

A two-tier architecture comes with additional complexity and in particular it suffers from the following problems:

  • Reliability and redundancy, as more copies of the same data exist in different systems and they need to be kept available and consistent across each other;
  • Staleness, as the data needs to be loaded first in the data lakes and, only later, into the data warehouse, introducing additional delays from the initial load to when the data is available for BI;
  • Limited support for AI/ML on top of BI data: business requires more and more predictive BI analysis, for example, “which customers should we offer discounts”. AI/ML libraries do not run on top of warehouses, so vendors often suggest offloading data back to the lakes, adding additional steps and complexity to the pipelines.
    Modern data warehouses are adding some support for AI/ML, but they’re still not ideal to cope with binary formats (video, audio, etc).
  • Cost: of course, keeping up two different systems increases the total cost of ownership, which includes administration, licences cost, additional expertise cost.

The third generation, the Data Lakehouse

A data lakehouse is an architectural paradigm adding a table layer backed up by file-metadata to a data lake, in order to provide traditional analytical DB features such as ACID transactions, data versioning, auditing, indexing, caching and query optimization.

In practice, it may be considered as a data lake on steroids, a combination of both data lakes and data warehouses.

This pattern allows to move many of the use cases traditionally handled by data warehouses into data lakes, it simplifies the implementation by moving from a two-tier pipeline to a single tier one.

In the following figure you can see a summary of the three different architectures.

Source: Databricks

Additionally, lakehouses move the implementation and support of data warehouses features from the processing engine to the underlying file format. As such, more and more processing engines are able to capitalise on the new features. Indeed most engines are coming up with a support for lake houses format (Presto, Starburst, Athena, …), contributing to the hype. The benefits for the users is that the existence of multiple engines featuring data warehouses capabilities allows them to pick the best solution suitable for each use case. For example, exploiting spark for more generic data processing and AI/ML problems, or Trino/Starburst/Athena/Photon/etc for quick SQL queries.

Characteristics of Data Lakehouse

For those who may be interested, let’s dig (slightly) more into the features provided by lake houses and on their role.

ACID

The most important feature, available across all the different lakehouse implementations, is the support of ACID transactions.

ACID, standing for atomicity, consistency, isolation, durability, is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps.

Indeed, cloud object stores haven't always provided strong consistency, so stale reads were possible - this is called eventual consistency.

Anyway, there’s no mutual exclusion guarantees, so that multiple writers can update the same file without external coordination and there’s no atomic update support across multiple keys, so that updates to multiple files may be seen at different times.

Lakehouse implementations guarantee ACID transactions on a single table, despite the underlying used storage and regardless of the number of files used underlying.

This is achieved in different ways in the three major players, but generally speaking, they all use metadata files to identify which files are part of a table snapshot and some WAL-like file to track all the changes applied to a table.

Note that there are alternative ways to provide ACID consistency, in particular by using an external ACID consistent metadata storage, like an external DB. This is what HIVE 3 ACID does for example, or Snowflake. However, not having to depend on an external system removes a bottleneck, a single point of failure, and allows multiple processing engines to leverage on the data structure.

Partitioning

Automatic partitioning is another fundamental feature, used to reduce queries’ process requirements and simplify table maintenance. This feature is implemented by partitioning data into multiple folders and while it can be easily implemented at application level, this is easily provided transparently by the lakehouse. Moreover some lakeshouses (see Iceberg) can support partitioning evolution automatically.

Time Travel

Time Travel is the ability to query/restore a table to a previous state in time.

This is achieved by keeping metadata containing snapshot information for longer time periods.

Time travel is a feature provided by traditional DBs oriented to OLAP workloads too, as this feature may be implemented relying on write-ahead-logs. Indeed it was available also in Postgres DB for example, until version 6.2, SQL Server. The separation between storage and processing makes this feature easier to support in lake houses, relying them on cheap underlying storage.

Of course, to reduce cost/space usage, you may want to periodically clean up past metadata, so that time travel is possible up to the oldest available snapshot.

Schema Evolution and Enforcement

Under the hood, Iceberg, Delta and Hudi rely on binary file formats (Parquet/ORC/Avro), which are compatible with most of the data processing frameworks.

Lakehouse provides an additional abstraction layer allowing in-place schema evolution, a mapping between the underlying files’ schemas and the table schema, so that schema evolution can be done in-place, without rewriting the entire dataset.

Streaming support

Data Lakes are not well suited for streaming applications for multiple reasons, to name a few cloud storages do not allow to append data to files for example, they haven’t provided for a while a consistent view on written files, etc. Yet this is a common need and, for example, offloading kafka data into a storage layer is a fundamental part of the lambda architecture.

The main obstacles are that object stores do not offer an “append” feature or a consistent view across multiple files.

Lake Houses make it possible to use delta tables as both input and output. This is achieved by an abstraction layer masking the use of multiple files and a background compassion process joining small files into larger ones, in addition to “Exactly-Once Streaming Writes” and “efficient log tailing”. For details please see [Arm20].

The great convergence

Will lake house-based platforms completely get rid of data warehouses? I believe this is unlikely. What’s sure at the moment is that the boundaries between the two technologies are becoming more and more blurred.

Indeed, while data lakes, thanks to Deta Lake, Apache Iceberg and Apache Hudi are moving into data warehouses territory, the opposite is true as well.

Indeed, Snowflake has added support for the lakehouse table layer (Apache Iceberg/Delta at the time of writing), becoming one of the possible processing engines supporting the lakehouse table layer.

At the same time, warehouses are moving into AI/ML applications, traditionally a monopoly of data-lakes: Snowflake released Snowpark, a AI/ML python library, allowing to write data pipelines and ML workflow directly in Snowflake. Of course it will take a bit of time for the data science community to accept and master yet another library, but the directions are marked.

But what’s interesting is that warehouses and lakes are becoming more and more similar: they both rely on commodity storage, offer native horizontal scaling, support semi-structured data types, ACID transactions, interactive SQL queries, and so on.

Will they converge to the point where they’ll become interchangeable in the data stacks? This is hard to tell and experts have different opinions: while the direction is undeniable, differences in languages, use cases or even marketing will play an important role in defining how future data stacks will look like. Anyway, it is a safe bet to say that the lakehouse is here to stay.

References

Author: Antonio Barbuzzi, Head of Data, AI & ML Engineering @ Bitrock

Read More
Mobile Application Development

Interview with Samantha Giro, Team Lead Mobile Engineering @ Bitrock

A few months ago, we decided to further invest in Bitrock’s User-Experience & Front-end Engineering area by creating a vertical unit dedicated to Mobile Application Development.

The decision stemmed from several inputs: first of all, we perceived a high demand from organizations looking for Professionals specialized in mobile app development that could support them in their digital evolution journey. 

Secondly, we already had the chance to implement successful projects related to mobile app development for some of our clients, primarily in the fintech and banking sectors.

Furthermore, since we are a team of young entrepreneurs and technicians continuously looking for new opportunities and challenges, we deeply wanted to explore this area within Front-end engineering, which we found extremely interesting and could perfectly fit in the 360° technology consulting approach offered by Bitrock.

Creating a unit specifically dedicated to mobile programming was thus a natural step towards continuous improvement and growth.

We are now ready to delve deeper into the world of mobile application development by asking a few questions to Samantha Giro, Team Lead Mobile Engineering at Bitrock. 

What is mobile application development? And what are the main advantages of investing in an app?

Mobile application development is basically the set of processes and procedures involved in writing software for small, wireless computing devices, such as smartphones, tablets and other hand-held devices.

Mobile applications offer a wide range of opportunities. First of all, they are installed on mobile devices - smartphones, iPhones, tablets, iPads - that users can easily bring with them, anywhere they go. 

It is thus possible to use them in work environments, such as manufacturing industries (just think about when workers control loading and unloading operations from a single device), to manage sales workflows, or events. Many solutions work even offline, allowing people to use them continuously and without interruption.

Moreover, mobile apps give users the opportunity to interact with the product readily and effectively. Through a push-notification campaign, for instance, it is possible to activate typical marketing affiliation mechanisms. These allow companies to run advertising campaigns and retain clients through continuous usage - for example by inviting users to discover new product functionalities or other related solutions.

Mobile technologies can also be associated with other external hardware solutions through bluetooth or Wifi connection, thus widening the range of usage possibilities.

The sensors and hardware of the device, such as the integrated camera, increase the number and type of functionalities that a product can perform. This brings great advantages and comfort to our daily lives: for example, if you need the digital copy of a signed paper, with the camera of your mobile device you can easily scan it and have the document online in real-time, ready to be shared with other users.

Last but not least, the interaction and integration with AI systems, augmented reality and vocal assistants grant easier access and an up-to-date user experience. 

Users can for instance “rule” their houses remotely: as we all know, nowadays we can easily turn on the lights of our house or activate the warning system simply by accessing an app on our mobile device. 

What types of mobile applications are there?

There are different ways to develop a product dedicated to mobile: native, hybrid and web applications. 

Native mobile app development implies the engagement of resources dedicated to devices that use Android as an operating system and other resources that use Apple systems. There are no limits to the customization of the product, apart from those defined by the operating systems themselves (Android and iOS). 

Native apps have to be approved by the stores before being published, and require a different knowledge base, since each platform has its specific operating system, integrated development environment (IDE) and language that must be taken into account.

They imply higher costs in terms of update and maintenance than hybrid apps since they usually require at least two different developers dedicated to the product. 

Native apps can take full advantage of the hardware with which they interact, and they are usually more performing and fast compared to their respective hybrid versions. Also the final dimension benefits from the fact they do not have any framework that converts them. Compatibility is always granted over time, net of updates that need to be executed following the guidelines issued by the parent company.

Thanks to specific frameworks, hybrid mobile app development allows the creation of applications for both operating systems with one, shared code, thus reducing maintenance costs. Development is subjected to the limitations of the framework and to its updates, which must be frequent and follow the native ones. For complex functionalities they still need a customization of the native part. Lastly, they must undergo the approval from the stores before being published. 

The most popular development frameworks are React Native and Flutter.Based on Javascript, React Native is widely used and known by many web developers. It is highly compatible with the development of hybrid applications, and it is highly fast and performing. Its nature as an interface for native development

makes applications less performing compared to the purely native ones; nevertheless, it is a good product since it facilitates the sharing of the code also for web applications. The community is wide, and the high number of open-source libraries makes up for any functionality you may need to integrate. It has two different development modes that allow to create applications entirely in Javascript or with the possibility of customization on native.

Flutter is a more recent product compared to React Native, and it is based on the Dart language developed by the Google team. The ease of the language and of the tool is convincing more and more developers to use it. Differently from React Native,  Flutter’s components do not depend on native: for this reason, when the operating systems are updated, the product keeps on functioning well. The plugins for specific functionalities, such as localization and maps, are created and managed from the Google team, which grants truthfulness, compatibility and constant update. Dart, indeed, is still a little-known language compared to Javascript, which requires a very  specific knowledge in the field.

Last but not least, there are web applications, which are similar to mobile apps, but are developed by using web technologies. They are not installed through stores, but as a website: the user can add a short link in the mobile screen and launch the web application. In this case, offline usability is limited and not granted. Moreover, it is not possible to take full advantage of the hardware resources. 

How is a mobile app created? And what are the most widely used programming languages?

The development of a mobile app generally starts with the study of its key functionalities and an analysis of what the customer needs, along with the creation of a dedicated design. Another preliminary step consists in working on the user experience: for this reason, a close collaboration with an expert of UI/UX is essential.

When all this is defined, the team decides what the best technologies and solutions to develop that specific application are. Then, developers write code, recreate the design and functionality, and run some tests to prove that everything works properly. 

Once the whole package is created through the system and approved by the customer, it is published on the different stores (Google or Apple) - of course, only if that’s the case. 

Let’s now have a quick look at the main programming languages.

If we’re talking about hybrid development, the main tools are Ionic (JavaScript), Flutter (Dart) and React Native (JavaScript). As for native apps, the top iOS development language is Swift or, alternatively, the previous language Objective-C. While the most popular Android development language is Kotlin, some still use Java. Of course, developers must rely on an IDE. Although there are many other alternatives, the above mentioned can be considered as the most widely used.

What are the main market differences compared to website development?

Let me start by saying that websites will never die, for the simple reason that when you need information rapidly and for a very specific circumstance, the product or service’s website is always available. However, websites cannot take advantage of all the instruments and hardware available for a mobile application. 

Mobile apps grant the memorization of a certain amount of data - this is something that a website cannot always provide (unless you have your own account and you always work online). 

They can access information quickly from hardware such as an accelerometer, a gyroscope and others. And other instruments enable the adoption of strategies for customers’ retention (even though nowadays there are push notifications for websites, too). With a mobile app, it is thus possible to grow customer loyalty with specific features and functionalities.

Furthermore, mobile applications are specifically designed to grant ease of use, while websites traditionally provide a different kind of usability. 

Most of the time, web and mobile can perfectly work together (see, for example, what happens with Amazon: users can buy an item via website or by using the mobile app); other times, a mobile app can “overcome” its web counterpart, especially when users have to manage specific things or data, or use specific technologies. For example, you will hardly do biometric authentication via a website.

At Bitrock, we always put clients’ needs first: the creation of the brand-new Mobile Application Development unit within our User Experience & Front-end Engineering team has the goal to widen our technology offering in the field. In this way, we can provide a broad range of cutting-edge, versatile and scalable mobile technology solutions to a variety of markets and businesses. 

We always collaborate closely with our clients to plan, create, and deploy solutions that are tailored to their specific requirements and in line with their contingent needs.

If you want to find out more about our proposition, visit our dedicated page on Bitrock’s website or contact us by sending an email!

Thanks to Samantha Giro > Team Lead Mobile Engineering @ Bitrock

Read More
Apache Airflow

Introduction

Apache Airflow is one of the most used workflow management tools for data pipelines - both AWS and GCP have a managed Airflow solution in addition to other SaaS offerings (notably Astronomer).

It allows developers to programmatically define and schedule data workflows and monitor them using Python. It is based on directed acyclic graphs (DAGs) concept, where all the different steps (tasks) of the data processing (wait for a file, transform it, ingest it, join with other datasets, process it, etc.) are represented as graph’s nodes.

Each node can be either an “operator”, that is a task doing some actual job (i.e. transform data, load it, etc.), or “sensors”, a task waiting for some event to happen (i.e. a file arrival, a Rest api call, etc.).

In this article we will discuss sensors and tasks controlling external systems and, in particular, the internals of some of the (relatively) new most interesting features, Reschedule sensors, SmartSensors and Deferrable Operators.

Sensors are synchronous by default

Sensors are a special type of Operators designed to wait for an event to occur and then succeed so that their downstream tasks can run.

Sensors are a fundamental building block to create pipelines in Airflow; however, historically, as they share the Operator’s main execution method, they were (and by default still are) synchronous. 

By default, they busy-wait for an event to occur consuming a worker’s slot.

Too many “sensors” busy waiting may, if not well dimensioned, use all the worker’s slots and bring to starvation and deadlocks (if TaskExternalSensor were used for example). Even when enough slots are available, workers may be hogged by tons of sleeping processes.

Working around it

The first countermeasure is to confine sensors in separate pools. This only partially limits the problems.

A more efficient workaround exploits the airflow’s ability to retry failed tasks. Basically, the idea is to make unmet sensor fail if sensing conditions are unmet, and set the sensor’s number of retries and retry delay to account for it, in particular number_of_retries * retry_delay should be equal to the sensor’s timeout. This frees the worker’s slot, making it possible to run other tasks.

This solution works like a charm, it doesn’t require any Airflow code change.

Main drawbacks are:

  • bugs and errors in the sensors may be masked by timeouts, which however may be mitigated by properly written unit tests.
  • Some overhead is added to the scheduler, as such polling intervals may not be too frequent - and a separate process is spawned.

Reschedule mode

Sensor’s reschedule mode is quite similar to the previous workaround.

In practice, sensors have a new “mode” attribute which may have two values, “poke”, the default one, providing the old synchronous behaviour, and “reschedule”.

When mode is set to reschedule:

  • BaseSensorOperator’s “execute” method raises an AirflowRescheduleException when the sensing condition is unmet, containing the reschedule_date
  • This exception is caught by the TaskInstance run method, which persists it in the TaskReschedule table along with id of the task associated with it and updates the task state to “UP_FOR_RESCHEDULE
  • When the TaskInstance run method is called, if it is in “UP_FOR_RESCHEDULE” state, the task is run if the reschedule_date allows it

This approach improves over the above mentioned workaround as it allows to distinguish between actual errors and unmet sensor condition, otherwise shares the same limitations, and lightweight checks are quite resource intensive.

Smart sensors

In parallel to the “reschedule” mode, a “different” approach was proposed in AIP-17, called Smart Sensor, merged in release 2.0.0 and already deprecated and planned to be removed in the next Airflow 2.4.0 release (they’re not in the main branch anymore).

All smart sensor poke-contexts are serialised in the DB and picked up by a separate process, running in special built-in smart sensor DAGs.

I won’t add any additional details on them, as they’ve been replaced by Deferrable Operators.

Smart Sensor were a sensible solution; however, despite considerable changes in airflow code, they have two main pitfails:

  • No High Availability support
  • Sensor’s suspension is a subset of a more generic problem, suspension of tasks - this solution can’t be easily extended to it.

For referece, please refer to AIP-17 here and here.

Deferrable Operators

Deferrable Operators, introduced in AIP-40, are instead a more generic solution: they’re a superset of Smart Sensors, supporting broader Task suspension, and built from the design to be highly-available. Therefore, no surprise they’ve replaced SmartSensors.

Albeit quite elegant, this solution is slightly more complex. To fully understand it, let’s start from a  use case to grasp the solution details.

A typical airflow use-case is to orchestrate jobs running on external systems (for example, a Spark Job runs on Yarn/EMR/…). More and more frequently, those systems offer an asynchronous API returning a job id and a way to poll its status.

Without Deferrable Operators, a common way to implement it is through a custom operator triggering the job in the execute method, getting the job id, and polling for it until it finishes, in a busy-wait loop. One may be tempted to use two separate operators, one for the “trigger” and one for the “poll” calls, anyway this would invalidate the airflow retry mechanism.

Deferrable Operators solve this problem and add to the tasks the ability to suspend themselves. If the polling condition is unmet, task execution may be suspended and resumed after a configurable delay.
Suspension of tasks is achieved by raising a TaskDeferred exception in a deferrable operator. A handy “defer” method is added to the BaseOperator to do it. This exception contains the following information:

  • The function to resume, along with the needed arguments.
  • A Trigger object, containing the details on when to trigger the next run.

The function arguments are a simple way to keep the task state, for example the job_id of the triggered spark job to poll.

Most useful trigger objects are generally time-based, and most commons are already provided by airflow: DateTimeTrigger, triggering at a specific time, and TimeDeltaTrigger, triggering after a delay, so it is generally not necessary to implement them.

Triggers and Triggerer implementation leverages Python’s async library introduced with Python 3.5 (Airflow 2.0.0 requires Python version 3.6 or higher). A trigger extends a BaseTrigger and provides an async-compatible “run” method, which yields control when idle. 

Time based trigger are implemented in a while loop using await asyncio.sleep rather than thread.sleep.

This allows them to coexist with thousands of other Triggers within one process.

Note that, to limit the number of triggers, there is a one-to-many relationship between Triggers and TaskInstances, in particular the same trigger may be shared by multiple tasks.

Let’s see how everything is orchestrated.

When a TaskDeferred exceptions is caught in the run method of TaskInstance, these steps are followed:

  • TaskInstance state is updated to DEFERRED.
  • The method and the arguments to resume the execution of the task are serialised in the TaskInstance (and not in the Trigger), in the next_method and next_kwargs columns table. Task instance is linked to the trigger through a trigger_id attributed to TaskInstance.
  • The Trigger is persisted in the DB, in a separate table, Trigger.

A separate airflow component, the Triggerer,  forming a new continuously-running-process part of an Airflow installation, is in charge of executing the triggers.

This process contains an async event loop which drains all the triggers serialised in the DB and creates all the not-yet-created triggers, running the coroutines concurrently. Thousands of triggers may run at once efficiently.

A trigger does some lightweight check. For example, the DateTimeTrigger verifies that the triggering date is passed; if so, it yields a “TriggerEvent”. 

All events are handled by the Triggerer, and for each TriggerEvent all the corresponding TaskInstance to schedule are picked up, their state is updated from DEFERRED to SCHEDULED.
The TaskInstance run method has been updated to check if the task should resume (it checks if “next_method” is set); if so, it resumes it, otherwise it proceeds as usual.

The availability of the system is increased by allowing multiple Triggerer to run in parallel - this is implemented adding to each Trigger the id of the triggerer in charge of it - and adding a heartbeat to each triggerer, serialised in the DB. Each trigger will pick up only assigned triggers. 

Author: Antonio Barbuzzi, Head of Data Engineering @ Bitrock

Read More

It has been around for almost 30 years, and still shows no signs of retiring. Java was there when the web was taking its first steps, and has accompanied it throughout the decades. It has steadily changed, evolving based on the protean needs of internet users and developers, from the early applets to today’s blockchain and Web3. We can only imagine where it will be 30 years from now. 

In this four-part retrospective, written in collaboration with Danilo Ventura, Senior Software Engineer at ProActivity (our sister company part of Fortitude Group), we attempt at tracing the history of Java and the role it has played in the development of the web as we know it. The aim is to identify the reason why Java succeeded in lieu of other languages and technologies, especially in the early hectic, experimental days of the internet. That was when the programming language and the web were influencing each other, to fully unlock the potential of the technology that has changed the world forever. 


It all started in the mid 1990s. It was the best of times, it was the worst of times. The World Wide Web was still in its infancy and was not readily accessible by the general public. Only tech-savvy enthusiasts connected their computers to the internet to share content and talk with strangers on the message boards. 

The birth and development of the web had been made possible by the creation and development of a simple protocol called HTTP (Hypertext Transfer Protocol), first introduced by Tim Berners-Lee and his team in 1991 and revised as 1.0 HTTP five years later. Since then the protocol has continuously evolved to become more efficient and secure - 2022 saw the launch of HTTP 3.0 - but the underlying principles are still valid and constitute the foundation for today’s web applications. 

HTTP works as a straightforward request–response protocol: the client submits a request to the server on the internet, which in turn provides a resource such as a document, content or a piece of information. This conceptual simplicity of HTTP has ensured its resilience throughout the years. We can see a sort of Darwinian principle at play, by which only the simple, useful and evolvable technologies pass the test of time.

The documents exchanged between computers via the HTTP protocol are written in HTML, i.e. HyperText Markup Language. Since its introduction in 1991, HTML has been used to describe the structure and content of web pages. At first, these were crude text documents with some basic formatting, such as bold and italic. Later on, the evolution of HTML and the addition of accompanying technologies such as CSS enabled more formatting and content options, such as images, tables, animations, etc.. 

In order to be accessed by human readers, HTML web pages need to be decoded by a web browser, namely the other great technology that enabled the birth and development of the internet. Browsers were created as simple programs capable of requesting resources via the HTTP protocol, receiving HTML documents and rendering them as a readable web page.

At the time the Web was primitive, with very few people accessing it for leisure or work. It is reported that in 1995 only 44M people had access to the internet globally, with half of them being located in the United States (source: Our World in Data). Business applications were scarce, but some pioneers were experimenting with home banking and electronic commerce services. In 1995, Wells Fargo allowed customers to access their bank account from a computer, while Amazon and AuctionWeb - later known as eBay - took their first steps in the world of online shopping. 

The main limiting factors to the web’s democratization as a business tool were technological. Needs were changing, with users reclaiming an active role in their online presence. At the same time, website creators wanted easier ways to offer more speed, more flexibility, and the possibility to interact with an online document or web page. In this regard, the introduction of Java was about to give a big boost to the evolution of the online landscape.

The first public implementation of Java was released by Sun Microsystems in January 1996. It was designed by frustrated programmers who were tired of fighting with the complexity of the solutions available at the time. The aim was to create a simple, robust, object-oriented language that would not generate operating system-specific code

That was arguably the most revolutionary feature. Before Java, programmers wrote code in the preferred language and then used an OS-specific compiler to translate the source code into object code, thus creating an OS-specific program. In order to make the same program compatible with other systems, the code had to be rewritten and retranslated with the appropriate compiler. 

Java instead allowed programmers to “write once, run everywhere” - that was its motto. Developers could write code on any device and generate a metalanguage, called bytecode, that could be run on all operating systems and platforms equipped with a Java Virtual Machine. It was a game changer for web developers, for they did not have to worry anymore about the machine and OS running the program.

This flexibility guaranteed Java’s success as a tool to create multi-OS desktop applications with user-friendly interfaces. Supported by the coeval spread of the first mass OS for lay people (Windows 95), it helped the codification of the classic computer program visual grammar, still relevant today. Java also became one of the preferred standards for programs running on household appliances, such as washing machines or TV sets.  

The birth of applets can be seen as a milestone in the development and popularization of Java. These were small Java applications that could be launched directly from a webpage. A specific HTML tag indicated the server location of the bytecode, which was downloaded and executed on the fly in the browser window itself.

Applets allowed a higher degree of interactivity than HTML, and were for instance used for games and data visualization. The first web browser supporting applets was Sun Microsystem’s own HotJava, released in 1997, with all major competitors following soon thereafter. 

A Java Applet used for data visualization (source: Wikimedia)

Java Applets were pioneering attempts at transforming the web into an interactive space. Yet, they had security issues that contributed to their gradual demise in the early 2010s, when all major browsers started to terminate the support for the underlying technology. One of the last great applets was Minecraft, which was first introduced as a Java-based browser game in 2009. Java Applets were officially discontinued in 2017.

We can say that the goal of making HTML web pages more interactive has been fully achieved thanks to JavaScript, another great creation of the mid-nineties. Yet, despite the name, it has nothing to do with Java, apart from some similarities in the syntax and libraries. It was actually introduced in 1995 by Netscape as LiveScript and then rebranded JavaScript for marketing purposes. It is not a programming language, but rather an OOP scripting language that runs in a browser and enhances web pages with interactive elements. JavaScript has now become dominant, being used in 2022 by 98% of all websites (source: w3techs).

At the same time, another Java technology, RMI (Remote Method Invocation), and later RMI-IIOP (RMI on Internet Inter-Orb Protocol), enabled distributed computing based on the Object Oriented paradigm in a Java Virtual Machine. In the early 2000s, it was possible to develop web applications with Applets that, thanks to RMI-services, could retrieve data from a server, all based on JVMs. 

The next step in the evolutionary path were Java servlets, which paved the way for your typical Web 1.0 applications. Servlet allowed the creation of server-side apps interacting with the HTTP protocol. That means that the browser could request a resource to the server, which in turn provided it as a HTML page. It was finally possible to write server-side programs that could interact with web browsers, a real game changer for the time. As servlets’ popularity increased, that of Applets started to wane, for it was easier to adopt pure HTML as User Interface and build server-side web pages.

In the following entry of this retrospective we will be focusing on the early 2000s developments of Java, which culminated in the groundbreaking Java 5.0 version. Follow us on LinkedIn and don’t miss the upcoming episode!

Thanks to Danilo Ventura for the valuable contribution to this article series.

Read More
mainframes

Today, mainframes are still widely used in data-centric industries such as Banking, Finance and Insurances. 92 of the world’s top 100 banks rely on these legacy technologies, and it is believed that they account for 90% of all global credit card transactions (source: Skillsoft).

This is suboptimal, since relying on mainframes generates high operational costs, calculated in MIPS (million instructions per second). A large institution can spend more than $16 million per year, estimating the cost for a 15.200 MIPS mainframe (source: Amazon Web Services).

In addition, mainframes come with technical complexities, like the reliance on the 60-year old COBOL programming language. For organizations, this means not only reduced data accessibility and infrastructure scalability, but also the problem of finding skilled COBOL programmers at a reasonable cost - more info here

Moreover, as consumers are now used to sophisticated on-demand digital services  - we could call it the “Netflix effect”, by which everything must be available immediately and everywhere. Thus banking services, such as trading, home banking, and financial reports need to keep the pace and offer reliability and high performances. In order to do that, large volumes of data must be quickly accessed and processed from web and mobile applications: mainframes may not be the answer. 

Mainframe Offloading to the rescue

Mainframe Offloading can solve the conundrum. It entails replicating the mainframe data to a parallel database, possibly open source, that can be accessed in a more agile way saving expensive MIPS. As a sort of “Digital Twin” to the mainframe, the replicated data store can be used for data analysis, applications, cloud services and more. 

This form of database replication provides significant advantages both in flexibility and cost reduction. Whenever an application or a service needs to read customers’ data, it can access the parallel database without having to pay for expensive mainframe MIPS. Moreover, the mere offloading paves the way for a progressive migration to the cloud, e.g. entailing bidirectional replication of information between the open source cloud database and the datacenter.

Offloading data from the mainframe requires middleware tools for migration and integration. Apache Kafka can be leveraged as a reliable solution for event streaming and data storage, thanks to its distributed and replicated log capabilities. It can integrate different data sources into a scalable architecture with loosely coupled components. 

Alongside the event streaming platform, CDC (Change Data Capture) tools are also to be considered to push data modifications from the mainframe into the streaming platform. CDC is a software process that automatically identifies and tracks updates in a database. It allows to overcome the limitations of batch data processing in favour of a near-real time transfer. While IBM and Oracle offer proprietary CDC tools, such as InfoSphere Data Replication and Oracle Golden Gate,  3rd party and open-source solutions are also available, like Qlik Data Integration (formerly known as Attunity) and Debezium

From Offloading to Replacement

As a heuristic process for perfectibility, Mainframe Offloading can also be seen as a starting point to mainframe replacement proper, with both applications and mission-critical core banking systems running in the cloud. This would mean that the expensive monolithic architecture gives way to modernization and future-proof, cloud native solutions.

Yet, replacing a mainframe is not an easy nor a quick task. In his blog article “Mainframe Integration, Offloading and Replacement with Apache Kafka”, Kai Waehner hypothesizes a gradual 5-year plan. First, Kafka is used for decoupling between the mainframe and the already-existing applications. Then, new cloud-based applications and microservices are built and integrated in the infrastructure. Finally, some or even all mainframe applications and mission-critical functionalities are replaced with modern technology.

It must be said that it is often not possible to switch off mainframes altogether. For larger institutions, such as major banks, the costs and inconveniences of a full migration may be just too high. Realistically speaking, the most effective scenario would be a hybrid infrastructure in which certain core banking functionalities remain tied to the mainframe, and others are migrated to a multi-cloud infrastructure.

How Bitrock can help

Given the complexity of the operation, it is fundamental to work with a specialized partner with thorough expertise with offloading and legacy migration. In Bitrock we have worked along with major organizations to help them modernize the infrastructure, save costs and support their cloud native transition. By way of example, we have carried out a mainframe offloading project for an important consumer credit company, transferring data from a legacy DB2 to a newer Elastic database. Thanks to the Confluent platform and a CDC system, data are now intercepted and pushed in real time from the core system to the front-end database, enabling advanced use cases

If you want to know more about this success story or how we can help you with your journey from legacy to cloud, please do not hesitate to contact us!

Read More

The Covid-19 pandemic has changed healthcare forever. Among other things, it demonstrated the importance of data in the health sector. The global response to the pandemic showed that offering high-quality patient care depends on accessing and sharing large amounts of sensitive information in a secure manner - let us think of the logistic complexities of carrying out clinical trials, mass testing, and vaccinations under strict time constraints.

Today, accessing reliable data allows medical practitioners and institutions to pursue a patient-centered approach, i.e. to ensure adequate, personalized services at any time. We are witnessing a paradigm shift, which impacts both patient care quality and financial sustainability. Regardless of each nation's welfare system, residual or institutional, individualized healthcare is indeed one of the most effective ways to reduce costs and redirect money where it matters most - R&D, hirings, technology, facilities. 

The blockchain technology has the chance to play a critical role in supporting the data-driven evolution of healthcare. Thanks to its immutability and decentralization, the distributed ledger can ensure secure information exchange between healthcare providers and stakeholders. Especially in highly fragmented healthcare systems, it may offer interoperability and disintermediation of trust in the collection and management of data. This in turn enables greater agency for patients, who are empowered to access personal information in a simple, transparent manner. 

Blockchain and its applications - Smart Contracts, NFTs - can thus have a disruptive impact in some critical areas of contemporary healthcare, which the paper "Applications of Blockchain Within Healthcare" in the peer-review journal Blockchain in Healthcare Today identifies as

  • Drug Tracking - necessary to prevent diversion, counterfeit and overprescriptions throughout the supply chain
  • Healthcare Data Interchange - the integration of health data among different stakeholders such as hospitals, insurances, national health systems 
  • Nationwide Interoperability - i.e. ensuring access to health record across different incompatible service providers
  • Medical Device Tracking - to increase the efficiency of inventories and save money spent in repurchasing unnecessary devices 

Considering the centrality of these areas of intervention, it is easy to see why it is a booming business. The global market size of blockchain applications in healthcare was valued $1,5B in 2020, and it is estimated to reach $7,3B by 2028 (Source: Verified Market Research).

Let us take a look at 4 exciting use cases for blockchain-powered patient care.

Pharmaceutical Supply Chain Management

Counterfeit drugs are a significant problem, especially in developing countries. While figures are somehow difficult to come by, a 2019 OECD/EUIPO report estimated that the global market value of counterfeit pharmaceuticals amounted to $4,4B in 2016. We are talking about 0,84% of global imports in medicines.

In this regard, the implementation of blockchain-enabled services to track pharmaceuticals can offer transparency and security through the entire chain of custody. The immutability of the distributed ledger guarantees the authenticity of medical products from manufacturers to the pharmacist and patient.

In addition to the increase in traceability, blockchain-powered supply chain solutions can also increase efficiency and reduce costs thanks to AI/ML. Advanced streaming event analysis can detect anomalies in real-time and ensure the timely delivery of pharmaceuticals. 

Electronic Prescriptions

One of the most crucial stages of the pharmaceutical chain of custody is the prescription to the patient. Here, errors and communication mishaps can have devastating effects on a treatment plan. Let us also consider that in the US, 62% of prescriptions for controlled drugs are handwritten (source: Statista), which increases the risk of mistakes and prevents any automated safety feature. 

Blockchain-enabled electronic prescription systems can support healthcare providers in delivering a tailored service that takes into account patients' specific needs and clinical history. By integrating data from health records into a shared, secure database, blockchain can help prescribers check for allergies, interactions, and overprescriptions - also to avoid drug abuse or diversion.

The paper "Use of Blockchain Technology for Electronic Prescriptions" (2021), published in Blockchain in Healthcare Today, recounts an e-prescription pilot programme carried out in Tennessee clinics between 2021 and 2022. Switching to a blockchain-based electronic system automatized instantaneous patient safety checks (interaction, allergies), which resulted in practitioners changing the prescription 28% of the time. It also allowed them to save significant time - a mean of 1 min 48 sec per written prescriptions.

Electronic Health Record

No effective real-time patient safety check can be carried out without a reliable Electronic Health Record (EHR) system. Allowing patients and practitioners to securely access health records is fundamental both for transparency and clinical reasons. According to a Johns Hopkins study, medical errors, often resulting from uncoordinated or conflicting care, are currently the third cause of death in the US.

Yet, that is one of the very countries in which the theoretical effectiveness of EHR system is hampered by the fragmentation and lack of interoperability of service providers. It is estimated that there currently exist at least 500 vendors of EHR products - other sources claim more than 1 thousand! - with the average hospital running 16 platforms simultaneously.

The blockchain technology can thus be used as a way to connect different data sources and create a secure, decentralized ledger of patient records. In the words of a 2021 study carried out by the US Department of Health and Human Services, "Blockchain-based medical record systems can be linked into existing medical record software and act as an overarching, single view of a patient’s record".

Device Tracking & IoMT

The transparency and security of blockchain can benefit the management of medical devices throughout the chain of custody, from manufacturer to the hospital and patient. A thorough tracking of medical assets can help identification and verification against counterfeits and unapproved products, as with drugs. It also offers significant financial advantages, saving hospitals' money spent repurchasing devices. 

Implementing blockchain-based solutions for tracking medical devices can integrate with the widespread RFID technology, i.e. radio-frequency identifiers. These cost-effective tags, both active and passive, are currently employed to digitalize the inventory of medical items and drive the effectiveness of resource management. RFID-generated data can thus be transferred to the immutable ledger, to store in a secure and compliant way the history, lifecycle, and salient features of tracked devices. 

Blockchain can also provide data integrity for the messages exchanged via sensors and devices employed by patients at home. Remote monitoring is becoming more and more important for telehealth, also thanks to the increased availability of high-speed wireless connectivity (5G) - and this raises concerns about cybersecurity. The blockchain technology can limit unauthorized data access and manipulation, guaranteeing at the same time high transparency and agency for patients. 


These are just but a few use cases for blockchain in the healthcare sector. Other potential applications leverage smart contracts for service payment or trustless communication with insurance companies, to name a few. And other use cases will certainly appear in the next future, while the technology continues to develop and spread. What is certain is that the blockchain has the potential to transform healthcare for good, and help improve all stages of the patient journey.

Bitrock has global expertise and proven experience in developing blockchain-based solutions and applications. If you want to know more about our consultancy services, and learn how we can jumpstart your blockchain project, book us a call and we'll be happy to talk!

Author: Daniele Croci, Digital Marketing Specialist @ Bitrock

Read More

It all started with a horse. In 2006, Bethesda popularized the notion of microtransactions in gaming with the launch of the (in)famous horse armor DLC - i.e. downloadable content - for The Elder Scrolls 4: Oblivion. For the somehow affordable cost of $2,50, PC and Xbox 360 players could unlock a cosmetic add-on for an in-game asset, that is the horse used by the player's diegetic surrogate. It did not provide any significant gameplay advantage, just a shiny metal armor to brag about with oneself in a single-player game.

It was not the first time players had had the chance to buy in-game items for real-world money. Microtransaction had been around since the democratization of internet F2P (free-to-play) gaming, featuring for instance in Nexon's Maplestory (2003), or proto-metaverse experiences such as Habbo (2001) or Second Life (2003). The last two, in particular, pioneered the offering purchasable cosmetic items for players who wanted to differentiate themselves in crowded online multiplayer spaces. 

And let us not forget Expansion Packs, full-fledged additional experiences that could be bought and added to videogames for more plot, quests, items and hours of entertainment, and that first came on physical media and only later via digital download. Some notable examples include Warcraft 2: Beyond the Dark Portal, a 1996 expansion to the wildly popular Warcraft 2: Tides of Darkness (1995), and The Sims: Livin' Large (2000), released in the same year as the original life simulation game. 

Even though we cannot underestimate Expansion Packs' role in transitioning the gaming industry from a Game-as-a-Product towards a Game-as-a-Service (GaaS) business model, today they have somehow waned in favor of parcelized microtransactions and DLCs. These forms of continuous content have now become dominant, with add-ons - both functional and cosmetic - coming at a lower, less affordable price for players and providing a consistent revenue stream to publishers. Even Bethesda's scandalous Horse Armor proved successful at the end of the day. 

The financial advantages of the GaaS model is still more evident with F2P games, where the ongoing sale of digital goods constitutes the sole source of revenue for the publisher. These addictive mobile games have often turned into viral phenomena that generate way more money than many conventional $70 AAA products - we are talking about games like Fortnite, which generated $9 billion in revenue in 2018 and 2019, League of Legends, $1.75 billion in 2020, or newcomer Genshin Impact, which is estimated to have totalled $3,5 billion in its first year. Seeing these figures, it is easy to understand how the global gaming industry generated a whopping $54 billion in 2020 with in-game purchases only - and numbers are only projected to increase (source: Statista). 

Mobile gamer

NFTs to overcome the limitations of DLCs

However, microtransactions and in-game purchases as we know them have a major limitation. When a horse armor is bought in a game, it stays in the game. It is not really an asset owned by the player, but rather a service that is accessed only in the context of the title that originated it. If we are talking about online-only multiplayer games, as many F2P are, the purchase practically ceases to exist when the game servers are shut down. Furthermore, digital assets bought or earned via gameplay cannot be normally exchanged on secondary markets for real world money - while there currently exist some under-the-desk reselling of items in some MMORPGs, like World of Warcraft, it is a risky practice that tends to go against End-Users License Agreements and leads to inglorious bans. 

This is where blockchain and NFTs come into play. Non Fungible Tokens allow players to acquire true ownership of the assets they have bought or earned in game, opening up to  collecting, exchanging and reselling. In a word, stronger players' engagement, fuelled by the Copernican revolution in the flow of value. Companies are no longer the sole beneficiaries of the gaming economy, with players empowered to (re)claim the value of their money or time investments. 

All this is possible thanks to tokenization, enabled by the blockchain technology. The term refers to the process of converting an asset, digital or physical, into a virtual token that exists and circulates via the blockchain. In this sense, tokens are representations of assets (money, real estate, art - you name it) that store information in a transparent, efficient, and secure way via the blockchain's immutable ledger. This allows all users to trace not only the token’s provenance, but also the history of transactions carried out by the users. 

NFTs are a special kind of tokens characterized by being - well - non-fungible, meaning that they are endowed with individuality as such and cannot be interchanged with another one. A Bitcoin is the same as every other Bitcoin, just like a dollar is the same as every other dollar. A NFT, by contrast, has unique and permanent metadata that identify it unequivocally. As a sort of authenticity certificate, this record details the item’s nature and ownership history. Another feature that differentiates NFTs from Bitcoin is indivisibility. It is possible to own a fraction of Bitcoin, while it is not possible to have a quarter of a tokenized work of art or gaming item.

All these features suggest why tokenized game assets can offer significant benefits for the players. Unlocking real ownership for unique items earned and bought redefines a user's relationship with the game, creating a greater sense of engagement that can also exceed the barriers of the game itself. Indeed, the interoperable nature of NFTs means that the gaming items can also be virtually transferred to and reused in other connected games, provided that the game engine and framework supports such functionality. In addition, blockchain-enabled games offer the chance to monetize item ownership in a legitimate way via reselling. We are witnessing the rise of the play-to-earn model, where gaming leads to the acquisition of NFTs that can be later sold for legitimate income. 

And benefits are not only limited to the players. Secondary trading of gaming NFTs may also generate immediate revenues for gaming companies via royalties inscribed within tokens themselves. This is one of the most exciting features of NFTs in general, with huge applications for the art world. In a nutshell, it is technically possible to mint a token in a way that automatically guarantees the payment of royalties to the original creator whenever the token is traded between third parties. The system still needs perfecting, being currently some limitations due to the interoperability between different platforms - more info here -, but it nonetheless is a great way to potentially ensure a fair distribution of profits between owners and creators. 

Tezos NFT Gaming

NFT Games & dApps to Know

To understand the impact of NFTs in the gaming industry, we need to consider at least two different applications: on the one hand, play-to-earn games that are structured upon the blockchain technology and NFTs, and which often are Decentralized Apps, or dApps; on the other hand, conventional games that variously adopt and integrate NFTs as part of the videoludic experience, without depending on them. 

The most popular dApp game is arguably Axie Infinity (2018), developed by Vietnamese studio Sky Mavis. It is a Pokemon-inspired online RPG where players can breed and fight their NFT creature called Axies. Available for mobile and PC, the game was initially based on Ethereum. Yet, Sky Mavis later launched their sidechain Ronin, optimized for NFT gaming due lower gas fees and transaction time (more info here). Axie Infinity can be said to fully leverage the blockchain possibilities by also integrating fungible tokens called ASX and SLP, which serve as in-game currency and can be traded like every other cryptocurrency on the market.

Despite the steep entry price - 3 competitive starting Axies can cost the player $300 and more -  Axie Infinity has quickly become a huge phenomenon. It is played each month by 2,8M users, with 10M total players estimated in December 2021. Even more staggering is the overall value of Axies NFTs transactions carried out on Ethereum and Ronin, which in March 2022 reached 4,17 billion dollars!

Due to the features of NFTs, collecting and trading are central in many - if not all - dApp games. CryptoKitties (2017) is another example of a game that focuses on breeding and exchanging NFT pets - there is no other discernible gameplay feature . It is often mentioned as the gateway to blockchain gaming for many players. Immutable's Gods Unchained (2021) is a trading card game à la Magic the Gathering that offers real ownership of the NFT virtual cards. It leverages Immutable X, the company's own Layer 2 Scaling Solution for NFTs that allows reducing gas fees. Gods Unchained also features its own homonym cryptocurrency, that enables players to buy card packs and vote in governance proposals that influence the game’s own development. It’s a growing phenomenon: the company reports 80k weekly players in January 2022, with $25 million in Gods Unchained assets traded on Immutable X.

Compared to the huge success of dApps, the relationship of traditional gaming companies and players with NFTs has been less straightforward. Quartz, Ubisoft’s proprietary platform for NFTs - or Digits, as they rebranded them - has been welcomed with mixed feelings since its launch in December 2021. The platform, now in beta, allows players to buy or earn some cosmetic items for the PC version of Tom Clancy’s Ghost Recon Breakpoint, which in turn can be resold on third-party marketplaces. Quartz is based on Tezos, a proof-of-work blockchain that, according to the publisher, “needs significantly less energy to operate. As an example, one transaction on Tezos consumes as much energy as 30 seconds of video streaming while a transaction on Bitcoin consumes the equivalent of one year of video streaming”.

Quartz’s lukewarm reception can be attributed to different factors. First of all, inaugurating the project with only one game, the PC version of a poorly-received 2019 shooter - 58 on Metacritic, with overwhelmingly negative user reviews. Secondly, limiting the acquisition of NFTs to Ghost Recon Breakpoint players that have reached a certain level in the game, effectively leaving out collectors and enthusiasts. Third, publishing sets of cosmetic items that look all the same and are merely differentiated with a serial number. However, despite all this, all Breakpoint NFTs appear to be sold out as of late March 2022.

Another gaming company that has been struggling with NFT implementation is GSC Game World, the Kiev-based developer behind the renowned S.T.A.L.K.E.R. series. On 15 december 2021, they announced that the upcoming S.T.A.L.K.E.R. 2: Heart of Chernobyl would include tokens to be purchased on DMarket, with the most prized allowing its owner to actually become a NPC in the game. The announcement garnered negative feedback from the community, which prompted GSC to backpedal on the very following day: “we’ve made a decision to cancel anything NFT-related in S.T.A.L.K.E.R. 2”. 

Konami had greater success with the launch of a NFT collection dedicated to the Castlevania series. The tokens were not items to be used in games, but 14 unique pictures and videos commemorating the 35 year-old franchise. Players seem to have appreciated the nostalgic set, which was auctioned off for ~$160k in total. This achievement prompted Konami to plan more NFTs for the future, as mentioned in their Outlook for the Fiscal Year Ending March 31, 2022.

Phenomena like Axie Infinity or Gods Unchained, and Ubisoft, GSC and Konami’s varied experiences with the blockchain demonstrate one thing: the gaming world is interested in NFTs when these enhance the experience and provide players with unique, valuable prizes that reflect their own passion and dedication. Gaming is a sophisticated medium, and gamers are sophisticated audiences. We have come a long way since the horse armor, and slapping a serial number on mass produced virtual items may not be enough. Today, integrating NFTs within a video ludic experience must be underpinned by a well-designed strategy that takes into account the specific features of the medium. 

Within this strategy, technological choices take on a primary importance. The variety of standards - paired with the lack of well-established business models - may hinder gaming companies’ efforts at creating scalable, flexible and environmentally sustainable blockchain solutions. This is why working with a reliable partner is increasingly important.


As a high-end tech consulting company, Bitrock has global expertise in supporting gaming and non-gaming companies for NFT, Smart Contract, cryptocurrencies and blockchain-enabled projects. Thanks to our integrated offering, we can accompany you from the project definition and to the deployment of the last line of code and optimization of the user interface.

To know more about how we can help you, contact us now!

Author: Daniele Croci, Digital Marketing Specialist @ Bitrock

Read More