Problems with your data? You need DataOps*

Articles, 26 January 2021 |

[*DataOps = Proactive and preventative Data Governance]

The DataOps manifesto brings together a series of practices that were published in 2017 to try to solve problems related to the inefficiency of data generation and processing processes, as well as the quality of data in relation to inconsistency errors and inconsistencies between data. Despite what one might think at first, DataOps is not just DevOps for data, although the idea is to apply this concept, which is very widespread and implemented in the field of software development and operation, to the field of data.What the DataOps manifesto says, complemented by the initiative that came out in 2018 called ‘The DataOps Philosophy’, is that DataOps is a combination of agile methodologies, DevOps concepts and everything known as Lean Manufacturing. In this way, in addition to the concepts of DevOps, it also incorporates management concepts more related to agile methodologies and other concepts closer to the world of industry and manufacturing and production processes.

Thus, the purpose of DataOps is to manage in an agile way the entire DevOps part (idea, development, production of the software) together with the value chain and the life cycle of the data. In this context, DataOps is a series of techniques, methodologies, tools and processes that together help the organization, or a specific project, to extract greater value from the data thanks to the automation of the processes that occur in the data lifecycle. All of this with the aim of achieving greater profitability from data analytics projects and initiatives.

From a functional architecture point of view, a DataOps platform aims to:

  • Integrate data from different sources, automating their ingestion, loading and availability.
  • Control the storage of data with its different versions over time, historifying the information and data transformation processes.
  • To have a centralised management of metadata that serves not only to know the available information but also to activate and configure the platform’s processes.
  • To have under control all the management related to the request, authorisation and access permissions to data for consumption and exploitation.
  • And, finally, to apply analytical, reporting and dashboarding mechanisms and techniques to monitor and track what is happening throughout the platform.

All this, operated by a DataOps team to facilitate the sharing of data and the development of projects and initiatives on the platform for producers and consumers of information.Additionally, on the one hand, it is necessary to seek the fit with the automation of software deployments to achieve continuous integration, while on the other hand, in relation to data flow, data pipelines must be orchestrated, tested, automated and monitored within the platform itself: as always, moving data from areas where the information is raw, until it is treated, refined and enriched in subsequent layers to finally be able to perform analytics on the information in the layers of exploitation.

In this context, Data Governance is often positioned at the end of the value chain and the data lifecycle, which is totally wrong and we will explain why.

Throughout the entire data value chain, a multitude of roles coexist that will have to collaborate with each other, from developers to business users, including architects, operations teams, systems technicians, etc. Therefore, it is necessary to implement management methodologies (in this case agile) as a basic principle and with continuous change management, fundamental parts of Data Governance. Only in this way can we finally achieve the automation of technical processes that provide us with greater efficiency and security, which is one of the main objectives pursued by DataOps.

In addition, to achieve DataOps objectives, it is necessary to have complete control over the data lifecycle, to separate the information into logical layers and to know exactly how it flows through the data ecosystem as a whole (lineage and traceability), as well as to consider how the data lifecycle fits in with the software lifecycle. Finally, the importance of achieving full integration with the technical architecture should not be forgotten either, as without integrating the different technical parts of the architecture, it will not be possible to automate the corresponding data processes. And, after all, these are all parts where Data Governance has a lot to contribute.

Finally, to achieve a complete integration and automation of processes, three essential management layers must be incorporated:

  • Demand management in data projects: changes to existing information, new use cases and exploitation of information, capture and ingestion of new data, …
  • Management of metadata and versions of data structures, since metadata will be the central piece that will allow the automation of many functions around this DataOps concept. 
  • Management of data access permissions in a centralised way, so that in the end it is possible to know who consumes what information and for what purpose.

Therefore, after having unpacked all that is necessary to achieve the implementation of a DataOps model, we can understand it as the natural evolution of Data Governance. A Data Governance by design (governance-first or governance-by-design), proactive and preventive, which is located at the beginning of the data value chain and which accompanies the different stakeholders throughout the data life cycle, being the central axis of the processes, and providing at all times a global vision that will allow us to achieve the desired effectiveness and efficiency.

Despite what one might think at first, the proactive and preventive Data Governance located at the beginning of the data value chain tries to reduce bureaucracy as much as possible with clear and concise policies and procedures, providing flexibility and agility to the processes (especially management) and thus maximising the synergies between the different projects and use cases, promoting the reuse of both information and existing processes in the platform. In short, thanks to this vision of Data Governance, we are enabling a single entry point to the data ecosystem for all data stakeholders by applying a collaborative approach to:

  • Integration with demand management.
  • Integration between technologies and pieces of the ecosystem.
  • Automation of common technical processes.
  • Incremental and iterative approach by use cases.
  • Democratisation and data-driven self-service.
  • Monitoring for continuous improvement.

All this is only possible by building a single, centralised metadata repository as the centrepiece of the data ecosystem and abstracting data governance and management from the underlying technologies and platforms, as Data Governance has a very different focus than data capture, storage, treatment, processing and exploitation technologies, which are much more concerned with process performance than with proper data management.

Finally, complementing this repository with a common and unique language based on the knowledge of the business and the organization (known as Business Glossary) will in turn allow us to build a semantic map of information assets (incorporating the corresponding taxonomies) that will enable us to bring the data ever closer to the business users and continue to build bridges with the technical teams.

If you want to know some practical cases of DataOps you can see this presentation and if you want to know how you can implement a DataOps model thanks to Anjana Data, write to us and we will be happy to help you 🙂


About the Author: Angela Miñana Francés