AI (Artificial Intelligence) for better data management

While artificial intelligence (AI) is continuing to grow in prominence, priority and capabilities, and is also moving beyond its own hype, more and more organizations are starting to struggle with the question ‘how and where to start’. Systems are outdated, data is fragmented and/or irretrievable, the scope of critical data is increasing, governance is lacking and one important aspect is coming back over and over again: The quality of (structured) data is not good enough; it is neither labeled nor classified correctly. The need for solutions in upscaling data quality is rising. So, how do you deal with this process? And: Which role can AI itself play in approaching data management?

In a series of articles, we take a deep dive into the interplay between AI and data management. The role data management and governance can play for AI, but also the other way around, namely, how can AI contribute to more qualitative and manageable data? Looking at data management, we can discern a split in how you may position AI. There is data management for AI and there is AI for data management. The previous involves data management practices to enable your AI solutions, while the latter is about leveraging AI to improve your data management practices. This series of articles will focus on the latter. There are many data management practices that would benefit from AI solutions, such as metadata management, data lineage, data classification and much more, and they will be discussed in later articles. In this article we will dive into practical solutions regarding data quality (monitoring) for structured data.

Data (management) in the era of AI

Data is the fuel of AI. It is the raw material that powers the engines of intelligence, learning, and insight. Data is what makes AI systems capable of amazing features, such as translating languages, diagnosing diseases, or detecting fraud. Data is also what fuels AI innovation and competitiveness. The more data an organization has, the more it can train and improve its AI models, and the more value it can extract from them. Data is not only a resource but also a strategic asset in the era of AI.

As data becomes more valuable because of AI, the way we manage data is changing dramatically. Traditional methods of handling data are being transformed by AI's ability to process and analyze enormous amounts of data. This change isn't just about handling more data, but also about doing it much better. AI brings a new level of precision to data management that wasn't possible before. The way AI can profile, clean, and improve data quality makes data management a proactive and important part of what organizations do. Instead of relying heavily on manual processes, AI can now handle much of the work with great accuracy, making sure that the data is not only plentiful but also reliable, accurate, and highly valuable. That’s why we see that many data management tools, of which Collibra and Informatica are some of the most used by enterprise organizations, are heavily investing in incorporating AI in their solutions to assist you in your day-to-day activities.

Value of data quality

Now that data is more important than ever and organizations are finally able to turn their massively collected data into value, the quality of data becomes crucial. Because, where we used to say, “Garbage in, garbage out”, the accessibility through AI changes the saying into “Garbage in, garbage everywhere ...”

On the positive side, once organizations internalize a high-performing culture around data, they can turn the risk of working with data into an asset that differentiates them from other organizations. To cultivate a high-performing culture around data, organizations should start by embedding data-driven decision-making at every level. This involves providing employees with the right tools and training to handle data effectively, fostering an environment of continual learning and improvement. Leaders should prioritize data quality and accountability, setting clear expectations and encouraging collaboration across departments. By leveraging AI for data management, organizations can ensure data integrity and empower employees to focus on strategic initiatives rather than manual processes.

Let’s dive a bit deeper into tangible steps for data quality with which you can leverage AI to help you. To get to a higher level of data quality and efficient monitoring, we differentiate three crucial steps that follow a recurring time cycle:

Data profiling
The first step in enhancing data quality involves diving into the world of data profiling, in which we analyze and evaluate our datasets. With traditional methods, we relied on reactive calculations to assess data, such as completeness, consistency, and uniqueness of columns, but now imagine a world where we leverage AI to swiftly detect patterns both within and across datasets, paving the way for proactive improvements. It's like embracing a data-driven approach to data quality enhancement – and believe it or not, AI excels at this.

To give an example, an algorithm can easily learn that a date has to follow a standardized pattern (DD-MM-YYYY). But I imagine that you are unimpressed: “Any data steward could have come up with this as well.” True, but AI can also learn that it is logical for a shipment date to be at a later point in time than an order date. Or maybe that there is a pattern between three, four or even more columns within your data that data stewards would have taken a lot longer to come up with. AI does this by scoring patterns within your dataset, the higher the score, the more likely the pattern is. Let AI be the hyperintelligent sidekick of your data stewards, enabling them to do their work much faster and much more consistently across the organization. Identifying these types of patterns is crucial for good data quality, as it lets you steer on the data quality you derive from these patterns by generating data quality rules.
Implementing data quality rules
Now that we have an optimized set of data quality rules for our data, let’s leverage AI for the implementation as well. Organizations typically implement data quality rules per data product or data domain. But as our data products and domains multiply, AI can come to our aid, identifying similarities between them and suggesting unified data quality rules. This means no more endless handovers to data quality boards! Instead, data stewards can easily accept suggested (and internally approved) rules for data objects like ‘Client identifier’, streamlining this entire process.

By implementing these data quality rules, an estimate can be given on whether the data at hand is of a sufficient quality to be used as training data for other models. Create an upward spiral of leveraging AI to get good data quality into your AI solutions. Besides the traditional data quality rules mentioned, there should be rules to ensure objectivity and reduce bias as much as possible.
Data quality remediation
And that's not all – we cannot overlook the role of AI in data quality remediation. Once you are monitoring your data quality through (AI-optimized) data quality rules, issues will inevitably start to become apparent in your data that you should act on. This requires attention in the form of remediation. There are multiple data remediation strategies and AI can help you automate those strategies.

Based on the data quality rules given to your data asset, the AI tooling will start to detect anomalies in the data and give you suggestions on how to resolve them. This consists of, among others, outlier removal, removal of duplicates, format standardization or imputing values of missing data. It is important to ensure that people with an understanding of the business assess the proposed data quality remediation efforts and either accept or reject them. But let’s also be realistic, not every data quality issue is to be resolved. Especially when we consider historical data issues, it is not always possible to solve the issue. However, AI can learn from the reasoning behind not solving an issue and include that suggestion in its future solution direction. This, as with all AI solutions, also lets the model learn from the feedback it has been given and increases its accuracy over time.

The symbiotic relationship between data quality and AI is undeniable: High-quality data fuels effective AI, while advanced AI techniques and technologies are pivotal in driving and maintaining superior data quality.

Applying AI for improving data quality

Where to start? Below are the actions you can take tomorrow to start improving your data quality.

Determine the scope within which you want to start the implementation of data quality management. From a business perspective, set realistic expectations on the data at hand (in terms of expected profiling scores, expected data quality requirements and remediation efforts). Consider the type of data that you are handling (e.g., training data, historical data, unstructured data, ungoverned data) and make an informed decision on the expected effort versus impact. This is a crucial step, as it is always important to be able to challenge anything that the AI tooling is giving you. We call this ‘human in the loop’.
Choose a tool with which you want to monitor whether your data is fulfilling the requirements you set as an organization (e.g., KPMG’s Sofy or KPMG’s AI data classification tooling).
Upload your dataset into the tooling and let AI find patterns in your dataset that are logical to monitor from a data quality perspective.
Review the given data quality requirements to create a feedback loop in the AI model.
Implement the data quality requirements and monitor your data quality to identify potential issues.
Explain the reason for those data quality issues and identify the ones that need fixing.
Let AI assist you in fixing data quality issues by for instance imputing logical values based on the dataset, using information from similar datasets or transforming data formats.
Scale up! Implement your generated data quality rules consistently across the organization wherever synergies can be achieved.

Above some aspects of data quality (monitoring) have been described to inspire you to think about what AI can be used for in this field. While these are just some examples, there are many more use cases for AI to help improve your data quality. Should you want to know more about this topic, please reach out to KPMG’s Data & Analytics team, so we can assess how we can tailor these solutions to your specific needs!