Data is gold, but what is dirty data?
Many people are starting to talk about data. They also talk about data being important, "data is an asset", "data is gold", etc. But it seems like many businesses still don’t know what to do to treat data as an asset and maybe also why it is important.
In a nutshell, when data is dirty (e.g. missing, redundant, inaccurate, inconsistent or just plain wrong), business processes don't work, analyses and reports are unreliable or flawed, and "good" decisions made on the basis of dirty data can ultimately turn out to be bad decisions.
Let me elaborate on this with a few examples I have come across, which are by no means unique.
For example, if you have a lot of duplicates in your systems, whether it's customers, suppliers, products or bookkeeping accounts, it affects your processes, your analytics and your decision-making.
Imagine you need to purchase a product from a supplier, but the supplier has been typed into the system several times. Which one should you choose? What do your spend analyses look like if the supplier exists several times, perhaps under different names? What does it mean for your price negotiation if you do not know what you have bought from the supplier? How do you decide which suppliers to use in the future if your analytics are unreliable? You can customise this thought experiment with, for instance, customers, products, accounts or something else.
Data is the raw material of all technologies
Today, many people invest in business development through technology that promises great potential. Common to all technological approaches is that data is the raw material that must be used by the technology to produce the promised results. When data is poor, technology cannot succeed optimally. For example, if you have set up an RPA (robotic) solution to economise on manual resources and your data is poor, then you will inevitably fail to achieve the full potential of your RPA solution. It will stall or require guidance every time it encounters data that is not as it should be.
Everyday heroes with tape and glue
In many businesses, bright minds are putting out fires created by dirty data on a daily basis. For many employees, finding the right information or double-checking whether it's one or the other is an everyday occurrence. Perhaps you are manually consolidating some data because there is no real mapping between data in one system and the other. And you see it so often that nobody even really thinks about the fact that the underlying data is not good enough and how much effort is actually spent on a daily basis to correct it. It's just the way it is. Or you are almost painfully aware that the data is so bad that you can't comprehend where to start fixing the problems, so you just concentrate on getting it reasonably right with enrichments and manipulation of data – often using Excel.
Big losses, imprisonment and newspaper headlines
Sometimes things go wrong, and businesses and institutions end up on the front of newspapers, risking huge fines or imprisonment or having to accept huge losses because their data has probably inadvertently gotten them into trouble. That dirty data was the problem is not always obvious, but it is often data that drives businesses’ processes and decisions. For example, when the Danish authorities made mistakes in paying out heating cheques to citizens, it later turned out that dirty data was to blame.1
Nobody likes to clean again and again
In my opinion, the challenge is that many people don't realise that you must invest in getting your dirty data under control. Not just as a one-off investment in a major data wash, but as an ongoing effort to dedicate staff time and effort to proactively maintaining good data – rather than having them spend it on quick fixes, stop-gap solutions and data washing. I promise that it's so much more fun to work proactively to maintain good data than it is, day after day, month after month, to clean data and compensate for dirty data or having to work overtime to figure out what's up and down in the underlying data.