Are you building data science skyscrapers on quicksand? Are you building data science skyscrapers on quicksand?
Data quality has always been important but is increasingly critical given the continuous expansion and evolution of data and analytics. We share two approaches to improving data quality that will strengthen a data-driven organization's reporting and decision-making.
Every day seems to deliver new solutions and promises in data analytics, data science and artificial intelligence. Before investing in that shiny new proposition or exciting emerging technology, however, ask yourself one question: is my data quality fit for purpose? We all know the phrase 'garbage in, garbage out' and that robust data is critical to achieving the right outcomes, but how should you ensure it is of an appropriately high quality?
Setting reliable values
First, let us define what we mean by data quality. In simple terms, it is how reliable are the values and content in the data fields used by an information system. We know from experience that it is not always easy to define what a reliable value looks like, and that a wrong value can have a severe impact on other aspects of a business's activities or processes. For instance, a lack of reliable values can produce incorrect information on product weights and storage needs, or the profitability of an investment.
Defining quality rules
The building blocks of good data quality are the quality rules you define that are specific to your organization. This involves defining a precise condition for a certain field and the conditions to generate a tabular outcome, which serves as the basis for your reports and dashboards. Keep in mind that quality rules are classified in seven dimensions that are often used to tailor dedicated reporting: relevance, accuracy, credibility, timeliness, accessibility, interpretability and coherence.
Not a one-off activity
We would also note that getting your approach to data quality right is not a one-off activity, rather it should be part of your daily routines. It should be applied by consistently repeating automated analyses at a suitable frequency based on the quality of the target data quality and subject of the rule.