Regardless of your niche, you need reliable information to make sound business decisions. To get the information you need, you need to collect data, analyze it, and transform it into a form that you can use. This information is so valuable that most businesses don’t just rely on their data collection but also use 3rd-party providers to provide them with all the data they need.
However, data of poor quality will result in incorrect information, regardless of your data source. If your business uses low-quality information, it is likely to make poor decisions. These ill-informed decisions, in turn, will result in a dip in performance and competitiveness.
If your organization treats information as an asset or product, you need to apply stringent measures to protect its quality. These measures begin with establishing a set of data quality metrics that you will closely monitor to ensure that you’re working only with reliable information.
This article will discuss data quality, the importance of measuring data quality, and a few data quality metrics your business should monitor.
What are data quality metrics?
Data scientists have many different definitions of data quality. Still, most agree on one thing: data is of high quality if it is fit to be used for decision making, planning, and operations. Data quality also refers to the degree of closeness of the data to the real-world environment or system that it represents.
These two data quality dimensions reflect the two contexts in which people work with data. First, it needs to meet business requirements. Second, it needs to be correct in itself. In other words, data quality metrics are either subjective or objective.
It’s not uncommon for a business to use both categories of data quality metrics, mainly if it collects customer data. The secret to maintaining data quality lies in recognizing how different metrics influence each other and contribute to an overall picture of your business.
Why should you measure data quality metrics?
If faulty products result in plummeting sales, poor data quality results in more than a drop in revenue. It also contributes to extra costs associated with rework, missed business opportunities, and poorly allocated resources. According to Gartner, businesses lose an average of $12 million each year because of poor data quality.
Because of the risks associated with poor data quality, organizations take data quality metrics seriously. Gartner also predicts that 70% of companies will monitor data quality metrics by 2022, which will result in high cost and risk reductions across industries.
We cannot overstate the importance of maintaining data quality. Aside from reducing production or operational costs, good quality data can help your business identify high-quality leads, create better products, and build more robust customer experiences.
But before you can use data to improve the way you do business, you need to understand how you can measure data quality metrics and identify data quality sources.
How do you regulate data quality metrics in your organization?
Your data will come from different sources: accounting software, CRM solutions, user feedback sites such as Capterra, email marketing software, websites, Google Analytics, or survey services like SurveyMonkey. With so much data being generated at any given time, it’s practically impossible to collect and measure all your potential data points and keep your data free of errors.
However, you don’t have to strive for 100% error-free data. Even if you take all the steps to prevent errors, some bad or unusable data will fall through the cracks. What you could do, though, is to specify a particular set of parameters within which you can collect and accept data as correct and valid.
Most businesses use a specific set of data governance rules to ensure the quality of the data they collect and measure. Some have a dedicated data governance office or data steward that sets standards for data quality, while smaller organizations fold data quality metrics and management into their IT department.
In either case, whoever is in charge of data quality should know how to spot faulty data and emphasize the effects of poor data quality to whoever handles it, whether end-users or stakeholders.
13 essential data quality metrics you should monitor
While different businesses and industries have different data quality standards, this shouldn’t stop you from creating a data quality policy that you can implement across your business.
Before you can create that policy, you also need to accept that you cannot monitor everything. Your data quality policy should include a list of standard metrics your data governance should follow. Here are 15 metrics that you need to track to get a good idea of your data quality posture:
To people outside data science, “accuracy” refers to the closeness of a quantity to a known value. However, to data scientists, it refers to the proportion of correctly predicted observations to the total number of observations.
In other words, accuracy doesn’t just refer to how correct your observations are but also the consistency with which your data is correct. Even if your data model has a high degree of precision 50% of the time, it’s a lot less accurate than another data model that gets it right 80% of the time.
How do you define the benchmarks that you can use to measure data quality? More often than not, you need to use a generally accepted accurate data source to serve as a proxy. These standards change over time. The meter was formerly defined as a certain proportion of the earth’s circumference. Now, scientists define it as the distance traveled by light in 1/299,792,458 of a second.
It’s not enough for your data to be accurate. Your database should also be complete, so it faithfully reflects the world or environment you’re trying to describe.
What exactly do we mean by “complete”? According to Towards Data Science, completeness indicates the degree to which your data set has the data you need. Your data set might have missing records, but it might still be complete if those records don’t affect your ability to answer your questions correctly. It can also consider if your data set is biased towards a specific segment.
For example, you can run a survey about the workplace experience during the pandemic. However, if your respondents include only those who work from home (and not those who still need to report to a physical office), your data isn’t complete. If your target, though, is to survey only those who work from home, then your data might be complete.
Many tend to confuse completeness for coverage. After all, both data quality metrics have something to do with ensuring that your database has data points for all people or objects that your study will cover. However, the similarities end there.
While completeness means that there are records for all subjects, coverage means that all those records have values. Here’s an example:
While both datasets contain the same number of records and are considered complete, only the first one has 100% coverage as 20 out of 20 possible fields have values. On the other hand, the first one has only 14 out of 20 fields with values, which yields a 70% coverage.
Aside from ensuring that all of your data is completely covered, you need to be sure that it is also consistent. For your data to be consistent, you need to ensure that you measure data the same way all the time.
For example, you cannot mix meters and feet or Celsius and Fahrenheit. You need to choose one unit and stick to it. If your data uses the wrong units, you need to convert it to the correct one. Otherwise, you risk making faulty calculations due to incorrect numbers.
The data set above is an excellent example of inconsistent data. While the altitude row is supposed to be in meters, the values for Lemuria and Atlantis are both in feet. The temperature row is supposed to be in Celsius, but the values for Mu and Gondwanaland are in Fahrenheit. As a result, the average value for each row is not correct.
Now, let’s take a look at the same data set, but with all the values converted into the correct units:
When we convert the values to the correct units, it affects the values themselves and the calculations that involve those values. In this example, the correct average altitude and temperature are much lower in reality than what the initial data implies.
Data points rarely exist by themselves. Most pieces of data are influenced by other factors that are recorded in the data. For example, the total of a set of numbers should always be equal to the sum of the individual numbers. On the other hand, foreign keys, or data values that point to external records, should point to existing records.
If you run a check on your data and see that the numbers fail to add up, it might be a sign of manual manipulation or faulty formulas. To ensure the integrity of your data, you need to ensure that the process of generating and collecting data is transparent and that you are aware of the relationships between records and values.
Most people think of timeliness as a metric that tracks whether data arrives on time. While this is certainly part of timeliness, this is only important if you do your calculations in real-time. Timeliness in a data science context only requires that the data you receive is accurate when it is generated (as opposed to displayed).
The numbers above tell us the temperature for specific hours. For example, it’s 12 degrees Celsius at 9 PM and 11 degrees at 6 AM. For the data to be timely, it should be 12 degrees at 9 PM and 11 degrees at 6 AM, whether you’re looking at it now or tomorrow at 7 PM.
For data to be truly useful, you need to eliminate duplicate records. These are records that contain precisely the same values. Duplicate records can affect your computations by skewing them towards a specific outcome.
Let’s look at our example for the section on data coverage, but with duplicates:
Because the entries for Nikita Grossman and Wilson Cox were duplicated in the example above, someone who isn’t aware of the duplication might think that there are more males and Australians in the data set than there actually are. When you remove the duplicates, however, there will be the same number of males and females as well as Americans and Australians.
You have to watch out for data duplication, especially if you’re using data from CRM software. Removing duplicates from your CRM database will allow you to create a more accurate customer persona that can help you create more effective marketing campaigns.
8. Data storage costs
The cost of storing data is tied closely to completeness, consistency, timeliness, and duplication, mainly if you use the same amount of data each time you make calculations. This is one of the more unusual data quality metrics, as the storage cost per GB doesn’t depend on the data itself but the provider’s quoted rate.
If your data storage costs keep rising, it is a sign that you might be collecting too much unusable data. Duplicate data, for example, is not usable, but it still takes up valuable storage space. Incomplete data also plays a role in rising data storage costs as it still consumes space, even if it’s not usable.
To reduce data storage costs, you need to make an inventory of your data storage needs. If your calculations use data only for a specific time frame, such as the past year, it might be time to remove older records. It might also be time to remove duplicate or incomplete records. Doing so will lower your data storage costs and make your reports more timely and accurate.
Accessibility is another data quality metric that is influenced heavily by its users. In the context of data quality, it refers to the number of users who access the data over a specific period. It can also refer to the number of systems where the data is available.
For example, if five users consistently access the data over 30 days through five different platforms, you can say that the accessibility rate is five users/month. Alternatively, you may also say that the data is accessible through five different systems.
You can calculate availability, a relative of accessibility, by dividing the number of times the data was accessed by the number of times users needed it. For instance, if a user needed to extract the data to produce five different reports in one day and was able to extract data four times, the data availability is 4/5 or 80%.
Next, we move on to what we call “subjective metrics”, which only make sense in the context of users’ perception of the data and are measured through user surveys. While they don’t directly impact the accuracy or integrity of the data, they have a significant impact on the way your business processes use them.
Oddly enough, the metric we call “objectivity” is one of the most subjective in this list. This data quality metric measures the impartiality of a data set in the eyes of its users. Different factors influence this metric, such as the data source or the user’s previous experience with similar data sets from the same source.
This perception can change over time. For instance, a source that users once found objective can become less subjective if a critical mass of users find that it is inaccurate, complete, timely, or accessible.
A database’s objectivity can also depend on the user’s perception of the source. The Brookings Institution argues that despite people’s high regard for public opinion polling, some interest groups commission pollsters to include specific questions that lean towards a particular agenda. These questions (and the poll results) then lead to the public forming a less favorable opinion on the pollster.
It’s not possible to appear objective to everyone. All data users have their own biases, but as long as you stick to the basics, such as accuracy, coverage, consistency, timeliness, and integrity, your audience will tend to view your data as objective.
Aside from objectivity, believability is another metric that measures end-users’ trust in the data. A data set might be accurate, but if you continuously find yourself having to substitute it with data from another source, it might not have high believability.
What exactly is believability? One study conducted at MIT proposes that believability has three dimensions: trustworthiness, reasonableness, and temporality. These may correspond to objectivity, accuracy, and timeliness. While the MIT believability model is quite complicated, it tends to reward sources that have been accurate for a long time and sources that provide data close to the time an analyst observes this.
In addition, the model looks at the proximity of the data reporter to the source of the data itself. One example used in the MIT study compares the believability scores of different data sources for the Indian population in Malaysia and Singapore. According to the study, the Malaysian and Singaporean state departments were more believable than the CIA in terms of population and demographic estimates.
Of course, there are certain times where an external party might be more believable. An external auditor might come up with more believable data than an internal auditor or an established end-user. In the long run, believability comes down to what most people see as believable for a longer time.
Even if your data is raw, you should still understand it, especially if you are constructing formulas to extract usable insights from it. A data extract with clearly defined labels or column headers, for example, is more understandable than one with labels or column headers that don’t describe the data.
For example, Data Set #1 is not very interpretable:
In contrast, Data Set #2 is a lot more interpretable:
Aside from adding labels to the data set itself, the existence of documentation can improve its interpretability. Data documentation usually includes a discussion of the system used to collect the data, a data dictionary containing the name, format, and unit of each data point, and the relationships between these data points.
You may also look into the way you gather data. Your online form builder should be able to accept and export data in a structured format, with clear headers. If your form builder churns out data that looks more like Data Set #1, you might want to consider another solution.
Finally, usability is a data quality metric that measures how easily you can manipulate the data to perform a business function. For example, if you need to perform a series of complicated calculations to get your data to say something, it has low usability. On the other hand, if most of your formulas are straightforward, then it has high usability.
Usability also includes findability, which refers to the ease with which a user can find a data set within a larger pool of data. By creating a data catalog that includes a description of a data set, a list of its contents, and its location, you can increase the usability of your data.
Your business relies on quality data to make informed decisions. It can make the difference between your business thriving and sinking. If you’re entering a competitive market, you need all the data you need, but you need to ensure high quality.
Data quality, though, is more than just accuracy. What we usually call “accurate data” is composed of data that passes multiple tests, such as its closeness to accepted or observed values, coverage, consistency, timeliness, and integrity. It is also heavily influenced by your opinion of the data source and the way you use and manipulate data sets.
By establishing data quality metrics for your business, you can ensure that the data you get tells you the whole story. The insights you gain from analyzing quality data will help you make informed decisions that will improve the way you do business and increase your revenue flow.