Data Quality: What It Is and Why It’s Important

With the advent of AI, edge computing and big data, data quality is more vital than ever.

Written by Jenny Lyons-Cunha
Published on Sep. 16, 2024
A rendering of data on a blue screen.
Image: Shutterstock

Data quality refers to how well data performs in areas like accuracy, validity, completeness and consistency.

What Is Data Quality?

Data quality is defined as the degree to which data meets standards of consistency, validity, accuracy and completeness.

High-quality data leads to informed decision-making and optimized operations, and it’s a key piece of data governance.

On the flip side, poor data quality costs organizations roughly 15 to 25 percent of their annual revenue, according to a report by MIT Sloan. Beyond lost income, subpar quality fragments data ecosystems and damages business outcomes.

 

Why Is Data Quality Important?

With the rise of artificial intelligence, hybrid cloud, edge computing and the Internet of Things, data quality has become more vital than ever. High-quality data helps drive key benefits, including: 

Improved Decision Making 

“Data quality is the foundation for better decision making,” Mitch Zink, senior manager of data engineering at Atrium, told Built In. “With the explosion of data and advent of AI in the last few years, it’s become clear that data drives better [decisions], but only if your data is good.” 

Indeed, good data gives companies visibility into their operations. The unknowns that come with poor data quality, like unreliable outliers and undefined fields, should be a concern to organizations, according to Mike Flaxman, VP of product at Heavy.AI and former MIT professor.

“If you’re making decisions based on your data, you don’t want to be thrown by something unexpected,” Flaxman told Built In. “An anomalous spike in data value is going to skew data one way or the other — [poor data quality] can make you think sales went up when they actually went down.”

Better Analysis

Analyses are only as good as their input data. Companies maximize their efforts when data is clean. 

Armed with tools like AI and machine learning, many organizations assume that automation will take care of lapses in quality during analysis, but Parker Ziegler, a computer science PhD candidate at the University of California, Berkeley, argues that these assumptions can lead to disaster. 

“In general, data analyses build in assumptions about data quality,” Ziegler told Built In. “When invariants don’t hold, the analysis can go awry.” 

Elevated Business Outcomes

Improved analysis and decision-making combats profit loss — which adds up to roughly $3.1 trillion per year in the US, according to IBM. In industries like healthcare and emergency services, good data mitigates negative impact on human lives by improving processes and bolstering efficiency.

 

Data Quality Metrics 

Data quality can be evaluated based on a number of metrics, which differ based on the source and nature of the data. These metrics include:

  • Uniqueness: This measures the amount of duplicate info in a dataset. Each record should be uniquely identified. For example, each user might be assigned a unique ID number. 
  • Completeness: Completeness measures the amount of data that is usable. Missing values and metadata lower this metric. 
  • Timeliness: The data should be current and readily available. For example, customers might expect to receive a tracking number when shipping an item — this data must be generated in real time. 
  • Validity: This metric accounts for how much data conforms to defined business parameters and rules. 
  • Accuracy: Accurate data correctly reflects values based on a set source of truth or primary data source.  
  • Consistency: This measures the uniformity of data across datasets and systems. In relationships between data, there should be no conflicts between the same data values in different datasets. 
  • Fitness: Fitness of purpose ensures that the data meets the business need at hand.         

 

Data Integrity vs. Data Quality

While data quality and data integrity may seem interchangeable, they have unique — but interrelated — definitions and implications. Data’s integrity is determined by its reliability, accuracy and consistency across its lifecycle. While data quality considers these factors, it refers to the fitness of data for its intended purpose. 

Simply put, data integrity refers to the accuracy and cleanliness of a dataset, while data quality is defined by how well a business can analyze and utilize it. 

“You can have 100 percent data integrity but still have room for improvement in data quality,” Zink said.

In an example Zink shared, customers might fail to fill out an optional field denoting why they canceled their service. In this case, the data would have full integrity but would be useless to the business and, therefore, have low quality.  

 

How to Improve Data Quality

When it comes to improving data quality, there isn’t a unilateral approach. In the age of big data, the furious pace of data creation — projected to reach 463 exabytes per day in 2025 — has placed the burden of quality assurance on automation.

“The cadence of data has increased quite a bit,” Flaxman said. “You don’t have human eyeballs and brains spending years going over data to make sure it’s clean before you put it into a decision-making process anymore.” 

Flaxman noted that companies should balance automated tools with human safeguards. Potential strategies include: 

Data Profiling

Sometimes called data quality assessment, data profiling is the process of auditing the current state of an organization's data. Profiling unearths errors, gaps, inaccuracies, inconsistencies, duplications and barriers to access. Data quality tools can be used to uncover anomalies. 

Data Cleansing

Once data quality issues have been identified, data cleansing remediates the errors at hand. This process might include matching, merging or resolving redundant data. 

Data Standardization

Data standardization conforms unstructured big data and disparate assets into a consistent format. Standardization is applied via business rules designed to serve the organization’s needs. 

“Many data quality challenges for organizations arise from difficulty managing related data across many different parts of the org,” Ziegler said. “With multiple data sources contributing small pieces of a full data picture, you often run into issues of data duplication or data drift over time.” 

Normalization — or attributing all data to a single source of truth — is a common technique for standardization, Ziegler said. 

Geocoding

The practice of adding location metadata — coordinates denoting where data originated, has been and is stored — to datasets is called geocoding. This method helps organizations stay compliant with location-specific standards and maintain data privacy. For example, geocoding is necessary to sustain GDPR compliance.  

Visualization 

Visualization is the process of creating a graphical representation of data. It allows data scientists and domain experts alike to quickly spot data problems. Various visualization tools and automation can help organizations unearth data errors, large and small. 

Real-Time Validation 

Batch and real-time validation is the utilization of data validation rules across all data within an organization. Validation can occur in a batch process or in continuous, real time. 

Master Data Management

Master data management is the process of creating a centralized data registry in which all data is cataloged. MDM affords an organization a unified location to quickly view its data across multiple locations. 

Data Quality Monitoring

Once data quality has been improved, data quality monitoring should be set up to maintain continuous management. This process might include revisiting previously cleansed datasets or setting data quality KPIs. 

No matter where organizations are in their data quality journey, the best place to focus their attention is on high-impact datasets, Zink said. 

“To maintain data hygiene,” he added, “make sure you have a solid system, good architecture and a use-case-centric focus.”

Frequently Asked Questions

Data quality is the degree to which data meets standards like consistency, validity, accuracy and completeness.

Examples of data quality include accuracy, completeness, consistency, timeliness, uniqueness and validity. 

Explore Job Matches.