Data integrity means the data has been collected and stored accurately, as well as being contextually accurate to the model at hand. To maintain integrity, data must be collected and stored in an ethical, law-abiding way and must have a complete structure where all defining characteristics are correct and can be validated.
Data integrity is applied in order to understand the health and maintenance of any piece of digital information utilized throughout its lifecycle. Data integrity can be viewed as either a state, meaning that the data set is valid, or as a process, which describes the measures taken to ensure data set accuracy. Additionally, data integrity can be applied to database management as well through one of four categories: entity integrity, referential integrity, domain integrity and user-defined integrity.
Why Is Data Integrity Important?
Data integrity is crucial to ensuring the validity, recoverability, traceability, connectivity, reusability and maintainability of data.
Data is one of the largest driving factors in decision making for organizations of all sizes. In order to create the insights that drive these decisions, raw data must be transformed through a series of processes that makes it possible to be organized and for data relationships to be identified as insights. Data integrity exists to ensure the data remains accurate and uncompromised throughout this process. Poor data integrity can lead to incorrect business decisions and a distrust in the data-driven decision making process, potentially causing critical harm to a company’s future.
Lack of data integrity may also have legal ramifications if data is not collected and stored in a legal manner, as outlined by international and national laws such as the General Data Protection Regulation (GDPR) and the U.S. Privacy Act.
Data can become compromised in a variety of ways:
- Human error, such as unintended alterations
- Errors in transferring
- Malware/hacker interference
- Disk crashes
- Bugs and physical device damage
- Illegal data collection
A thorough data integrity process is crucial and measures should include lock-tight data security measures, regular data backups and automated duplications, as well as the utilization of input validation, access control and encryption.
What Are the Different Types of Data Integrity?
Physical integrity and logical integrity are the primary types of data integrity.
Physical Integrity
Physical integrity is the overall protection of the wholeness of a data set as it is stored and retrieved. Anything that impedes the ability to retrieve this data, such as power disruption, malicious disruption, storage erosion and a slew of additional issues may cause a lack of physical integrity.
Many companies outsource their data storage to cloud providers, such as AWS, to manage the physical integrity of the data. This is particularly useful for small companies that benefit from offloading data storage to spend more time focusing on their business.
Logical Integrity
Logical integrity allows data to remain unchanged as it is utilized in a relational database. Maintaining logical integrity helps protect from human error and malicious intervention as well, but does so in different ways than physical integrity depending on its form.
Databases use four variations of logical integrity:
- Entity integrity
- Referential integrity
- Domain integrity
- User-defined integrity
Entity integrity involves the creation of primary keys to identify data as distinct entities and ensure that no data is listed more than once or is null. This allows data to be linked to and enables its usage in a variety of ways.
Referential integrity is the series of processes that is used to store and access data uniformly, which allows rules to be embedded into a database’s structure regarding the use of foreign keys. This allows for a consistent and meaningful combination of data sets across the database. Critically, referential integrity allows the ability to combine various tables within a relational database, facilitating uniform insertion and deletion practices.
Domain integrity refers to the collection of processes that ensure accuracy in each piece of data included in a domain, or a set of acceptable values that a column may contain.
User-defined integrity provides rules and constraints that are created by the user in order to use data for their specific purpose.