In my work as a data engineer, one of the most challenging and interesting problems I’ve faced is the question of resolving divergence in data. Divergence refers to differences in data results generated from old and new versions of our data architecture. It’s a simple heuristic that, when applied intelligently, allows data architectures to migrate quickly and with confidence.
In solving problems prior to my current work, such as developing CRUD applications or ETL frameworks, I primarily worked with code. Code is nice in some ways. It gives you things. Stack traces that pinpoint exactly where you made a mistake. Logs that tell you the historical activity of a service. Control primitives to describe control flow exactly the way you want it. Unit tests and mocks that allow you to quickly and decisively evaluate the logic you are effectively describing before deploying a change. All these features help create general-purpose, scalable and reliable products.
What Is Data Divergence?
Now, however, I primarily deal with data, and these concepts have broken down for me a bit. At work, I primarily use GCP BigQuery and SQL, with data coming in at different points in our tech stack and through different frameworks. We’ve standardized on SQL for now because multiple departments can work with it, allowing for higher levels of collaboration. Prototyping while allowing for shorter lead times in development is also much easier this way.
SQL does come with trade-offs, though. As a domain-specific language, it’s tightly coupled with the data it’s meant to represent, and that makes it difficult to run integration tests and nigh-impossible to run unit tests, especially since BigQuery is cloud-native. If you get your SQL query wrong, you won’t see an error message, but rather faulty results. Since there’s no hard errors, I’ve found myself relying on log messages to tease out what has happened more than I might like.
We’re still in the early days of designing our data architecture, and I’m mindful of keeping stakeholder confidence in our data as high as possible while also evolving this architecture to suit the needs of tomorrow. So, I’ve taken it upon myself to push for a “tick-tock” model for our data architecture development. In this approach, we first evolve the data architecture while maintaining the exact same data results (“tick”), then update the data results in order to meet updated requirements (“tock”). This approach balances the needs of stakeholders deriving insights and other requests from the data engineering team, as well as internal engineering needs around infrastructure and operations.
Spotting Divergence
Now, we get into the problem of divergence. In this context, I use the word “divergence” to denote differences between data results generated from old and new versions of our data architecture. For instance, when migrating away from a batch processing pipeline to a streaming version to enable reading data into different engineering services, we still want to preserve our reliability in batch analytics to maintain a high level of stakeholder trust and hence our own freedom of action in making trustworthy improvements. Minimizing this divergence is our goal for completing the “tick” portion of development.
In my experience, data will always exhibit at least some divergence, which is the result of multiple factors. Data refreshes at different rates depending on configuration options for the framework. We also collect data at different points in our tech stack, meaning that it contains different information depending on the application state. Finally, data may be lost if it passes through lossy stages of a data pipeline. For example, BigQuery SQL keeps timestamp values to six microseconds precision. If the data gets piped to a JavaScript lambda that keeps three microseconds of precision, the result will have three microseconds precision, losing the last three digits of precision.
To document all this, we write divergence reports that detail how many records are different between the two systems, what columns or fields are necessarily affected by those differences, and what explanations we might be able to provide to stakeholders regarding these changes for their review and approval. This adheres to our internal culture of documentation, allowing stakeholders to review whether changes to the data results are acceptable for their purposes and providing an avenue of communication and a platform for further discussion and collaboration around their needs. In practice, the act of creating these reports has kept our divergence to low levels and communicates exactly how low those levels are.
Calculating and Solving Divergence
So how do we calculate divergence? This calculation has three preprocessing steps:
-
Select a window of data to compare against.
-
Select the fields for comparison.
-
Deduplicate each record, guaranteeing uniqueness.
First, we may want to get a window of data from both data sets to compare differences. Although this may not be necessary if your data set is small enough to query in its entirety, it does become more important as the data set grows. We want to keep the comparison logic quick or feasible without additional configuration in BigQuery. Running a where query over select timestamp columns in both result sets is critical to getting these windows.
Simple things, like correlating timestamps, can get tricky. During certain backfill jobs, when I had to transform data from raw XML strings stored in the database to denormalized JSON for renormalizing in our data warehouse, the timestamps for the created objects matched the time of the backfill job and not the time for when the data was first loaded. I had to join the original timestamps to the newly created table in order to provide accurate data. The implications of this issue, which include being able to join this data to other data sets, speaks to the importance of other data engineering best practices like “always append fields, never overwrite them” (in disambiguating creation/update times for backfill and original ingest), and how they intersect with this kind of work.
We can’t compare every single column since some data is almost certain to be different. For example, I’ve found that if you’re ingesting data using Stitch Data, it will create special `_sdc_*` columns within your BigQuery sink. This issue arises from Stitch Data’s policy of copying records at least once, but not exactly once, granting you the agency to deduplicate your own data. This differs from adding the BigQuery integration for Segment, where information is deduplicated via the integration itself. Therefore, we may filter for the set of columns we want to ensure are equal.
One of the keys to checking whether two records are the same or different is to ensure you’re comparing one record with another, and not a group of records against another slightly different group. I’ve found the concept of uniqueness quite powerful in making record comparison simple and quick to check. Keeping a unique index may be easy or difficult based on your need. For example, when deduplicating tables managed by an ORM and copied via Stitch Data, it’s a matter of identifying the replication key, removing the Stitch Data specific columns, and running select distinct. When creating an index from scratch in order to match the key logic in an external datastore, uniqueness had to be guaranteed by hand in all join conditions, which took a lot longer and is harder to guarantee.
After these processing steps have been completed, the rest of the comparison is fairly easy:
-
Order columns the same for both sets of data
-
Union distinct the two datasets
-
Run an aggregate group-by over the unique index and filter for index values that appear twice (thereby indicating some difference in the following values) using having count(*) > 1
-
Left join the old and new datasets with the processed list of indexes into two separate tables
-
Union all the two datasets together, and order all columns so that duplicate data is easy to identify all at once and by eye
And voila! Your divergence report, as SQL, is complete.
Ultimately, this divergence calculation process is only relevant if you are using SQL. If your data sets grow to the point where it takes more time to run a select* command than warming up a virtual machine (from my experience, around 25 minutes), or if you are fetching miscellaneous data sources like videos or PDFs, then you may consider using a fully-fledged solution like Apache Beam/GCP DataFlow, or another solution that wraps a general-purpose language with an SDK and a tools ecosystem. I’m sure we’ll re-evaluate our approach in time; right now though, this approach works very well for me.