If — pardon the cliche — data is the new oil, then data mapping is all about making sure the stuff actually flows through the pipeline correctly, and to the right destinations.
Traditionally, that refers to mapping source data to a target system, such as a database or data warehouse. But with the advent of laws like the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA) and Virginia’s new Consumer Data Protection Act (CDPA), data mapping now also has a privacy compliance context.
To know what data you have as an organization, you also need to know how it gets moved. So a back-end flow map that helps an organization comply with, say, GDPR might look similar to the old-school diagram of arrows pointing from source to field across data models.
“The goal of data mapping, loosely, is understanding what types of information we collect, what we do with it, where it resides in our systems and how long we have it for,” said Cillian Kieran, CEO and founder of Ethyca.
What Is Data Mapping?
Whether it’s privacy compliance or just good data-ingestion governance, a few good maps can help ameliorate a lot of confusion.
“How do we ensure, when we define a policy for what we want to do with information, that we’re enforcing it across our systems? In order to enforce it, we’ve got to know where data exists,” Kieran said.
But that takes time and effort.
Data-lineage products have made life easier, as have tools that can define even the most bizarrely named data fields. But nothing’s fully replaced the task of getting under the hood and learning exactly how things connect.
“You just have to take the time to understand what the data actually represents within that field,” said David Wilmer, technical product marketing manager at Talend. “If the data that’s supposed to be in the field is there, and it’s just about connecting those dots, then it’s really about taking the time to understand what data is where and how it relates.”
What else makes for good data mapping? We asked three experts to chart the dos and don’ts.
- Dimitri Sirota: CEO and co-founder of BigID
- David Wilmer: technical product marketing manager at Talend
- Cillian Kieran: CEO and founder of Ethyca
Data Quality Is Key
David Wilmer: There are a number of tools that make it easier to profile data: Are these [fields] cities? Are these first names? Semantic dictionaries help define, for instance, what a city looks like, or what an address format looks like or — the simplest form — what an email address looks like. And if it doesn’t match those very basic criteria, then don’t classify this field or this data element a certain way.
That’s really what it comes down to — being able to dig into and profile the data. It starts with data quality: garbage in, garbage out. If your data is poor on the way in, and you’re not doing anything to address that, you’ll be making critical decisions off really bad data. So, for me, it really starts with creating that quality data, cleaning it up, standardizing it. That alone can also help improve and simplify the mapping process.
Provide the Metadata
Cillian Kieran: What often doesn’t happen in [the extract, transform, load process (ETL)] is defining metadata around the transformation. For instance, moving “Name” to a field called “FName” along a new record set that’s identified differently — the translation of labels and fields. That’s problematic because it breaks the linear thread and makes it very hard to identify the relationship with, for instance, names that flow from a registration service into a data warehouse that are used to enrich some of the behavior data.
So, a data-engineering team could do a huge amount to mitigate problems if they simply provide metadata to the structure of the inputs and outputs. That’s essentially describing each of the fields, the types of information that each of the fields represents, on the input and on the output. One reason that’s really difficult for businesses to do is there’s no universally agreed standard in the engineering community for what personal information is and how it’s defined.
Describing what constitutes an email versus a name versus a Social Security Number — those are obvious. When businesses get into custom data, those definitions become really difficult because they’re not universally understood. An engineer might have trouble deciding how to describe the data that they’re collecting so that it can be appropriately labeled.
Maintain the Data Dictionary...
Kieran: You need a uniform understanding of what personal data types are. A dictionary needs to be available to engineers to understand easily. That might be like a glossary. It could also be a service — something they can use while they’re coding to pin a type of data or maybe a couple of sample records. Then the system responds with recommendations for the type of data that is there.
“You’ve sent us 100 sample records of data, and we think, based on the analysis, that it looks like the last four digits of credit card numbers,” for example. So the engineer doesn’t have to guess — the service makes the recommendation.
... but Automate When Possible
Dimitri Sirota: [Taxonomies] created a huge burden on people who don’t necessarily know the data. They’re saying, “Well, we’re gonna start with the taxonomy and try to lump everything into it.” What if everything doesn’t naturally fit within these guardrails?
We’re an exponent of, and there are others, saying: “Why start with some mythical, almost Olympian model: Here are all the gods, and they have to fit into this? [Instead, I’m] combining the God of wine and the God of war because I didn’t create a separation in the beginning.” Let’s start with the reality, the messy complexity of the data.
If you can know your data up front, let machines recommend what those data definitions should be. Also, reconcile places where you see collisions or potential discrepancies — m.mail is not the same as email, and so forth.
To make data governance less reliant on these supernatural data stewards with a mythical ability to understand the data correctly, you’re going to need to start from the data and build up, using machine learning and other methodologies to prescribe the right set of data definitions. You’re going to see this kind of inversion, where now it’s possible to truly build a data inventory, not just at a metadata level, but at a data value level — literally know what’s in every database, MongoDB, SQL Server, Salesforce. It’s going to become easier to remove that burden on data stewards to come up with these lexicons. Then use those stewards more for validation, less creation.
Make It a Habit
Wilmer: [Adding new data sources] can definitely have more downstream effects, [for example, if] it’s a new field that you need to include, [you’d need to] proliferate that down the [system]. Data mapping is not a one-and-done proposition. It has to be a living, breathing, evolving activity that the data-integration team has to continue to do. You can’t just write it down on paper and tuck it away in a filing cabinet. Even existing sources may add a field or change the way they store the data. So it always has to be top of mind.
Maintain the semantic dictionary and the dynamic schema that automatically take that new field into account and can adjust. Those technologies really aid the data mapper, but it still needs to be addressed on a regular basis — not an exercise, but more of just a habit.
Don’t Collect Too Much Too Fast
Wilmer: I would hope people are wiser to the fact that you can’t bring all these sources in at once. With a small team, you can’t bite off more than you can chew. Get that data source that’s really going to drive your business, and then bring in supplementals that maybe give you more insight into your customers or just help simplify your end-to-end product delivery. It can definitely be a wake-up call for an organization that’s just starting out to say, “Maybe we need to take a step back.” You can’t just go full in with a team of three and start bringing in 12 different sources of data and trying to map it also.
Kieran: The most common issue is a business that hasn’t thought about this because it’s just trying to scale, and it suddenly accumulates a lot of customer data. And now you have to go backwards — a huge amount of refactoring to understand what you’ve collected and where. By punting, you turn it into a very difficult thing to fix.
Consider, say, a small e-commerce website that collects registered user data and a bit of analytics information. It scales, brings in a business-intelligence team and wants to collect more event data through the checkout flow. So it starts to collect the location of the user, maybe mouse location on the screen and propensity to click on certain products. It mashes that into some other data warehouse, enriches it with data that maybe comes from a third-party provider. You have all of these data sources.
The average D2C commerce business in the U.S. has more than 60 data sources flowing into its data warehouse.
ETL and Data Lineage
Sirota: Historically, you would try to get a lineage of how your data progresses by examining the ETL. This goes back to the Informatica days, when they first innovated around load and transform. The problem with that was twofold. One, there’s not always an ETL. In the enterprise, sometimes files just move from system to system without a formalized ETL process. You do if you’re moving to a data warehouse and various data stores. But that’s not universal. And typically, when you’re thinking about data mapping, you want something more comprehensive.
The other problem with ETL is [organizations are] very dependent on whoever builds that ETL, so not everybody is able to mine those systems to understand exactly what they’re doing. There have been attempts and efforts to improve upon data lineage. I would say that it’s still a work in progress.
There are data-lineage products attempting to do a variety of things like mining ETL. Some rely on metadata. ASG has a product that mines code to try and interpret [lineage]. We’re introducing a product the second half the year that does something completely different around genetic algorithms.
Limit Access and Mask Personally Identifiable Information
Wilmer: The first and simplest thing is to implement some sort of role-based responsibility. Limit the number of people with access to that important information. That’s your first line of defense.
Next, I would suggest making sure you have some sort of ability to mask that data, so that you can give it to your data scientist, and they can do what analysis they need to do without knowing [personally identifiable information (PII)]. Those are the two basic, pretty easily implementable things.
From our standpoint at Talend, we think the best approach would be to implement a data catalog. That gives you the full breadth of those capabilities — it gives you the lineage of data, what’s happening, who’s touching it, where it’s going. Those are the big three. And as you grow as an organization and get more data, that’s the natural progression of approaching governance.
Kieran: The way we try to help businesses that are trying to grok data privacy compliance for the first time is to break it down into a set of simple modules. The first thing you want to do is data mapping. Once you have that map, you want to maintain it. Because you’ve got that in place, you then want to be able to offer a user their rights over their information — data subject rights requests.
The second thing is to implement a process or automated tooling to simplify that — either an automated solution, or at least a process that’s documented that you can support and scale as requested.
And the third thing is consent. This is the idea that the state may decide that a user has a right to opt in or opt out. If someone opts out of behavioral analysis being done on their data, the company that’s doing that needs to identify where their data is, how it’s flowing into those warehousing and analytics tools and suppress it from those processes. That’s not cookie consent banners, that’s deep technical integration.
Realize You Might Have More PII Than You Think
Kieran: It’s very easy to naively assume a business doesn’t collect a lot of personal information. Data drives competitive advantage, so you’d be surprised by the number of businesses that assume that they have accumulated a relatively small amount of personal information and therefore a small risk exposure.
The truth is, most businesses, whether they’re just using Google Analytics or email marketing, are actually accumulating a substantial amount of personal information. They might not realize just the degree of risk that that exposes them to. And that’s common to everybody. Most businesses other than very large, late-stage tech companies or enterprises are undereducated [about] that risk.