How many different ways can people plausibly misspell “Chicago”? At least 30, it turns out. What about “Philadelphia?” Nearly double.
Steven Rich, the database editor for investigations at the Washington Post, pointed out the multitude of misspellings, scattered within the Paycheck Protection Program loan data released earlier this year, by way of warning: Anyone hoping to analyze the data set had also better set some time aside for cleaning.
But data cleaning — also known as data wrangling, data munging or whichever often interchangeable label you prefer — is far from just glorified spell-checking.
What Is Data Wrangling?
Data quality issues in business contexts range from incorrect formats to null or missing values, which may need to be imputed or dropped altogether. In machine learning, you also have to be careful to not overly clean the data, lest the signal get scrubbed away. Data divergence causes headaches too.
“Engineering might make changes to a schema and how data comes through,” said Connor Carreras, director of adoption and enablement at Trifacta. “So in order to standardize that data for analysis, you sometimes have to deal with maybe three different versions of the data in the database and, essentially, consolidate that.”
That’s right. Even a company that makes data preparation software isn’t immune to such challenges.
We reached out to Carreras and three additional data science experts for some tips to keep in mind — and common pitfalls to avoid — when wrangling messy source data into model-ready shape.
- Connor Carreras: Director of adoption and enablement at Trifacta.
- Daniel Grzenda: Data scientist at the University of Chicago’s Center for Data and Computing.
- Javed Ahmed: Senior data scientist at Metis.
- Michael Guadarrama: Founder and CEO at Coegil.
Profile Your Data for Gaps
Connor Carreras: One issue we didn’t realize we had until we started the data-wrangling process at Trifacta was gaps in terms of features that are represented in the data, just based on the year or the month of the data — because some of these features were introduced later on in the pipeline, but that historical data is missing. Do you drop those features entirely? Do you guess or impute what it should be? Engineering orgs churn, so sometimes there’s nobody who knows exactly why the data looks like it does — or even that it does look the way it does — which was the biggest issue for us. So being able to profile the data as part of the data-wrangling or data-preparation flow is really critical.
Nulls: To Drop or Not to Drop?
Carreras: Depending on the column and type of data, this is where you get into more of the art of data wrangling than the science. Our particular missing values were primarily in categorical columns. And there’s just not a great way to automatically infer what category should be placed into that column — beyond eyeballing the data and essentially making an educated guess for what these values should be.
If you’re dealing with numerical data, there are better methods for filling those missing values. One way is to leverage statistics — medians, averages, standard deviations. But when it comes to missing a label, you essentially have to figure out: Do we assign a label arbitrarily? Do we leave the nulls in the data and essentially assign a meaning to the nulls? Or do we drop the records entirely?
In some of our earliest data, we ended up dropping nulls. Most of our data quality issues originated from a beta version of a product that sent usage data — missing features, deep format problems. We ended up determining those records weren’t going to add any value to our analysis, so let’s just drop them entirely.
Daniel Grzenda: Missing, null and zero are usually things you want to address. Is zero a reasonable value? Is missing the same as null? Sometimes, as in sensor data, null can be “made a reading, didn’t record it,” versus missing, which could have been “sensor’s completely off and didn’t even speak to the internet.” So think about those three and how they work in what you’re trying to do.
Institute Daily Conditional Checks
Carreras: We have a two-part pipeline. First, the data loads into intermediate staging tables — there, it also produces a table of quality checks. Some checks are automatic — things like percentage of missing values in a column or amount of mismatches based on data type. And some rules are more custom. So if you’re on the free tier, one of these two products should be listed in another column. And based on the output of those quality checks, you can go in daily, eyeball that table and verify everything is within the expected threshold. Then, based on those quality checks, the actual load of that data-wrangling process can continue.
Clean, but Don’t Over-Scrub
Carreras: One way to make sure you don’t over-scrub is by not over-aggregating data. Retaining a more granular level, at least as an initial pass, can give you enough information to start doing analytics and machine learning. If you’re making a dashboard or report, you want that aggregated data; you’re looking at broader trends over time. But to do something more predictive, you want more details.
We bumped into that pitfall. We have product usage data at the level of the individual user, and, at one point, we were interested in rolling up to an account level [view]. When you do that, you lose all the per-user interactions. Now you’re dealing with an entire company; you have these aggregated metrics. For our project, that was less valuable than actually having the very granular per-day, per-user data.
Grzenda: There is a signal that you can’t lose. You have to be very careful about what data you drop. We think a lot about this when we’re dropping data.
If you’re going to drop a column, make sure that whatever signal might have been carried in that column is somewhere else in the data, whether that’s a covariate column that carries that same variance, or you’ve generated columns that decompose the variance.
This happens a lot when data is missing. What do you fill in? People talk about the mean, the median, the mode. But that random mean that shows up a bunch of times isn’t real data, but actually imputed [data]. So you also have to have another column that captures all that was missing. So sometimes we decompose columns to make the variance more available to our model, and then we drop the original.
Ask: Who’s Doing the Cleaning?
Carreras: The degree of plug and play versus custom code depends on the user profile. We view data wrangling as something that spans a variety of different personas in an enterprise, with different levels of technical skill — business analyst, data analyst, data engineer and data scientist. Generally, the degree of custom code maps closely to the level of technical skill of the end user. When you start moving down that technical skill level, it becomes more important to have everything out of the box and easily available to quickly produce your recommendation reports.
So if an analyst can have a button that says, “calculate change over time for X group,” that’s awesome. You’re going to choose that instead of writing five, six, seven lines to group the data, do aggregations, roll it back up and figure out the difference. For more technical users, give them that correct balance of the freedom of code, SQL and Python, with, “I’m doing something over and over again, just wanting to pop something in, get my job done and move on to more interesting work.”
Map Who Does What in Your Pipeline
Carreras: When you have a variety of people all working with the same data, sometimes the cross-functional collaboration can be difficult, especially when everybody is working in different tools and has different levels of skill or knowledge of data operations. So when you’re composing your data strategy, figure out who’s going to own which part of the data-wrangling pipeline. And how much overlap do you anticipate having between functions?
It’s more difficult to untangle the mess. We had stuff in Excel, we had stuff in our own Trifacta tool, and we had views of views of views in BigQuery. Then when each department’s analytics lead wanted to essentially reuse someone else’s, if that lead is used to working in Tableau, and they’re given a set of 11 interlocking views, they’ll essentially throw their hands up in the air and redo it all in a tool they understand.
Automate, but Audit
Grzenda: We think about automation auditing all the time while working with Inclusive Development International, which monitors projects for human rights violations and tracks who invested in those projects. We built a scraper that’s supplying them kind of a database that they’re searching through. But how do we know when something breaks? A lot of that is record linkage.
Once the data is clean, and you’re trying to match what you know in your database against new information, you can do some measures on which rows exist that also existed in the previous one. If the rows were updated, did they gain cells? Did they lose cells? Are you overall gaining information by this update? And if not, something’s probably broken.
As far as record linkage, you’re scraping a really rough data source. They could put a space in their HTML table, and it literally offsets everything below that space. You have to catch things like that.
Javed Ahmed: Everybody wants to automate. It’s tricky, because the more you automate, the further you get from the manual validation that needs to happen. And, on the other hand, you have to automate it some extent. So it’s hard to find a one-size-fits-all solution.
Using a tool that streamlines automation doesn’t replace the validation process. But once you’ve figured out the quirks in the data and understand how your tool processes the data — and you can compare a sample of cleaned data to the raw, messy source data and do the right testing — then a tool like that can really streamline a lot of the intermediate steps. But the trick is also getting the data into the tool in the right format.
Let Your Model Dictate How You Handle Outliers
Ahmed: Whether an outlier is legitimate or a data error, it can have bad consequences for many models either way. So you may incorporate a function that smooths out historical shocks in a data set at one particular part of the workflow. But in some other point of the pipeline, you may not want to apply it.
For example, if you’re doing fraud detection, you probably don’t want to smooth your data so much, because you want to really identify the outliers. But if you’re doing forecasting, sometimes those outliers can really hamper the forecast because they’re not representative. They may have been driven by some other process. So smoothing data is really helpful for forecasting models — because you’re trying to forecast the predictable part, and, by nature, the outliers are less predictable.
When and How to Scale Up
Michael Guadarrama: You can go pretty far with Pandas. But there’s a point at which you’re limited by the box. If you’re not able to scale your box to load the data frame into memory, you have to do something like an EMR/MapReduce solution — something like PySpark, Databricks or EMR. And once you cross that bridge, it kind of sucks because the code’s not the same. Even the way you process the data sets isn’t the same.
So we encourage people to generalize as much of the processing as possible and use EMR/MapReduce solutions only for heavy lifting. We tried to limit MapReduce or PySpark to just the very beginning of the process, and only when we needed to do a big aggregation, and then get back to the Pandas processing.
Create Segments and Modules
Guadarrama: I see this all the time: You see a big hairball of code that’s pulling data, cleaning up missing values and making sure everything is going to be nice and normal, so your algorithm is off the deep end. But if you don’t segment responsibilities, when it comes time to make a change, you don’t know where to do it.
Imagine processing a petabyte of data and making a mistake. It would suck to have to start from the very beginning and reprocess everything. If you segment the processing and store intermediate results, you can just rewind to the last step.
Create distinct modules for each step in the process: acquisition, cleansing, normalization, synchronization, transformation and enrichment.
Embrace the Encoding Nuance
Ahmed: Problems with encoding usually occur upstream in the process when you start to put data in the model.
A recommendation system might have categorical variables of one through five. Those are categorical, but also ordinal. That can be tricky, because some models will assume the difference between a one and a two is the same as the difference between three and four.
If you have a categorical variable like red, green and yellow, the solution is straightforward. You create indicator variables for each category. You can do the same for ordinal variables. But you sometimes may want to leave it as one variable, as opposed to creating a bunch of indicator variables. There can be an art to that — there’s not necessarily an objective answer.
Beware the Black Box
Guadarrama: Figuring out where a process may have gone wrong is impossible if you don’t document your code. You have to comment, and you have to avoid things that make the processing opaque. For example, when working in finance, we would never use portfolio optimizers. Not because it was a bad mathematical technique, but because of the low explainability factor.
It’s tempting to use these things, because they’re kind of cool. And they seem to give you answers that you like. In our world, we try to avoid using neural networks except for basic classifications, because, for the same reason, you lose a lot of your explanatory power.
Remember: It’s All About Quality Assurance
Grzenda: You’re trying to guarantee to people downstream that what you did worked in a predictable and replicable way, so that if they want to make a tweak, it will have the expected effect on that data set before it gets passed to the modeler. In all our projects, we stress double checking. It’s like test-driven development but for data. What checks can you run at the end of your data wrangling to know that what you did worked as expected? You can’t look over every piece of data, so you want to be sure that edge cases are caught.