If such a thing as exaggerated humility can exist, that seems to be a defining characteristic of data engineering teams.
Every metaphor to describe the work employs a wind-beneath-my-wings construction. The data engineer is the modest plumber whose piping allows analysis and insights to flow freely. Or they’re the dutiful race-car mechanic, building and maintaining the engine that powers the more glamorous driver. And there’s also the classic comparison to the hierarchy of needs, with AI representing self-actualization and infrastructure being just the basics: food, water and shelter.
There is, of course, plenty of truth to that. Data engineers, broadly speaking, are responsible for maintaining data systems and frameworks, and they do often build out the pipelines that data scientists utilize. But their work can have a big impact.
Consider Shopify’s star turn.
What is Data Engineering?
The commerce platform has seen its stock skyrocket after businesses had to pivot to online sales and incorporate a variety of new features to respond to the pandemic. The Ottawa-based upstart “saved Main Street,” as The Markup put it, not simply by being in the right racket at the right (read: terrible) time; it also represented “by far the most comprehensive and streamlined” option for payment processing and sales and inventory management.
A not-insignificant part of that success stems from the company’s data engineering practices. Those notably include unit testing on every data pipeline job, company-wide query-ability of data, a rigorous to data modeling and safeguarding system that verifies every input and output.
Erik Wright isn’t a data engineer by title at Shopify — he’s a data development manager. But his work intersects with the overall data engineering ecosystem. Lately, that work means adapting the playbook to help so many merchants survive and thrive.
“There are many groups within Shopify trying to launch either new features or accelerate [existing] features that will help things, like curbside pick-up,” he said with some (on brand) understatement.
“Many of those, when they’re powered by data, can be risky and challenging to build,” he added. “Data pipelines are quite complicated to get right.”
Here’s how they do it.
Data Pipelines and ETL
Shopify updates its data science and engineering blog only about once per month, but people in the industry pay attention to these posts. As an ever-expanding, data-heavy enterprise, the company’s experience in how to scale resiliently — like moving from sharded databases to “pods” — has plenty of educational value.
Data types certainly took notice in June, when Marc-Olivier Arsenault, data science manager at Shopify, outlined 10 of the company’s foundational data science and engineering principles.
One foundation is the company’s rigorous ETL practices — specifically the fact that every data pipeline job is unit tested. We’ll circle back to the testing aspect, but first let’s dive into Shopify’s ahead-of-the curve approach to the ETL workflow. To understand what makes it prescient, you have to know how ETL was traditionally implemented.
What Is ETL?
ETL, for the uninitiated, stands for extract, transform and load. Depending on who’s doing the framing, it’s either essentially synonymous with data pipelines or it’s a subcategorical example thereof, specifically if referring to data pipelines as simply moving data from one location to another.
Here’s an ETL breakdown:
- Extract: Pull raw data from various locations.
- Transform: Manipulate and clean the data in accordance with any business requirements or regulations. “This manipulation usually involves cleaning up messy data, creating derived metrics (e.g., sale_amount is a product of quantity * unit_price), joining related data and aggregating data (e.g., total sales across all stores or by region),” explains Chartio’s handy business-intelligence buzzword dictionary.
- Load: Plop the extracted, transformed, analysis-friendly data into a data warehouse — or perhaps a data lake or data mart.
ETL as a concept remains one of the cornerstones of data engineering. Robert Chang, product manager of Airbnb’s data platform, was sure to include an outline of ETL best practices in his Beginner’s Guide to Data Engineering, which offered an inside look at how Airbnb helped establish a new way of building software with its Airflow pipeline automation and scheduling tool.
That said, ETL (or as some do it, ELT) is a malleable thing. For one, there’s seemingly endless debate as to whether or not ETL still even exists. But for the majority that answers yes, the nature of the architecture depends on a lot of variables, perhaps most notably the scale of the business.
A newborn startup, for instance, probably doesn’t require anything quite so advanced. It can get by with “a set of SQL scripts that run as a cron job against the production data at a low traffic period and a spreadsheet,” wrote Christian Heinzmann, former director of engineering at Grubhub, which uses a flow that might be best described as ELETL.
It’s malleable in terms of workflow too. At Shopify, it doesn’t even fall under the duties of data engineering. There, data scientists handle all the typical ETL processes related to their data modeling.
“Our role from a data engineering perspective is to enable [the data science team’s] processes.”
“The data scientists are the ones that are most familiar with the work they’ll be doing, and in terms of the data sets they’ll be working with,” said Miqdad Jaffer, senior lead of data product management at Shopify. “Our role from a data engineering perspective is to enable their processes.”
Of course, all data scientists need some programming chops, but that goes doubly in an environment where they’re building out their own pipelines. “Our data scientists come from a very strong engineering background,” Jaffer said. “The tools that we create are systematically set up so that we can be opinionated about what they build — but how they build it is still something entirely up to them.”
Such a workflow might not be as unique as it once was. More and more SQL tools are popping up to support ETL at the data science level to be sure. It just took some time getting there for most.
“I think that’s kind of the sweet spot where people have landed,” Jaffer added. “We’ve just always had that as the default.”
Put to the Test
That’s the responsibility breakdown, but how do you actually make sure the pipelines are resilient? That’s where those unit tests come in. As Shopify pointed out in June, every data pipeline job is unit tested.
“This may slow down development a bit, but it also prevents many pitfalls,” Arsenault wrote. “It’s easy to lose track of a JOIN that occasionally doubles the number of rows under a specific scenario.”
Of course that’s easier said than done. As Wright points out, in elaborate systems, unit tests can become as complex as, or even more complex than, the code being tested. Measured development for the sake of diligence is great, but slowing to a snail’s crawl isn’t exactly viable.
In order to strike the balance, Wright takes an approach he calls “minimal testing.” That includes creating DSLs for code that has become too unwieldy, refactoring duplicate code and building solutions from smaller, targeted classes and functions. You can also look at decoupling algorithms from specific schemas and data sources in order to decouple the tests as well.
“It’s really the same stuff that applies to engineering practices in any piece of code,” he said. “But these can apply to data pipelines as well.”
Of course the task of enabling analysis goes beyond smoothing out unit tests. How else does the data engineering side play facilitator?
This is probably a good time to mention Ralph Kimball.
A Model Approach
It might seem counterintuitive when discussing a field that’s seen a lot of transformation over the last several years, but the go-to text for dimensional modeling techniques remains Ralph Kimball’s The Data Warehouse Toolkit, published nearly a quarter century ago.
It’s most famous as the place where the star was born — the star schema, that is. That modeling schema is still the most popular thanks to its intuitive layout: multiple “dimension” tables spurring off a central “fact” table. Here’s a retail example. More advanced iterations get more complicated, but the basic structure stays essentially the same.
Still, the Kimball methodology goes far deeper, with detailed best-practice advice around modeling those tables and structures, like “Ensure that every fact table has an associated date dimension table,” and “Resolve many-to-many relationships in fact tables.”
Shopify is closely aligned to Kimball. Because the entire house follows the guidelines, it’s possible to “easily surf through data models produced by another team,” wrote Arsenault in June. “I understand when to switch between dimension and fact tables. I know that I can safely join on dimensions because they handle unresolved rows in a standard way — with no sneaky nulls silently destroying rows after joining.”
Also key is the fact that the Kimball method splits the warehouse architecture into a back room, for metadata, and a front room, where the high-quality data sets and reusable dimensions end up.
As Arsenault noted, Shopify’s streamlined approach encourages openness: Anyone at Shopify can query data. (The company’s lone data modeling platform is built on Spark, and the modeled data lives on a Presto cluster.) That said, it’s vital that the good stuff be kept up front.
“Not every query or transform that’s ever been output is suitable for blind reuse,” Wright said. “So keeping metadata sets in one place, and intermediate ones in another — it doesn’t mean you physically block somebody from using them. But you want the first thing they find to be the best-quality data. If they decide to go deeper, you want there to be a signal that there may be dragons.”
“You want the first thing they find to be the best-quality data. If they decide to go deeper, you want there to be a signal that there may be dragons.”
Shopify has steps along the way to define the base model plus both “rooms.” Testing and documentation are both baked into that process.
Wright compares schema and table best practices to developing APIs, which are always programmed to be easy to use properly and tough to use incorrectly. You can achieve something similar in warehouses by having consistent naming conventions and a plethora of information at the ready.
“When you find a data set, you don’t just find a data set,” he said. “You also find the documentation and the cross references to how that data set has been used by others, which will guide you how to use that data correctly.”
Peer Review and Filing Contracts
Jaffer and Wright point to two other important tenets of the data process. One is peer review. At least two other data scientists — or two others with ability within that repo — are brought in to make sure a pipeline’s code is tight and that all the above-mentioned processes are accounted for in data models.
Because everything gets at least two fresh pairs of eyes, “we make sure that whatever we’re producing to our end customers from a query perspective, is a gold standard,” Jaffer said. He also pointed to a data onboarding sequence and a Data 101 course, where product and UX get a basic understanding of the data engineering side, as helpful safeguards.
A bit more technical, but among the most vital, is Shopify’s concept of contracts. This happens during the transformation process, which runs on big data processing engine Apache Spark. As data scientists transform their data (the T in ETL) to make it presentable for the front room, every input is passed through a contract and every output is checked against its corresponding contract.
Wright explains: “The idea is that when you have a wide-open world where you allow anybody to write a cron job, that job can pick up inputs and write outputs anywhere in your data warehouse. So when things go wrong, your platform doesn’t have much leverage for how to help data scientists.”
Without the contracts, a lot of errors — an incorrectly written field name, an unexpected null, data sent to the wrong place — could slip right through.
Having some higher-level visibility on the metadata makes the platform more robust. “It doesn’t mean that you tell them exactly how to do their jobs,” Wright said. “In fact, that’s really where the sweet spot is: to find a way to let [data scientists] look for the best way to do their transformation, then find a way to allow the platform to help them do that.”
Finally, you might be wondering how we’ve come this far discussing the importance of data engineering with only passing mention of Apache Spark and none of, say, Hadoop or Kafka. (For the record, Shopify utilizes all three. You can explore its stack here, including some reflection from the company’s engineering lead on Shopify’s famous early bet on Ruby on Rails.)
That’s not to say these big-data powerhouses are beside the point; it’s more about the important intersection of engineering. Erik, for instance, had “literally zero” experience in big data before joining Shopify in 2015, coming from a development background.
“I was looking at it really from a systems engineering point of view,” he said. “That brought different ideas than maybe if you’re just thinking in terms of Redis versus CSV versus RDD — all these big data tech names.”
He continued: “Data structures, algorithms, object-oriented design. Concurrency, distributed systems engineering — those are skills you can build in any domain. But I think they provide a fantastic foundation for a data engineering career.”
Jaffer sees a similar path, with colleagues often starting from a software engineering practice and feeling the pull of data, particularly as an opportunity to tackle scale problems. Indeed, Shopify now works with more than a million merchants — some small, some major enterprises — across some 175 countries.
“That’s millions and millions of data coming in at a regular interval,” Jaffer said. “How do we deal with those things? It becomes a scale software engineering problem.”