Data Lake vs. Data Warehouse: Will the Already-Blurred Line Between Them Disappear?
How much does it cost to get medicine to market? It’s a fiercely debated figure, but one oft-cited estimate puts the average cost of developing a single Food and Drug Administration-approved drug at a whopping $2.7 billion.
Whatever the true cost, there’s no disagreement about the overall challenge: Drug development is spectacularly expensive and time-consuming — and those costs invariably roll down to patients. So anything drug companies can do to safely accelerate clinical trials is paramount.
What does this have to do with data lakes? In the case of AstraZeneca, quite a lot.
What Is a Data Lake?
In 2014, the pharmaceutical company made the kind of sweeping data architectural change that phrases like “data journey” do little to capture. Before the revamp, the multinational company’s data was scattered, siloed and slow-moving. It was particularly noticeable on the finance and administrative side, where a glut of CRMs and ERPs directed data this way and that.
“We were spending more time discussing the quality of the data than the business strategy,” Andy McPhee, data engineering director at AstraZeneca, said last year. “We wanted to consolidate and get a single set of global metrics so we could monitor activity across divisions and markets and do comparisons that were not previously possible.”
In pursuit of more streamlined integration, AstraZeneca decided to leave behind its legacy, on-premises data processing and storage set-ups and move to the cloud — specifically to Amazon S3. Its team set up an AWS data lake and used a Talend API to direct the data coming from that mess of systems.
AstraZeneca eventually engineered more and more tributaries to and from the lake, including early and late-science data and metadata. That feeds algorithms that do complex tasks ranging from identifying disease based on stored medical-imaging metadata to aggregating large data sets needed for trials.
“Data lakes give you a lot of speed,” said Mark Balkenende, Talend’s director of product marketing. “You can do things faster than you would in a traditional data warehousing world where things have to be modeled, transformed and fit.”
With its new data lake and ingestion tools, AstraZeneca was able to cut a month off clinical trial times. That may not sound like a lot, given that the full drug-discovery pipeline — from protein targeting to regulatory stamp of approval — can stretch well over a decade. But AstraZeneca estimates that trim-down translates to $1 billion saved per year.
Savings aside, the most remarkable thing about the pharma case study is how unremarkable it is in some ways. The transition to the cloud is perhaps the biggest tech trend of the previous decade. And the data lakes that fuel what was once called Big Data and is now just business as usual are considered must-haves for companies infinitesimally smaller than AstraZeneca.
But even though everyone is wading in a lake, that doesn’t mean the waters are entirely clear. A renewed focus on governance and discovery, plus complicating new wrinkles in the cloud paradigm, are redefining how teams secure and leverage data lakes going forward.
A Brief History of Data Lakes
The concept of the data lake arrived hot on the heels of the original Big Data boom. As Databricks pointed out, it’s still not uncommon for people to closely associate “data lakes” with the old-school Hadoop framework that first brought along the ability to store and process massive amounts of unstructured data, some 10 years ago.
Indeed, the concept was coined in 2010 by James Dixon, founder of Pentaho, in a blog post that outlined his company’s first Hadoop-based release. He explained the data lake’s essential focus on raw data — accessible to a large group of users — like so:
“If you think of a datamart as a store of bottled water — cleansed and packaged and structured for easy consumption — the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
A lot has changed since those early days. Under Cloudera, Hadoop is undergoing big, existential changes, moving away from the on-premise-centric framework that first launched Big Data: HDFS for storage and MapReduce for processing.
Put simply, Big Data drifted up to the cloud — and so did data lakes.
But the fundamental definition that Dixon sketched a decade ago out has largely persisted. The main difference between data lakes and data warehouses is structure. Data warehouses are highly modeled and geared toward more regular, repeated jobs. And data that’s piped into warehouses needs to be molded and transformed to conform to whatever parameters have been set.
A data lake, however, requires no such massaging. It’s the landing pad for massive amounts of data — sometimes on the exabyte scale — in its raw, untamed nature.
Notably, that speeds up access for data scientists.
“No longer will analysts have to wait for months to even begin exploring a data set, only to discover that the essential data they need has been aggregated away into the ether,” wrote Matt How. “Now they can dive straight into the data lake, doing as much cleaning as necessary, and once a proven value has been asserted, a proper process can be built to funnel the data into a warehouse.”
How’s detail about baseline cleaning proves a point: Drawing a strict raw-versus-modeled binary between lakes and warehouses is a bit reductive. It’s common to transform some data and run a bit of analysis in a separate area within the data lake. Also, some modicum of structural organization — such as foldering data along department functions — is necessary for at least two reasons.
First, you need to be able to actually find stuff. (After data lakes went mainstream, a slew of aquatic metaphors with varying degrees of usefulness and staying power followed; the cautionary tale of the “data swamp” remains instructive.) Second, the rise of domestic and international data-privacy regulations means you better have a good understanding of what’s swimming in your lake, structured on unstructured.
That said, some would argue that the line between warehouses and lakes has grown fuzzier. Exhibit A: the spectacular rise Snowflake.
An Artificial Line?
Snowflake’s IPO was an absolute jaw-dropper. When the data platform went public in September of 2020, it reached a $69 billion valuation, spurred in part by investment from Warren Buffett. It was the largest IPO for a software company, and the kind of share-price climb tech hadn’t witnessed in decades.
If the first data-era stage was file-based systems, followed by the cloud, Snowflake represents the new, third model — of the cloud, but distinct. Snowflake’s great selling point is its vendor agnosticism. Its data warehouse platform can run on any of the three major cloud-vendor services: Amazon S3, Microsoft Azure and Google Cloud Platform.
“They’re the Switzerland of cloud data warehouses,” said Balkenende of Talend, a Snowflake partner that’s making a similar Swiss-style play for provider neutrality, but in ETL.
It’s a perhaps counterintuitive position — partnered with the cloud giants, but competing against their warehouse architectures, namely Microsoft’s Synapse, Google’s BigQuery and Amazon’s Redshift. But clearly it’s working.
Snowflake set itself up for success on a more granular, architectural level years ago. It modernized the data warehouse by absorbing several key attributes of data lakes — comparable costs, the ability to posit different kinds of schemas alongside one another, and the all-important separation of compute and storage, which allows users to scale operations based on what they need most at a given time.
Torsten Grabs, director of product management at Snowflake, explained: “It gives you the ability to spin up as much compute as you want over the same data asset without introducing unnecessary, artificial copies of the data,” which can not only slow things down, but also lead to governance issues.
Snowflake’s big, early innovation was another lake/warehouse line-blurrer. Unlike Big Data 1.0 systems, Snowflake allowed users to load unstructured and semi-structured data files (such as JSON) into a relational database without going through a mess of modeling work. It introduced a new unstructured data file — VARIANT — to make it easier still. In essence, data teams could now not only combine, but also query, raw data without the major pre-processing hoop-jumps of yore.
All of which means Grabs considers the division between data lakes and warehouses to be “artificial, already today.”
“The industry going forward will more and more think of this as a continuum, as a spectrum, where essentially you look at the level of structure, curation and confidence that you have in data as it progresses through the data value chain,” he said.
This auto-transformation of sorts is a step forward not only because it lowers data engineers’ stress levels, but also because it kicks open the door for data collaboration across an organization.
“It’s that very generic operation that takes a piece of unstructured or semi-structured data, applies more structure to it, or promotes characteristics out of that data asset, into a shape that more and more business users can access easily,” he said.
But even as powerhouses like Snowflake blur traditional lines, the distinction won’t be obliterated any time soon, according to Balkenende. He points to financial data, which needs the tight structure of a warehouse. “Very few companies are going to use a data lake to do their financial reporting,” he said. “So you’ll continue to have your very highly structured, very conformed data warehouses.”
Swamp Monsters and Legal Liabilities
At the dawn of the Big Data era, data was coming in at higher speeds and greater volumes than could be made sense of. “When data lakes originally came out, there was very little talk about governance or discovery,” Balkenende said. “It was all about just pumping all this shit into your Hadoop or your cloud.”
Soon, however, people realized this approach made data fairly useless from a business perspective. It also opened up compliance issues.
Those issues have only intensified now that privacy regulations like GDPR and CCPA are in full effect. Companies that do business in California or the European Union are now obligated to honor residents’ requests to delete all personal identifying information it has on that person. Swamps aren’t just bad business; they’re a potential legal quagmire.
Not surprisingly, data catalogs and metadata managers — tools devoted to data governance, security, lineage and discovery and indexing — have taken off in recent years. The metadata manager market is expected to climb from $2.35 billion in 2016 to $16.72 billion by 2025, according to a recent projection by Kenneth Research. And a steady stream of high-profile acquisitions by Dell, Hitachi Vantara, Informatica and OneTrust — which has raised $410 million in funding — further points the tea leaves toward bigger and bigger growth — especially for tools that can address all the disparate compliance needs at once.
Data catalog technology has been around for decades, but such demand is unprecedented. Balkenende thinks the boom will only intensify.
“In the last five or six years, catalogs came out of nowhere and now everybody needs to have a catalog all of a sudden,” he said. “And it’s because these data lakes — they’re just a swamp of chaos data [coming] from everywhere.”
“The catalog resurgence is really proof of how governance of your data lake is critical,” he added, noting Talend’s own compliance offering.
A Future for All
Snowflake’s runaway IPO led to some inevitable comparisons to the (ultimately doomed) tech bubble of 1999. However, the overall rosy outlook for cloud computing, plus Snowflake’s ample reserves, make the comparison look a bit strained — as do decades of lessons learned about viable revenue models.
Still it made me wonder: What kinds of large-scale potential disruptions keep Grabs up at night? After all, Hadoop once made not-dissimilar big splashes before being overshadowed by the cloud.
“Disruptions are hard to anticipate,” he said. “If we could anticipate them, they wouldn’t be disruptions.”
He added: “I’m hopeful and confident that we’ll use those disruptive moments for a better Snowflake going forward.”
He pointed to two major trends he sees expanding in the future for data lakes and data warehouses: simpler, more automated data analysis tools sitting atop canonical data assets, and organization-wide access to meaningful data.
“Data sharing can be very eye-opening for users who previously didn’t have the opportunity to work on a very secluded data warehousing environment, where you always had to go to your IT team,” he said. “It’s empowering.”
In other words, whatever the future of data lakes looks like, more and more people will be dipping their toes in.