What Is Hadoop? A Guide

Think back to early November of 2007.

No Country for Old Men was about to debut in movie theaters ahead of its eventual Best Picture win. Boston Red Sox fans were days removed from celebrating the team’s latest World Series win with a flipped car and small fire or two — more subdued, however, than the destruction following the curse-breaking win three years prior. And the New York Times had just announced that articles published between 1851 and 1980 — some 11 million in total, encompassing some four terabytes of data — would be available online as PDFs.

Derek Gottfrid, then a senior software architect for the Times, accomplished the seemingly herculean conversion task in less than 24 hours. His success was thanks in large part to a new, but still obscure, parallel-programming breakthrough belied by a childlike name: Hadoop.

Within months, Hadoop would become a so-called top-level project for the open-source Apache Software Foundation; in the years thereafter, it saw a meteoric ascent. Together with Hadoop’s large-scale storage capabilities, the MapReduce model that underpinned its data processing muscle represented a genuine breakthrough — “it made me weep,” wrote Gottfrid in 2007 of the Google paper that first outlined the technique.

What is Hadoop?

Hadoop is an open-source big data framework co-created by Doug Cutting and Mike Cafarella and launched in 2006. It combined a distributed file storage system (HDFS), a model for large-scale data processing (MapReduce) and — in its second release — a cluster resource management platform, called YARN. Hadoop also came to refer to the broader collection of open-source tools that sits atop that base. Cloudera now stands as the foremost Hadoop vendor after merging with Hortonworks in 2018. The company’s Cloudera Data Platform (CDP), released in 2019, marked a major renovation of the Hadoop distribution as it faced increased competition from cloud providers.

As noted in Hadoop: The Definitive Guide, by early Hadoop contributor Tom White, the New York Times use case stood as a notable public win in the early Hadoop days. But for Doug Cutting, the co-creator of Hadoop, it was a lagging indicator. Big things had already been happening behind the scenes. User groups had popped up to nurture the system, and companies like Facebook and LinkedIn had added it to their stacks after contributing to its codebase.

“When VC started sniffing around, saying, ‘We should start a company around this,’ that to me was the sign,” Cutting told Built In. “You've got people dependent on it, which is going to ensure its survival. A neat thing about open source is, if you depend on it, you can make it survive. You can weigh in with your own efforts — and that’s what we saw from these early adopters.”

“[Hadoop] made me weep.”

Hadoop did indeed survive — and thrive. It quickly became synonymous with the heady early days of Big Data, which it helped usher in. It could easily handle the gluttonous ingestion of unstructured data in a way that conventional databases never could. It was, as a 2011 Wired headline proclaimed, “the future of Big Data.” Data-drenched enterprises like Twitter and eBay followed Facebook and LinkedIn’s lead, and a parade of Hadoop-related products emerged.

But the party didn’t last forever.

In time, new data storage and processing options appeared and matured. Meanwhile, startups slowly began to realize they probably wouldn't require the same data payloads as behemoths like Facebook. The “Is Hadoop Dead?” chin-stroking proliferated to such an extent that even Arun Murthy, chief product officer of Hadoop vendor Cloudera, asked the question last year.

His conclusion? Not dead. But even he yielded that, yes, “MapReduce is in decline” — while also stressing that Hadoop is not simply MapReduce.

Murthy, of course, has skin in the game, which might lead one to ask again, is Hadoop dead? Or is it the same powerhouse as ever, just now post hype? Before we try to answer that, let’s quickly look at how Hadoop got to where it is.

big data storage what is hadoop — Hadoop's combination of HDFS, MapReduce and YARN was a watershed in the early days of Big Data. | Image: Shutterstock

The Birth of Hadoop

Hadoop grew from what was initially a search engine project. In the early aughts, Doug Cutting had already built a search indexer (Lucene) and co-created a crawler (Nutch) in order to run search. But there was a problem.

“We realized it was a bigger project than a couple of people could do part time,” Cutting told Built In. “Building distributed systems is complicated.”

Two Google research papers changed all that: a 2003 outline of Google’s distributed file system, called Google File System (GFS), and the aforementioned MapReduce paper, published in 2004. Inspired by both, Cutting effectively created and merged his own open-source versions of each to create Hadoop. He named it after his kid’s stuffed elephant — “short, relatively easy to spell and pronounce, meaningless, and not used elsewhere,” Cutting explained, according to White’s Hadoop.

“Hadoop isn’t a thing; Hadoop is a set of things.”

Developers at Yahoo had been tracking Cutting’s progress and eventually recruited him to refine Hadoop’s open-source code as an employee of the established search engine. But the twin powers of Hadoop — big-time storage (HDFS) and lighting-quick compute (MapReduce), paired later with YARN’s simplified querying — would prove more revolutionary in big data warehousing than in search.

thumbtack what is hadoop — Thumbtack began migrating workloads away from Hadoop and Cloudera toward a cloud provider in 2016. The move is emblematic of the challenges Cloudera faces. | Image: Thumbtack

Why Some Companies Moved Away From Hadoop

Data infrastructure expert Nate Kupp’s relationship with Hadoop follows an archetypal pattern: intoxicating first encounter followed years later by a pragmatic separation.

From 2012 to 2014, Kupp worked as a battery-life analytics technical lead at Apple. That means it was his job to understand how exactly an iPhone’s myriad parts — hardware components, operating system and standard and third-party apps — affected the device’s battery life.

The job required an infrastructure that could handle vast amounts of data — more specifically, tens of terabytes. Kupp’s tenure also coincided with the biggest spike in interest Hadoop would enjoy, according to Google Trends. The team eventually rolled out its first Hadoop cluster — an install Kupp recalled as his Hadoop a-ha moment.

Find out who's hiring.

See all Data + Analytics jobs at top tech companies & startups

View Jobs

“It was just really cool to be able to crunch the data and — for the first time — be able to look at distributions of different statistics,” said Kupp, now of data orchestrator Elementl.

The team established a process around the system, which allowed them to analyze the data, better understand battery life distributions in broad strokes and drill down into bugs that hadn’t been visible.

“All these insights, which were previously inaccessible, were now at our fingertips.”

“All these insights, which were previously inaccessible, were now at our fingertips,” he said.

When Kupp left Apple for a role heading up infrastructure and data science at Thumbtack, Hadoop was still part of the job. For year, most workloads at the freelance-labor service ran on a large production Cloudera cluster. But by 2016, a significant amount of resource contention was straining the infrastructure.

“All the analysts would come in on Monday morning, kick off and refresh all their dashboards, and bring the cluster to its knees with a ton of SQL workloads,” he said.

Kupp and Thumbtack ultimately decided to migrate away from Cloudera and switch to Google. SQL data warehousing moved to BigQuery, Spark — which had been running on CDH at Thumbtack — shifted to Google Cloud’s Dataproc, and data storage moved from HDFS to Google Cloud Storage (GCS).

The move quickly paid dividends.

“Our ops burden went from getting alerts all the time to going weeks without a PagerDuty notification,” Kupp said.

cloudera exterior what is hadoop — Cloudera offices in Palo Alto, California. The Hadoop vendor rolled out an ambitious new distribution in 2019. | Photo: Cloudera

From Under a Dark Cloud

Kupp’s experience is emblematic of the challenges that face Cloudera and, by extension, Hadoop. These days, the company faces stiff competition from cloud providers such as Amazon Web Services, Azure and GCS. These services have reputations for ease of use, low maintenance overhead and cost control that Cloudera has struggled to match, Gartner data management analyst Merv Adrian told Built In.

According to Adrian, the two keys to Cloudera’s long-term success are: 1) how well it transitions its new product architecture, and 2) how well it’s able to keep clients if and when they move to the cloud.

Indeed, Cloudera has long had a cloud problem. When it merged with Hadoop-centric Hortonworks in 2018 — a move the industry largely took as a sign of Hadoop’s diminishing market share — VentureBeat sneered, “Ironically, there has been no Cloud Era for Cloudera.”

The company has worked hard to push back on that characterization, admitting, in effect, that it partially stems from Cloudera’s own marketing failures. As Murthy pointed out in a blog post last year, the first connector between Hadoop and Amazon’s cloud storage service S3 was written way back in 2006.

“Unfortunately, as an industry, we have done a poor job of helping the market (especially financial markets) understand how ‘Hadoop’ differs from legacy technologies in terms of our ability to embrace the public cloud,” he wrote. “Something to ponder, and fix.”

“They could have done a simple smash-the-two-things-together play, but they actually re-architected the product fundamentally for a new marketplace.”

Financial services and telecommunications represent the two biggest sectors in Cloudera’s client base. Those industries are highly regulated and very security conscious, which means they’re more inclined toward on-premises software than others. But even tightly regulated industries are looking to the cloud for some operations — a fact that Cutting readily admits, and one on which Cloudera remains uniquely focused.

doug cutting what is hadoop “There are huge advantages to throwing things in the cloud, and we’re not going to hem our customers in,” Cutting (left) said. “If you need to burst small loads, [the cloud] gives an organization the ability to chart its own destiny much more rapidly.”

Hybrid, multi-cloud setups are indeed Cloudera’s long-term lane, Cutting said. That’s what Cloudera promises with Cloudera Data Platform (CDP), the new product architecture Adrian references above.

CDP is, in essence, a management layer. Through it, users are able to shift loads between multiple cloud systems and on-premises while keeping data security and lineage consistent. The ambitious, months-in-the-making new system was put together in the wake of the Hortonworks merger. It replaced both companies’ legacy systems and was released last September.

“They could have done a simple smash-the-two-things-together play, but they actually re-architected the product fundamentally for a new marketplace,” Adrian said.

Even though Cloudera continues to enjoy some objects-at-rest advantage when it comes to maintaining its enterprise-heavy clientbase — that is, it’s always a royal pain to switch data vendors, so why bother? — the company clearly realized any go-along-get-along complacency among clients has a ceiling.

CDP notably eliminated YARN and swapped in Kubernetes, seemingly in response to many companies’ increased willingness to run their Spark workloads on the container system instead of through Hadoop. It also recently made Ozone — its new distributed object store — open to general use. Ozone is an HDFS replacement candidate, with potential to be highly portable across different environments. A necessary improvement since, as Adrian bluntly put it, “HDFS just ran out of steam.”

So, with its new look, does a post-pivot Cloudera stand a better chance against so many cloud provider Goliaths? It’s too soon to render a verdict on CDP or, certainly, Ozone. But Adrian, for one, is “cautiously optimistic.” And the company’s market summary certainly looks rosier than it did in June of last year, when shares nosedived before ticking back upward.

Cutting sounds cautiously optimistic, too, even while facing off against some of the most powerful names in tech.

“They’re massive, [so] it’s still frightening,” he said with a laugh. “But we’ve been there before.”

An Open-Source Legacy

On the day we spoke, Adrian shared a blast from the past. He tweeted out a list he compiled in 2015 of 10 Hadoop-affiliated projects. Some, like Crunch, never found any real purchase; others, like Presto, very much did.

His point was multifold: Hadoop has almost always encompassed far more than MapReduce and HDFS.

“Hadoop isn’t a thing; Hadoop is a set of things,” he said. And some of those things — Presto, Hive, Kudu, Impala, etc. — endured atop the system for a long time.

Also, everything on the list is open source.

“Hadoop is a symbol of that explosion, that shift from the traditional enterprise software world ... to where we are now.”

That might seem unremarkable in 2020, but Doug Cutting had to fight tooth and nail to keep the Hadoop bedrock open source. It took six months of wrangling with legal, plus a change of CEO, before he could actually develop Hadoop as open source at Yahoo without any roadblocks.

At the time, data systems had been the realm of private software vendors, and that’s just how things worked. Cutting’s employment agreement ultimately sported a handwritten addendum that allowed him to contribute to open source, which was otherwise prohibited at the search company.

“Hadoop is a symbol of that explosion, that shift from the traditional enterprise software world — where the guys in the blue suits were the only ones you could trust to hold your core corporate data — to where we are now,” he said.

So perhaps it’s no surprise that Ozone, though spearheaded by Cloudera, is also open source. Sink, swim or tread water, Cloudera will do so by carrying on the open-source ethos that Cutting helped cement.

Hadoop Ruled the Early Big Data Era. Can It Rise Again?

What is Hadoop?

The Birth of Hadoop

Why Some Companies Moved Away From Hadoop

From Under a Dark Cloud

An Open-Source Legacy

Recent Data Science Articles