There are several models to collect, store and organize a company’s data, but one of the most talked about in the last decade has been the data lake, a huge repository that allows the input of massive amounts of any kind of data in raw form at high speed.
The concept of the data lake originated with a technology called Hadoop, in which a few engineers developed a data storage model designed for the immense input of data required by search platforms. It was eventually spun into an open source project run by the Apache Software foundation, which made Hadoop’s code available to the world for free.
It’s been more than a decade since the data lake went mainstream, and since then enterprise software vendors like Microsoft and Amazon have also stepped up with offerings like the Azure Data Lake and AWS Lake Formation.
Yet, companies are still trying to figure out if and when a data lake is right for their data.
Why Was the Data Lake Invented?
To decide whether your company needs a data lake, it’s helpful to think about why it was created in the first place. In some ways, a data lake can be seen as a response to an earlier concept: the data warehouse.
A data warehouse provides a systematized process and tightly organized repository for data that might be needed for a company's operations and analytics. This strict structure keeps data ready for analysis, but it does so at the expense of collection and processing speed. This makes it impractical for companies that collect massive data volumes at high speed and who collect a range of structured and unstructured data — otherwise known as big data.
In contrast, a data lake’s ability to handle any data type makes it desirable for companies that deal with big data. In addition to its ubiquity in terms of data types, it can also handle high volumes of data at high speed. It does this largely because of its use of “clustering,” where multiple servers, rather than a single server, are leveraged for storage, processing and computation.
This technique, also known as distributed computing, allows the data infrastructure to be expanded as needed allowing it to scale to the kind of immense demands that a company like Yahoo, Facebook or eBay might experience.
A Primer Question: Do We Really Have Big Data?
If you’re considering a data lake, the first and most obvious question is: “Do we really deal with big data?” In terms of volume, what is considered “big” is somewhat subjective, but generally it is considered to be data which is too large or complex for ordinary data storage and processing technologies to handle.
Think about not just the amount of data you collect, but also the data type. What if you're mostly handling relational data — highly structured data that fits in a standard column and row table (like sales and transaction data)? Even in large volumes, a data lake probably isn’t going to be ideal because it isn’t designed with the kind of structure and safeguards of a relational database system (RDBMS), and therefore requires additional engineering to provide these features. But if you are managing high volumes of semi-structured and unstructured data, you’re right to consider a data lake.
Assuming you answered “yes” to the gatekeeper question, here are some additional questions you should ask as you plan your data lake implementation.
Questions to Help You Plan Data Lake Implementation
- What’s our plan for dealing with small data?
- Can my data science team easily work in the lake?
- How are we going to keep track of the data once we put it in the data lake?
- Can I integrate a data lake with current data infrastructure? And if so, how?
What’s Our Plan for Dealing With Small Data?
As mentioned, data lakes aren’t good for every scenario. Hadoop, for example, has trouble dealing with smaller datasets, and is better suited for handling a single large file than it is the same file split up into many smaller ones.
If you’re looking to create a system that can handle small datasets in addition to big data, you don’t necessarily need to scrap your data lake plans. For instance, you might consider using Apache’s Hadoop Ozone software, an alternative to Hadoop’s native file system (HDFS) that can manage both large and small files.
You also might consider implementing a few workarounds, such as configuring your data pipeline to bundle many files into one container like a sequencefile, an Avro file, or a Hadoop archive (.har) file.
Some commercial enterprise-grade ETL (extract, transform, load) solutions, such as Xplenty, automatically optimize small files for storage in a data lake as it is loading new data into your system. Alternately, HBase, which is a part of the Hadoop ecosystem, can also be deployed as a solution to handle small files.
Can My Data Science Team Easily Work in the Lake?
Data scientists can be tough to hire, so the last thing you want to do is implement a data architecture that your data science team can’t work with.
Hadoop is notorious for posing some challenges for data science teams in that some of its technologies, like MapReduce, are more difficult to use. In particular, programming languages often used by data scientists like Python and R aren’t able to work directly on Hadoop’s distributed data sets. You can address this, to a degree, with Apache Spark, open source software designed to increase performance and simplify operations on a data lake.
Among other things, Spark offers machine learning libraries that allow data scientists to build models that run on distributed data sets. PySpark and SparkR, for example, enable data scientists to work with a language that’s very close to Python and R with access to an extensive machine learning library.
Before embarking on a Spark implementation, however, consider the skill sets that will be required, such as the ability to integrate and manage open source software code.
Though Spark simplifies the process of working with data lakes like Hadoop or Microsoft Azure Data Lake, it does require some specialized knowledge for proper integration. If your developer team finds the task of integrating Spark a little overwhelming, you could leverage one of the commercial versions of the product which, though still requiring some level of management, will enable you to get up and running on data lake in less than an hour.
With that said, keep in mind that PySpark and SparkR require significant additional training — even for data science teams that are adept with Python and R. Its syntax is still different enough that there’s a steep learning curve.
How Are We Going to Keep Track of the Data Once We Put It in the Data Lake?
This point can’t be stressed enough. The data lake’s ability to quickly ingest any data type comes at the expense of providing the structure needed to make the data useful. If your data lake strategy doesn’t include a plan for dealing with this, you may find that you’ve got something like a roach motel on your hands — datasets check in, but they don’t check out.
This challenge has given rise to solutions designed to make data lakes more manageable, like data catalogs, which attempt to provide a data warehouse-like level of structure in a data lake.
If you’re shopping for a data catalog, there are a lot of solutions to choose from. But before you start, note that this product category has evolved greatly over the past several years. Be sure that whatever you pick can take you to the next level.
But what exactly is the “next level?” To answer that question, think about what has happened to modern data architecture over the past decade. To be blunt, for most companies, it’s a tangled web of different kinds of repositories — relational databases, NoSQL databases, warehouses, marts, operational data stores — often from different vendors, owned by different departments, and existing in different locations.
So, before purchasing a data catalog, you need to ask how it is going to deal with not just your data lake, but all your other existing data infrastructure as well.
Can I Integrate a Data Lake With Current Data Infrastructure? And If So, How?
While it’s true that a data lake can handle pretty much any type of data imaginable, you’re still not going to want to put all of your existing data in it.
For starters, moving data is a massive undertaking, and unless there’s a compelling reason to move it — like AWS pulling the plug on you — it’s generally not something you’ll want to do.
So what do you do when you need to integrate a data architecture that includes, say, Hadoop, Oracle, and Teradata? Let’s take a step back and look at some of the technologies and models that have emerged to address this need.
Matt Aslett of 451 Research coined the term enterprise intelligence platform (EIP) to describe technology that enables analytics projects to be run simultaneously on datasets that may exist in many different data repositories. Similarly, John Santaferraro of Enterprise Management Associates (EMA) has described a unified analytics warehouse (UAW) that unifies all interactions with both data and analytics tools through a singular system.
That’s all good and fine, you may say, but how are the forward-thinking ideas of analysts going to build a working data infrastructure at my company?
When my company was working on creating a system that would automate SQL queries over dispersed and distributed data architecture, we asked this very question. Researching and reviewing the literature on this issue led us to an open source technology called Trino. A key benefit of Trino, we learned, is that integrating complicated architecture — think Hadoop combined with OracleDB, MongoDB and Teradata — is not accomplished by moving, or even changing any data. Rather, it involves leveraging virtualization to create an abstract layer that allows you to look at a mess of different systems and see them as a single data source.
So in assessing data catalogs, look for capabilities that give you an abstracted access layer over multiple data repositories. If it requires you to move data, you’re going to have to deal with an additional set of challenges that are going to set you back.
Ultimately, if you’re considering a data lake, you need to think through how you’re going to manage small datasets, maintain order, and integrate with existing data infrastructure. While each of these obstacles, if unaddressed, can derail your ability to gain value from your data lake investment, they’re all surmountable when the approaches I’ve described are applied.