What Is a Data Lake?
A data lake is a digital storage area where businesses hold structured and unstructured data, usually in its raw format. Like other kinds of storage repositories, data lakes house data sets until they are ready for analysis or some other use case.
In the same way a natural lake is fed by various water sources, data lakes likewise receive various types of data. Data types flowing into a data lake may include such as data tables, social media data, log files and unstructured text such as emails, images or videos.
What Are the Advantages of a Data Lake?
- Data lakes are a flexible storage technology that can store any format of data or any size of data.
- Data lakes have a strong focus on rapid and flexible data ingestion.
- A variety of data types and data sources are available in one location, which supports statistical discovery.
- Data lakes are designed for low-cost storage so they often house a high volume of data at a relatively low price.
Data Lake vs. Data Warehouse: What’s the Difference?
Data lakes were created to provide more agile, flexible storage options that allow for all types of data. Data lakes focus on rapid ingestion of data no matter the type or size and solve the need to have all types of data in one place for discovery, thereby enabling cross-source analysis. For example, a company might use a data lake to store a vast amount of marketing data before processing and analysis including media data files, social media comments, ad engagement metrics and other marketing data.
Data warehouses, on the other hand, are more rigid. Unlike a data lake, a data warehouse doesn’t ingest data files in their raw format. They focus on defining data tables and setting data format structures up front. Data warehouses ingest mostly structured data and store data in preset formats. Data that’s not structured appropriately, or doesn’t adhere to predetermined requirements, won’t be permitted in the data warehouse. That said, data warehouses are preferable when data is in a consistent, preprocessed, structured format. An example might be a structured daily media data file that adheres to a set table format and consistent update schedule.
Data lakes came about in response to the rigidity of data warehouses. Businesses needed a way to store all their data in a variety of formats quickly to enable discovery, analysis and the data regardless of structure. Some businesses resisted the constraints offered by the need for highly structured data required by data warehouses. These limitations could often delay storage and analysis due to the need to reformat all data sources to a rigid tabular format with predetermined data field structures (i.e. column A must be a number, column B must be a date, and so on).
Why Do Organizations Use Data Lakes?
Organizations primarily use data lakes for their flexibility and speed. Data lakes process and store data quickly, regardless of data format. For these reasons, the best use case for a data lake is any situation in which we create data in a variety of formats and will find value in having all that data in one location. An industry or organization that focuses on discovering new themes or patterns across various sources of data would benefit from data lakes because one of the key advantages of a data lake is that all data (regardless of type) is in one place.
Another major benefit of data lakes is that they store a wide variety of data in their native format. In order to enter the data into storage, we don’t need to restructure the data. Therefore, the data retains all its original information including metadata. However, this also means data lake users need to be familiar with ways to analyze data with different structures. For example, a data lake user may need to know how to analyze unstructured text, extract metadata from stored images and how to analyze large structured data tables. The ability to analyze data in all formats (beyond the standard data table format) used to be a rare skill set but is becoming more necessary among data scientists as businesses are looking to their data teams to extract value from all formats of data.
Data Lake Example
Let’s look at an omnichannel marketing data lake.
Marketing departments that support larger businesses will often source their intelligence from an incredibly wide mix of data sources, such as:
- Web or mobile analytics data
- Media impressions, clicks and ad viewership data
- Data from cookies or marketing attribution data
- Market-level information
- Consumer trending data and benchmarks
- Demographics and market profiling or segmentation data
- Web or mobile customer journey and data around customer touch points
- Social media account analytics data
- Social media listening data
- Public relations data and press mentions
All of these data sources may be housed in a data lake, ready to be analyzed either as stand alone data sources or in relation to other data sources stored in the data lake.
What Are the Disadvantages of a Data Lake?
- Varying data types mean the users need to be versed in ways to analyze and process a wide variety of data.
- As the size of the data lake increases, some systems can have trouble scaling or there may be unexpected costs.
- Data lakes need a high level of stewardship and administration to avoid becoming data swamps, or dumping grounds for undocumented data sources.
- Given the wide variety of data structures, data lakes offer a lower level of security than more structured data locations like data warehouses.