Recently I’ve been thinking about the common themes we encounter in our data management client work, especially early on in the client relationship. One of the biggest themes parallels a challenge faced by clinical psychologists — the problem the patient brings you is rarely the problem you end up addressing. This seems to be especially true with all the hype surrounding data science-related projects over the past few years.
Over the past 24-36 months, the majority of our long-term client engagements have started with our clients wanting to address what they perceived to be a “data science” or “big data” problem. In nearly every case, we found that there was a lack of data management infrastructure necessary for doing serious analytical work and they would be better served, and able to address a wider set of stakeholders in the organization, by tying together their various data sources into an integrated data lake/data warehouse as a first step. This approach not only provides a richer data set for data science work, it automates a ton of manual processes required to cleanse and stitch the data together, and allows for operational business intelligence (BI) and self-service reporting. Finance/accounting, operations, sales/marketing, C-suite, and so on are now able to produce reports and analytics from a single version of the truth.
At the heart of these solutions is the data lake/data warehouse. While the concept of a data warehouse is not new — I’ve been implementing them for nearly 30 years now — it still blows me away that we often find ourselves in a position where we are helping our client build consensus in the organization around the concept of a data warehouse and the overall business case. For many folks, they just can’t understand why they need to build yet another database and start feeding it data.
In the context of business intelligence, a data warehouse is a core repository that serves as a “single version of truth” that integrates and rolls up data across various data sources within an organization. The stored data is both historical and current, and supports analytical reporting, executive dashboards, “self-service BI,” and data science. Many organizations don’t have a clear understanding of how a data warehouse is different from the various transactional databases that power things like customer relationship management, enterprise resource planning, accounting, and other core infrastructure within an organization’s data landscape. The purpose of this post is to help the reader understand the differences between a data warehouse versus a traditional, transactional database.
Database Architecture: 3NF vs. Dimensional Modeling
The primary difference between a data warehouse and a transactional database is that the underlying table structures for a transactional database are designed for fast and efficient data inserts and updates (it’s all about getting data into the database). For a data warehouse, the underlying table structures are designed for efficient data reads, allowing users to develop simple and powerful queries for retrieving data for analytical purposes.
The rest of this post goes into greater detail on the differences between these two database designs.
Transactional databases are designed and built using a data modeling technique called third normal form (3NF). The purpose of using this technique is to build a database that is optimized for the transactional applications that are running on top of the database. The overarching goal of this design technique is to reduce data duplication/redundancy, and it follows a database design consisting of a set of atomic tables that are optimized for fast inserts and updates.
The disadvantage of this design technique is that creating a database of smaller, normalized tables means that when you want to start performing analytics and business intelligence, you experience poor query performance because you have to join a bunch of smaller tables together to answer the questions you are seeking to address. The overhead of these “joins” greatly impacts database performance and results in long query times and sometimes the queries timeout altogether and don’t finish.
The other disadvantage with this design technique is that your database overwrites previous values with the most current version of a transaction, and thus it is hard to track changes over time, which is important for analytics. As mentioned earlier, companies without data warehouses often attempt to perform analytics and business intelligence on copies of their transactional databases. The poor query performance, the lack of historical perspective, and the lack of integrated datasets across multiple applications and data sources are the primary reasons that companies invest in an upgrade to a data warehouse for powering their analytics and providing a single version of truth within the organization.
In contrast to 3NF data modeling, data warehouses use a data modeling concept called dimensional modeling. This design technique calls for organizing your data into facts and dimensions. Fact tables record measurements or metrics for a specific event (i.e. sales). Dimension tables contain the data by which you would slice a fact table. For example, you would slice a fact table containing sales numbers by region, time, product, employee, etc. In this example region, time, product, and employee are all dimensions. Organizing the data in this manner allows for business users to easily slice and dice data for analytical purposes and helps support organizational goals such as enabling business users with “self-service BI” capabilities. This is critical for unlocking the power of BI tools such as Tableau, Looker, and Power BI.
Having a data warehouse is a critical component of a modern analytics environment for an organization. It is different from existing transaction database systems in that it is organized for integrated reporting across all of your transactional systems and data sources. A data warehouse is designed using a different database modeling technique referred to as dimensional modeling. Application developers are typically more focused on third normal form modeling which is why it is important to have a data warehouse architect who is skilled in dimensional modeling to design and develop your data warehouse. In doing so, the organization ensures that it has a solid foundation for realizing the full value of various BI tools such as Tableau, Power BI, Looker, as well as the ability to produce clean, integrated data for data science purposes.
This post originally appeared on the WaterlooData blog.