Cognizant Servian

Cognizant Servian was formed on the deep expertise of two acquisitions: Servian and Contino…

Follow publication

Photo by Anders Jildén on Unsplash

Ingredients of a Data Warehouse

Cutting through the noise around data warehousing

Kovid Rathee
Cognizant Servian
Published in
9 min readNov 8, 2022

Signal vs. Noise

With the appropriation of the term data warehousing to mean a whole lot of things it shouldn’t, the general understanding of what a data warehouse is has become a nomenclature problem more than anything else. Many people understand that the distinction between a data warehouse and a database is solely that of the underlying infrastructure.

Although infrastructure plays a significant role in building a data warehouse, data warehousing has one more aspect, which, I think, supersedes in importance — data modeling. Because of the failure to acknowledge data modeling as one of the core ideas behind data warehousing, the common understanding of data warehouses is flawed.

This article will take you through the three core ideas behind data warehousing and also discuss how data modeling is central to creating a data warehouse:

  • Data modeling
  • Storage architecture
  • Processing architecture

The article will also discuss how data warehouse platforms are pre-packaged products with a combination of storage and processing and how the data warehouse platforms are not themselves data warehouses. For instance, contrary to popular belief, dumping data into a data warehouse platform like Redshift or Snowflake doesn’t make it a data warehouse. There’s more to it.

Let’s dive straight in.

Three Core Ideas Behind Data Warehouses

TL;DR — data models, storage architecture, and processing methodology.

Data Models

First, the database data model fed by applications, third-party integrations, IoT devices, etc., is meant to optimize the application's performance. You set yourself up for failure when you start overloading those databases with analytical workloads. That is not to say that these databases can’t partially support analytical workloads because they’re modeled a certain way. Still, they should only be used sparingly for such workloads, directed mainly by the application layer.

How a database is modeled depends on how the application wants to access the data. With relational databases, for instance, applications want to get all the data for an application entity like customers, products, brands, etc. Each entity has extensions based on its actions, such as customers having transactions, brands having products, products having prices, etc. The database model is created in such a way that the application is easily able to access any of these given entities efficiently.

Let’s look at a simple relational database model that uses an application that allows customers to purchase products belonging to various brands. The different entities in the following diagram have been created to cater to specific features of the application, such as customer signup, customer ordering a product, and customer adding a new address.

The design is based on the actions that can be performed on the application. When you start querying the database with this data model (and remember that my example shows an ultra-simplistic data model — in real life, you’ll find much more complicated data models), not to serve the application but data analysts, ML engineers, and data scientists, you’ll start facing two significant problems really quickly:

  • Slow queries — this shouldn’t surprise you as you will have to join many tables to reach the desired output. You’ll need to take care of indexes, partitions, data types, disk-intensive operations, etc. Remember that slow queries have a genuine cost attached — they cost a lot more in resource time and block other analysts from doing their work.
  • SQL verbosity — not just the joins but overall unpreparedness for handling such workloads will make your SQL queries much more verbose than they need to be. You’ll have data types that are great for the consumption of the application but not for data analysis (at least not in their native form) — array, JSON, etc.

To understand more about this, say you want to calculate a month-wise summary of successful transactions for every product in the catalog. You can use the following query to do that — just notice how you have to navigate through different tables to get different pieces of information necessary for your query. And this works completely fine for an application because they are built to serve application features, not aggregates.

Now, suppose you have hundreds and hundreds of such tables to navigate and want to simplify the process of answering business questions. In that case, you must create another data model that serves the purpose. We’ll talk about many data models later, but here’s the simplest one that broadly categorizes all your source tables into facts and dimensions.

Facts are the things you want to measure, such as sales, response time, etc., and dimensions are the things you want to measure across, such as sales by store or sales by country. The following diagram isn’t an accurate description of a data warehouse model. Still, it loosely demonstrates how you will reorganize your data into these two different types of tables in a data warehouse:

When you want to deal with hundreds of facts and dimensions, you want to see the metrics measured. A dimensional model allows you to do that. If you want to just look at transactions by product, you don’t really have to know which brand the products belong to. If you have to measure the sales by the customer's region, you don’t even have to know who the customer is.

Remodeling a relational database into a data warehouse data model allows you to connect the measures directly with the dimensions instead of a complex route that you have to take, making a lot of joins. Here’s a simple example of measuring a month-wise summary of transactions by product and by brand separately:

Remember that the above query will structurally remain almost the same, and it would, more or less, be the same number of lines irrespective of the dimension you want to see a metric across. The story will be slightly different when you want to see a metric across multiple dimensions, but you’ll still be far better off than querying a relational data model.

To summarize, you can use several data modeling techniques when creating a data warehouse. I’ll briefly touch upon these different data warehousing data models:

  • 3NF (third normal form) modeling — owing its origins to the entity-relationship modeling methodology, 3NF is also widely used in data warehousing to serve as a normalized layer to the further layers. It provides tremendous flexibility but can end up having really verbose queries.
  • Dimensional modeling — a modeling technique that restructures the data into denormalized fact and dimension tables to make analytic workloads easier to run. The most popular exponents of this method are the star schema and snowflake schema.
  • Data Vault modeling — a hybrid between 3NF and dimensional modeling, the Data Vault model is much closer to 3NF than to the dimensional model. It tries to keep the best features of 3NF, such as ease of querying highly granular, historical data, and still restructures the data into new types of tables, such as satellites, links, hubs, bridges, and PITs.

The choice of the data model should be impacted by the type of workload you expect, the scale and speed at which you want the data warehouse to operate, the agility with which you want to adapt to changes in the source systems, and the cost you’re willing to pay for all these things. But the model alone doesn’t seem so critical. It’s just one piece (but a very important piece) of the puzzle. Once you choose the model, you’ll have to store it on memory or disk. Let’s talk about the different storage architectures and their impact on your data warehousing endeavor.

Storage Architecture

Second, the application databases' storage architecture (both on disk and memory) is built to serve a specific purpose. Usually, applications use several databases to account for ACID compliance, schema flexibility, in-memory storage, etc. They all have a particular workload they are best designed to support. For instance, as mentioned in this first point, MySQL is designed to support applications that need to read, write, or update a record or a set of records at a time, which means that it makes sense to keep whole records physically close on the disk in a contiguous manner.

In contrast, analytical workloads usually need the ability to look through specific columns all at once for calculating aggregates, peeking into values of the same column for the next record or several records in a defined window for calculating moving averages, cumulative sums, and so on. Row-based storage won’t make much sense to run those queries. The following image loosely represents the access pattern on disk if you want to access a single column for calculating an aggregate in a row-based storage structure (as in MySQL or PostgreSQL):

Relational databases work in pages (MySQL, PostgreSQL). These pages continuously move in and out of the memory based on the workload. You can think of the above block as a page in a database. Even if you need a single record, the whole page gets accessed and moved to the memory. Imagine analytical workloads requiring a single column stored across many database pages. That can’t be efficient! It isn’t.

To cater to analytical workloads, it makes much more sense to store the data to be aggregated at the exact physical location, the most granular representation of which is a database page. And that page, for column-based storage, will conceptually look something like the following:

The storage arrangement shown above should make it clear how by accessing a small block of disk or memory, you can calculate aggregates, peek into specific columns of leading or lagging rows, etc., for your analytical workloads. The benefit of storing columns contiguously doesn’t end with a smaller search area for specific queries; it also helps compress the column data more efficiently as you are now applying compression to a block of storage that contains data with the same data type. Compression algorithms save more space for specific data types than others.

Processing Methodology

Third, the processing methodology and the processing infrastructure significantly differ in handling your workloads. Traditionally, relational databases have served mid-sized applications using a single-node deployment with appropriate provisions for read-intensive workloads, point-in-time backups, and failovers.

The philosophy of single-node databases is based on symmetric multiprocessing. With most processors being multi-core, operating systems can split the load between those cores while sharing all other resources, such as IO bandwidth and memory of the system.

This obviously doesn’t work at a scale where you have to process enormous amounts of data in a very short time. Enter massively-multiprocessing systems. With MPP systems, you can use the age-old computer science method of divide-and-conquer to break down a processing task into many different nodes, which have their processors and do the work in parallel, consequently saving a lot of time for businesses. Here’s what a typical massively parallel processing system looks like:

Most modern cloud data warehouses are architected as massively parallel processing systems. Popular examples of MPP systems are Snowflake, Databricks (Spark), Redshift, and so on. When choosing between SMPs and MPPs, you need to understand the kind of read-and-write workloads you’re going to have, how you’ll store the data, and the money you’re willing to spend to get the results of your queries faster.

Epilogue

Having gone through the core ideas — data modeling, storage, and processing, it is clear that the identifying feature of a data warehouse is its data model. The storage and processing architecture depend on the scale and, hence, come later. For a business with low volumes of data, you can start your data warehousing journey by modeling one and exposing it using a traditional relational database, such as MySQL or PostgreSQL. Nothing says you can’t do that.

Besides the volume of data, at the heart of the data warehousing question also lie the variety of access patterns you expect. Both data volume and access patterns help you navigate the data warehouse storage and processing space much better. In contrast, the choice of data models is usually driven by how fast you want to deliver and how agile you want to be when the systems feeding the data warehouse change.

How you understand data modeling, storage, and architecture will decide how you understand data warehouses. Are data warehouses just about data platforms, i.e., a pre-packaged infrastructure for data processing, or are they more than that? Data modeling is the key to answering this question.

Published in Cognizant Servian

Cognizant Servian was formed on the deep expertise of two acquisitions: Servian and Contino. Together, we are experts in innovative data and analytics, digital, customer engagement and cloud solutions to evolve your competitive advantage.

Written by Kovid Rathee

I write about tech, Indian classical music, literature, and the workplace among other things. 1x engineer on weekdays.

Responses (1)

Write a response