Data Warehouse Design & Modeling

Table of Contents

Overview
#

A deep dive into designing a data warehouse for analytical workloads — from choosing a modeling approach to handling slowly changing dimensions and optimizing for query performance.

This project documents the decisions behind a warehouse I built to consolidate multiple operational databases into a single source of truth for reporting.

Problem
#

Multiple teams were pulling data from different operational databases and getting different numbers. There was no agreed-upon definition of core metrics, no historical snapshots, and no way to run cross-domain analytics without painful joins across systems.

Modeling Approach
#

I chose a Kimball-style dimensional model over a Data Vault for this use case. The reasoning:

Team was small, iteration speed mattered
Consumers were primarily BI analysts who understood dimensional modeling
Data Vault made more sense at a larger scale with more sources

Layer structure:

Staging — raw source data, typed and renamed, no business logic
Intermediate — source-specific transformations and joins
Dimensions — slowly changing dimension tables (SCD Type 2 where needed)
Facts — grain-level transactional tables
Marts — pre-aggregated tables for specific business domains

Handling Slowly Changing Dimensions
#

SCD Type 2 was required for customer and product dimensions where historical accuracy mattered for reporting.

Implementation approach:

Used dbt snapshots to track row-level changes with valid_from / valid_to date columns
Surrogate keys on all dimension tables (not business keys)
Grain documented in YAML and enforced with dbt tests

Performance Optimization
#

Clustering keys on fact tables by date (the most common filter) Materialization strategy:

Heavy intermediate models → incremental
Dimensions → table (full refresh nightly)
Marts → table (full refresh nightly)

Query patterns: pre-aggregating common joins into mart tables rather than making analysts do it at query time

What I Learned
#

Agree on metric definitions before building anything — retroactive changes are expensive
SCD Type 2 sounds simple until you need to query point-in-time correctly
Incremental models need careful thought about late-arriving data

Tech Stack
#

Component	Tool
Warehouse	Snowflake
Transformation	dbt
Orchestration	Airflow
Source systems	PostgreSQL (multiple)
BI	Metabase

Overview #

Problem #

Modeling Approach #

Handling Slowly Changing Dimensions #

Performance Optimization #

What I Learned #

Tech Stack #