Overview #
A deep dive into designing a data warehouse for analytical workloads — from choosing a modeling approach to handling slowly changing dimensions and optimizing for query performance.
This project documents the decisions behind a warehouse I built to consolidate multiple operational databases into a single source of truth for reporting.
Problem #
Multiple teams were pulling data from different operational databases and getting different numbers. There was no agreed-upon definition of core metrics, no historical snapshots, and no way to run cross-domain analytics without painful joins across systems.
Modeling Approach #
I chose a Kimball-style dimensional model over a Data Vault for this use case. The reasoning:
- Team was small, iteration speed mattered
- Consumers were primarily BI analysts who understood dimensional modeling
- Data Vault made more sense at a larger scale with more sources
Layer structure:
- Staging — raw source data, typed and renamed, no business logic
- Intermediate — source-specific transformations and joins
- Dimensions — slowly changing dimension tables (SCD Type 2 where needed)
- Facts — grain-level transactional tables
- Marts — pre-aggregated tables for specific business domains
Handling Slowly Changing Dimensions #
SCD Type 2 was required for customer and product dimensions where historical accuracy mattered for reporting.
Implementation approach:
- Used dbt snapshots to track row-level changes with
valid_from/valid_todate columns - Surrogate keys on all dimension tables (not business keys)
- Grain documented in YAML and enforced with dbt tests
Performance Optimization #
Clustering keys on fact tables by date (the most common filter) Materialization strategy:
- Heavy intermediate models → incremental
- Dimensions → table (full refresh nightly)
- Marts → table (full refresh nightly)
Query patterns: pre-aggregating common joins into mart tables rather than making analysts do it at query time
What I Learned #
- Agree on metric definitions before building anything — retroactive changes are expensive
- SCD Type 2 sounds simple until you need to query point-in-time correctly
- Incremental models need careful thought about late-arriving data
Tech Stack #
| Component | Tool |
|---|---|
| Warehouse | Snowflake |
| Transformation | dbt |
| Orchestration | Airflow |
| Source systems | PostgreSQL (multiple) |
| BI | Metabase |