Skip to main content

Data Warehouse Design & Modeling

·336 words·2 mins
Author
Albin Cikaj
Building data pipelines and writing about what I learn along the way.

Overview
#

A deep dive into designing a data warehouse for analytical workloads — from choosing a modeling approach to handling slowly changing dimensions and optimizing for query performance.

This project documents the decisions behind a warehouse I built to consolidate multiple operational databases into a single source of truth for reporting.


Problem
#

Multiple teams were pulling data from different operational databases and getting different numbers. There was no agreed-upon definition of core metrics, no historical snapshots, and no way to run cross-domain analytics without painful joins across systems.


Modeling Approach
#

I chose a Kimball-style dimensional model over a Data Vault for this use case. The reasoning:

  • Team was small, iteration speed mattered
  • Consumers were primarily BI analysts who understood dimensional modeling
  • Data Vault made more sense at a larger scale with more sources

Layer structure:

  1. Staging — raw source data, typed and renamed, no business logic
  2. Intermediate — source-specific transformations and joins
  3. Dimensions — slowly changing dimension tables (SCD Type 2 where needed)
  4. Facts — grain-level transactional tables
  5. Marts — pre-aggregated tables for specific business domains

Handling Slowly Changing Dimensions
#

SCD Type 2 was required for customer and product dimensions where historical accuracy mattered for reporting.

Implementation approach:

  • Used dbt snapshots to track row-level changes with valid_from / valid_to date columns
  • Surrogate keys on all dimension tables (not business keys)
  • Grain documented in YAML and enforced with dbt tests

Performance Optimization
#

Clustering keys on fact tables by date (the most common filter) Materialization strategy:

  • Heavy intermediate models → incremental
  • Dimensions → table (full refresh nightly)
  • Marts → table (full refresh nightly)

Query patterns: pre-aggregating common joins into mart tables rather than making analysts do it at query time


What I Learned
#

  • Agree on metric definitions before building anything — retroactive changes are expensive
  • SCD Type 2 sounds simple until you need to query point-in-time correctly
  • Incremental models need careful thought about late-arriving data

Tech Stack
#

Component Tool
Warehouse Snowflake
Transformation dbt
Orchestration Airflow
Source systems PostgreSQL (multiple)
BI Metabase