Overview #
This project covers building a production-grade ELT pipeline that ingests raw data from multiple sources, transforms it through a layered data model, and serves it to downstream consumers with guaranteed quality guarantees.
The goal was to move away from fragile, one-off scripts and into a reproducible, observable pipeline that could be maintained and extended without fear.
Problem #
The existing data workflow was a collection of ad-hoc Python scripts run manually on a schedule. There was no lineage, no alerting, and no way to know if the data was stale or wrong until someone noticed a dashboard looked off.
Architecture #
Sources (APIs, DB)
│
▼
[Extraction Layer]
Python + custom connectors
│
▼
[Raw Storage]
PostgreSQL / S3 staging
│
▼
[Transformation Layer]
dbt (staging → intermediate → marts)
│
▼
[Serving Layer]
Analytical views / BI toolsOrchestration: Apache Airflow with DAGs per source Transformation: dbt with full lineage and tests Data Quality: dbt tests (not-null, unique, referential integrity) + custom macros Monitoring: Airflow alerts on failure, dbt run results logged
Key Decisions #
Why dbt for transformation? SQL-first transformations with built-in testing, documentation generation, and lineage. The learning curve is low for anyone who knows SQL, which makes it easier to hand off.
Why Airflow over simpler schedulers? Complex dependency graphs between DAGs, retry logic, and the need for a UI to inspect historical runs. For a simpler project, Prefect or even cron would be fine.
Layered data model (staging → intermediate → marts) Keeps raw data untouched, makes transformations auditable, and prevents the “who touched this?” problem.
What I Learned #
- Data quality tests need to run at every layer, not just the final mart
- Backfilling is always harder than you think — design for it from the start
- Documentation in dbt is surprisingly useful when you return to a project six months later
Tech Stack #
| Layer | Tool |
|---|---|
| Orchestration | Apache Airflow |
| Transformation | dbt |
| Storage | PostgreSQL |
| Language | Python 3.11 |
| Containerization | Docker |
| CI | GitHub Actions |