Skip to main content

End-to-End Data Pipeline Orchestration

·330 words·2 mins
Author
Albin Cikaj
Building data pipelines and writing about what I learn along the way.

Overview
#

This project covers building a production-grade ELT pipeline that ingests raw data from multiple sources, transforms it through a layered data model, and serves it to downstream consumers with guaranteed quality guarantees.

The goal was to move away from fragile, one-off scripts and into a reproducible, observable pipeline that could be maintained and extended without fear.


Problem
#

The existing data workflow was a collection of ad-hoc Python scripts run manually on a schedule. There was no lineage, no alerting, and no way to know if the data was stale or wrong until someone noticed a dashboard looked off.


Architecture
#

Sources (APIs, DB)
  [Extraction Layer]
  Python + custom connectors
  [Raw Storage]
  PostgreSQL / S3 staging
  [Transformation Layer]
  dbt (staging → intermediate → marts)
  [Serving Layer]
  Analytical views / BI tools

Orchestration: Apache Airflow with DAGs per source Transformation: dbt with full lineage and tests Data Quality: dbt tests (not-null, unique, referential integrity) + custom macros Monitoring: Airflow alerts on failure, dbt run results logged


Key Decisions
#

Why dbt for transformation? SQL-first transformations with built-in testing, documentation generation, and lineage. The learning curve is low for anyone who knows SQL, which makes it easier to hand off.

Why Airflow over simpler schedulers? Complex dependency graphs between DAGs, retry logic, and the need for a UI to inspect historical runs. For a simpler project, Prefect or even cron would be fine.

Layered data model (staging → intermediate → marts) Keeps raw data untouched, makes transformations auditable, and prevents the “who touched this?” problem.


What I Learned
#

  • Data quality tests need to run at every layer, not just the final mart
  • Backfilling is always harder than you think — design for it from the start
  • Documentation in dbt is surprisingly useful when you return to a project six months later

Tech Stack
#

Layer Tool
Orchestration Apache Airflow
Transformation dbt
Storage PostgreSQL
Language Python 3.11
Containerization Docker
CI GitHub Actions