Data Engineering

🔧 Why Do We Split Airflow into init, scheduler, and webserver?

If you start working with Airflow a bit more seriously, you’ll quickly notice that it’s usually split into multiple services: airflow-init airflow-scheduler airflow-webserver At first, you may wonder: “Why do we need to split them up like this?” Well — this is actually the standard production architecture. Let’s break it down in simple, practical terms. 1️⃣ airflow-init — Preparation Step Also sometimes called airflow-db-migrate or airflow-bootstrap. This runs only once when you initialize Airflow. ...

📊 What dbt Does Well vs What Python Does Better

Role dbt Does Well Python Does Better Structured data cleaning (staging) ✅ Possible, but inconvenient Designing mart table structures ✅ Also possible User-specific calculations ❌ Inconvenient ✅ Super flexible Scoring, conditional matching, if-else logic ❌ Very cumbersome ✅ Ideal Filtering based on user input ❌ Not possible ✅ Core feature Explaining recommendations, tuning logic ❌ ✅ Fully customizable For Example -- This kind of logic is painful in dbt... SELECT CASE WHEN user.age BETWEEN policy.min_age AND policy.max_age THEN 30 ELSE 0 END + CASE WHEN user.income < policy.income_ceiling THE_ ELSE 0 END + ... In dbt, the concept of a “user” doesn’t even exist dbt is built for models that apply the same logic to everyone Python, on the other hand, can generate different recommendations per user based on input 👉 dbt is great for static modeling, but dynamic, user-input-driven recommender systems are better suited for Python. ...

🚀 Building a Batch Data Pipeline with AWS, Airflow, and Spark

✨ Project Summary Assuming I am working for a fintech company, I built a batch pipeline that automatically aggregates → transforms → analyzes credit card data. Since I couldn’t use real data, I used synthetic transaction data generated using Faker, but I believe it was sufficient for the purpose of designing the overall data flow and structure. 🎯 Goal “Build an Airflow pipeline that processes realistic financial data with Spark, analyzes and stores them.” ...