Data Engineering

📚 SQL Structs: When Your Database Learns to Think Like a Bookshelf

Imagine your database as a giant library. For decades, we’ve been organizing it like a traditional library catalog system – every book detail gets its own card, filed in separate drawers. Title cards in one drawer, author cards in another, publication year in yet another. This is what we call normalization in database terms. But what if your library could store complete book information in a single, smart envelope? That envelope could contain the title, author, publication details, and even reviews – all tucked neatly together. This is essentially what a struct does in modern SQL databases. ...

💯 From Basic to Intermediate: Understanding dbt Tests

If you’re using dbt to transform data, you’re already winning. But did you know dbt has powerful testing features to keep your data clean, reliable, and trustworthy? In this post, we’ll walk through: ✅ Basic dbt tests — the quick wins 🚀 Intermediate tests — custom logic and reusable macros ✅ Basic dbt Tests (Built-in) dbt has out-of-the-box tests you can define in your .yml files under your models. Here’s an example: ...

How PostgreSQL Surprises You: Booleans, Text I/O, and ETL Gotchas

PostgreSQL is a powerful, standards-compliant database — but it has its quirks. One of those is how it handles boolean values, especially when exporting data in text format. 🧠 PostgreSQL Boolean Behavior: It’s Not What You Think Internally, PostgreSQL stores boolean values efficiently using just 1 bit — as you’d expect. But when you convert those values to text, say in a query or an export via COPY, things look… different: ...

UUID Pitfalls in Spark → Kafka → Postgres Pipelines

I was building a data pipeline using Kafka and Spark structured streaming. Fully containerized. The stack: Kafka for streaming transaction data Spark Structured Streaming for real-time processing and fraud detection Postgres as the data warehouse Everything was smooth. Until one tiny villain showed up: UUID fields. Yes — UUIDs. Here’s exactly what happened (so you can avoid the same headache). ✅ The Original Design I designed the tables in Postgres like this: ...

🛡️ Solving the Kerberos User Authentication Issue in Spark Docker Streaming

Solving the Kerberos User Authentication Issue in Spark Docker Streaming While building my real-time streaming pipeline using Spark, Kafka, and Docker, I ran into a Spark error related to Kerberos authentication - when I wasn’t even using Kerberosa. org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name ❓ What triggered the problem? I was using the official apache/spark:3.5.0 Docker image. Spark inside Docker was trying to resolve Hadoop’s default authentication mechanism. Hadoop tries to retrieve the current OS user via: UnixPrincipal(name) Inside Docker containers, my app was running as UID/GID that had no proper username mapping. This caused: invalid null input: name because UnixPrincipal() received null. ...

🫙 The Final Spark Streaming Hurdle: When --jars Isn't Enough for Kafka

As a data engineer, there’s nothing quite like the satisfaction of reaching the “final hurdle” in a complex distributed system setup. Today, I want to share a frustrating but very common issue with Apache Spark Structured Streaming + Kafka: 👉 the dreaded Failed to find data source: kafka error. 🧨 The Problem: Everything Seems Right — But Kafka Won’t Load Picture this: Spark cluster is up and running. Postgres connection works. Kafka is producing events. Your code calls .readStream.format("kafka")… And then: ...

🧚 Why Run dbt Inside Airflow Docker Container

Why I Run dbt Inside Airflow Docker Container In modern data engineering pipelines, dbt and Airflow often work side by side. One common design decision is how to run dbt alongside Airflow: Should dbt run in its own container, orchestrated via API or CLI call? Or should dbt run directly inside Airflow’s Docker container as part of the DAG? After experimenting with both, I prefer running dbt inside Airflow’s Docker container. ...

🧹 Data Cleansing: Why You Should Always Clean at the Staging Layer

In real-world data engineering pipelines, one of the most common mistakes is postponing data cleansing until too late in the pipeline. The cleaner your upstream data is, the simpler and more maintainable your downstream models will be. Let’s break it down. ✅ The Principle Whenever possible, cleanse your data as early as possible — ideally at the staging layer. ✅ The Why 1️⃣ Clear Separation of Responsibilities Staging models are responsible for: ...

🔧 ARM Mac + Docker + dbt: Troubleshooting Startup Issues

While setting up Airflow + dbt projects with Docker, you may run into this common error message and its solutions. 🔍 Problem 1: Platform Architecture Mismatch Error message: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) My Mac is running on ARM (Apple Silicon - M1/M2/M3). The official dbt Docker image is built for amd64 (x86-based). As a result, Docker tries to run cross-architecture using QEMU emulation, which sometimes leads to internal Python path issues → surfaces as the dbt dbt --version error. This is not a simple dbt bug — the root cause is platform mismatch. ...

🔧 Solving Airflow Docker Startup Issues

Common issues you will often encounter when running Airflow with Docker. ❗ Issue 1 — .env file is not visible inside Airflow container 🔍 Symptom Summary The .env file exists at the project root. But inside the Airflow container, load_dotenv() fails to read it. The reason: Docker automatically passes .env as environment variables. But Docker does not copy or mount the file itself into the container. Therefore, load_dotenv() has no file to read. ✅ Solution 1️⃣ Add volume mount for .env in docker-compose.yml This way, the .env file becomes available inside the container at the correct path. ...