Imagine your database as a giant library. For decades, we’ve been organizing it like a traditional library catalog system – every book detail gets its own card, filed in separate drawers. Title cards in one drawer, author cards in another, publication year in yet another. This is what we call normalization in database terms. But what if your library could store complete book information in a single, smart envelope? That envelope could contain the title, author, publication details, and even reviews – all tucked neatly together. This is essentially what a struct does in modern SQL databases. ...
💯 From Basic to Intermediate: Understanding dbt Tests
If you’re using dbt to transform data, you’re already winning. But did you know dbt has powerful testing features to keep your data clean, reliable, and trustworthy? In this post, we’ll walk through: ✅ Basic dbt tests — the quick wins 🚀 Intermediate tests — custom logic and reusable macros ✅ Basic dbt Tests (Built-in) dbt has out-of-the-box tests you can define in your .yml files under your models. Here’s an example: ...
How PostgreSQL Surprises You: Booleans, Text I/O, and ETL Gotchas
PostgreSQL is a powerful, standards-compliant database — but it has its quirks. One of those is how it handles boolean values, especially when exporting data in text format. 🧠 PostgreSQL Boolean Behavior: It’s Not What You Think Internally, PostgreSQL stores boolean values efficiently using just 1 bit — as you’d expect. But when you convert those values to text, say in a query or an export via COPY, things look… different: ...
UUID Pitfalls in Spark → Kafka → Postgres Pipelines
I was building a data pipeline using Kafka and Spark structured streaming. Fully containerized. The stack: Kafka for streaming transaction data Spark Structured Streaming for real-time processing and fraud detection Postgres as the data warehouse Everything was smooth. Until one tiny villain showed up: UUID fields. Yes — UUIDs. Here’s exactly what happened (so you can avoid the same headache). ✅ The Original Design I designed the tables in Postgres like this: ...
🛡️ Solving the Kerberos User Authentication Issue in Spark Docker Streaming
Solving the Kerberos User Authentication Issue in Spark Docker Streaming While building my real-time streaming pipeline using Spark, Kafka, and Docker, I ran into a Spark error related to Kerberos authentication - when I wasn’t even using Kerberosa. org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name ❓ What triggered the problem? I was using the official apache/spark:3.5.0 Docker image. Spark inside Docker was trying to resolve Hadoop’s default authentication mechanism. Hadoop tries to retrieve the current OS user via: UnixPrincipal(name) Inside Docker containers, my app was running as UID/GID that had no proper username mapping. This caused: invalid null input: name because UnixPrincipal() received null. ...
🫙 The Final Spark Streaming Hurdle: When --jars Isn't Enough for Kafka
As a data engineer, there’s nothing quite like the satisfaction of reaching the “final hurdle” in a complex distributed system setup. Today, I want to share a frustrating but very common issue with Apache Spark Structured Streaming + Kafka: 👉 the dreaded Failed to find data source: kafka error. 🧨 The Problem: Everything Seems Right — But Kafka Won’t Load Picture this: Spark cluster is up and running. Postgres connection works. Kafka is producing events. Your code calls .readStream.format("kafka")… And then: ...
🧚 Why Run dbt Inside Airflow Docker Container
Why I Run dbt Inside Airflow Docker Container In modern data engineering pipelines, dbt and Airflow often work side by side. One common design decision is how to run dbt alongside Airflow: Should dbt run in its own container, orchestrated via API or CLI call? Or should dbt run directly inside Airflow’s Docker container as part of the DAG? After experimenting with both, I prefer running dbt inside Airflow’s Docker container. ...
🧹 Data Cleansing: Why You Should Always Clean at the Staging Layer
In real-world data engineering pipelines, one of the most common mistakes is postponing data cleansing until too late in the pipeline. The cleaner your upstream data is, the simpler and more maintainable your downstream models will be. Let’s break it down. ✅ The Principle Whenever possible, cleanse your data as early as possible — ideally at the staging layer. ✅ The Why 1️⃣ Clear Separation of Responsibilities Staging models are responsible for: ...
🐳 How I Dockerized My GitHub Pages Jekyll Site — The Clean Setup That Works
😩 The Problem Setting up Jekyll with Docker sounds easy, but I ran into: platform issues (arm64 vs amd64) - I use Apple Silicon Macbook (M1) bundle install headaches Since I was building this for my personal GitHub Pages site, I also had to make sure it stays compatible with GitHub Pages gem versions while being easy to develop locally. 🛠 My Clean Solution I ended up building this Docker setup. It works for me at last. ...
🔧 ARM Mac + Docker + dbt: Troubleshooting Startup Issues
While setting up Airflow + dbt projects with Docker, you may run into this common error message and its solutions. 🔍 Problem 1: Platform Architecture Mismatch Error message: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) My Mac is running on ARM (Apple Silicon - M1/M2/M3). The official dbt Docker image is built for amd64 (x86-based). As a result, Docker tries to run cross-architecture using QEMU emulation, which sometimes leads to internal Python path issues → surfaces as the dbt dbt --version error. This is not a simple dbt bug — the root cause is platform mismatch. ...