Spark

📚 SQL Structs: When Your Database Learns to Think Like a Bookshelf

Imagine your database as a giant library. For decades, we’ve been organizing it like a traditional library catalog system – every book detail gets its own card, filed in separate drawers. Title cards in one drawer, author cards in another, publication year in yet another. This is what we call normalization in database terms. But what if your library could store complete book information in a single, smart envelope? That envelope could contain the title, author, publication details, and even reviews – all tucked neatly together. This is essentially what a struct does in modern SQL databases. ...

UUID Pitfalls in Spark → Kafka → Postgres Pipelines

I was building a data pipeline using Kafka and Spark structured streaming. Fully containerized. The stack: Kafka for streaming transaction data Spark Structured Streaming for real-time processing and fraud detection Postgres as the data warehouse Everything was smooth. Until one tiny villain showed up: UUID fields. Yes — UUIDs. Here’s exactly what happened (so you can avoid the same headache). ✅ The Original Design I designed the tables in Postgres like this: ...

🛡️ Solving the Kerberos User Authentication Issue in Spark Docker Streaming

Solving the Kerberos User Authentication Issue in Spark Docker Streaming While building my real-time streaming pipeline using Spark, Kafka, and Docker, I ran into a Spark error related to Kerberos authentication - when I wasn’t even using Kerberosa. org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name ❓ What triggered the problem? I was using the official apache/spark:3.5.0 Docker image. Spark inside Docker was trying to resolve Hadoop’s default authentication mechanism. Hadoop tries to retrieve the current OS user via: UnixPrincipal(name) Inside Docker containers, my app was running as UID/GID that had no proper username mapping. This caused: invalid null input: name because UnixPrincipal() received null. ...

🫙 The Final Spark Streaming Hurdle: When --jars Isn't Enough for Kafka

As a data engineer, there’s nothing quite like the satisfaction of reaching the “final hurdle” in a complex distributed system setup. Today, I want to share a frustrating but very common issue with Apache Spark Structured Streaming + Kafka: 👉 the dreaded Failed to find data source: kafka error. 🧨 The Problem: Everything Seems Right — But Kafka Won’t Load Picture this: Spark cluster is up and running. Postgres connection works. Kafka is producing events. Your code calls .readStream.format("kafka")… And then: ...

🚀 Building a Batch Data Pipeline with AWS, Airflow, and Spark

✨ Project Summary Assuming I am working for a fintech company, I built a batch pipeline that automatically aggregates → transforms → analyzes credit card data. Since I couldn’t use real data, I used synthetic transaction data generated using Faker, but I believe it was sufficient for the purpose of designing the overall data flow and structure. 🎯 Goal “Build an Airflow pipeline that processes realistic financial data with Spark, analyzes and stores them.” ...