Streaming

UUID Pitfalls in Spark → Kafka → Postgres Pipelines

I was building a data pipeline using Kafka and Spark structured streaming. Fully containerized. The stack: Kafka for streaming transaction data Spark Structured Streaming for real-time processing and fraud detection Postgres as the data warehouse Everything was smooth. Until one tiny villain showed up: UUID fields. Yes — UUIDs. Here’s exactly what happened (so you can avoid the same headache). ✅ The Original Design I designed the tables in Postgres like this: ...

🛡️ Solving the Kerberos User Authentication Issue in Spark Docker Streaming

Solving the Kerberos User Authentication Issue in Spark Docker Streaming While building my real-time streaming pipeline using Spark, Kafka, and Docker, I ran into a Spark error related to Kerberos authentication - when I wasn’t even using Kerberosa. org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name ❓ What triggered the problem? I was using the official apache/spark:3.5.0 Docker image. Spark inside Docker was trying to resolve Hadoop’s default authentication mechanism. Hadoop tries to retrieve the current OS user via: UnixPrincipal(name) Inside Docker containers, my app was running as UID/GID that had no proper username mapping. This caused: invalid null input: name because UnixPrincipal() received null. ...

🫙 The Final Spark Streaming Hurdle: When --jars Isn't Enough for Kafka

As a data engineer, there’s nothing quite like the satisfaction of reaching the “final hurdle” in a complex distributed system setup. Today, I want to share a frustrating but very common issue with Apache Spark Structured Streaming + Kafka: 👉 the dreaded Failed to find data source: kafka error. 🧨 The Problem: Everything Seems Right — But Kafka Won’t Load Picture this: Spark cluster is up and running. Postgres connection works. Kafka is producing events. Your code calls .readStream.format("kafka")… And then: ...