Solving the Kerberos User Authentication Issue in Spark Docker Streaming
While building my real-time streaming pipeline using Spark, Kafka, and Docker, I ran into a Spark error related to Kerberos authentication - when I wasn’t even using Kerberosa.
org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
❓ What triggered the problem?
- I was using the official apache/spark:3.5.0 Docker image.
- Spark inside Docker was trying to resolve Hadoop’s default authentication mechanism.
- Hadoop tries to retrieve the current OS user via:
UnixPrincipal(name)
- Inside Docker containers, my app was running as UID/GID that had no proper username mapping.
- This caused:
invalid null input: name
because UnixPrincipal() received null.
🔎 Root Cause
- Spark uses Hadoop’s UserGroupInformation internally.
- Hadoop falls back to the system user if no explicit user is configured.
- But Docker containers don’t always have valid /etc/passwd entries for dynamically assigned users.
- No valid username → Hadoop crashes → Kerberos exception.
⚠️ Note: This issue happens even if you’re not using Kerberos! The exception name is misleading — it’s simply Hadoop not finding the current user.
🔧 The Fix: Set Valid Username Inside Docker
Instead of patching Spark or Hadoop config, I solved this cleanly by:
1️⃣ Explicitly creating a valid user inside the Dockerfile
FROM apache/spark:3.5.0
USER root
# Create airflow user (or any valid user)
RUN useradd --create-home --shell /bin/bash airflow
USER airflow
WORKDIR /opt/spark-app
2️⃣ Set HOME environment variable in docker-compose.yml
environment:
- HOME=/home/airflow
- HADOOP_USER_NAME=airflow
3️⃣ ✅ Now Hadoop’s UserGroupInformation can successfully resolve user identity inside the container.
🔬 Why this fix works
- Hadoop doesn’t care about UID or GID.
- It only needs a valid username string.
- Creating a named user inside Docker satisfies this requirement.
- No need to configure full Kerberos, Kinit, or any secure ticket infrastructure.
✅ Key Takeaways
- Hadoop inside Spark relies on valid OS users.
- Docker sometimes creates anonymous UIDs → always explicitly create users.
- Set HOME and HADOOP_USER_NAME to match.
- Avoid wildcard SPARK_CLASSPATH=* style configs — prefer explicit JAR mounting.
- Debug systematically: read logs top-down. The root cause is usually much earlier than you think.
This tiny Dockerfile tweak saved me hours of painful debugging.