Solving the Kerberos User Authentication Issue in Spark Docker Streaming

While building my real-time streaming pipeline using Spark, Kafka, and Docker, I ran into a Spark error related to Kerberos authentication - when I wasn’t even using Kerberosa.

org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name

❓ What triggered the problem?

  • I was using the official apache/spark:3.5.0 Docker image.
  • Spark inside Docker was trying to resolve Hadoop’s default authentication mechanism.
  • Hadoop tries to retrieve the current OS user via:
UnixPrincipal(name)
  • Inside Docker containers, my app was running as UID/GID that had no proper username mapping.
  • This caused:
invalid null input: name

because UnixPrincipal() received null.

🔎 Root Cause

  • Spark uses Hadoop’s UserGroupInformation internally.
  • Hadoop falls back to the system user if no explicit user is configured.
  • But Docker containers don’t always have valid /etc/passwd entries for dynamically assigned users.
  • No valid username → Hadoop crashes → Kerberos exception.

⚠️ Note: This issue happens even if you’re not using Kerberos! The exception name is misleading — it’s simply Hadoop not finding the current user.

🔧 The Fix: Set Valid Username Inside Docker

Instead of patching Spark or Hadoop config, I solved this cleanly by:

1️⃣ Explicitly creating a valid user inside the Dockerfile

FROM apache/spark:3.5.0
USER root
# Create airflow user (or any valid user)
RUN useradd --create-home --shell /bin/bash airflow
USER airflow
WORKDIR /opt/spark-app

2️⃣ Set HOME environment variable in docker-compose.yml

environment:
  - HOME=/home/airflow
  - HADOOP_USER_NAME=airflow

3️⃣ ✅ Now Hadoop’s UserGroupInformation can successfully resolve user identity inside the container.

🔬 Why this fix works

  • Hadoop doesn’t care about UID or GID.
  • It only needs a valid username string.
  • Creating a named user inside Docker satisfies this requirement.
  • No need to configure full Kerberos, Kinit, or any secure ticket infrastructure.

✅ Key Takeaways

  • Hadoop inside Spark relies on valid OS users.
  • Docker sometimes creates anonymous UIDs → always explicitly create users.
  • Set HOME and HADOOP_USER_NAME to match.
  • Avoid wildcard SPARK_CLASSPATH=* style configs — prefer explicit JAR mounting.
  • Debug systematically: read logs top-down. The root cause is usually much earlier than you think.

This tiny Dockerfile tweak saved me hours of painful debugging.