I was building a data pipeline using Kafka and Spark structured streaming. Fully containerized.
The stack:
- Kafka for streaming transaction data
- Spark Structured Streaming for real-time processing and fraud detection
- Postgres as the data warehouse
Everything was smooth. Until one tiny villain showed up:
UUID fields.
Yes — UUIDs.
Here’s exactly what happened (so you can avoid the same headache).
✅ The Original Design
I designed the tables in Postgres like this:
CREATE TABLE fact_transaction (
transaction_id UUID PRIMARY KEY,
customer_id UUID REFERENCES dim_customer(customer_id),
merchant_id UUID REFERENCES dim_merchant(merchant_id),
...
);
Kafka was happily publishing events with UUIDs serialized as strings (since JSON doesn’t do native UUIDs).
Spark was happily reading them using:
schema = StructType([
StructField("transaction_id", StringType()),
StructField("customer_id", StringType()),
StructField("merchant_id", StringType()),
...
])
All green lights. Until…
❌ The “WHY IS THIS FAILING?!” Moment
As Spark tried to write into Postgres via JDBC, kaboom:
ERROR: column "transaction_id" is of type uuid but expression is of type character varying
Hint: You will need to rewrite or cast the expression.
But wait… I am passing valid UUID strings!
🔬 Where It Actually Broke (Spoiler: JDBC Shenanigans)
Here’s what was really happening under the hood:
-
Spark reads UUIDs as strings (
StringType()
) -
JDBC sends them to Postgres as
VARCHAR
-
Postgres (strict little thing) says:
“I’m not auto-casting VARCHAR to UUID when you’re using prepared statements.”
In plain SQL, Postgres might cast for you. In JDBC prepared statements? Nope. Strict mode activated.
And to make things worse: Spark doesn’t even have a native UUIDType()
— you’re stuck with StringType()
.
🚧 False Solutions I Tried (Don’t Do This)
I tried all sorts of hacks:
- Cast UUIDs again in Spark? → Made no difference.
- Play with JDBC options like
stringtype=unspecified
? → Unreliable. - Write ugly casting logic in Spark SQL? → Works today, breaks tomorrow.
Nothing worked cleanly.
🔨 The Real Fix (Simple and Effective)
Finally, I stopped fighting and embraced reality:
Store UUIDs as
TEXT
in Postgres.
ALTER TABLE fact_transaction ALTER COLUMN transaction_id TYPE TEXT USING transaction_id::TEXT;
I updated both fact and dimension tables to store UUIDs as text.
Why? Because:
- Kafka produces UUIDs as strings ✅
- Spark reads them as
StringType()
✅ - JDBC sends them as
VARCHAR
✅ - Postgres accepts
TEXT
✅
Everything finally just worked.
🧠 Industry Secret: TEXT is Totally Fine
Here’s the thing most people won’t tell you:
- Snowflake, BigQuery, Redshift — no native UUID types.
- BI tools? They love strings.
- Debugging? Easier with TEXT.
- Schema evolution? Smoother.
Unless you’re building OLTP-level transactional systems, TEXT for UUIDs is often the more pragmatic choice in data pipelines.
🔄 The Final Architecture
- Kafka → Spark Streaming → Postgres
- No more casting hacks
- Fully containerized via Docker Compose
- Automated dimension seeding
- Just
docker compose up
— and relax
🚀 Lessons Learned
- UUID + Spark + JDBC + Postgres = ticking time bomb ⚠️
- TEXT > UUID for analytical pipelines (most of the time)
- Simplify your type system across the stack
- Optimize for operational smoothness, not theoretical purity
At the end of the day:
Real-world data engineering = hitting weird walls.
Surviving them = becoming a better engineer.