🧭 Situation I’m setting a Metabase dashboard to view the status of my data stored in Supabase. Options for Secure Connection to Metabase Option 1. IP Allowlist via EC2 Security Group Allow inbound access only from your home or office IP. Effect: Only you can reach the login page. Tradeoff: You’ll need to update the rule if you travel or use a new network. Option 2. Put Metabase Behind Reverse Proxy + Basic Auth Run an Nginx reverse proxy on the same EC2 instance. ...
Making Postgres Search Fast and Accurate with FTS and dbt Indexes
Here’s how I added search to my KBooks site. All my book data was already flowing into Supabase from the National Library of Korea API through a dbt pipeline, so the challenge was: How do I make this table searchable efficiently without adding another service at this stage? This is how I got database-level search ready, step by step, and why each decision made sense. 1. Start from a clean source: silver_books My raw data lands as JSON in raw_nl_books. From there, I built a silver_books table using dbt. Each row represents one unique book identified by its ISBN-13, and only valid ISBNs are kept. ...
📚 Building a 25-Year Backfill Pipeline for the National Library of Korea API
How I Designed a Reliable, Auto-Resuming ETL to Collect Decades of Book Data — Without Airflow 1. Why I Built This The National Library of Korea (NLK) provides a public API called Seoji — a bibliographic catalog of all registered books in Korea. I wanted to collect the entire dataset, from January 2000 to December 2024, and store it in my PostgreSQL database (Supabase). It sounded simple at first — just a loop over API pages. But in practice, I had to solve: ...
🚀 Building a Fintech Batch ETL Pipeline — the Modular Way
👉 Code, Portfolio, Blog, and LinkedIn 🎯 Batch Pipeline for Transaction Data Imagine: K-pop demon hunters launches a fintech startup for the fans. Now they have to deal with millions of credit card transactions every day — and they need to make sense of them. ...
📚 SQL Structs: When Your Database Learns to Think Like a Bookshelf
Imagine your database as a giant library. For decades, we’ve been organizing it like a traditional library catalog system – every book detail gets its own card, filed in separate drawers. Title cards in one drawer, author cards in another, publication year in yet another. This is what we call normalization in database terms. But what if your library could store complete book information in a single, smart envelope? That envelope could contain the title, author, publication details, and even reviews – all tucked neatly together. This is essentially what a struct does in modern SQL databases. ...
💯 From Basic to Intermediate: Understanding dbt Tests
If you’re using dbt to transform data, you’re already winning. But did you know dbt has powerful testing features to keep your data clean, reliable, and trustworthy? In this post, we’ll walk through: ✅ Basic dbt tests — the quick wins 🚀 Intermediate tests — custom logic and reusable macros ✅ Basic dbt Tests (Built-in) dbt has out-of-the-box tests you can define in your .yml files under your models. Here’s an example: ...
How PostgreSQL Surprises You: Booleans, Text I/O, and ETL Gotchas
PostgreSQL is a powerful, standards-compliant database — but it has its quirks. One of those is how it handles boolean values, especially when exporting data in text format. 🧠 PostgreSQL Boolean Behavior: It’s Not What You Think Internally, PostgreSQL stores boolean values efficiently using just 1 bit — as you’d expect. But when you convert those values to text, say in a query or an export via COPY, things look… different: ...
UUID Pitfalls in Spark → Kafka → Postgres Pipelines
I was building a data pipeline using Kafka and Spark structured streaming. Fully containerized. The stack: Kafka for streaming transaction data Spark Structured Streaming for real-time processing and fraud detection Postgres as the data warehouse Everything was smooth. Until one tiny villain showed up: UUID fields. Yes — UUIDs. Here’s exactly what happened (so you can avoid the same headache). ✅ The Original Design I designed the tables in Postgres like this: ...
🛡️ Solving the Kerberos User Authentication Issue in Spark Docker Streaming
Solving the Kerberos User Authentication Issue in Spark Docker Streaming While building my real-time streaming pipeline using Spark, Kafka, and Docker, I ran into a Spark error related to Kerberos authentication - when I wasn’t even using Kerberosa. org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name ❓ What triggered the problem? I was using the official apache/spark:3.5.0 Docker image. Spark inside Docker was trying to resolve Hadoop’s default authentication mechanism. Hadoop tries to retrieve the current OS user via: UnixPrincipal(name) Inside Docker containers, my app was running as UID/GID that had no proper username mapping. This caused: invalid null input: name because UnixPrincipal() received null. ...
🫙 The Final Spark Streaming Hurdle: When --jars Isn't Enough for Kafka
As a data engineer, there’s nothing quite like the satisfaction of reaching the “final hurdle” in a complex distributed system setup. Today, I want to share a frustrating but very common issue with Apache Spark Structured Streaming + Kafka: 👉 the dreaded Failed to find data source: kafka error. 🧨 The Problem: Everything Seems Right — But Kafka Won’t Load Picture this: Spark cluster is up and running. Postgres connection works. Kafka is producing events. Your code calls .readStream.format("kafka")… And then: ...