πŸ“š Building a 25-Year Backfill Pipeline for the National Library of Korea API

How I Designed a Reliable, Auto-Resuming ETL to Collect Decades of Book Data β€” Without Airflow 1. Why I Built This The National Library of Korea (NLK) provides a public API called Seoji β€” a bibliographic catalog of all registered books in Korea. I wanted to collect the entire dataset, from January 2000 to December 2024, and store it in my PostgreSQL database (Supabase). It sounded simple at first β€” just a loop over API pages. But in practice, I had to solve: ...

October 22, 2025

πŸš€ Building a Fintech Batch ETL Pipeline β€” the Modular Way

πŸ‘‰ Code, Portfolio, Blog, and LinkedIn 🎯 Batch Pipeline for Transaction Data Imagine: K-pop demon hunters launches a fintech startup for the fans. Now they have to deal with millions of credit card transactions every day β€” and they need to make sense of them. ...

September 12, 2025

🧹 Data Cleansing: Why You Should Always Clean at the Staging Layer

In real-world data engineering pipelines, one of the most common mistakes is postponing data cleansing until too late in the pipeline. The cleaner your upstream data is, the simpler and more maintainable your downstream models will be. Let’s break it down. βœ… The Principle Whenever possible, cleanse your data as early as possible β€” ideally at the staging layer. βœ… The Why 1️⃣ Clear Separation of Responsibilities Staging models are responsible for: ...

June 4, 2025

πŸ”§ Why Do We Split Airflow into init, scheduler, and webserver?

If you start working with Airflow a bit more seriously, you’ll quickly notice that it’s usually split into multiple services: airflow-init airflow-scheduler airflow-webserver At first, you may wonder: β€œWhy do we need to split them up like this?” Well β€” this is actually the standard production architecture. Let’s break it down in simple, practical terms. 1️⃣ airflow-init β€” Preparation Step Also sometimes called airflow-db-migrate or airflow-bootstrap. This runs only once when you initialize Airflow. ...

May 30, 2025

πŸš€ Building a Batch Data Pipeline with AWS, Airflow, and Spark

✨ Project Summary Assuming I am working for a fintech company, I built a batch pipeline that automatically aggregates β†’ transforms β†’ analyzes credit card data. Since I couldn’t use real data, I used synthetic transaction data generated using Faker, but I believe it was sufficient for the purpose of designing the overall data flow and structure. 🎯 Goal β€œBuild an Airflow pipeline that processes realistic financial data with Spark, analyzes and stores them.” ...

May 1, 2025