How I Designed a Reliable, Auto-Resuming ETL to Collect Decades of Book Data β Without Airflow
1. Why I Built This
The National Library of Korea (NLK) provides a public API called Seoji β a bibliographic catalog of all registered books in Korea.
I wanted to collect the entire dataset, from January 2000 to December 2024, and store it in my PostgreSQL database (Supabase).
It sounded simple at first β just a loop over API pages. But in practice, I had to solve:
- Rate limits and connection timeouts,
- 25 years Γ 12 months of data (300+ runs),
- Occasional EC2 disconnects,
- Safe deduplication for millions of JSON records.
2. Understanding the NLK API
Reference: NLK Seoji API Docs
- Requires an API key (
cert_key). - Request parameter:
PUBLISH_PREDATE. - Iterate in small chunks,
page_size = 100.
That meant I couldnβt pull everything at once. I decided to partition by month, fetching one month at a time with clear start and end dates.
3. The First Prototype β A Simple Fetcher
I began with fetch_pages.py, which:
- Pulled pages sequentially,
- Inserted JSON into
raw_nl_books, - Logged progress to the console.
It worked, but it wasnβt sustainable:
- No resume logic β if the job crashed, I lost progress.
- Couldnβt survive an EC2 reboot.
- Iβd have to manually change the month each time.
4. From Manual to Autonomous
I needed the system to:
- Fetch all months automatically, from 2000 β 2024.
- Recover from network drops or timeouts.
- Avoid duplicates even if re-run.
- Run safely for days without supervision.
That meant I needed something more resilient β but still lightweight.
5. Orchestration Options I Considered
| Option | Pros | Cons | Decision |
|---|---|---|---|
| Cron | Easy to schedule | No retry or backoff; hard to monitor | β |
| Airflow / Kestra | Full orchestration, UI | Overkill for one sequential task | β |
| systemd | Built-in restart, logs, no external deps | Linux-only | β Chosen |
I realized systemd already gave me what I needed:
- Auto-restart on failure,
- Persistent logs (
journalctl), - Easy enable/disable,
- Zero setup beyond one service file.
6. Designing a Checkpoint System
I created a simple folder:
~/nlk-state/
Each month has two small files:
2000-01.page # last fetched page
2000-01.done # month completed
If the system crashes, the next run just reads .page and resumes that month.
Why file-based checkpoints?
- No DB lockups.
- Transparent: I can open and see progress.
- Works even if DB is temporarily unavailable.
7. The Monthly Runner
I extended fetch_pages_month.py to:
- Fetch a specific month window (
start_publish_dateβend_publish_date). - Write
.pageafter each success. - Exit non-zero on timeout (so systemd restarts it).
- Use
rec_hash = md5(source_record::text)for idempotent inserts.
Each month now runs cleanly and stops when NLK returns zero results.
8. The Master Manager: run_all_months.py
This script drives the entire 25-year loop:
2000-01 β 2000-02 β β¦ β 2024-12
For each month:
- If
.doneexists β skip. - Else β run
fetch_pages_month.pyas a subprocess. - On success β create
.doneand move on. - On error β exit (systemd restarts it and resumes the same month).
Why a βmanagerβ instead of a single giant loop?
Because this keeps boundaries clear:
- The month script focuses on API + DB work.
- The manager focuses on flow control and checkpointing.
9. Supervisor: systemd
My nlk-history.service file runs the manager in the background.
[Service]
User=ec2-user
WorkingDirectory=/home/ec2-user/kbook-data-pipeline
ExecStart=/home/ec2-user/kbook-data-pipeline/venv/bin/python /home/ec2-user/kbook-data-pipeline/scripts/run_all_months.py
Restart=on-failure
RestartSec=30s
Why systemd over Airflow or Cron?
- Restarts automatically on non-zero exit.
- Doesnβt require any UI or database.
- Clean logs via
journalctl. - Perfect for long EC2 jobs.
When the final month finishes, the script exits 0 and systemd stops gracefully.
10. Verifying the Pipeline
Sanity checks:
.pageupdates every few minutes..doneappears when each month completes.- Restarted EC2 β resumed same page automatically.
- No duplicate inserts (thanks to
rec_hashuniqueness).
Full runtime: β 30 minutes Γ 300 months β 6 days of continuous crawling.
11. Challenges Solved
| Problem | Solution |
|---|---|
| Duplicate IDs | Reseeded identity column (ALTER SEQUENCE β¦ RESTART) |
| Duplicate rows | Added rec_hash as generated unique column |
| Timeout errors | Added exponential backoff + longer read timeout |
| EC2 disconnects | Moved job under systemd supervision |
12. Results
After a week of unattended runtime:
- β Full dataset from 2000β2024 stored in Postgres
- β Zero duplicate rows
- β Automatic restarts during network hiccups
- β Completely hands-off process
In total: ~6 million book records collected and stored safely.
13. Lessons Learned
- Simplicity scales β small, reliable scripts outperform heavy frameworks when you only need sequential control.
- Explicit checkpoints > retries.
- systemd is underrated as a production-grade job runner.
- Design for resumability first, optimization later.