Architecture & Setup for a Dashboard with Hundreds of Millions of Records - Powered by DuckDB
Published: February 5, 2026

Part 1 - Architecture & Setup for a Dashboard with Hundreds of Millions of Records. Fully customized user interface. Powered by DuckDB.
Custom movie analytics dashboard - 230 million rows, 16GB database file. Runs on a €8/month Hetzner VPS with 8GB RAM.
Built with AI assisted coding (Claude Code now). Handles drill-down, side-by-side comparisons, instant search across 15 million people, similar movie recommendations using Jaccard similarity, career timelines spanning decades.
If you're building a custom dashboard, especially with multi-GB database, the patterns here apply and components can be extracted and adapted. In Part 2 I would go deeper into query optimization - how queries taking 5-9s were reduced to 500ms to 3s.
The Backend: FastAPI + DuckDB
Why DuckDB?
Embedded database - no server process to manage. Analytical queries work out of the box and are fast.
Database Setup
DuckDB file sitting on my Hetzner VPS at /data/imdb.duckdb. I use duckdb.connect() directly - no ORM. Each query opens a connection, runs, closes. DuckDB is embedded, so connection overhead is nothing, like ~1ms.
Two Endpoint Types
1. Read-only endpoint (/api/query/{filename})
- SQL sanitization strips DELETE, DROP, INSERT, UPDATE, ALTER
- Returns JSON results
- Rate limited: 20/second burst, 500/minute sustained
- Query timeout: 30 seconds
- In-memory cache: 2000 queries per file, 30-day TTL
2. Admin endpoint (/api/admin/query/{filename})
- Separate API key
- No SQL restrictions
- No timeout (allows long CREATE TABLE, bulk operations)
- Used for building optimization tables and data pipeline tasks directly on server
The database is 16GB. Admin endpoint lets me run transformations in place - create denormalized tables, compute aggregates, build indexes - all server-side.
Rate Limiting
Per-IP limiting using slowapi. Burst limit is 20/second - high because dashboards send 10-15 parallel queries on page load. Sustained limit is 500/minute - prevents abuse while allowing heavy usage. For IP tracking I use a custom header (X-Original-Client-IP) because Cloudflare was overwriting standard headers like X-Forwarded-For. Custom headers pass through unchanged.
Frontend: React + Vercel
React app deployed on Vercel.
Mix of React Context and local component state. SettingsContext holds global preferences (backend selection, theme). Page-level data (search results, filmography, stats) lives in local useState - each page manages its own data.
Query Composition
SQL queries are template strings built in the frontend. The component constructs the query, sends it to the serverless proxy, proxy forwards to backend, results come back as JSON arrays.
const query = `
SELECT primaryTitle, startYear, averageRating
FROM person_filmography
WHERE nconst = '${personId}'
LIMIT 5000
`
const result = await executeImdbQuery(query)
Data Flow
User interaction → Component builds SQL → Serverless proxy call → Backend executes → Results stored in component state → Render
All backend calls route through /api/duckdb serverless function. No direct database access from browser.
Frontend-to-Backend: The Serverless Proxy
React frontend on Vercel. Backend URL never exposed in browser code. All API calls route through a Vercel serverless function (/api/duckdb) that acts as a proxy.
- Frontend calls /api/duckdb?action=imdb-query on its own domain
- Serverless function holds the backend API key in server-side env vars
- Function forwards request to FastAPI backend
- Backend returns results through serverless to frontend
Three layers of protection:
- Backend URL hidden from browser
- Backend API key never exposed client-side
- Optional: Clerk authentication at serverless layer (JWT validation before forwarding)
For low-security apps (public demos, personal projects), the API key alone is sufficient. For high-security needs, add JWT validation at both serverless and backend layers. Full details in Part 3 of my 2026 infra guide.
Auth Toggle
I built Clerk authentication with an on/off switch (AUTH_ENABLED=true/false env var). When disabled, app loads ~250ms faster - no Clerk SDK bootstrap, no token verification. Useful for public demos.
Frontend Components
Reusable for other dashboards. Download the repo, point your AI coder at a component and have it adapt it for your requirements.
Dashboard Overview
Static stat cards, charts, and pre-computed aggregates pulled from single JSON blob.
Search
- Type-as-you-search autocomplete
- Smart Search: keeps the default search fast. IMDb has 15 million people, but most are one-time appearances. The toggle filters down to 2.3 million people who have 3 or more credits. I expect to cover 80-90% of searches. Search loads in 200ms instead of 2 seconds. If someone doesn't show up, turn off Smart Search to search everyone.
- Advanced filters (era, genre, votes threshold) with auto-reset when typing
Profile Analysis
- Complete filmography table: up to 5000 titles, client-side filter/sort (instant response)
- Career stats: total works, average rating, career span, highest rated, most voted
- Decade breakdown with average ratings
I load all data once (up to 5000 rows), then filter and sort client-side. No database round-trip for UI interactions.
Comparison Tools
- Side-by-side comparison (up to 4 people)
- Collaborators analysis (directors worked with, co-stars)
- Similar movies (Jaccard similarity on genre/keyword vectors)
- Drill-down modals: click any title for full metadata, hyperlinks to IMDb
These patterns would work for many other customer and product analytics dashboards.
Dual Backend Setup
The app runs on two backends simultaneously - Hetzner (Germany) and Oracle Cloud (Singapore). Frontend has a toggle to switch between them.
Primary reason: Testing performance and as failover. If Hetzner goes down, flip the toggle to Oracle or can be automated at backend also. It's same FastAPI code & same database, just an additional VPS.
- Frontend sends backend=hetzner or backend=oci parameter
- Serverless proxy routes to appropriate backend URL
- In production, this toggle could move to backend config and trigger automatically on health check failures
Useful for failover, A/B testing of infra changes and zero-downtime migrations.
Feature Toggles
I built several toggles into the architecture:
- Auth toggle (AUTH_ENABLED env var) - enable/disable Clerk authentication
- Backend toggle - switch between Hetzner/Oracle (exposed in UI for testing, would be backend-only in production)
- Smart Search toggle - user-facing checkbox to filter prolific people vs search all
- Query timer toggle - show/hide execution times for each query
The query timer pairs with the backend toggle. Turn on the timer, switch backends, run the same query twice. You can see exactly how Hetzner compares to Oracle. I built it for debugging but kept it in the UI - to be able to do realistic comparisons.
Key point - build these toggles into the architecture. You can decide later whether it's a user-facing feature or admin config. Easier to hide a toggle than to add one after launch.
Data Pipeline
Dashboard's fast because I pre-computed tables via the data pipeline. I create these tables directly on the server using the admin endpoint.
Pipeline scripts (in scripts-dataprocessing/ folder):
- Download IMDb datasets (TSV files, ~10GB compressed)
- Import to DuckDB base tables (title_basics, title_principals, name_basics, title_ratings)
- Build optimization tables:
- person_filmography - denormalized (eliminates JOINs at query time)
- person_stats - one row per person with pre-computed aggregates
- prolific_summary - leaderboards for all era/category combinations
- dashboard_cache - entire dashboard response as single JSON blob
Most processing happens server-side via admin endpoint API calls. I send CREATE TABLE queries through the scripts to /api/admin/query/imdb.duckdb - SQL runs on the server. No need to download 10GB locally. I run the pipeline from my laptop, execution happens on the VPS.
Example - creating person_filmography (91M rows):
response = requests.post(
f"{BACKEND_URL}/api/admin/query/imdb.duckdb",
headers={"Authorization": f"Bearer {ADMIN_API_KEY}"},
json={"sql": """
CREATE TABLE person_filmography AS
SELECT tp.nconst, tp.tconst, tp.category,
tb.primaryTitle, tb.startYear, tb.titleType,
tr.averageRating, tr.numVotes
FROM title_principals tp
JOIN title_basics tb ON tp.tconst = tb.tconst
LEFT JOIN title_ratings tr ON tb.tconst = tr.tconst
"""}
)
Runs on the server, writes directly to the DuckDB file.
Refresh cycle: I am yet to do a refresh. Planning to do the re-download and all table builds directly on the server instead of doing it via laptop.
Reusability
The patterns here aren't specific to movies and can be reused across dashboards. Frontend components extract cleanly. Backend endpoints are generic (POST SQL, get JSON). The optimization strategy - pre-compute what's expensive, filter what's big, cache what's static.
Built for 230 million rows. Works the same for 230 thousand.
Container Memory Mismatch
This was a very new one for me. The initial Oracle Cloud deployment was 4x slower than Hetzner. Same code, same database, same container size (8GB). Earlier I thought it was because Oracle uses network-attached storage, Hetzner uses NVMe. The real problem was memory. Here's the issue:
Oracle's host machine has 24GB RAM. My container is limited to 8GB. When DuckDB starts, it checks how much RAM is available by reading /proc/meminfo. But containers don't have their own /proc/meminfo - they see the host's file. So DuckDB saw 24GB available, not 8GB.
DuckDB's default is to use 80% of available RAM. 80% of 24GB is about 19GB. So DuckDB tried to allocate 19GB in an 8GB container.
The container didn't crash immediately. Instead, it started swapping - moving data between RAM and disk constantly. That's why queries were slow, not failing. Memory thrashing looks like disk latency from the outside.
On Hetzner, no problem. The host has 8GB RAM, container has no limit. DuckDB sees 8GB, uses about 6GB. Everything fits.
Fix: Set DUCKDB_MEMORY_LIMIT=7GB explicitly on Oracle. DuckDB ignores /proc/meminfo and uses your number instead. Leave 1GB for the OS and FastAPI.
Lesson: In containers, always specify DuckDB memory explicitly.
Resources
- Live Dashboard: https://www.tigzig.com/movie-explorer
- Frontend Code: github.com/amararun/shared-imdb-dashboards
- Backend Code: github.com/amararun/shared-duckdb-dashboards-backend