# CinePro - IMDb Analytics Dashboard

Interactive movie and TV analytics dashboard exploring 12M+ titles from the IMDb dataset. 230M+ rows across pre-computed optimization tables in a 10GB DuckDB database. Sub-second query responses.

## Links
- App: https://imdb-dashboards.tigzig.com
- Docs: https://tigzig.com/app-documentation/movie-explorer.html
- GitHub (Frontend): https://github.com/amararun/shared-imdb-dashboards
- GitHub (Backend): https://github.com/amararun/shared-duckdb-dashboards-backend
- Full DuckDB Database (16GB): https://duckdb-upload.tigzig.com/s/x73-0B1PtnYW1-qwSNobVQ

## Tags
database-ai, duckdb, dashboards, imdb, react, fastapi

## Architecture

```
┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
│  React Frontend │ ──── │ Vercel Serverless│ ──── │ FastAPI Backend │
│  (Vite + TS)    │      │    (Proxy)       │      │ (DuckDB Server) │
└─────────────────┘      └──────────────────┘      └─────────────────┘
```

- Frontend: React + TypeScript + Vite + TailwindCSS (deployed on Vercel)
- Backend: FastAPI + DuckDB (deployed on Hetzner/Oracle Cloud self-hosted servers)
- Proxy: Vercel serverless function forwards requests to backend
- Auth: Clerk authentication (optional, can be disabled)

### Backend API Endpoints
- `POST /api/query/{database}` - Execute SQL query against DuckDB
- `GET /api/admin/cache/stats` - Cache statistics

Backend repo: https://github.com/amararun/shared-duckdb-dashboards-backend

### Database Optimization Tables
The dashboard uses pre-computed tables for sub-second queries:
- `person_filmography` - Denormalized filmography (91M rows)
- `person_stats` - Pre-computed career statistics
- `dashboard_cache` - Single JSON blob for instant dashboard load (~650ms)
- `movie_tokens` - Jaccard similarity vectors for "Similar Movies" feature

### Performance
- Dashboard load: ~650ms (single cached JSON)
- Deep Profile: ~300-600ms
- Search: ~100-200ms

## Features

- At a Glance: Database statistics, rating distributions, top movies/TV by genre
- Explore: Browse top-rated movies, TV series, mini-series, hidden gems
- Star Profiles: Deep dive into any actor/actress/director career - filmography, career stats, timeline, collaborator analysis, side-by-side comparisons
- Through the Decades: Top rated and most prolific by era with adaptive thresholds

## Data Source

IMDb Non-Commercial Datasets: https://datasets.imdbws.com/

## Local Development

### Prerequisites
- Node.js 18+
- Backend API access (or run your own DuckDB server)

### Setup
```bash
git clone https://github.com/amararun/shared-imdb-dashboards.git
cd shared-imdb-dashboards/frontend
npm install
cp .env.example .env.local
# Edit .env.local with backend URL and API key
npm run dev
```

### Environment Variables
- `VITE_DUCKDB_BACKEND_URL` (required) - DuckDB backend API URL
- `VITE_DUCKDB_BACKEND_API_KEY` (required) - API key for backend auth
- `VITE_AUTH_ENABLED` - Set false to disable Clerk auth (default: enabled)
- `VITE_CLERK_PUBLISHABLE_KEY` - Clerk key (if auth enabled)
- `VITE_STATCOUNTER_PROJECT` / `VITE_STATCOUNTER_SECURITY` - StatCounter analytics (optional)
- `VITE_POSTHOG_KEY` / `VITE_POSTHOG_HOST` - PostHog analytics (optional)

### Project Structure
```
├── api/                       # Vercel serverless functions
│   └── duckdb.ts             # Proxy to backend API
├── frontend/
│   ├── src/
│   │   ├── components/       # Shared UI components
│   │   ├── features/         # Feature modules (imdb/)
│   │   ├── services/         # API client
│   │   └── contexts/         # React contexts
│   └── public/               # Static assets
├── scripts-dataprocessing/   # Data pipeline scripts
│   ├── download_and_import.py   # Download IMDb data & create DuckDB
│   ├── create_toprated_byera.py # Build optimization tables
│   └── ...                      # Analysis & EDA scripts
└── vercel.json               # Deployment config
```

### Building the IMDb Database from Scratch

```bash
cd scripts-dataprocessing
python download_and_import.py         # Download IMDb data, create base DuckDB
python create_toprated_byera.py       # Create optimization tables (needs running backend)
python update_prolific_summary_v2.py  # Update prolific person summaries
```

See scripts-dataprocessing/README.md for full data pipeline documentation.
