# Security Checklist for Web Apps
React + FastAPI + Postgres + DuckDB + Cloudflare — 95 items

## 1. React Frontend (Vite + Vercel)

### 1.1. Security Headers

**THE RISK:** Without security headers, browsers don't enforce basic protections. Attackers can embed your site in an iframe to steal clicks, trick browsers into executing files as the wrong type, or downgrade your HTTPS connection. These headers are free to add and stop entire categories of attacks.

**THE SOLUTION:** Add a few lines to your hosting configuration that tell every browser visiting your site: always use HTTPS, don't guess file types, don't allow framing, and don't leak the full URL when navigating away. These are one-time settings that protect every page on your site automatically.

**THE FIX:**
```
// vercel.json — add to headers array
{
  "source": "/(.*)",
  "headers": [
    { "key": "Strict-Transport-Security",
      "value": "max-age=31536000; includeSubDomains" },
    { "key": "X-Content-Type-Options", "value": "nosniff" },
    { "key": "X-Frame-Options", "value": "SAMEORIGIN" },
    { "key": "Referrer-Policy",
      "value": "strict-origin-when-cross-origin" }
  ]
}
```

*WARNING — EMBEDDING GOTCHA: Use DENY if your app should never be embedded. But if your app IS embedded by your own parent site (e.g., myapp.com embeds dashboard.myapp.com), REMOVE X-Frame-Options entirely and use CSP frame-ancestors instead (see 1.2). X-Frame-Options only supports DENY or SAMEORIGIN — it cannot whitelist specific external domains. Having both X-Frame-Options: DENY and frame-ancestors set will block embedding even if frame-ancestors allows it, because browsers enforce whichever is stricter.*

### 1.2. Content Security Policy (CSP)

**THE RISK:** Without CSP, any XSS vulnerability can load external scripts, send data to attacker-controlled servers, or embed your page in malicious iframes. CSP is the most powerful browser-side defense — it tells the browser exactly which sources are allowed for scripts, styles, images, and connections.

**THE SOLUTION:** You add a Content Security Policy header that acts like a whitelist for your browser. You tell it: only run scripts from my own domain, only load fonts from Google Fonts, only connect to my API servers, and don't let anyone embed my page in a frame. If anything else tries to load, the browser blocks it automatically.

**THE FIX:**
```
// vercel.json — add alongside other security headers
{ "key": "Content-Security-Policy",
  "value": "default-src 'self'; script-src 'self';
  style-src 'self' 'unsafe-inline';
  img-src 'self' data: https:;
  font-src 'self' https://fonts.gstatic.com;
  connect-src 'self' https://*.yourdomain.com;
  frame-ancestors 'self'" }
```

*Start with Content-Security-Policy-Report-Only to find violations without breaking your site, then switch to enforcing mode. IMPORTANT: If this app is embedded by a parent site on a different subdomain, set frame-ancestors to whitelist the parent (e.g., frame-ancestors 'self' https://*.yourdomain.com https://yourdomain.com http://localhost:*) AND remove X-Frame-Options from your security headers (item 1.1). X-Frame-Options cannot whitelist specific domains — it will override frame-ancestors and block your own parent from embedding. CSP MAINTENANCE: Every time you add a new API endpoint, analytics tool, CDN resource, or self-host an asset, update connect-src/script-src. Missing entries cause SILENT fetch failures — no console errors, just broken features. Review CSP after every architecture change. INLINE ANALYTICS: If your app uses inline analytics scripts (StatCounter, PostHog, GA), strict script-src 'self' will break them. Options: nonces (complex for static Vite builds), hashes (brittle), or skip script-src and rely on other protections. For public read-only dashboards with no user input, CSP is lower priority than rate limiting and SQL validation.*

### 1.3. CORS Origin Whitelisting

**THE RISK:** Using Access-Control-Allow-Origin: * on your API endpoints means any website on the internet can call your API from a user's browser. An attacker's site can make requests to your feedback endpoint, booking system, or any other API using the victim's browser — and your server will happily respond.

**THE SOLUTION:** Instead of allowing every website to call your API, maintain a list of your own domains that are allowed. When a request comes in, check if the caller's website is on your list. If it is, allow it. If not, reject it. This way only your own frontend can talk to your backend.

**THE FIX:**
```
// Vercel serverless function
const ALLOWED_ORIGINS = [
  'https://yourdomain.com',
  'http://localhost:5173',  // dev only
];

function getCorsOrigin(req) {
  const origin = req.headers.origin;
  if (ALLOWED_ORIGINS.includes(origin)) return origin;
  return ALLOWED_ORIGINS[0]; // safe default
}

res.setHeader('Access-Control-Allow-Origin', getCorsOrigin(req));
```

### 1.4. Rate Limiting (Serverless)

**THE RISK:** Without rate limiting, anyone can call your API endpoints thousands of times per second. This burns your serverless function quota, overloads your backend, and can be used to brute-force codes or spam your feedback/email systems. Serverless functions have no built-in rate limiting.

**THE SOLUTION:** Use a fast counter (like Redis) to track how many requests each visitor has made in the last few minutes. If someone exceeds the limit — say 10 requests per hour for a feedback form — reject additional requests with a "slow down" message. Each endpoint can have its own limit based on how sensitive it is.

**THE FIX:**
```
// Using Upstash Redis (@upstash/redis)
import { Redis } from '@upstash/redis';
const redis = new Redis({
  url: process.env.KV_REST_API_URL,
  token: process.env.KV_REST_API_TOKEN
});

const rateKey = \
```

*Typical limits: 5/15min for auth, 10/hour for feedback, 30/min for data APIs. RACE CONDITION: incr and expire are two separate HTTP calls. If the serverless function crashes between them, the Redis key has no TTL and lives forever — that IP is permanently rate-limited. Safety check: if count > 1 and TTL is -1, re-set the expiry. Or use @upstash/ratelimit which handles this atomically via Lua scripts.*

### 1.5. Login Brute Force Protection

**THE RISK:** General rate limiting still allows many password guesses per minute. A login endpoint needs a separate pattern: count only failed attempts, lock out after 5 failures, and reset the counter on a successful login. Without this, an attacker can systematically try passwords within your general rate limit.

**THE SOLUTION:** Keep a separate counter just for failed login attempts per visitor. After 5 wrong passwords, lock them out for 15 minutes. When they log in successfully, reset their failure count to zero. This is different from general rate limiting because only failures count, and a correct password clears the slate.

**THE FIX:**
```
const failKey = \
```

*The key difference from general rate limiting: only failures count, and success resets the counter. Alternative: check-first pattern (redis.get before processing) avoids incrementing on already-blocked requests, but costs an extra Redis call (~20ms). Both patterns are valid.*

### 1.6. Environment Variable Safety

**THE RISK:** In Vite, any environment variable starting with VITE_ gets bundled into your frontend JavaScript — visible to anyone who views source. If you put an API key, database URL, or secret in a VITE_ variable, it's public. This is the most common accidental secret exposure in React apps.

**THE SOLUTION:** Only use the VITE_ prefix for values that are meant to be public (like your login page domain or API URL). Keep all secrets — API keys, database passwords, private tokens — as regular environment variables without the VITE_ prefix. Those stay on the server and never reach the browser.

**THE FIX:**
```
# Safe as VITE_ (these are public anyway):
VITE_AUTH0_DOMAIN=yourapp.auth0.com
VITE_API_BASE_URL=https://api.yourdomain.com

# NEVER use VITE_ prefix for these:
BREVO_API_KEY=xkeysib-...     # server-side only
DATABASE_URL=postgres://...    # server-side only
KV_REST_API_TOKEN=...          # server-side only
```

*Also ensure no .env files are committed to git. Add .env, .env.local, .env.*.local to .gitignore. VITE GOTCHA: When using loadEnv() in vite.config.ts to read non-VITE_ env vars (for dev server proxy config), pass empty string as third parameter: loadEnv(mode, process.cwd(), ''). Default prefix is 'VITE_' which silently filters out your secrets. The loaded vars only exist in vite.config.ts (server-side), not in the frontend bundle.*

### 1.7. Dependency Audit

**THE RISK:** Outdated npm packages contain known vulnerabilities that attackers actively exploit. For example, older versions of react-router-dom had an XSS vulnerability via open redirects. Running npm audit before every deploy takes 5 seconds and catches these.

**THE SOLUTION:** Run a single command before every deployment that checks all your installed packages against a database of known vulnerabilities. If anything is flagged, update that specific package. It takes 5 seconds and catches issues that would otherwise require a security researcher to find.

**THE FIX:**
```
npm audit                            # check for vulnerabilities
npm install react-router-dom@latest  # fix specific packages
```

*Watch for: react-router-dom (XSS), @babel/helpers (ReDoS), ajv (ReDoS). @vercel/node typically has transitive vulnerabilities (undici, path-to-regexp) that require breaking version bumps to fix. Assess whether the specific vulns affect your code path before force-upgrading — most are mitigated by Vercel's infrastructure layer. Don't blindly npm audit fix --force.*

### 1.8. Source Maps and Console Stripping

**THE RISK:** Source maps expose your entire unminified source code to anyone using browser DevTools. Console.log statements in production can leak sensitive data like user IDs, API responses, and internal state. Vite disables source maps by default — don't re-enable them.

**THE SOLUTION:** Add a build setting that automatically strips out all console.log and debugger statements when your app is built for production. This removes any accidental data leaks from debug messages. Also make sure source maps are turned off so nobody can read your original source code in the browser.

**THE FIX:**
```
// vite.config.ts — strip debug logs in production
export default defineConfig({
  esbuild: {
    drop: ['debugger'],
    pure: ['console.log', 'console.info',
           'console.debug', 'console.trace']
  }
});
```

*Do NOT use drop: ['console'] — it kills console.error and console.warn too, silently swallowing all errors in production. Users see blank screens with zero feedback. Use pure: [...] for selective removal — it strips debug messages but keeps console.error and console.warn alive for real errors.*

### 1.9. Iframe Embedding Security

**THE RISK:** If your app embeds other apps in iframes without sandbox restrictions, those embedded apps get full access to your page. They can read cookies, modify the DOM, or navigate your page. Without postMessage origin validation, any page can send messages that your app trusts.

**THE SOLUTION:** When you embed another app inside your page using an iframe, add a sandbox attribute that restricts what it can do — like allowing scripts but blocking access to your cookies or navigation. Also, when receiving messages from embedded apps, always check where the message came from before acting on it.

**THE FIX:**
```
// Restrict iframe capabilities
<iframe
  src="https://embedded-app.com"
  sandbox="allow-same-origin allow-scripts
           allow-popups allow-forms"
  allow="clipboard-read; clipboard-write"
/>

// Always validate postMessage origin
window.addEventListener('message', (e) => {
  if (!ALLOWED_ORIGINS.includes(e.origin)) return;
  // handle message
});
```

*Never grant allow-popups-to-escape-sandbox or camera/microphone unless explicitly needed.*

### 1.10. URL Parameter Validation

**THE RISK:** If your React app uses URL parameters in fetch() calls or navigation without validation, an attacker can craft URLs with path traversal sequences (../../etc/passwd) or inject arbitrary paths. The browser will send whatever you put in the fetch URL.

**THE SOLUTION:** Before using any value from the URL (like a page slug or ID), check that it only contains safe characters — letters, numbers, and hyphens. If it contains anything unexpected like dots, slashes, or special characters, reject it immediately. This prevents attackers from tricking your app into loading unintended files.

**THE FIX:**
```
const slug = useParams().slug;
// Only allow alphanumeric and hyphens
if (!/^[a-z0-9\\-]+$/i.test(slug)) return <NotFound />;
fetch(\
```

### 1.11. DOMPurify for Dynamic HTML

**THE RISK:** Using dangerouslySetInnerHTML without sanitization is the #1 XSS vector in React apps. If you render HTML from blog posts, API responses, or markdown-to-HTML conversion, an attacker can inject <script> tags, <img onerror=> handlers, or javascript: links that execute in the user's browser.

**THE SOLUTION:** Use a library called DOMPurify to clean any HTML before displaying it. DOMPurify strips out dangerous elements like script tags and event handlers, keeping only safe content like paragraphs, headings, links, and images. You can also specify exactly which HTML tags and attributes are allowed.

**THE FIX:**
```
import DOMPurify from 'dompurify';

// Sanitize before rendering
<div dangerouslySetInnerHTML={{
  __html: DOMPurify.sanitize(htmlContent)
}} />

// Or restrict to specific tags
DOMPurify.sanitize(html, {
  ALLOWED_TAGS: ['p','br','strong','em','a',
                 'ul','ol','li','h1','h2','h3',
                 'code','pre','img'],
  ALLOWED_ATTR: ['href','src','alt','class'],
});
```

### 1.12. Auth0 Configuration

**THE RISK:** Misconfigured Auth0 can lead to session fixation, stale tokens, or users getting stuck in login loops. Common mistakes: not using refresh tokens (sessions expire abruptly), not handling access_denied errors (infinite redirects), and not clearing stale auth state on retry.

**THE SOLUTION:** Configure your Auth0 login to use refresh tokens so sessions don't expire unexpectedly. Store the login state in localStorage so it survives page refreshes. When a user retries login, clear any leftover state first to avoid getting stuck in a loop. Always force a fresh login prompt so users can't accidentally reuse an old session.

**THE FIX:**
```
// Auth0Provider configuration
<Auth0Provider
  domain={import.meta.env.VITE_AUTH0_DOMAIN}
  clientId={import.meta.env.VITE_AUTH0_CLIENT_ID}
  cacheLocation="localstorage"
  useRefreshTokens={true}
>

// Force fresh login
loginWithPopup({ authorizationParams: { prompt: 'login' } });

// On retry — clear stale state
localStorage.removeItem('auth0...');
sessionStorage.clear();
```

### 1.13. JWT Validation in Serverless

**THE RISK:** If your Vercel serverless function accepts a JWT token but doesn't validate it, anyone can forge a token and access protected endpoints. The token must be verified against Auth0's public keys, with issuer and audience checked. Without this, authentication is just theater.

**THE SOLUTION:** When your serverless function receives a login token, verify it against Auth0's public keys to confirm it's genuine. Also check that the token was issued by your Auth0 account (issuer) and intended for your app (audience). If any check fails, reject the request. This ensures nobody can create a fake token to access your protected endpoints.

**THE FIX:**
```
import { jwtVerify, createRemoteJWKSet } from 'jose';

const JWKS = createRemoteJWKSet(
  new URL(\
```

*AUTH0_DOMAIN and AUTH0_AUDIENCE must be non-VITE_ server-side env vars. Real-world learning: during a security audit, one endpoint was found that checked for the Bearer prefix in the Authorization header but never actually verified the JWT signature. The code did `if (!token) return unauthorized()` — confirming a token string existed — but skipped the jwtVerify() call for certain request paths (like cron jobs and webhooks that shared the same handler). An attacker could send `Authorization: Bearer anything` and bypass auth entirely on those paths. Lesson: every code path that accepts a Bearer token must call jwtVerify() — checking that the header exists is not the same as validating the token. Audit every conditional branch in multi-purpose handlers (endpoints that serve cron, webhook, AND user requests).*

### 1.14. PII Cleanup in Server Logs

**THE RISK:** Console.log statements in serverless functions persist in Vercel's log dashboard. If you log email addresses, full request bodies, or API response data, that personal information sits in plain text in your hosting provider's logs — accessible to anyone with dashboard access and potentially subpoenaed or leaked.

**THE SOLUTION:** Only log operational information like which endpoint was called, the HTTP method, status code, and how long it took. Never log personal data like email addresses, phone numbers, or full request bodies. If you add debug logging temporarily during development, remove it before deploying to production.

**THE FIX:**
```
// WRONG — leaks PII
console.log('User:', user.email, 'Body:', JSON.stringify(req.body));

// RIGHT — log only operational data
console.log('Request received:', {
  endpoint: req.url,
  method: req.method,
  status: 200,
  ms: Date.now() - start
});
```

*If you add debug logging temporarily, remove it before committing.*

### 1.15. Cookie Auth Hardening

**THE RISK:** Cookies without proper attributes are vulnerable to theft and misuse. Without HttpOnly, any XSS can steal the cookie via JavaScript. Without Secure, the cookie is sent over plain HTTP. Without SameSite, the cookie is sent on cross-site requests (CSRF). Never store auth tokens in localStorage — any XSS can read them.

**THE SOLUTION:** When setting a cookie, add three protective flags: HttpOnly (JavaScript cannot read it), Secure (only sent over HTTPS), and SameSite (not sent when requests come from other websites). Also set an expiry time so the cookie doesn't last forever. Never store login tokens in localStorage — cookies with these flags are much safer.

**THE FIX:**
```
res.setHeader('Set-Cookie', [
  \
```

*SameSite=Lax is the sensible default for most apps. With Strict, the cookie is NOT sent when users click links TO your site from email, Slack, or search results — they appear logged out on first click and must navigate/refresh. Lax sends the cookie on top-level navigations (clicking links) but blocks cross-site POST (CSRF protection). Use Strict only for high-security scenarios (banking, admin-only tools) where the UX tradeoff is acceptable.*

### 1.16. Backend URL Exposure in Frontend

**THE RISK:** After migrating to a serverless proxy architecture, leftover backend URLs in your frontend code reveal your infrastructure. Even unused URLs are visible in DevTools or bundled JavaScript. Attackers use this to map your endpoints, probe for vulnerabilities, and replay requests with captured headers.

**THE SOLUTION:** After any architecture change (like switching from direct API calls to a serverless proxy), search your frontend code for old backend URLs and remove them. Your frontend should only reference its own domain — the serverless proxy handles forwarding to the actual backend. Also check for old config files, unused imports, and commented-out fetch calls.

**THE FIX:**
```
// WRONG — exposes backend infrastructure
const API_URL = "https://api.mybackend.coolify.app/query";

// RIGHT — calls own domain, proxy forwards
const API_URL = "/api/query";

// After architecture changes, search for leaked URLs:
// grep -r "coolify\\|hetzner\\|backend" src/
```

*Backend URLs in vite.config.ts (dev server proxy config) are NOT bundled into frontend JS — Vite strips configureServer() during build. But they ARE committed to git. For public repos, use env vars for backend URLs in vite.config.ts too. For private repos, this is acceptable. BUNDLE AUDIT: Don't just grep source files — build and grep the PRODUCTION bundle (the minified JS output). Bundlers can inline values from config files, env vars, or imported constants that look clean in source but appear in the final output. After any architecture change: npm run build, then grep the output JS for infrastructure names (AWS, RDS, Hetzner, Coolify, internal IPs, backend subdomains). Source-level cleanup alone gives false confidence.*

### 1.17. Self-Host Critical CDN Assets

**THE RISK:** Loading JavaScript, WASM binaries, or CSS from external CDNs (sql.js.org, cdnjs, unpkg) means a compromised CDN serves malicious code to all your users. The code executes in your domain's security context with full access to cookies, localStorage, and the DOM. This is a supply chain attack that bypasses all your other security — CSP allows the CDN domain, so the malicious script runs unchallenged.

**THE SOLUTION:** Copy critical assets (WASM binaries, JS libraries, fonts) to your own public/ folder and reference them locally. This eliminates the external dependency entirely — a CDN compromise cannot affect you. If self-hosting isn't practical (large files, frequent updates), use Subresource Integrity (SRI) hashes on the script/link tag. SRI tells the browser: only execute this file if its hash matches what I expect. Any tampering causes the browser to reject it.

**THE FIX:**
```
// WRONG — external CDN dependency
<script src="https://cdn.example.com/sql-wasm.js"></script>

// RIGHT — self-hosted copy
// 1. Download: curl -o public/sql-wasm.js https://cdn.example.com/sql-wasm.js
// 2. Reference locally:
<script src="/sql-wasm.js"></script>

// ALTERNATIVE — SRI hash (if self-hosting isn't practical)
<script src="https://cdn.example.com/lib.js"
  integrity="sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8w"
  crossorigin="anonymous"></script>

// Generate SRI hash:
// cat file.js | openssl dgst -sha384 -binary | openssl base64 -A
```

*After self-hosting a CDN asset, remember to remove the old CDN domain from your CSP connect-src/script-src directives (see 1.2 maintenance note). Stale CSP entries are harmless but indicate incomplete cleanup.*

### 1.18. Vite Dev Server Network Exposure

**THE RISK:** Setting server.host = true in vite.config.ts (or using --host flag) binds the dev server to 0.0.0.0, exposing it to everyone on your network. Combined with dev auth stubs that bypass login (always return authenticated), anyone on the same WiFi — coffee shops, hotel networks, coworking spaces — can access your app including admin panels.

**THE SOLUTION:** Only use host: true when you actively need to test on a mobile device on your local network. Remove it when you're done. If your dev environment stubs out authentication (common for faster development), the combination of network exposure + no auth means your app is wide open to anyone nearby. Default to host: false (localhost only) which is safe even on shared networks.

**THE FIX:**
```
// vite.config.ts
export default defineConfig({
  server: {
    // host: true,  // ONLY enable for mobile testing
    host: false,    // default — localhost only (safe)
  }
});

// If you MUST use host: true on shared networks:
// 1. Don't stub auth in dev (use real Auth0 dev tenant)
// 2. Or add a simple dev password middleware
// 3. Disable when done — don't leave it as default
```

*This is especially dangerous when combined with dev auth stubs (if (isDev) return { ok: true }). The fix is simple: default to host: false, enable only when needed.*

### 1.19. Serverless Proxy Architecture

**THE RISK:** When your frontend calls a backend directly (fetch("https://api.mybackend.com/query")), you expose your backend URL to anyone who opens DevTools. Attackers can bypass your frontend entirely — calling your backend directly, replaying requests, probing for vulnerabilities, and brute-forcing endpoints. Any API key or secret needed for that call must be in the frontend code (or absent), and you have no server-side layer to add authentication, rate limiting, input validation, or logging before the request reaches your backend.

**THE SOLUTION:** Route all frontend API calls through serverless functions on your own domain (e.g., Vercel API routes at /api/query). The frontend only ever calls its own domain. The serverless function receives the request, adds the real API key from environment variables, validates input, applies rate limiting (see 1.4), and forwards to the actual backend. The backend URL and API key never reach the browser. This gives you a server-side security layer (auth, rate limiting, input validation, logging, error sanitization) without running your own server.

This pattern works for calling your own backends AND third-party APIs — the serverless function can call any external API server-side, adding your API keys and handling auth without exposing anything to the browser. It also solves CORS issues: many third-party APIs don't set CORS headers, so browsers block direct calls from frontend JavaScript. Since serverless functions run server-side, there are no CORS restrictions — the function calls the API freely and returns the result to your frontend on your own domain.

Vercel free tier allows up to 300 seconds (5 minutes) per function invocation, which covers most API calls. For operations that exceed this — large file uploads, heavy database imports, long-running data processing — use one-time tokens or signed URLs (see 10.6): the serverless function issues a short-lived token, and the frontend uses it to call the backend directly for that single operation.

**THE FIX:**
```
// WRONG — frontend calls backend directly
const resp = await fetch("https://api.mybackend.com/query", {
  headers: { "X-API-Key": "sk-secret-123" }  // exposed in browser
});

// RIGHT — frontend calls own domain
const resp = await fetch("/api/query", {
  method: "POST",
  body: JSON.stringify({ sql: query })
});

// api/query.ts (Vercel serverless function)
import { Redis } from '@upstash/redis';
const redis = new Redis({ url: process.env.UPSTASH_URL, token: process.env.UPSTASH_TOKEN });

export default async function handler(req, res) {
  // 1. Rate limit (see item 1.4)
  const ip = req.headers['x-forwarded-for']?.split(',')[0] || 'unknown';
  const count = await redis.incr(\
```

*This is the foundational security pattern for frontend apps. Items 1.3 (CORS), 1.4 (rate limiting), 1.13 (JWT validation), 1.14 (PII logging), and 9.3 (API keys never in browser) all build on this architecture. Vercel free tier: 300s timeout, 4.5MB payload limit. For large uploads exceeding these limits, use the one-time token pattern from item 9.6 — the serverless function issues a short-lived token, the frontend calls the backend directly with that token for the single operation. The serverless function is also where you add Upstash rate limiting (1.4), CORS origin checking (1.3), authentication verification (1.13), and error sanitization (2.6) — all server-side, all invisible to the browser.*

### 1.20. Cloudflare Turnstile (Invisible Bot Protection)

**THE RISK:** ADDITIONAL HARDENING — This is an extra layer of protection beyond what most apps implement. Standard rate limiting (items 1.4, 5.2) and authentication (section 9) already provide solid security for most use cases. Turnstile adds value specifically for high-security apps, admin dashboards, or apps where you want to make absolutely sure no automated tool can interact with your sensitive endpoints. If your app already has rate limiting + authentication and you're not dealing with targeted attacks, you can skip this.

Your login page, access code form, or any sensitive endpoint can be hit directly by scripts — someone writes a Python script or uses curl to send thousands of POST requests to your API, trying different passwords or access codes. They never open a browser, never see your UI. Your rate limiting helps, but a clever attacker can slow down their requests to stay under the limit and still brute-force their way in over time. Browser-based protections like Cloudflare's Browser Integrity Check don't help here because the attacker isn't using a browser at all.

**THE SOLUTION:** Add Cloudflare Turnstile to your sensitive forms. Turnstile is an invisible captcha — your users never see a puzzle, never click "I'm not a robot", nothing changes in their experience. Behind the scenes, when your page loads, Cloudflare's script silently checks if the visitor is using a real browser (it looks at JavaScript execution, browser APIs, mouse behavior, and dozens of other signals). If it's a real browser, it generates a one-time cryptographic token. Your frontend sends this token along with the form data. Your server then calls Cloudflare's verification API to confirm the token is real before processing anything.

The key thing to understand: this is completely invisible to real users. They fill in the form and click submit — same as before. But a script running curl or Python requests can't generate a valid Turnstile token because it's not a real browser. So the server rejects the request before it even looks at the password or access code. The attacker gets "Bot verification failed" and your actual security logic (code checking, IP blocking) is never reached.

**THE FIX:**
```
// 1. Add Turnstile script to your HTML <head>
<script src="https://challenges.cloudflare.com/turnstile/v0/api.js?render=explicit"
  async defer></script>

// 2. Create a Turnstile widget via Cloudflare API or dashboard
//    Mode: "invisible" (no UI shown to users)
//    Domains: your production domain + localhost for testing
//    You get back: a Site Key (public) and Secret Key (private)

// 3. Frontend — render invisible widget + send token with form
const widgetId = window.turnstile.render(container, {
  sitekey: 'YOUR_SITE_KEY',
  callback: (token) => { turnstileToken = token; },
  size: 'invisible',
});

// Include token when submitting the form
fetch('/api/your-endpoint', {
  method: 'POST',
  body: JSON.stringify({
    code: accessCode,
    turnstileToken: turnstileToken  // <-- add this
  })
});

// 4. Server — verify token with Cloudflare BEFORE processing
const resp = await fetch(
  'https://challenges.cloudflare.com/turnstile/v0/siteverify',
  {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      secret: process.env.TURNSTILE_SECRET_KEY,
      response: turnstileToken,
      remoteip: clientIp,
    })
  }
);
const data = await resp.json();
if (!data.success) {
  return res.status(403).json({ error: 'Bot verification failed' });
}
// Only NOW check the actual password/code/form data
```

*Turnstile has three modes: Managed (shows a checkbox only when suspicious), Invisible (never shows anything), and Non-Interactive (shows a loading spinner briefly). For login forms and access gates, Invisible is the best choice — zero friction for users, full protection against scripts. The token expires after a few minutes, so if a user leaves the page open for a while and then submits, the token may be stale. Handle this by resetting the widget on error. Free tier: unlimited verifications, up to 1M siteverify calls per month. Create widgets via the Cloudflare dashboard (Turnstile section) or via their API.*

### 1.21. Application-Level IP Blocking

**THE RISK:** ADDITIONAL HARDENING — This goes beyond standard security practices. Most apps handle failed logins with temporary lockouts (item 1.5) and rely on Cloudflare WAF for IP-level blocking (item 5.2). Application-level permanent blocking is useful when you want zero tolerance for unauthorized access attempts — typically for private admin tools, internal dashboards, or single-user systems where any wrong attempt is suspicious by definition. For consumer-facing apps with thousands of users who might genuinely mistype their password, this would be too aggressive.

Cloudflare's WAF rate limiting (item 5.2) blocks IPs that send too many requests in a short window — but the block is temporary (typically 10 seconds to 10 minutes). fail2ban (item 8.2) works at the SSH level, not at the application level. Neither of these can permanently block an IP based on application-specific logic, like "this person entered the wrong access code" or "this account failed login 5 times." You need a way to make blocking decisions inside your application code and have those blocks persist across deploys and server restarts.

**THE SOLUTION:** Use a fast key-value store like Upstash Redis to maintain a blocklist of IP addresses. When something suspicious happens — a wrong access code, too many failed logins, a detected attack pattern — write the IP to Redis with a key like "blocked:{ip}". On every page load, check Redis first: if the IP is blocked, show a "you're blocked" screen immediately without even rendering the login form. This all happens in milliseconds because Redis is fast.

The blocking can be permanent (stays until an admin removes it) or time-limited (set a Redis expiry of 24 hours, 7 days, etc.). You decide based on your use case. For a private admin tool, permanent blocking on the first wrong attempt makes sense — any wrong guess is suspicious by definition. For a consumer app, you'd want temporary blocks after multiple failures (see item 1.5). The key difference from Cloudflare or fail2ban: this runs inside your application logic, so you can block based on business rules, not just request volume.

**THE FIX:**
```
// Redis setup (Upstash — works in serverless)
import { Redis } from '@upstash/redis';
const redis = new Redis({
  url: process.env.KV_REST_API_URL,
  token: process.env.KV_REST_API_TOKEN,
});

const BLOCKED_KEY = (ip) => \
```

*Important UX detail: while the block-check API call is in-flight, show a blank screen (not the login form). Otherwise, blocked visitors see the login page flash for a split second before the block screen appears — this looks unprofessional and tells attackers "there's a login form here, I just need to get unblocked." Also: always get the real visitor IP from the cf-connecting-ip header when behind Cloudflare (see item 2.8). The default request IP will be Cloudflare's proxy IP, not the actual visitor. For the admin unblock endpoint, protect it with a separate admin key — don't reuse your main API key.*

### 1.22. Blocking Direct Access via .vercel.app URLs

**THE RISK:** ADDITIONAL HARDENING — This matters specifically for apps hosted on Vercel behind Cloudflare. If your app isn't on Vercel, or isn't behind Cloudflare, you can skip this.

Every app deployed on Vercel gets a public .vercel.app URL in addition to your custom domain. For example, if your custom domain is myapp.com (which goes through Cloudflare), Vercel also creates something like myapp-abc123-yourteam.vercel.app. This second URL goes directly to Vercel, completely bypassing Cloudflare. That means all the Cloudflare protections you set up — WAF rate limiting (item 5.2), Browser Integrity Check (item 5.3), Bot Fight Mode, security level challenges — none of them apply when someone uses the .vercel.app URL. They're talking directly to your server with no bouncer at the door.

Anyone can discover this URL. It appears in your Vercel dashboard, in deployment logs, and can sometimes be found through DNS enumeration or by guessing the pattern (project-name.vercel.app). Once they have it, they can hit your API endpoints directly, bypassing your entire Cloudflare defense layer.

**THE SOLUTION:** Enable Standard Protection in your Vercel project settings. This is available on ALL plans, including Hobby (free). It blocks all .vercel.app URLs at Vercel's infrastructure level — visitors hitting the .vercel.app URL get redirected to Vercel's auth page (307), and only your custom domain remains publicly accessible. No code changes needed, no workarounds, no spoofing possible.

To enable: Vercel Dashboard > Your Project > Settings > Deployment Protection > select "Standard Protection". That's it. One toggle, problem solved.

After enabling, .vercel.app URLs return 307 (redirect to Vercel auth). Your custom domain (e.g., myapp.com via Cloudflare) continues to work normally. All your Cloudflare protections (WAF, rate limiting, bot protection) remain in the path since traffic goes through your custom domain.

**THE FIX:**
```
// Enable Standard Protection:
// Vercel Dashboard > Project > Settings > Deployment Protection
// Select "Standard Protection"
//
// Before: .vercel.app URLs → 200 (bypasses Cloudflare entirely)
// After:  .vercel.app URLs → 307 (redirected to Vercel auth page)
//         Custom domain    → works normally through Cloudflare
//
// This is free on all plans including Hobby.
// No code changes needed — it's an infrastructure-level block.
```

*This used to require a Pro plan ($20/month) but is now free on all plans. It's the clean, definitive solution — no redirect hacks in vercel.json, no Host header checks in your code, no spoofing possible. Just enable it and the .vercel.app bypass problem goes away entirely.*

### 1.23. SPA Catch-All Path Blocking

**THE RISK:** ADDITIONAL HARDENING — Discovered during penetration testing. This applies to any Single Page Application (React, Vue, Angular) hosted on Vercel, Netlify, or similar platforms.

SPAs use a catch-all rewrite rule that sends every URL to index.html — that is how client-side routing works. The problem is that this applies to ALL paths, including paths that automated scanners and attackers love to probe: /.env (looking for your secrets file), /.git/config (looking for your git repo), /wp-admin (looking for WordPress admin panels), /phpmyadmin, /xmlrpc.php, and so on.

None of your actual files are exposed — the server just returns your app's HTML page with a 200 OK status. But from the scanner's perspective, it sees 200 OK and thinks it found something. This creates noise in security reports and makes your app look like it might be running WordPress or has exposed secrets, when it doesn't. More importantly, a human attacker probing your site sees 200 OK responses everywhere, which encourages them to dig deeper rather than move on.

**THE SOLUTION:** Add explicit redirect rules that return 404 (Not Found) or 403 (Forbidden) for known-bad paths BEFORE the SPA catch-all rule kicks in. The key insight is that redirects in vercel.json (or _redirects on Netlify) are evaluated before rewrites, so you can intercept these scanner paths first.

You don't need to block every possible bad path — just the most commonly probed ones. The usual suspects are: /.env and variants, /.git/*, /wp-admin, /wp-login.php, /wp-content, /wp-includes, /xmlrpc.php, /phpmyadmin, and /administrator.

Bonus: if you have Cloudflare in front, its WAF may catch some of these first (like /wp-admin returning 403 automatically). The vercel.json rules act as a second layer for paths Cloudflare doesn't catch.

**THE FIX:**
```
// vercel.json — add these BEFORE the SPA catch-all rewrite
// Redirects are evaluated before rewrites, so these take priority
{
  "redirects": [
    { "source": "/.env:path(.*)", "destination": "/", "statusCode": 404 },
    { "source": "/.git/:path(.*)", "destination": "/", "statusCode": 404 },
    { "source": "/wp-admin/:path(.*)", "destination": "/", "statusCode": 404 },
    { "source": "/wp-login.php", "destination": "/", "statusCode": 404 },
    { "source": "/wp-content/:path(.*)", "destination": "/", "statusCode": 404 },
    { "source": "/wp-includes/:path(.*)", "destination": "/", "statusCode": 404 },
    { "source": "/xmlrpc.php", "destination": "/", "statusCode": 404 },
    { "source": "/phpmyadmin/:path(.*)", "destination": "/", "statusCode": 404 },
    { "source": "/administrator/:path(.*)", "destination": "/", "statusCode": 404 }
  ],
  "rewrites": [
    { "source": "/api/(.*)", "destination": "/api/$1" },
    { "source": "/((?!api/).*)", "destination": "/index.html" }
  ]
}

// For Netlify, use _redirects file instead:
// /.env*    /  404
// /.git/*   /  404
// /wp-*     /  404
// /* /index.html 200

// WHAT THIS LOOKS LIKE IN PRACTICE:
// Before: GET /.env → 200 (index.html) — scanner thinks it found secrets
// After:  GET /.env → 404 — scanner moves on, nothing to see
// Before: GET /.git/config → 200 — scanner thinks repo is exposed
// After:  GET /.git/config → 404 — correctly signals "not here"
// Normal SPA routes still work: GET /dashboard → 200 (index.html)
```

*This is a cosmetic hardening measure — no actual files are leaked even without these rules, because the SPA catch-all only returns your app's HTML page, never real file contents. The value is reducing noise in security scans, discouraging manual probing, and presenting a cleaner security posture. If Cloudflare WAF is in front of your app, it may catch some of these paths first (returning 403 instead of your 404), which is equally good — the result is the same: the scanner sees a rejection, not a 200 OK.*

### 1.24. waitUntil for Reliable Background Work in Serverless

**THE RISK:** Serverless functions on Vercel (and similar platforms) terminate immediately after the response is sent. Any "fire-and-forget" work — logging to an external service, sending analytics, writing audit trails — that runs as an unawaited Promise will be killed mid-flight when the function exits. The platform doesn't wait for background Promises to resolve.

Real-world impact: during a security audit, centralized logging was found to be dropping 83% of log entries. The serverless function sent the response to the client, then tried to POST the log event to an external logging service — but the function was killed before that fetch() completed. Out of 6 test requests, only 1 was actually logged. This created a massive security blind spot: attacks, errors, and access patterns were invisible because the logs never arrived.

This is particularly dangerous for security-critical logging (access attempts, blocked IPs, auth failures) where missing logs means missing evidence of an ongoing attack.

**THE SOLUTION:** Use the waitUntil() function from @vercel/functions to tell the platform: "I've sent the response, but keep the function alive until this Promise resolves." The function continues running in the background until the logging fetch() completes, then shuts down. This is the official Vercel mechanism for background work after response.

The pattern: instead of just calling fetch() and hoping it completes, wrap it in waitUntil(). The function sends the response immediately (no latency impact for the user) but stays alive long enough for the background work to finish.

After deploying this fix, the same 6-request test showed 100% log capture — all 6 entries appeared in the database. The fix is a single function call wrapping existing code.

**THE FIX:**
```
// WRONG — fire-and-forget drops ~83% of logs
function logEvent(data) {
  // This Promise is NOT awaited — function may exit before it resolves
  fetch('https://logger.example.com/log', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(data)
  }).catch(() => {});
}

export default async function handler(req, res) {
  logEvent({ endpoint: '/api/data', status: 200 });
  return res.json({ data: 'response' });
  // Function exits HERE — logEvent fetch is killed mid-flight
}

// RIGHT — waitUntil keeps function alive for background work
import { waitUntil } from '@vercel/functions';

function logEvent(data) {
  const logPromise = fetch('https://logger.example.com/log', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(data)
  }).catch(() => {});

  // Tell Vercel: keep function alive until this resolves
  waitUntil(logPromise);
}

export default async function handler(req, res) {
  logEvent({ endpoint: '/api/data', status: 200 });
  return res.json({ data: 'response' });
  // Function sends response but stays alive for logPromise
}

// RESULT: 83% drop rate → 0% drop rate (100% log capture)
```

*waitUntil() is imported from @vercel/functions (install: npm i @vercel/functions). It works in both serverless functions and Edge Middleware. The background work does NOT delay the response — the user gets their response immediately, and the function continues running in the background. There is no extra cost for the background execution time on Vercel's Hobby plan. This pattern applies to any background work after response: logging, analytics, cache warming, webhook notifications, cleanup tasks. If you're using fire-and-forget fetch() calls anywhere in your serverless functions, they are almost certainly being dropped silently — wrap them in waitUntil().*

### 1.25. RPC API Surface Obfuscation

**THE RISK:** In a typical serverless architecture, each API endpoint has its own URL (/api/dashboard, /api/aws-costs, /api/duckdb). These URLs are visible in the frontend JavaScript bundle — anyone can open DevTools, read the minified JS, and build a complete map of your API surface. Even with authentication, exposed endpoint names reveal what services you use, what infrastructure you run, and which endpoints to probe for vulnerabilities. For portfolio or showcase apps published on social media, this is especially problematic — people WILL inspect the bundle.

**THE SOLUTION:** Route all API calls through a single POST endpoint (like /api/rpc) using numeric action codes. The frontend sends { a: 2, p: { type: "status" } } instead of GET /api/aws-costs?type=status. The server-side mapping from numbers to handler functions never reaches the browser. Move handler files to a directory that the hosting platform ignores (Vercel skips underscore-prefixed directories like api/_h/). After migration, delete the old endpoint files — they remain reachable even if no frontend code calls them.

The result: anyone inspecting the bundle sees only /api/rpc and a set of meaningless numbers. They can't tell what action 2 does without access to the server code.

**THE FIX:**
```
// Frontend: rpc helper (only this file knows the URL)
const RPC_URL = '/api/rpc';

export async function rpc(token, action, params, body, method) {
  return fetch(RPC_URL, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: \
```

*After migrating to RPC, verify the bundle is clean: build and grep the minified JS for old endpoint names. React Router paths (/aws-costs as a page URL) and UI labels ("Cloudflare Analytics" as a heading) will still be visible — that's fine, they're disconnected from API endpoints. File uploads can't go through a JSON RPC endpoint — use the one-time token pattern (item 10.6): get an upload token via RPC, then upload directly to the backend server. Similarly, file downloads via <a href> links can't send JWT headers — use programmatic fetch + blob URL instead (see item 10.7 note).*

### 1.26. Noscript Fallback for SPA

**THE RISK:** Single Page Applications render entirely in JavaScript. Bots, scrapers, and users with JavaScript disabled see a blank page. Search engines may index nothing. Security scanners may report the site as empty or broken, triggering false alarms.

**THE SOLUTION:** Add a <noscript> block in your index.html that displays key information about the site — what it is, what it does, and a link to any static/server-rendered version. This gives bots and no-JS users something meaningful instead of a blank page.

**THE FIX:**
```
<!-- index.html — inside <body>, after the root div -->
<noscript>
  <div style="padding: 2rem; font-family: sans-serif;">
    <h1>My App — Description</h1>
    <p>This application requires JavaScript to run.</p>
    <p>Key features: ...</p>
    <p>Contact: email@example.com</p>
  </div>
</noscript>
```

*This is a defense-in-depth measure, not a replacement for proper SSR or pre-rendering. If SEO is critical, consider server-side rendering or static generation for key pages.*

### 1.27. Global Daily Rate Limit

**THE RISK:** Per-IP rate limiting stops a single attacker but not a distributed attack. If 100 different IPs each send 10 submissions, they all stay under per-IP limits while flooding your system with 1,000 entries. Feedback forms, email endpoints, and file upload APIs are especially vulnerable.

**THE SOLUTION:** Add a global daily counter (across all IPs) for sensitive submission endpoints. When the total exceeds a threshold (e.g., 200 submissions per day), reject further requests and send an alert email to the site owner. This catches distributed attacks that per-IP limits miss.

**THE FIX:**
```
// Serverless function — Upstash Redis global counter
const globalKey = \
```

*The daily cap should be generous enough for normal usage but low enough to catch abuse. Monitor actual daily volumes for a week before setting the threshold. The alert email on first breach is critical — you want to know about distributed attacks immediately.*

### 1.28. HTML Entity Escaping in Email Bodies

**THE RISK:** When user-supplied content (feedback form text, names, URLs) is inserted directly into an HTML email template, an attacker can inject HTML or JavaScript. If the recipient's email client renders HTML, this can lead to phishing links, invisible tracking pixels, or layout manipulation.

**THE SOLUTION:** Escape all user-supplied content before inserting it into HTML email templates. Replace &, <, >, and " with their HTML entity equivalents. This is a simple string transformation that prevents any HTML injection.

**THE FIX:**
```
// Simple HTML entity escaping function
function escHtml(str) {
  return str
    .replace(/&/g, '&amp;')
    .replace(/</g, '&lt;')
    .replace(/>/g, '&gt;')
    .replace(/"/g, '&quot;');
}

// Usage in email template
const emailBody = \
```

*This applies to any system that sends HTML emails with user content — feedback forms, notification emails, admin alerts. Plain-text emails don't need this, but most transactional email services render HTML.*

## 2. FastAPI Backend

### 2.1. CORS Configuration

**THE RISK:** Using allow_origins=["*"] with allow_credentials=True in FastAPI is especially dangerous — it lets any website make authenticated requests to your API using a visitor's cookies. The real question is: does your API use cookies or header-based auth? This determines whether wildcard origins are safe or dangerous.

**THE SOLUTION:** If your API uses header-based auth (X-API-Key or Bearer token) and never reads cookies, allow_origins=["*"] with allow_credentials=False is safe — the auth header is what protects it, not the origin. This is the practical standard for backends with 30+ frontends. If your API uses cookies for auth, you MUST use an explicit origin allowlist. The key rule: allow_credentials=True + wildcard origins is always dangerous.

**THE FIX:**
```
# PATTERN 1: Header-based auth (X-API-Key / Bearer)
# Wildcard is SAFE here — auth header is the real security
app.add_middleware(CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=False,
    allow_methods=["*"],
    allow_headers=["Content-Type", "Authorization", "X-API-Key"],
)

# PATTERN 2: Cookie-based auth (sessions, JWTs in cookies)
# MUST use explicit origin list — cookies are sent automatically
origins = os.getenv("CORS_ORIGINS", "").split(",")
origins = [o.strip() for o in origins if o.strip()]
app.add_middleware(CORSMiddleware,
    allow_origins=origins,
    allow_credentials=True,  # sends cookies
    allow_methods=["*"],
    allow_headers=["Content-Type", "Authorization"],
)
```

*Starlette gotcha: allow_origins=["https://*.yourdomain.com"] is treated as a LITERAL string, not a pattern. For subdomain wildcards, use allow_origin_regex=r"https://.*\\.yourdomain\\.com". DANGER: Never use *.vercel.app, *.netlify.app, or *.railway.app as allowed origins — anyone can deploy a page on these platforms for free and call your API cross-origin.*

### 2.2. Rate Limiting (SlowAPI)

**THE RISK:** Without rate limiting, a single attacker can exhaust your server's CPU, memory, and database connections by sending thousands of requests. This is especially critical for endpoints that run SQL queries or call external APIs — each request consumes real resources.

**THE SOLUTION:** Use a rate limiting library (SlowAPI for FastAPI) that tracks requests per IP address. Set a limit for each endpoint — for example, 30 requests per minute for a data query API. When someone exceeds the limit, they get a "too many requests" response and have to wait.

**THE FIX:**
```
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.errors import RateLimitExceeded

RATE_LIMIT = os.getenv("RATE_LIMIT", "30/minute")

# Use get_client_ip (item 2.8), NOT get_remote_address
# get_remote_address reads proxy IP behind Cloudflare
limiter = Limiter(key_func=get_client_ip)

# CRITICAL: without this, SlowAPI returns an HTML error page
# instead of a JSON 429 response
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.get("/api/endpoint")
@limiter.limit(RATE_LIMIT)
async def endpoint(request: Request):  # request param is REQUIRED
    ...
```

*SILENT FAILURE: If your endpoint signature doesn't include request: Request, the @limiter.limit() decorator does nothing — no rate limiting, no error, no warning. The endpoint works fine, you think it's rate limited, but it isn't. This is the #1 SlowAPI gotcha. Also: without the exception handler, clients get an HTML page instead of a clean 429 JSON.*

### 2.3. Concurrency Limiting

**THE RISK:** Rate limiting caps requests per minute, but an attacker can still send 30 simultaneous requests that all start executing at once. Each might run a heavy SQL query or allocate memory. Without concurrency limits, a burst of parallel requests can crash your server even within rate limits.

**THE SOLUTION:** Set a cap on how many requests from the same IP address can be processing at the same time — for example, 4 concurrent requests per visitor and 10 across all visitors. If someone already has 4 queries running, their 5th request waits or gets rejected. This prevents one person from hogging all your server's resources at once.

**THE FIX:**
```
MAX_PER_IP = int(os.getenv("MAX_CONCURRENT_PER_IP", "4"))
MAX_GLOBAL = int(os.getenv("MAX_CONCURRENT_GLOBAL", "10"))

# Critical: use asyncio.shield() on release
# Without shield, if the ASGI middleware cancels
# the request, the counter never decrements (leaks)
async def release_concurrency(ip):
    try:
        await asyncio.shield(_do_release(ip))
    except asyncio.CancelledError:
        pass  # shield caught it — counter is safe
```

*Load limits from env vars so you can tune without redeploying. The CancelledError catch is required — shield() re-raises it after protecting the inner coroutine. Log ACQUIRED/RELEASED/DENIED events for diagnosing counter leaks in production.*

### 2.4. SQL Validation Stack

**THE RISK:** If your API accepts SQL queries from users (text-to-SQL, data exploration tools), a single malicious query can drop tables, read password hashes, or generate billions of rows to crash your database. Row limits alone don't protect you — CROSS JOIN creates trillion-row intermediate results before LIMIT applies, REPEAT() generates single-cell 95MB responses, and aggregate functions like STRING_AGG collapse 1M rows into one row bypassing LIMIT entirely. This requires a multi-layer defense — no single check is sufficient.

**THE SOLUTION:** Apply multiple layers of checks before running any user-submitted SQL. First, only allow queries that start with SELECT or similar read-only commands. Then block dangerous keywords like DELETE, DROP, or ALTER. Block resource exhaustion functions — CROSS JOIN (CPU bomb via cartesian products), REPEAT (bandwidth bomb via inflated strings), MD5/SHA256/REGEXP_REPLACE (CPU bombs on computed large strings). Add a SQL parser (sqlglot) for structural validation that catches attacks regex cannot see — CTE alias bypass, implicit cartesian products, nested blocked functions. Block data-generating functions, strip out SQL comments (used to bypass filters), hide system tables, and enforce a row limit. No single check is enough — you need all of them together.

**THE FIX:**
```
# Apply ALL of these layers (regex + parser):
1. Prefix allowlist: SELECT, SHOW, DESCRIBE, EXPLAIN, WITH
2. Keyword blocklist: INSERT, UPDATE, DELETE, DROP, ALTER,
   CREATE, TRUNCATE, EXEC, COPY, GRANT, REVOKE, SET...
3. Block GENERATE_SERIES, RANGE, UNNEST (billion-row attacks)
4. Block CROSS JOIN, REPEAT, REGEXP_REPLACE, MD5, SHA256
   (CPU and bandwidth bombs — small output, massive compute)
5. SQL parser structural check (see item 2.4b for details)
   (catches CTE bypass, aliased self-joins, implicit cartesian)
6. Reject queries containing SQL comments (/* or --)
   (reject entirely — don't strip, stripping changes query meaning)
7. Block system catalog: pg_shadow, pg_authid, pg_roles
8. Auto-append LIMIT 1000 if missing
9. Enforce response size limit (e.g. 1MB) — catches STRING_AGG
   and other aggregate dumps that bypass row limits
10. Block computed ORDER BY (prevents function injection)
11. Block subqueries in ORDER BY (O(n^2) attacks)
12. Limit subquery depth (max 3 SELECT keywords)
```

*WHOLE-WORD MATCHING: Keyword blocklist needs word boundaries or you get false positives — "DESCRIPTION" contains "SET", "RESERVED" contains "SET". Use: if f" {kw} " in f" {upper} " or f" {kw}(" in f" {upper}(" or upper.startswith(kw). MCP GOTCHA: MCP clients send multi-line SQL. Use re.DOTALL in ORDER BY regex or function detection will miss matches across newlines. SUBQUERY ORDER: Check original SQL for ORDER BY subqueries BEFORE stripping content inside parentheses (for LIMIT detection) — stripping parens removes the subqueries you need to catch. REGEX VS PARSER: Regex handles 95% of attacks (fast, simple). But regex cannot understand SQL structure — CTEs, aliases, and subqueries can disguise cartesian products. Use a SQL parser like sqlglot (item 2.4b) as a second structural layer. ROW LIMITS VS RESPONSE SIZE: Auto-append LIMIT protects against SELECT * but not against aggregates (STRING_AGG collapses 1M rows into 1 row) or CROSS JOIN (computation before LIMIT). The response size check is the catch-all.*

### 2.4b. SQL Parser Structural Validation (sqlglot)

**THE RISK:** Regex-based SQL validation works on raw text — it cannot understand SQL structure. Attackers exploit this gap with CTEs (WITH t AS (...) SELECT FROM t a, t b), table aliases, nested function calls, and implicit joins that look different in text but produce the same dangerous execution plan. Every regex bypass the pen tester found in Round 3 was structural — the SQL was semantically identical to a blocked pattern but syntactically different enough to pass text-based checks.

**THE SOLUTION:** Add a SQL parser (sqlglot) as a structural validation layer on top of your regex checks. The parser builds an Abstract Syntax Tree (AST) and validates the actual query structure: how many tables appear in each SELECT's FROM clause, which functions are called regardless of nesting or aliasing. Keep the regex layer as the first line of defense (fast, catches obvious attacks) and use the parser as a second pass for structural checks that regex cannot handle. sqlglot supports both Postgres and DuckDB dialects and is pure Python with no C dependencies.

**THE FIX:**
```
import sqlglot
from sqlglot import exp as sqlglot_exp

def validate_sql_structure(sql, engine="postgres"):
    dialect = "duckdb" if engine == "duckdb" else "postgres"
    try:
        parsed = sqlglot.parse_one(sql, dialect=dialect)
    except sqlglot.errors.ParseError:
        return "Invalid SQL syntax"

    # Block multiple table sources per SELECT (catches cartesian products,
    # CTE alias bypass, implicit joins via FROM a, b)
    for select in parsed.find_all(sqlglot_exp.Select):
        from_clause = select.find(sqlglot_exp.From)
        if not from_clause:
            continue
        from_tables = list(from_clause.find_all(sqlglot_exp.Table))
        join_tables = []
        for join in select.find_all(sqlglot_exp.Join):
            if join.find_ancestor(sqlglot_exp.Select) is select:
                join_tables.extend(join.find_all(sqlglot_exp.Table))
        if len(from_tables) + len(join_tables) > 1:
            return "Multiple table sources not allowed"

    # Block dangerous functions via AST (catches nested/aliased usage)
    BLOCKED = {"REPEAT", "GENERATE_SERIES", "MD5", "SHA256"}
    for func in parsed.find_all(sqlglot_exp.Func):
        name = func.name.upper() if hasattr(func, 'name') else ""
        if name in BLOCKED:
            return f"Blocked function: {name}"
    return None
```

*HYBRID APPROACH: Do not replace regex with the parser — use both. Regex is fast and catches 95% of attacks. The parser catches the remaining 5% that require structural understanding. GRACEFUL FALLBACK: Wrap the import in try/except so the server still works (regex-only) if sqlglot is not installed. DIALECT MATTERS: sqlglot.parse_one(sql, dialect="duckdb") handles DuckDB-specific syntax that would fail with the default dialect. PER-SELECT CHECK: Count tables per individual SELECT, not globally across the whole query — otherwise legitimate correlated subqueries (WHERE x IN (SELECT y FROM same_table)) get blocked. See the Database MCP Server repo (github.com/amararun/shared-fastapi-database-mcp) for the full working implementation.*

### 2.5. Admin Write Table Whitelist

**THE RISK:** Even with API key authentication, a compromised client or misconfigured admin tool can issue DELETE FROM critical_table. If your endpoint allows write operations through a proxy-to-database pattern, restricting which tables are writable is a last-resort guard.

**THE SOLUTION:** Maintain a list of table names that are allowed for write operations (insert, update, delete). Before running any write query, extract the table name from the SQL and check it against your list. If the table isn't on the list, reject the query. This is a safety net — even if someone has valid credentials, they can only modify approved tables.

**THE FIX:**
```
ALLOWED_TABLES = {"users", "sessions", "audit_log"}

import re
match = re.search(
    r'\\b(?:INTO|UPDATE|FROM|DELETE\\s+FROM)\\s+["\\'\\']?(\\w+)',
    sql, re.IGNORECASE
)
if match and match.group(1).lower() not in ALLOWED_TABLES:
    raise HTTPException(400, detail="Operation not allowed")
```

*Only needed for admin/write endpoints. Read-only APIs should use the full SQL validation stack (2.4) instead. DuckDB gotcha: DuckDB supports INSERT OR REPLACE INTO and INSERT OR IGNORE INTO which the standard INTO regex won't match. Use a broader pattern: /(?:INTO|UPDATE|FROM|DELETE\\s+FROM|OR\\s+REPLACE\\s+INTO|OR\\s+IGNORE\\s+INTO)\\s+(\\w+)/. Also ensure table name case in your Set matches the normalized SQL case.*

### 2.6. Error Message Sanitization

**THE RISK:** Returning str(e) to clients in error responses leaks internal details: file paths, table names, database versions, library versions. Attackers use this information to map your infrastructure and craft targeted attacks. Health endpoints that return version info are equally dangerous.

**THE SOLUTION:** Never send raw error messages to the user. Log the full error details on your server for debugging, but return only a generic, helpful message to the client like "Query failed — check your SQL syntax." Health check endpoints should return just "status: ok" without version numbers, file paths, or configuration details.

**THE FIX:**
```
# WRONG — leaks internals
raise HTTPException(400, detail=f"Error: {str(e)}")

# RIGHT — log internally, return generic message
logger.error(f"Query error: {e}")
raise HTTPException(400, detail="Query failed. Check SQL syntax.")

# SAFETY NET — global exception handler catches anything
# that slips through individual try/except blocks
@app.exception_handler(Exception)
async def global_exception_handler(request: Request, exc: Exception):
    logger.error(f"Unhandled: {request.url.path}: {exc}", exc_info=True)
    return JSONResponse(status_code=500,
        content={"detail": "Internal server error"})

# Health endpoints — only return status
@app.get("/health")
async def health():
    return {"status": "ok"}  # No versions, paths, or config
```

*Serverless proxy gotcha: when forwarding backend responses, wrap response.json() in its own try/catch. Backends behind Cloudflare can return HTML challenge pages (Bot Fight Mode) or HTML error pages as 200 status. response.json() throws SyntaxError on non-JSON — handle it: try { data = await response.json(); } catch { return res.status(502).json({ error: "Invalid backend response" }); } STATUS CODE UNIFORMITY: Beyond sanitizing error messages, make ALL rejection paths return the same HTTP status code AND body. If your endpoint returns 405 for wrong method, 400 for missing params, and 403 for bad auth — an attacker can distinguish these cases and map your internal routing logic. Use a single response (e.g., 403 + {"error":"Forbidden"}) for every rejection regardless of reason. This turns the endpoint into a black box — wrong method, invalid action, bad token all look identical from outside. INPUT ECHO: Never echo user input in error messages (e.g., "Unknown action: ${action}"). This lets attackers enumerate valid values by observing which inputs get a different response than the echo.*

### 2.7. API Key Authentication

**THE RISK:** Non-public endpoints without authentication are discoverable and callable by anyone. If the API key defaults to an empty string when the environment variable isn't set, every request passes validation — this is a common deployment mistake that silently disables auth.

**THE SOLUTION:** Require an API key in a request header for every non-public endpoint. At startup, check that the API key environment variable is actually set — if it's missing, refuse to start the server. This "fail-secure" approach ensures you never accidentally run without authentication because someone forgot to set the variable during deployment.

**THE FIX:**
```
API_KEY = os.getenv("API_KEY")
if not API_KEY:
    raise RuntimeError("API_KEY not configured")  # fail-secure

async def verify_api_key(request: Request):
    key = request.headers.get("X-API-Key", "")
    if key != API_KEY:
        raise HTTPException(401, detail="Invalid API key")

@app.post("/api/endpoint", dependencies=[Depends(verify_api_key)])
```

*If the env var is not set, REJECT all requests. Never default to empty string. X-API-Key vs Bearer: use X-API-Key for server-to-server calls (simpler, no token lifecycle). Use Bearer tokens (JWT) when you need user identity, token expiry, or OAuth flows. Don't mix both on the same endpoint — pick one auth scheme per endpoint.*

### 2.8. Client IP Extraction

**THE RISK:** Behind Cloudflare or nginx, request.client.host returns the proxy's IP, not the actual user's IP. This means all users share one "IP" for rate limiting — either everyone gets blocked or nobody does. You need to read the correct proxy header to get the real client IP.

**THE SOLUTION:** When your app sits behind Cloudflare or a reverse proxy, the visitor's real IP is passed in a special header (like cf-connecting-ip or x-forwarded-for), not in the standard connection info. Write a helper function that checks these headers in order and extracts the actual IP. Without this, your rate limiting and logging will treat all visitors as one person.

**THE FIX:**
```
def get_client_ip(request: Request) -> str:
    for header in ['cf-connecting-ip',
                   'x-forwarded-for',
                   'x-real-ip']:
        val = request.headers.get(header, '')
        if val:
            return val.split(',')[0].strip()
    return request.client.host if request.client else 'unknown'

# CRITICAL: SlowAPI ignores this unless you override key_func
limiter = Limiter(key_func=get_client_ip)  # NOT get_remote_address
```

*SlowAPI's default get_remote_address reads request.client.host, which is the proxy IP behind Cloudflare. You MUST pass your custom IP extractor as key_func when creating the Limiter. This is a silent failure — rate limiting appears to work but applies to one shared IP instead of per-user.*

### 2.9. Webhook Signature Verification

**THE RISK:** Without signature verification, anyone who discovers your webhook URL can POST fake events — fake payment confirmations, fake user signups, fake booking notifications. Your backend processes them as real, leading to data corruption or unauthorized actions.

**THE SOLUTION:** When receiving a webhook, verify the signature that the sender attached. The sender (Stripe, GitHub, Brevo, etc.) signs each webhook with a shared secret — your server recalculates the signature and compares. If they don't match, the webhook is fake. Also reject webhooks older than 5 minutes to prevent replay attacks where someone resends a captured webhook later.

**THE FIX:**
```
import hmac, hashlib

def verify_webhook(payload: bytes, signature: str, secret: str):
    expected = hmac.new(
        secret.encode(), payload, hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature)

raw_body = await request.body()
sig = request.headers.get("x-webhook-signature", "")
if not verify_webhook(raw_body, sig, os.getenv("WEBHOOK_SECRET")):
    raise HTTPException(401, detail="Invalid signature")

# Reject old webhooks (replay protection)
timestamp = int(request.headers.get("x-webhook-timestamp", "0"))
if abs(time.time() - timestamp) > 300:  # 5 minutes
    raise HTTPException(401, detail="Webhook expired")
```

### 2.10. SSRF Protection

**THE RISK:** If any endpoint fetches a URL supplied by the client (proxy endpoints, URL preview generators), attackers can use your server to reach internal services: AWS metadata at 169.254.169.254 (leaks credentials), localhost admin panels, or private network scanning. Your server becomes a weapon against your own infrastructure.

**THE SOLUTION:** Maintain a list of approved domains and URL paths that your server is allowed to fetch. When a user provides a URL, check that the domain and path are on your approved list before making the request. Block any requests to internal addresses like localhost or cloud metadata endpoints. This prevents attackers from using your server as a proxy to reach things they shouldn't.

**THE FIX:**
```
# Python — domain allowlist + redirect validation
ALLOWED_DOMAINS = {"api.example.com", "cdn.example.com"}
from urllib.parse import urlparse

def validate_url(url: str):
    parsed = urlparse(url)
    if parsed.scheme not in ("https",):
        raise HTTPException(400, "HTTPS only")
    if parsed.hostname not in ALLOWED_DOMAINS:
        raise HTTPException(400, "Domain not allowed")

validate_url(user_url)  # check initial URL

# CRITICAL: disable auto-redirect, validate each hop
resp = httpx.get(user_url, follow_redirects=False)
while resp.is_redirect:
    redirect_url = str(resp.next_request.url)
    validate_url(redirect_url)  # check EVERY redirect target
    resp = httpx.get(redirect_url, follow_redirects=False)

# Also enforce response size cap
MAX_SIZE = 50 * 1024 * 1024  # 50MB
```

*Most HTTP clients follow redirects by default. An attacker hosts a redirect on an allowed domain that points to 169.254.169.254 (cloud metadata). Also block file:// URIs and IP addresses in hostnames. PREFERRED PATTERN: If your proxy uses structured parameters (not raw paths), use a value-based whitelist — client sends a parameter, server maps to a fixed URL: ALLOWED = ['stocks', 'users']; url = f"{BACKEND}/api/{param}". This is safer because the user never controls the URL path directly.*

### 2.11. API Monitoring

**THE RISK:** Without request logging, you can't detect attacks, debug failures, or understand usage patterns. When something goes wrong, you're flying blind. Monitoring should log endpoints, status codes, and response times — not request bodies or PII.

**THE SOLUTION:** Add a monitoring middleware that automatically logs every API request — which endpoint was called, whether it succeeded or failed, and how long it took. Send these logs to a centralized dashboard so you can spot unusual patterns (sudden spike in errors, slow queries, unknown endpoints being probed). Only log operational data, never personal information.

**THE FIX:**
```
pip install your-api-monitor  # or any monitoring middleware

# Use try/except — if package isn't installed, app still starts
try:
    from your_api_monitor import APIMonitorMiddleware
    app.add_middleware(APIMonitorMiddleware,
        app_name="YOUR_APP_NAME",
        include_prefixes=("/api/", "/analyze"),
    )
except ImportError:
    pass  # monitoring unavailable, app runs without it

# Env vars: API_MONITOR_URL, API_MONITOR_KEY
```

*Without try/except, a missing package crashes the entire app on startup. This happened on fresh deploys where pip install failed silently. Monitoring should degrade gracefully — a monitoring failure shouldn't take down your API.*

### 2.12. File Upload Validation

**THE RISK:** If your backend accepts file uploads without validation, attackers can upload executable files (.py, .sh, .exe), oversized files that fill your disk, or files disguised with fake extensions (.jpg that is actually a .html with embedded scripts). If uploaded files are served back to users from your domain, an uploaded HTML file executes in your site's security context.

**THE SOLUTION:** Validate every uploaded file on three levels: check the file extension against an allowlist (only accept the types your app actually needs), enforce a maximum file size so nobody fills your disk, and store uploaded files in a location where they cannot be executed by the server. Never trust the browser's Content-Type header — check the actual file contents. Serve downloads with Content-Disposition: attachment so browsers download instead of rendering.

**THE FIX:**
```
import os
ALLOWED_EXTENSIONS = {".csv", ".xlsx", ".parquet", ".duckdb"}
MAX_FILE_SIZE_MB = 50

async def validate_upload(file: UploadFile):
    # 1. Check extension
    ext = os.path.splitext(file.filename)[1].lower()
    if ext not in ALLOWED_EXTENSIONS:
        raise HTTPException(400, "File type not allowed")

    # 2. Check size (read in chunks, don't load all into memory)
    size = 0
    while chunk := await file.read(8192):
        size += len(chunk)
        if size > MAX_FILE_SIZE_MB * 1024 * 1024:
            raise HTTPException(413, "File too large")
    await file.seek(0)  # reset for actual processing

    # 3. Serve with safe headers
    # Content-Disposition: attachment (download, don't render)
    # X-Content-Type-Options: nosniff (don't guess type)
```

*Never store uploads in a publicly accessible folder. Use signed URLs with expiry for download access.*

### 2.13. SSL Certificate Verification in Backend Calls

**THE RISK:** Python's requests.post(..., verify=False) disables SSL certificate checking, enabling man-in-the-middle attacks. An attacker on the network path can intercept and modify traffic between your backend and other services — reading API keys, injecting malicious responses, or stealing data in transit. This is especially common in backends that were set up during development and never fixed for production.

**THE SOLUTION:** Always use verify=True (the default) for external API calls. When the target is behind Cloudflare or has a valid certificate, there is no excuse for verify=False. For Docker-internal calls where services communicate without SSL, use an internal URL variable that explicitly skips SSL only for known-internal addresses. This way, internal calls use Docker networking (no SSL needed) and external calls always verify certificates.

**THE FIX:**
```
# WRONG — disables all SSL verification
resp = requests.post(url, json=data, verify=False)

# RIGHT — verify certificates for external calls
resp = requests.post(url, json=data)  # verify=True is default

# For Docker-internal services (no SSL):
BACKEND_URL = os.getenv("BACKEND_URL")         # external, verified
BACKEND_URL_INTERNAL = os.getenv("BACKEND_URL_INTERNAL")  # Docker alias

async def call_backend(path, data):
    url = BACKEND_URL_INTERNAL or BACKEND_URL
    verify = BACKEND_URL_INTERNAL is None  # skip SSL only for internal
    return requests.post(f"{url}{path}", json=data, verify=verify)
```

*verify=False and empty API_KEY defaults (item 2.7) are correlated — backends with one usually have both. Audit them together. If you see verify=False in code review, check API_KEY handling too.*

### 2.14. Centralized Logging & PII Retention

**THE RISK:** Item 1.14 says "don't log PII in console.log" — that's about careless logging in serverless functions visible to anyone with dashboard access. But for incident response, you NEED request bodies and client IPs. Without them, you can't answer "what happened?" after an attack — you can see that endpoints were called, but not what payloads were sent (SQL injection attempts, credential stuffing patterns, malformed requests from scanning tools). The question isn't whether to capture PII, it's how to capture it safely with controls and auto-deletion.

**THE SOLUTION:** Set up a centralized logging pipeline: every backend sends request metadata (endpoint, status, response time, client IP, request body) to a dedicated logging service, which stores it in a database with a retention policy. Keep identifiable data (IP, request body) for 30 days for incident response, then auto-NULL it via a daily cron job. Keep permanent non-identifiable hashes (SHA256 of request body) for pattern analysis. All backends flow to one dashboard so you can correlate attacks across services — during an incident, you need one place to look, not 30 containers to SSH into.

**THE FIX:**
```
# The pipeline:
# Backend → monitoring middleware → Logger Service → DB → Dashboard
#                                                        ↓
#                                          Daily cron: NULL PII > 30 days

# Each backend: 3 lines + 2 env vars
try:
    from your_api_monitor import APIMonitorMiddleware
    app.add_middleware(APIMonitorMiddleware,
        app_name="APP_NAME",
        include_prefixes=("/api/",),
    )
except ImportError:
    pass
# Env vars: API_MONITOR_URL, API_MONITOR_KEY

# PII cleanup endpoint (called by daily cron)
@app.post("/admin/cleanup-raw-data")
async def cleanup_raw_data():
    retention = int(os.getenv("PII_RETENTION_DAYS", "30"))
    result = await db.execute(
        "UPDATE api_logs SET client_ip = NULL, request_body = NULL "
        "WHERE created_at < NOW() - INTERVAL '%s days' "
        "AND client_ip IS NOT NULL", retention
    )
    return {"nullified": result.rowcount}
# request_body_hash stays — count patterns without identifying sender
```

*This is the opposite of item 1.14 and that's intentional. 1.14 is about careless logging (console.log in Vercel). This is deliberate capture with access controls and auto-deletion. The middleware should extract client IP from cf-connecting-ip automatically (ties into item 2.8) — don't rely on individual endpoints to log it. Truncate request bodies to 10KB max to prevent log storage abuse.*

### 2.15. Old Report / Temp File Cleanup

**THE RISK:** Endpoints that generate reports (PDF, HTML, CSV) create files on disk. Without cleanup, these accumulate indefinitely — filling the disk, slowing the server, and potentially exposing old reports to anyone who guesses the filename pattern.

**THE SOLUTION:** Add cleanup logic that deletes generated files older than a set period (e.g., 2 hours). Run cleanup on each new request (lazy cleanup) or via a scheduled cron job. Use a dedicated output directory that's easy to sweep.

**THE FIX:**
```
import os, time
from pathlib import Path

OUTPUT_DIR = Path("./output")
MAX_AGE_HOURS = 2

def cleanup_old_files():
    """Delete generated files older than MAX_AGE_HOURS"""
    cutoff = time.time() - (MAX_AGE_HOURS * 3600)
    for f in OUTPUT_DIR.glob("*"):
        if f.is_file() and f.stat().st_mtime < cutoff:
            f.unlink()

# Call at the start of each report generation endpoint
@app.post("/generate-report")
async def generate_report(...):
    cleanup_old_files()  # sweep old files first
    # ... generate new report
```

*This applies to any endpoint that creates files: PDF generators, CSV exports, chart images, temp uploads. Never store generated files in a publicly accessible static directory — use a non-routable path and serve via a download endpoint with proper auth.*

### 2.16. Read-Only Enforcement

**THE RISK:** SQL validation (item 2.4) blocks write keywords in queries, but regex-based validation can have gaps — clever encoding, dialect-specific syntax, or new SQL features might bypass it. If the database connection itself allows writes, a validation bypass means data can be modified or deleted.

**THE SOLUTION:** Enforce read-only mode at both the SQL validation level AND the database connection level. This is defense in depth — even if SQL validation misses something, the database connection itself will reject any write operation.

**THE FIX:**
```
# Postgres: use a read-only database role
CREATE ROLE readonly_user WITH LOGIN PASSWORD 'xxx';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly_user;
# Connect with this role for all query endpoints

# DuckDB: open in read-only mode
conn = duckdb.connect("data.duckdb", read_only=True)

# Application level: validate BEFORE database level
validated_sql = validate_sql(raw_sql)  # regex + parser
result = readonly_conn.execute(validated_sql)  # read-only conn
```

*Write operations (data imports, admin updates) should use a separate connection with a different role — never the same connection used for user-facing queries. Keep write endpoints behind additional auth and rate limiting.*

## 3. Database Layer (Postgres)

### 3.1. Database Indexes

**THE RISK:** Without indexes, every query scans the entire table row by row. On a table with 10K+ rows, this makes queries slow enough that an attacker can tie up your database connections with a few concurrent requests. Indexes turn full scans into near-instant lookups.

**THE SOLUTION:** Create indexes on columns that you frequently filter, sort, or group by. Think of an index like a book's table of contents — instead of reading every page to find a topic, you jump straight to it. This makes queries dramatically faster and harder to abuse, because even a malicious query resolves quickly instead of tying up the database.

**THE FIX:**
```
-- Add indexes on columns used in WHERE, GROUP BY, ORDER BY, JOIN
CREATE INDEX idx_col ON table_name(column_name);

-- For composite queries
CREATE INDEX idx_multi ON table_name(col1, col2);
```

### 3.2. Statement Timeout

**THE RISK:** A single runaway query (or a deliberately crafted one) can consume 100% CPU for minutes, blocking all other queries. Without a timeout, your database becomes unresponsive until you manually kill the query. This is an easy denial-of-service vector.

**THE SOLUTION:** Set a maximum time limit on every database user role. 15 seconds is a good starting point for read-only analytics APIs — most legitimate queries finish in under 5 seconds. If any query takes longer than that, the database automatically kills it. This prevents a single slow or malicious query from locking up your entire database and affecting all other users. Also set the timeout at the application level (e.g. PG_STATEMENT_TIMEOUT_MS environment variable) so it applies to every connection in the pool.

**THE FIX:**
```
-- Set on every role that connects
ALTER ROLE myuser SET statement_timeout = '15s';
```

### 3.3. Read-Only Database Role

**THE RISK:** If your application connects as the database owner (the postgres superuser), a SQL injection vulnerability gives the attacker full control: drop tables, read password hashes, modify data. A restricted read-only role limits the damage — even a successful injection can only read data from allowed tables.

**THE SOLUTION:** Create a separate database user that can only read data (SELECT) from specific tables you choose. Connect your application using this restricted user instead of the all-powerful admin account. Also block access to system tables that contain password hashes and configuration. Even if an attacker finds a way to run SQL, they can only read — not delete, modify, or access sensitive system data.

**THE FIX:**
```
CREATE ROLE app_readonly WITH LOGIN PASSWORD '...';
GRANT SELECT ON specific_table TO app_readonly;
ALTER ROLE app_readonly SET default_transaction_read_only = on;

-- Revoke access to system schemas
REVOKE ALL ON SCHEMA pg_catalog FROM app_readonly;
REVOKE ALL ON SCHEMA information_schema FROM app_readonly;
```

### 3.4. Supabase Row Level Security

**THE RISK:** Supabase exposes a PostgREST API publicly. Your anon key is visible in frontend code. Without Row Level Security (RLS), anyone with that key can INSERT, UPDATE, and DELETE any row in any table. The anon key is not a secret — RLS is your real access control. But enabling RLS alone is not enough — a policy like "Users can update their own profile" with USING (auth.uid() = user_id) applies to ALL columns. If the table has sensitive columns (is_admin, status, subscription_tier, limits), the user can update those on their own row. One API call = privilege escalation.

**THE SOLUTION:** Enable Row Level Security on every table and create policies that define who can do what. For read-only data, allow SELECT only. For user-writable tables, be extremely careful about UPDATE policies — they grant write access to every column on matching rows. Keep sensitive columns (roles, subscription status, permissions, usage limits) in separate tables that users can never write to directly. Billing webhooks and server-side logic should write to those tables via service_role. If you cannot restructure the schema, add BEFORE UPDATE triggers to reject changes to protected columns. If tables are backend-only, revoke all access from the anonymous role entirely.

**THE FIX:**
```
ALTER TABLE mytable ENABLE ROW LEVEL SECURITY;
CREATE POLICY "read_only" ON mytable FOR SELECT USING (true);

-- If backend-only (no frontend needs anon access):
REVOKE ALL ON ALL TABLES IN SCHEMA public FROM anon;

-- DANGER: This policy lets users update ANY column on their row
-- including is_admin, role, subscription_tier, limits!
CREATE POLICY "update_own" ON profiles
  FOR UPDATE USING (auth.uid() = user_id);

-- FIX: Protect sensitive columns with a trigger
CREATE OR REPLACE FUNCTION protect_sensitive_columns()
RETURNS TRIGGER AS $$
BEGIN
  IF NEW.status != OLD.status OR NEW.limits != OLD.limits THEN
    RAISE EXCEPTION 'Cannot modify protected columns';
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER guard_sensitive
  BEFORE UPDATE ON app_users
  FOR EACH ROW EXECUTE FUNCTION protect_sensitive_columns();
```

*Enabling RLS is necessary but nowhere near sufficient. The vulnerabilities are in the policies themselves — not whether RLS is on or off. Audit every UPDATE and INSERT policy for column-level exposure.*

### 3.5. Connection Pool Configuration

**THE RISK:** Without a properly configured connection pool, your app creates a new database connection for every request (slow) or runs out of connections under load (crash). Cold-start connections add latency. Without an acquire timeout, requests queue forever during connection exhaustion.

**THE SOLUTION:** Use a connection pool that keeps a few database connections ready at all times (pre-warmed) so requests don't wait for new connections to be created. Set a maximum number of connections so your database doesn't get overwhelmed, and set a timeout so requests fail quickly rather than hanging forever if all connections are busy.

**THE FIX:**
```
# asyncpg — pre-warm and configure limits
pool = await asyncpg.create_pool(
    dsn=DATABASE_URL,
    min_size=3,       # pre-warmed connections
    max_size=6,       # upper limit
    command_timeout=15,
    init=init_conn    # set read-only + statement timeout
)
```

### 3.6. SECURITY DEFINER Function Audit

**THE RISK:** Postgres functions marked SECURITY DEFINER execute with the function owner's privileges, completely bypassing Row Level Security. If such a function accepts user-supplied identity parameters (email, user_id, org_id), any authenticated user can impersonate anyone else. For example, a function increment_usage(p_email text, ...) SECURITY DEFINER lets any user insert or update usage records for ANY email — complete identity spoofing that RLS cannot prevent.

**THE SOLUTION:** Audit all SECURITY DEFINER functions in your database. Custom functions should almost never need SECURITY DEFINER — use the default SECURITY INVOKER instead. If a function genuinely needs elevated privileges, never accept user-supplied identity parameters. Pull identity from the JWT using auth.email() or auth.uid() inside the function body. After fixing a function signature, remember that Postgres treats different parameter counts as separate overloads — you must explicitly DROP the old vulnerable version.

**THE FIX:**
```
-- Find all SECURITY DEFINER functions
SELECT n.nspname, p.proname, pg_get_functiondef(p.oid)
FROM pg_proc p JOIN pg_namespace n ON p.pronamespace = n.oid
WHERE p.prosecdef = true
  AND n.nspname NOT IN ('pg_catalog','information_schema');

-- BAD: accepts user-supplied email (spoofable)
CREATE FUNCTION increment_usage(p_email text, p_app text)
  SECURITY DEFINER AS $$ ...
  INSERT INTO usage (email, ...) VALUES (p_email, ...) $$;

-- GOOD: pulls email from JWT (unforgeable)
CREATE FUNCTION increment_usage(p_app text)
  SECURITY DEFINER AS $$ ...
  INSERT INTO usage (email, ...) VALUES (auth.email(), ...) $$;

-- After fixing: DROP the old overload explicitly
DROP FUNCTION increment_usage(text, text, text, text, numeric);
```

*Supabase system functions (graphql, pgbouncer, pgsodium, vault, storage) use SECURITY DEFINER legitimately — these are safe. Focus your audit on custom functions in the public schema.*

### 3.7. Least Privilege Table Grants

**THE RISK:** Supabase defaults grant the authenticated role full CRUD permissions (INSERT, UPDATE, DELETE, TRUNCATE, TRIGGER, REFERENCES) on every table — regardless of your RLS policies. This means if RLS is accidentally disabled on any table, or a permissive policy is added carelessly, authenticated users instantly have full write access. RLS is one layer. Grants are another. You need both locked down.

**THE SOLUTION:** After setting up RLS policies, audit and tighten the underlying table grants so the authenticated role only has the specific permissions that match your actual RLS policies. Read-only data tables (reference data, public datasets, lookup tables) should have SELECT-only grants. User-writable tables should have only the operations their RLS policies allow (e.g., SELECT + INSERT + UPDATE for a files table, not DELETE + TRUNCATE). This is defense-in-depth — even without RLS, damage is limited to what the grants allow.

**THE FIX:**
```
-- Audit current grants
SELECT grantee, table_name, privilege_type
FROM information_schema.table_privileges
WHERE table_schema = 'public'
  AND grantee IN ('anon', 'authenticated')
ORDER BY table_name;

-- Lock down read-only data tables
REVOKE INSERT, UPDATE, DELETE, TRUNCATE, TRIGGER, REFERENCES
  ON my_readonly_table FROM authenticated;

-- Lock down user-writable tables (keep only needed ops)
REVOKE TRUNCATE, TRIGGER, REFERENCES
  ON user_files FROM authenticated;

-- Lock down tables users should never write to
REVOKE UPDATE, DELETE, TRUNCATE, TRIGGER, REFERENCES
  ON app_users FROM authenticated;
```

*Run the audit query after every schema change. New tables inherit default grants, which are usually too permissive.*

## 4. DuckDB-Specific

### 4.1. Read-Only + External Access Disabled

**THE RISK:** DuckDB can read and write files on the server's filesystem by default. A crafted query can read /etc/passwd, write malicious files, or load external data from the network. In a web-facing application, this is equivalent to giving users shell access to your server.

**THE SOLUTION:** Open the DuckDB database in read-only mode so no data can be modified, and disable external access so queries cannot read files from disk or fetch data from the internet. These two settings together ensure that DuckDB can only query the data already inside the database file — nothing else on your server.

**THE FIX:**
```
conn = duckdb.connect(filepath, read_only=True)
conn.execute("SET enable_external_access=false")
```

### 4.2. Resource Limits

**THE RISK:** DuckDB runs in-process — not in a separate server. Without limits, a single query can consume all available RAM and CPU cores, starving your web server and every other process on the machine. Unlike Postgres, there's no separate process to kill.

**THE SOLUTION:** Set explicit limits on how much memory and how many CPU cores DuckDB is allowed to use. For example, cap it at 512MB of RAM and 2 CPU threads. This ensures that even a heavy query leaves enough resources for your web server and other services to keep running normally.

**THE FIX:**
```
conn.execute("SET memory_limit='512MB'")
conn.execute("SET threads=2")
```

### 4.3. Query Interrupt on Timeout

**THE RISK:** When you cancel an asyncio task in Python, it does NOT stop the DuckDB C++ engine running underneath. The query keeps consuming CPU and memory even though the Python coroutine has been cancelled. You must explicitly call conn.interrupt() to kill the query.

**THE SOLUTION:** When a DuckDB query times out, you must explicitly tell the DuckDB engine to stop by calling its interrupt method, then close the connection. Simply cancelling the Python task is not enough — the database engine runs in C++ underneath and keeps going unless you directly tell it to stop.

**THE FIX:**
```
# On timeout:
conn.interrupt()          # kills the C++ query engine
await asyncio.sleep(0.1)  # let engine acknowledge interrupt
conn.close()              # release resources
```

*The sleep between interrupt() and close() is not optional — calling close() immediately after interrupt() can hang or crash because the C++ engine hasn't finished aborting. 100ms is enough for the engine to acknowledge the interrupt signal.*

### 4.4. Container Resource Limits

**THE RISK:** Even with DuckDB's memory_limit set, a bug or edge case can cause the process to exceed it. Without Docker container limits, one runaway DuckDB container can starve every other service on the same server — databases, web servers, monitoring.

**THE SOLUTION:** Set memory and CPU limits at the Docker container level as a second safety net. Even if DuckDB's own memory limit is bypassed due to a bug, the container will be capped. This protects every other service running on the same server from being starved by one runaway container.

**THE FIX:**
```
# Docker / Coolify container settings
memory: 1GB
swap: 2GB
cpus: 2
```

*Coolify gotcha: if you set container limits via the Coolify API (PATCH), the config is updated but the running container keeps its old limits. You must trigger a redeploy for new limits to take effect. Don't assume the API response means limits are active.*

### 4.5. DuckDB File Locking

**THE RISK:** DuckDB uses file-level locking. If a process crashes or is killed while holding a write lock, it can leave behind a stale .wal (write-ahead log) file that blocks all subsequent connections. In read-only mode, multiple connections work fine — but if ANY process opens the file in write mode (even briefly, like a data import script), it blocks all other connections until it releases the lock.

**THE SOLUTION:** Always open DuckDB in read-only mode for web-facing applications (see 4.1). Keep write access limited to offline processes (data imports, migrations) that run one at a time. If a stale .wal file blocks connections after a crash, you may need to open the database in write mode once to recover it, then switch back to read-only. Monitor for .wal files as part of your health checks.

**THE FIX:**
```
# Web-facing app: always read-only (no lock contention)
conn = duckdb.connect(filepath, read_only=True)

# Data import script: write mode, one process at a time
conn = duckdb.connect(filepath, read_only=False)
# ... import data ...
conn.close()  # releases lock

# Recovery: if stale .wal blocks read-only connections
# Option 1: open write mode briefly to flush WAL
conn = duckdb.connect(filepath, read_only=False)
conn.close()
# Option 2: delete .wal file (LAST RESORT — may lose data)
```

*Unlike Postgres, DuckDB has no connection pooler or lock manager. File locking is the only concurrency control. Plan your architecture so the web server never needs write access.*

### 4.6. DuckDB Blocks the Asyncio Event Loop

**THE RISK:** Unlike Postgres (separate server process), DuckDB runs inside your FastAPI process. A heavy DuckDB query blocks the Python asyncio event loop, freezing ALL concurrent handling — health checks stop responding, rate limit checks can't execute, and timeout logic can't fire. Your entire API becomes unresponsive for the duration of the query, not just the endpoint that triggered it.

**THE SOLUTION:** Always run DuckDB queries in a ThreadPoolExecutor so they execute in a separate thread, leaving the asyncio event loop free to handle other requests. Without this, your timeout and interrupt logic (item 4.3) can't even fire — the event loop is blocked, so the timeout coroutine never gets a chance to run. This is also why concurrent request limits (item 2.3) are critical for DuckDB backends.

**THE FIX:**
```
import asyncio

# WRONG — blocks the event loop
@app.get("/api/query")
async def query(request: Request):
    result = conn.execute(sql).fetchall()  # blocks everything

# RIGHT — run in thread, event loop stays responsive
@app.get("/api/query")
async def query(request: Request):
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(
        None, execute_duckdb_query, sql
    )
    return result
```

*This is why DuckDB backends need ALL three protections together: run_in_executor (so timeout can fire), interrupt() on timeout (so the C++ engine actually stops), and concurrency limits (so multiple heavy queries don't pile up). Without run_in_executor, the other two are useless.*

## 6. Infrastructure

### 6.1. Credential Management

**THE RISK:** API keys in committed files are permanently in git history — even if you delete the file later. Bots scan GitHub continuously (including private repos via leaked PATs) and can find your keys within minutes. A single committed Brevo API key lets attackers send email as your domain.

**THE SOLUTION:** Store all secrets (API keys, database passwords, tokens) as environment variables on your hosting platform — never in code files. If a secret is accidentally committed to git, rotate it immediately (generate a new one and revoke the old one). Simply deleting the file doesn't help because git keeps the full history. Scan your repo periodically for accidental exposures.

**THE FIX:**
```
# Use environment variables only
API_KEY = os.getenv("API_KEY")
if not API_KEY:
    raise RuntimeError("API_KEY not configured")

# Scan for accidentally committed secrets
git ls-files | grep -i "env|secret|key|token"

# If found, rotate immediately — don't just delete the file
```

### 6.2. .gitignore Essentials

**THE RISK:** Missing .gitignore entries cause secrets and large files to be committed accidentally. A .env file with database URLs, an SSL private key (.pem), or a node_modules folder can all end up in your repo. Once committed, removing them from the working tree doesn't remove them from git history.

**THE SOLUTION:** Add a .gitignore file at the root of every repository that lists all files and folders that should never be committed: .env files (secrets), private keys (.pem, .key), dependency folders (node_modules, .venv), and build outputs (dist, build). This is a one-time setup that prevents accidental exposure of sensitive files.

**THE FIX:**
```
# Minimum .gitignore for every repo
.env
.env.local
.env.*.local
.env.vercel.check
node_modules/
__pycache__/
.venv/
dist/
build/
*.pem
*.key
```

### 6.3. Dependency Supply Chain Security

**THE RISK:** AI coding tools suggest packages constantly, and typosquatted packages (names that look like popular ones but are malicious) specifically target developers who install without verifying. A single malicious dependency can exfiltrate environment variables, inject backdoors, or compromise your entire build pipeline. This risk multiplies when AI assistants generate install commands — they can hallucinate package names that happen to be typosquats.

**THE SOLUTION:** Before installing any package, verify it is the official one — check the npm/PyPI page, look at download counts, check the publisher. Run npm audit and pip audit regularly and fix HIGH/CRITICAL vulnerabilities before deploying. Pin your dependency versions so updates don't silently introduce malicious code. When an AI suggests a package you haven't used before, verify it exists and is legitimate before running the install command.

**THE FIX:**
```
# NPM: audit and fix
npm audit
npm audit fix
# Pin versions in package.json (no ^ or ~)
"dependencies": { "express": "4.18.2" }  # exact

# Python: audit with pip-audit
pip install pip-audit
pip-audit
# Pin versions in requirements.txt
fastapi==0.115.0  # exact, no >=

# Before installing ANY new package:
# 1. Check npm/PyPI page - verify publisher, downloads
# 2. Compare name character-by-character (no typosquats)
# 3. Check: is this the package the AI meant?
# 4. Review package README and repo link
```

*AI coding assistants can hallucinate package names. Always verify before installing — a package that sounds right may not exist or may be a typosquat.*

### 6.4. Shared Backend Blast Radius

**THE RISK:** When one backend serves multiple apps (e.g., a shared API behind App A, App B, and a dashboard), changing CORS origins, auth behavior, or rate limits for one app can break the others. You push a CORS fix for App A and App B's fetch calls start failing with opaque CORS errors. This is especially dangerous because the broken app shows no error in server logs — only the browser console reveals the CORS failure.

**THE SOLUTION:** Before changing any shared backend configuration: list all apps that depend on it (check CORS origins list, API monitor logs, cron jobs, MCP endpoints). Test the change against each app's endpoints. If possible, use environment variables so CORS origins and rate limits are configurable per deployment. When in doubt, add the new origin to the allow list rather than replacing the existing ones.

**THE FIX:**
```
# Before changing a shared backend:
# 1. List all consumers
grep -r "BACKEND_URL" ../*/  # find all apps calling this backend
# Check API monitor: which apps hit which endpoints?

# 2. Check CORS origin list covers all consumers
origins = os.getenv("CORS_ORIGINS", "").split(",")
# Verify: every frontend that calls this backend is in the list

# 3. Test each app after deploying changes
# CORS failures show in browser console, NOT server logs

# 4. Use env vars for flexibility (per-deployment config)
RATE_LIMIT = os.getenv("RATE_LIMIT", "30/minute")
# Different deployments can have different limits
```

*CORS errors are invisible server-side. The backend returns 200 but the browser blocks the response. Always test from the actual frontend, not just curl.*

### 6.5. Deploy Gap — Verify Security Fixes Are Live

**THE RISK:** Unlike Vercel (auto-deploys on git push), self-hosted platforms like Coolify often require manual deployment triggers. After pushing a security fix, the old vulnerable code keeps running until you explicitly deploy. This gap can be minutes or hours — during which your fix exists in git but not in production.

**THE SOLUTION:** For security-critical changes: verify whether auto-deploy is enabled BEFORE pushing. If not, immediately trigger deployment after push (via Coolify API, dashboard, or webhook). After deployment, verify the new version is running — hit the /health endpoint, check container logs, or look for a version indicator. Don't assume pushing means deploying.

**THE FIX:**
```
# After pushing a security fix:

# 1. Check if auto-deploy is enabled
# Coolify: Settings > Auto Deploy toggle
# If OFF, trigger manually:
curl -X POST "https://coolify.example.com/api/v1/deploy" \\
  -H "Authorization: Bearer $COOLIFY_TOKEN" \\
  -d '{"uuid": "app-uuid"}'

# 2. Verify new version is running
curl https://app.example.com/health
# Check: response should reflect the fix

# 3. Check container logs for successful startup
docker logs <container> --tail 20
```

*Deploy (rebuild from source) vs Restart (same image) — for code changes, you need Deploy. Restart reuses the old built image and your fix won't be applied.*

### 6.6. Dead Code Path Audit After Architecture Changes

**THE RISK:** After major refactors — consolidating endpoints, migrating auth systems, switching from direct API calls to a proxy — code paths that were once reachable may become dead. This seems harmless, but dead code with auth exemptions is a latent vulnerability. If someone later creates a route or endpoint that accidentally hits that old code path, the auth bypass activates silently.

Real-world example: after consolidating 11 API endpoints into a single RPC endpoint with JWT auth, a Calendly webhook handler inside one of the old handlers still had an auth exemption (it used webhook signature verification instead of JWT). The old endpoint no longer existed, but the code was still there inside the handler function. If someone later re-created that endpoint path for any reason, the auth bypass would have been active.

**THE SOLUTION:** After any major architecture change, audit the codebase for: (1) Auth bypass patterns — search for code that skips JWT/auth verification for specific actions or paths, then verify those paths are still reachable and intentional. (2) Hardcoded URLs referencing old endpoint structure — webhook callback URLs, redirect URLs, API base URLs that point to deleted paths. (3) Conditional auth logic — if/else branches that grant different auth treatment based on action type or path. Remove unreachable branches entirely rather than leaving them as dead code.

**THE FIX:**
```
# After consolidating endpoints or changing auth flow:

# 1. Search for auth bypass patterns
grep -rn "skip.*auth\\|no.*auth\\|bypass\\|exempt\\|without.*verify" api/
grep -rn "action.*===.*webhook\\|action.*===.*public" api/

# 2. Search for hardcoded URLs to old endpoints
grep -rn "/api/old-endpoint\\|api/calendar\\|api/webhook" .

# 3. Search for conditional auth (different paths get different treatment)
grep -rn "if.*action.*===.*\\|switch.*action" api/
# Then verify each branch is still reachable

# 4. After cleanup, verify nothing broke:
npm run build  # frontend compiles
# Hit each endpoint to confirm auth still works
```

*This is a maintenance item, not a one-time setup. Run this audit after every major refactor: endpoint consolidation, auth provider migration, proxy architecture changes, webhook reconfiguration. The goal is zero dead code with elevated privileges. Dead code that has no auth implications (unused utility functions, old formatting helpers) is low priority — focus on code paths that touch authentication or authorization.*

## 7. Incident Response

### 7.1. Incident Response Protocol

**THE RISK:** The instinct during an attack is to immediately restart containers, redeploy, or change code. This destroys logs and evidence. But blindly preserving logs while data is being stolen is equally wrong. The correct response depends on what's happening — is there active damage right now, or are you investigating something that already happened?

**THE SOLUTION:** First, assess: is the attack ACTIVE (data being exfiltrated, records being deleted, resources being drained right now) or DISCOVERED (you found evidence of a past breach, probing, or vulnerability)? For active damage, stop the bleeding first — shut down the service or block at the firewall immediately. Logs are worthless if the attacker is still inside deleting data. For discovered incidents (post-breach investigation, reconnaissance detected in logs, vulnerability found), preserve evidence first — don't restart anything until you've saved the logs.

**THE FIX:**
```
ACTIVE ATTACK (data theft, deletion, resource drain):
1. KILL   — shut down the service or block at Cloudflare firewall
             (stop the damage NOW — logs are secondary)
2. DUMP   — save container logs, API monitor, CF analytics,
             pg_stat_activity (from the stopped state)
3. ASSESS — what was accessed, what was taken
4. HARDEN — fix the vulnerability before restarting

DISCOVERED INCIDENT (past breach, probing, vuln found):
1. STOP   — do NOT restart or redeploy (preserves logs)
2. DUMP   — container logs, API monitor, CF analytics,
             pg_stat_activity
3. BLOCK  — Cloudflare firewall rules per IP (faster than app-level)
4. HARDEN — fix the vulnerability, then deploy
```

*The difference matters: if someone is actively draining your database, don't sit there downloading logs while they work. Pull the plug, then investigate. But if you found a suspicious log entry from yesterday, restarting containers before saving logs destroys your only evidence of what happened.*

### 7.2. Credential Rotation After Exposure

**THE RISK:** A secret committed to git — even briefly, even in a private repo — should be considered compromised. Bots that scrape GitHub for secrets operate within minutes. Simply deleting the file doesn't help because the secret remains in git history forever unless explicitly scrubbed.

**THE SOLUTION:** The moment you discover a leaked credential, generate a new one immediately — don't wait to investigate first. Update the new credential in all your environments (Vercel, Coolify, local), then revoke the old one from the provider's dashboard. Check the provider's access logs to see if anyone used the leaked credential while it was exposed. If the repo might go public, scrub the secret from git history too.

**THE FIX:**
```
1. Rotate the credential IMMEDIATELY (generate new)
2. Update env vars in all environments
   (Vercel, Coolify, local)
3. Revoke the old credential from provider dashboard
4. Check git history for other exposures:
   git log --all --oneline -- "**/.*env*"
5. Audit provider access logs for unauthorized usage
   during the exposure window

# Scrub from git history if repo may go public:
# Use BFG Repo-Cleaner or git filter-repo
```

### 7.3. Secret Rotation Plan

**THE RISK:** Most teams only rotate credentials after a breach. But secrets accumulate risk over time — they get shared in Slack, copied to local machines, cached in CI pipelines, and stored in browser password managers. The longer a secret lives, the more places it exists and the more likely it has leaked somewhere you don't know about.

**THE SOLUTION:** Have a documented plan for how to rotate each credential your app depends on: database passwords, API keys, OAuth client secrets, signing keys. Know the steps BEFORE you need them — during a breach is the wrong time to figure out the process. Ideally, rotate proactively on a schedule (quarterly for high-value secrets). At minimum, know which environment variables need updating when a key changes, and test that your app handles the rotation without downtime.

**THE FIX:**
```
# Secret rotation checklist per credential:
# 1. Where is it stored? (Vercel, Coolify, .env, CI)
# 2. What breaks if I change it? (which services)
# 3. Can I rotate without downtime?
#    (some services support two active keys)
# 4. How do I generate a new one?
#    (provider dashboard, CLI, API)
# 5. What environments need updating?
#    (prod, staging, local, CI)

# Practical steps:
# - Document rotation procedure for each key
# - Keep a list: which secrets exist, where, last rotated
# - After any team member leaves: rotate shared secrets
# - After any security incident: rotate everything
```

*This is one of those things most solo developers and small teams skip. Having the plan documented matters more than perfect execution — when a breach happens, you need to move fast, not figure out the steps.*

### 7.4. Incident Investigation — Use Your Centralized Logs

**THE RISK:** When an incident happens, your first instinct is to SSH into the affected container and grep logs. With 30+ backends, this takes hours — and by the time you find the relevant entries, the attacker may have moved to another service. Container logs are also lost on redeploy. If you don't have centralized logging set up BEFORE the incident, you're doing forensics blind.

**THE SOLUTION:** Your centralized API logging dashboard (item 2.14) is the first place to go during any incident. Filter by the suspicious IP across ALL backends to see the full attack timeline — what they probed first, which endpoints they hit, what payloads they sent. The 30-day window of request bodies and client IPs gives you the raw evidence. Cross-reference with Cloudflare analytics (which shows blocked requests the backend never saw) and pg_stat_activity (for active database queries). This is why centralized logging must be set up before you need it — during an incident is too late.

**THE FIX:**
```
# Incident investigation sequence:

# 1. Centralized API dashboard — filter by suspicious IP
#    See: all endpoints hit, payloads sent, status codes, timing
#    Cross-backend: did this IP probe other services?

# 2. Cloudflare analytics — what was BLOCKED
#    Dashboard > Analytics > filter by IP
#    Shows requests that never reached your backend
#    (WAF blocks, rate limit blocks, challenge failures)

# 3. Database activity — active/recent queries
#    SELECT * FROM pg_stat_activity WHERE state = 'active';
#    Check for: long-running queries, unusual query patterns

# 4. Container logs — only if centralized logs are incomplete
#    docker logs <container> --since 2h | grep <ip>
#    WARNING: lost on redeploy, this is your last resort

# 5. Timeline reconstruction
#    Combine all sources into a single timeline:
#    IP → first probe → escalation → data access → blocking
```

*The centralized logging pipeline (item 2.14) must be set up and running BEFORE an incident. During a breach you need answers in minutes, not hours. If you're setting up logging after discovering an attack, you've already lost the evidence from the attack window.*

## 8. MCP Security (fastapi-mcp)

### 8.1. MCP Client IP Forwarding

**THE RISK:** fastapi-mcp makes internal HTTP calls to your FastAPI endpoints. Without configuration, ALL MCP queries appear to come from the server's own IP. This means per-IP rate limiting and concurrency limits treat every MCP user as the same person — either everyone is rate-limited or nobody is.

**THE SOLUTION:** Configure the MCP server to forward the original client's IP address when it makes internal calls to your API. By default, all MCP requests appear to come from localhost, so your rate limiting can't tell users apart. Pass through headers like cf-connecting-ip and x-forwarded-for. CRITICAL: the base_url MUST be localhost — if you use the external domain, the request goes through Cloudflare on the way back in, which overwrites cf-connecting-ip with your server's own IP, destroying the real client IP.

**THE FIX:**
```
import httpx
from fastapi_mcp import FastApiMCP

mcp = FastApiMCP(
    app,
    headers=["authorization", "cf-connecting-ip",
             "x-forwarded-for", "x-real-ip"],
    http_client=httpx.AsyncClient(
        # MUST be localhost — external URL goes through Cloudflare
        # and overwrites cf-connecting-ip with server's own IP
        base_url="http://localhost:8000",
        timeout=60.0,
    ),
)
```

*This is the #1 MCP misconfiguration. Using the public URL (https://app.example.com) instead of localhost looks correct but silently breaks per-user rate limiting for all MCP clients.*

### 8.2. Open vs Secured MCP Endpoints

**THE RISK:** A single MCP endpoint forces a choice: open to all (insecure) or locked down (unusable for demos). Running two endpoints side by side lets you serve both use cases, with all security layers (rate limiting, SQL validation, concurrency) active on both.

**THE SOLUTION:** Run two separate MCP endpoints: one open (no login required) for demos and testing, and one secured with authentication (OAuth + JWT + email whitelist) for production use. Both endpoints still get all the other protections — rate limiting, SQL validation, concurrency limits. The difference is just who's allowed to connect.

**THE FIX:**
```
# Two MCP endpoints:
/mcp        — open, no auth (demo/testing)
/mcp-secure — Auth0 OAuth + JWT + email whitelist

# Both get: rate limiting, SQL validation, concurrency
# /mcp-secure adds: identity verification via OAuth
```

*Auth setup (Auth0/OAuth, JWT verification, email whitelists) is covered in detail in Section 10 (Authentication & Authorization) of this checklist. For a working implementation with both open and secured MCP endpoints, see the example repos in the Live Examples section.*

### 8.3. MCP Transport Compatibility

**THE RISK:** fastapi-mcp 0.4.0 uses SSE transport. Some newer MCP clients (like Claude Code) use Streamable HTTP, which sends a POST to /mcp — and your SSE endpoint returns 405 Method Not Allowed. This silently breaks connectivity for newer clients.

**THE SOLUTION:** Be aware that the MCP protocol has two transport methods — SSE (older) and Streamable HTTP (newer). If your MCP library only supports SSE, newer clients that expect Streamable HTTP will get errors when connecting. Keep your MCP library updated and watch for releases that add support for the newer transport method.

**THE FIX:**
```
# Monitor fastapi-mcp releases for Streamable HTTP support
# Current workaround: clients must use SSE transport
# Future: upgrade fastapi-mcp when POST transport is supported
```

*This is a library limitation, not a configuration issue. Keep fastapi-mcp updated.*

### 8.4. Failed-Auth Rate Limiting on Secured Endpoint

**THE RISK:** Your secured MCP endpoint validates JWT tokens on every connection. Without rate limiting on failed attempts, an attacker can send thousands of fake tokens probing for valid ones, timing differences, or error message leaks. Your application-level rate limits (SlowAPI) protect query endpoints but not the SSE connection handshake where JWT validation happens.

**THE SOLUTION:** Track failed auth attempts per IP in memory. After a configurable threshold, block that IP from further attempts with a 429 response. Choose thresholds based on your use case — a public API with many users might allow 20 failures per hour, while a private internal tool might allow only 3 failures with a 24-hour block. This stops brute-force probing without affecting legitimate users who authenticate successfully on first try.

**THE FIX:**
```
# In-memory failed-auth tracker
_auth_failures: dict[str, list[float]] = {}

# Tune these to your use case:
#   Public API with many users:  AUTH_FAIL_MAX=20, WINDOW=3600 (1hr)
#   Internal tool, few users:    AUTH_FAIL_MAX=3,  WINDOW=86400 (24hr)
#   General-purpose default:     AUTH_FAIL_MAX=5,  WINDOW=86400 (24hr)
AUTH_FAIL_MAX = int(os.getenv("AUTH_FAIL_MAX", "5"))
AUTH_FAIL_WINDOW = int(os.getenv("AUTH_FAIL_WINDOW", "86400"))
AUTH_FAIL_MAX_IPS = 1000  # cap tracked IPs to bound memory

def _is_auth_blocked(ip: str) -> bool:
    if ip not in _auth_failures:
        return False
    now = time.time()
    cutoff = now - AUTH_FAIL_WINDOW
    # Prune expired entries
    _auth_failures[ip] = [t for t in _auth_failures[ip] if t > cutoff]
    if not _auth_failures[ip]:
        del _auth_failures[ip]
        return False
    return len(_auth_failures[ip]) >= AUTH_FAIL_MAX

def _record_auth_failure(ip: str):
    now = time.time()
    _auth_failures.setdefault(ip, []).append(now)
    # Periodic cleanup: evict expired IPs when dict grows too large
    if len(_auth_failures) > AUTH_FAIL_MAX_IPS:
        cutoff = now - AUTH_FAIL_WINDOW
        expired = [k for k, v in _auth_failures.items()
                   if not v or v[-1] < cutoff]
        for k in expired:
            del _auth_failures[k]

async def verify_oauth_token(request: Request):
    client_ip = get_client_ip(request)
    if _is_auth_blocked(client_ip):
        raise HTTPException(429, "Too many failed attempts")
    try:
        # ... validate JWT ...
        pass
    except JWTError:
        _record_auth_failure(client_ip)
        raise HTTPException(401, "Invalid token")
```

*SEPARATE FROM SLOWAPI: SlowAPI limits total requests per IP; this specifically targets failed authentication only. A user who authenticates successfully is never affected. MEMORY BOUNDED: Cap the dictionary size (e.g. 1000 IPs) and evict expired entries on overflow — otherwise a distributed attack from thousands of IPs could grow the dict unboundedly. MAKE IT CONFIGURABLE: Load thresholds from env vars so you can tighten or loosen without redeploying. WINDOW STRATEGY: Short windows (5-15 min) with moderate limits suit high-traffic public APIs. Long windows (24hr) with low limits suit private or internal tools where legitimate users rarely fail auth.*

## 9. Server Security (VPS / Self-Hosted)

### 9.1. SSH Key-Only Access

**THE RISK:** When you spin up a new server, password login is often enabled by default. Bots start scanning for SSH access within hours. If password login is enabled, they will brute-force it — even strong passwords are vulnerable to sustained automated attacks with millions of combinations.

**THE SOLUTION:** Disable password login entirely and only allow SSH key authentication. SSH keys are effectively impossible to guess (they are thousands of characters long). Once configured, anyone trying to log in with a password is rejected instantly — they never get a chance to guess. This is a one-time setup during server provisioning.

**THE FIX:**
```
# /etc/ssh/sshd_config
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin prohibit-password

# Apply changes
sudo systemctl restart sshd
```

### 9.2. fail2ban

**THE RISK:** Even with SSH key-only authentication, bots will keep trying to connect. Each failed attempt consumes CPU for the handshake rejection. Thousands of simultaneous attempts can spike CPU to 100% and crash your server. This happened in production — server went down from the sheer volume of rejected connections.

**THE SOLUTION:** Install fail2ban, which monitors your SSH log for failed login attempts. After a set number of failures (say 5) from the same IP within an hour, it bans that IP at the firewall level for 24 hours. The banned IP's connections are dropped before they even reach SSH, so there is zero CPU cost. Real-world results: 150+ IPs banned at any time, 6,000+ blocked attempts per week.

**THE FIX:**
```
sudo apt install fail2ban

# /etc/fail2ban/jail.local
[sshd]
enabled = true
maxretry = 5
findtime = 3600     # 1-hour observation window
bantime = 86400     # 24-hour ban
```

### 9.3. Firewall — Close Unnecessary Ports

**THE RISK:** By default, many ports may be open on your server. Each open port is a potential entry point for attackers. Database ports (5432 for Postgres, 3306 for MySQL), admin panels, and debug servers should never be reachable from the internet.

**THE SOLUTION:** Only three ports need to be open: 22 (SSH for server management), 80 (HTTP), and 443 (HTTPS). Close everything else. Configure the firewall at the cloud provider level (Hetzner Firewall, OCI Security Lists) so traffic is dropped before it reaches your server — zero CPU cost. Services that need to talk to each other should use Docker internal networking, not exposed ports.

**THE FIX:**
```
# OS-level firewall (ufw)
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable

# Also configure at cloud provider level
# (Hetzner Firewall / OCI Security Lists)
```

### 9.4. Non-Root User

**THE RISK:** When you first set up a server, you're logged in as root with full control over everything. If an application vulnerability is exploited while running as root, the attacker owns your entire system — every file, every service, every secret on the machine.

**THE SOLUTION:** Create a dedicated non-root user with sudo access and run all your applications under that user. This limits the damage if something goes wrong — an attacker can only access what that user has permission to, not the entire system. Docker, Coolify, and your applications should all run under this restricted user, not root.

**THE FIX:**
```
# Create a deploy user with sudo access
adduser deploy
usermod -aG sudo deploy

# Disable root password login
passwd -l root

# Run apps as this user, not root
```

### 9.5. Docker Image Cleanup on Deploy Servers

**THE RISK:** Every Nixpacks/Docker build creates a 1-1.5GB image, and old images are never cleaned up automatically. On a typical VPS with 75GB disk, you can fill the entire disk in a single batch deployment session (40+ apps). When the disk fills up, everything stops — containers can't write logs, databases can't write WAL files, and new deploys fail silently.

**THE SOLUTION:** Run docker image prune -f between deployments to remove unused images. For batch redeploy sessions (like updating all apps at once), prune after every 5-10 deploys. Set up a cron job or deploy hook to prune weekly. Monitor disk usage — if your server has less than 20% disk free after a deploy, prune immediately.

**THE FIX:**
```
# After each deploy or between batch deploys
docker image prune -f

# Check disk usage
df -h /

# Nuclear option: remove ALL unused images, volumes, networks
docker system prune -af --volumes

# Cron job: weekly cleanup (add to deploy user's crontab)
0 3 * * 0 docker image prune -f >> /var/log/docker-prune.log 2>&1
```

*This is especially critical on Coolify/Hetzner setups with limited disk. A 75GB disk filled to 100% from 43 images in one session — learned the hard way.*

## 10. Authentication & Authorization

### 10.1. When OAuth is Mandatory vs Optional

**THE RISK:** Without clear rules for when to require authentication, developers either add friction to apps that should be frictionless (killing adoption of public demos) or skip auth on apps that handle sensitive data (creating a wide-open door). Both are expensive mistakes — one costs users, the other costs security.

**THE SOLUTION:** OAuth is mandatory when your app handles private or sensitive data, when it's for internal use or a small specific audience, when you need per-user rate limiting or audit trails, or when you need role-based access control (who can see what). It's optional for public demo apps where frictionless access drives adoption. Even for public apps, OAuth gives you benefits: user behavior monitoring, a contact list for marketing, and abuse tracking tied to real identities. The trade-off is real friction — login walls reduce casual usage. For public apps, it's your call. For private apps, there is no choice.

**THE FIX:**
```
# OAuth is MANDATORY when:
# - App handles private/sensitive data
# - Internal tools for specific team/audience
# - Per-user rate limits needed
# - Role-based access control required
#   (e.g., only certain users query certain DBs)
# - Audit trail of who did what is required

# OAuth is OPTIONAL (developer choice) when:
# - Public demo app — frictionless access matters
# - Open data — nothing sensitive exposed
# - Portfolio/showcase apps

# Even for public apps, OAuth gives you:
# - User behavior analytics
# - Marketing: email list of users
# - Abuse tracking: ban by identity, not just IP
# - Rate limiting per user (more accurate than IP)

# If public app + optional OAuth:
# Rely on the REST of the hardening stack
# (rate limits, SQL validation, CORS, etc.)
```

### 10.2. Backend Token Verification

**THE RISK:** Frontend authentication is just a UI convenience — it tells the browser who is logged in. It is NOT security. Anyone with curl, Postman, or browser DevTools can call your API directly, bypassing all frontend auth checks. If your backend doesn't independently verify the token, your API is effectively public.

**THE SOLUTION:** Your backend must verify the JWT on every single request that needs authentication. Extract the Bearer token from the Authorization header, verify its cryptographic signature against the OAuth provider's public keys (JWKS endpoint), and check the audience, issuer, and expiry claims. If any of these checks fail, reject the request with a 401. The backend should never trust the frontend — it should trust only the cryptographic proof in the token.

**THE FIX:**
```
# FastAPI — verify Auth0 JWT on every request
from jose import jwt as jose_jwt, JWTError
import httpx

async def verify_token(request: Request):
    auth = request.headers.get("Authorization", "")
    if not auth.startswith("Bearer "):
        raise HTTPException(401, "Missing bearer token")
    token = auth.split(" ", 1)[1]

    # Fetch provider's public keys (cache in production)
    jwks_url = f"https://{AUTH0_DOMAIN}/.well-known/jwks.json"
    async with httpx.AsyncClient() as client:
        jwks = (await client.get(jwks_url)).json()

    try:
        payload = jose_jwt.decode(
            token, jwks,
            algorithms=["RS256"],
            audience=AUTH0_AUDIENCE,
            issuer=f"https://{AUTH0_DOMAIN}/",
        )
        return payload  # contains user info (sub, email)
    except JWTError:
        raise HTTPException(401, "Invalid or expired token")

# Apply to protected endpoints:
@app.get("/api/data", dependencies=[Depends(verify_token)])
```

*Cache the JWKS response — don't fetch it on every request. Refresh on restart or on a schedule.*

### 10.3. API Keys — Never in the Browser

**THE RISK:** Any API key in your frontend JavaScript is visible to everyone. Browser DevTools, View Source, or a simple network tab inspection reveals keys embedded in bundled code or sent in fetch headers. Once someone has your key, they can use your API quota, access your data, or impersonate your app — and you'll pay the bill.

**THE SOLUTION:** API keys must live server-side only — in environment variables on your hosting platform (Vercel, Coolify). Use a serverless proxy pattern: your frontend calls a Vercel API route on your own domain, and that server-side function adds the API key before forwarding the request to the actual backend. The key never reaches the browser. This is the same pattern for any secret: database URLs, signing keys, OAuth client secrets.

**THE FIX:**
```
// Frontend — calls own domain, no keys visible
const res = await fetch("/api/query", {
    method: "POST",
    body: JSON.stringify({ sql: query }),
});

// Vercel API route (api/query.ts) — adds key server-side
export default async function handler(req, res) {
    const API_KEY = process.env.BACKEND_API_KEY; // server-side only
    const response = await fetch(BACKEND_URL + "/query", {
        method: "POST",
        headers: {
            "Authorization": \
```

*Remember: VITE_ prefix variables are bundled into frontend JS and visible to everyone. Only use VITE_ for non-secret config like Auth0 domain or public API base URLs.*

### 10.4. Role-Based Access Control (RBAC)

**THE RISK:** Authentication tells you WHO someone is. Without authorization, every authenticated user can do everything — query any database, delete any file, access any admin panel. A logged-in user with no role restrictions has the same power as an admin.

**THE SOLUTION:** After verifying identity, check what each user is allowed to do. This can be as simple as an admin vs user role, or as granular as per-file or per-table permissions. Store roles in your auth provider (Auth0 app_metadata, Clerk publicMetadata) or your own database. Check the role on every request — not just in the UI. The backend must enforce roles independently, because frontend role checks are just UI cosmetics.

**THE FIX:**
```
# Simple RBAC pattern in FastAPI
async def require_admin(request: Request):
    payload = await verify_token(request)  # from 9.2
    role = payload.get("app_metadata", {}).get("role", "user")
    if role != "admin":
        raise HTTPException(403, "Admin access required")

# Granular: file-level access control
async def check_file_access(user_email: str, filename: str):
    # Query your DB: does this user have access to this file?
    allowed_files = await get_user_files(user_email)
    if filename not in allowed_files:
        raise HTTPException(403, "Access denied to this file")

# Apply to endpoints:
@app.delete("/api/files/{name}", dependencies=[Depends(require_admin)])
@app.get("/api/query/{file}", dependencies=[Depends(verify_token)])
async def query_file(file: str, request: Request):
    user = request.state.user  # set by verify_token
    await check_file_access(user["email"], file)
```

*Roles in the UI (showing/hiding buttons) are for UX only. The backend must enforce the same rules — never trust the frontend to restrict access.*

### 10.5. OAuth for MCP Servers

**THE RISK:** MCP servers give AI clients direct access to your databases and tools. An open MCP server means any AI client on the internet can query your data. For demo servers with public data, this may be acceptable (with other hardening). But for servers with private data, this is a direct path to data exfiltration.

**THE SOLUTION:** fastapi-mcp supports OAuth out of the box via its AuthConfig. The practical pattern is to run two MCP endpoints: an open one at /mcp for public demos (protected by rate limiting, SQL validation, and other hardening), and a secured one at /mcp-secure that requires Auth0 JWT authentication. The secured endpoint verifies the token on every request and can restrict access by user identity. For team use, combine this with role-based access to control which team members can query which databases.

**THE FIX:**
```
from fastapi_mcp import FastApiMCP, AuthConfig
from fastapi import Depends

# Open MCP — for public demos (relies on other hardening)
mcp_open = FastApiMCP(app, name="MCP (Open)", ...)
mcp_open.mount()  # /mcp

# Secured MCP — requires Auth0 OAuth
mcp_secure = FastApiMCP(
    app,
    name="MCP (Secured)",
    auth_config=AuthConfig(
        issuer=f"https://{AUTH0_DOMAIN}/",
        authorize_url=f"https://{AUTH0_DOMAIN}/authorize",
        audience=AUTH0_AUDIENCE,
        client_id=AUTH0_CLIENT_ID,
        client_secret=AUTH0_CLIENT_SECRET,
        dependencies=[Depends(verify_oauth_token)],
        setup_proxies=True,
    ),
)
mcp_secure.mount(mount_path="/mcp-secure")
```

*For open MCP servers: all your other hardening (rate limits, SQL validation, concurrency caps, read-only mode) still applies. OAuth is an additional layer, not a replacement.*

### 10.6. One-Time Tokens for Sensitive Operations

**THE RISK:** Some operations need to bypass the normal API flow — like large file uploads that exceed your serverless platform's payload limit (Vercel caps at 4.5MB). If you issue a long-lived or reusable token for these operations, a leaked token can be replayed indefinitely. If you skip tokens entirely and let the frontend call the backend directly, you expose your API key.

**THE SOLUTION:** Generate short-lived, single-use tokens server-side for sensitive operations. The frontend requests a token through your serverless proxy (which has the API key), receives a random token, and uses it to call the backend directly for that one operation. The backend verifies the token exists, hasn't expired, and then deletes it after use. The token works once and is gone — even if intercepted, it cannot be replayed.

**THE FIX:**
```
import secrets, time

UPLOAD_TOKENS = {}  # In production, use Redis

@app.post("/upload-token")  # called via serverless proxy
async def create_upload_token(request: Request):
    verify_api_key(request)  # only proxy has the key
    token = secrets.token_urlsafe(32)
    UPLOAD_TOKENS[token] = {
        "created": time.time(),
        "expires_at": time.time() + 600,  # 10 min
    }
    return {"token": token}

@app.post("/upload-direct/{token}")  # frontend calls directly
async def upload_direct(token: str, file: UploadFile):
    if token not in UPLOAD_TOKENS:
        raise HTTPException(401, "Invalid or expired token")
    if time.time() > UPLOAD_TOKENS[token]["expires_at"]:
        del UPLOAD_TOKENS[token]
        raise HTTPException(401, "Token expired")
    # Process upload...
    del UPLOAD_TOKENS[token]  # consume — one-time use
    return {"status": "uploaded"}
```

*For production, store tokens in Redis with TTL instead of an in-memory dict. The dict approach works but is lost on server restart.*

### 10.7. Signed URLs for File Downloads

**THE RISK:** If uploaded files are served at predictable URLs (like /files/report.csv), anyone who guesses or discovers the filename can download it. Directory listing attacks, URL enumeration, and shared links that never expire are all common vectors for unauthorized file access.

**THE SOLUTION:** Generate cryptographically signed URLs that embed the filename and an expiry timestamp. The signature prevents tampering — changing the filename or extending the expiry invalidates the URL. Set a reasonable expiry (hours to days depending on use case) so shared links don't work forever. When a user requests a file, generate a fresh signed URL and redirect them to it.

**THE FIX:**
```
from itsdangerous import URLSafeTimedSerializer

SIGNING_SECRET = os.getenv("SIGNING_SECRET")
url_signer = URLSafeTimedSerializer(SIGNING_SECRET)

def generate_signed_url(filename: str, base_url: str) -> str:
    token = url_signer.dumps({"file": filename})
    return f"{base_url}/secure-download/{token}"

@app.get("/secure-download/{token}")
async def secure_download(token: str):
    try:
        data = url_signer.loads(token, max_age=172800)  # 48h
    except Exception:
        raise HTTPException(403, "Invalid or expired link")

    filepath = UPLOAD_DIR / data["file"]
    if not filepath.exists():
        raise HTTPException(404, "File not found")

    return FileResponse(
        filepath,
        headers={
            "Content-Disposition": f'attachment; filename="{data["file"]}"',
            "X-Content-Type-Options": "nosniff",
        },
    )
```

*Always serve with Content-Disposition: attachment so the browser downloads instead of rendering. This prevents uploaded HTML/SVG files from executing in your domain's security context. JWT AUTH GOTCHA: If your download endpoint requires a Bearer token (JWT auth), you cannot use <a href> links or window.open() — browsers send plain GET requests with no Authorization header, so the download silently fails with 403. Instead, use programmatic download: fetch the file with auth headers, convert the response to a blob, create a temporary object URL, trigger a click on a hidden <a> element, then revoke the URL. Example: const resp = await fetch(url, { headers: { Authorization: `Bearer ${token}` } }); const blob = await resp.blob(); const a = document.createElement("a"); a.href = URL.createObjectURL(blob); a.download = filename; a.click(); URL.revokeObjectURL(a.href);*

### 10.8. Choosing an OAuth Provider

**THE RISK:** Picking the wrong auth provider costs you twice — once during setup and again when you migrate. Some providers charge for basic features (Clerk charges $25/mo for email allowlists). Some have great developer experience but less flexibility. Some are rock-solid but take longer to wire up. The wrong choice can lock you into a provider that doesn't fit your needs or charge you for features that others offer free.

**THE SOLUTION:** Auth0 is the long-term standard — 12+ years old, owned by Okta, most features, free allowlist via Actions, HIPAA-capable. It takes longer to set up (more config spread across Dashboard, Actions, APIs) but it's the most flexible and the most likely to exist in 10 years. Clerk is the fastest to set up — AI coders wire it up in minutes, excellent React components, but allowlists require the paid Pro plan. Supabase Auth is good if you're already on Supabase (50K free MAU, allowlist via RLS). Neon Auth is new (built on Better Auth), interesting for branchable auth in dev workflows, but still Beta. For quick temporary lockdowns, Clerk works fast. For everything else, standardize on Auth0.

**THE FIX:**
```
# Decision framework:

# Auth0 — default choice for production apps
# + Free allowlist (via Actions)
# + 7,500-25,000 free MAU
# + Most features, extensive customization
# + HIPAA, enterprise compliance
# + 12+ years stable (owned by Okta)
# - More wiring (15-30 min vs 5 min for Clerk)
# - Config spread across Dashboard/Actions/APIs
# - AI coders need 2-4 iterations to get it right

# Clerk — fastest setup, great DX
# + AI coders set it up in minutes
# + Excellent pre-built React components
# + 10,000 free MAU
# - Allowlist requires Pro ($25/mo)
# - Less customization than Auth0
# Good for: quick lockdowns, prototypes

# Supabase Auth — if already on Supabase
# + 50,000 free MAU (most generous)
# + Allowlist via Row Level Security
# + Integrated with Supabase DB
# - Tied to Supabase ecosystem

# Neon Auth — watch but wait
# + Branchable auth (auth branches with DB)
# + Built on Better Auth (open source)
# - Still in Beta
# - No allowlist feature yet
# - Tied to Neon ecosystem
```

*Standardize on one provider across your apps. Mixing providers creates maintenance overhead — different dashboards, different token formats, different debugging patterns. Use a second provider (like Clerk) only for quick temporary lockdowns.*

### 10.9. Pre-Authentication Access Gate (Frontend-Only — Cosmetic, Easily Bypassed)

**THE RISK:** WARNING: This item describes a FRONTEND-ONLY access gate pattern. It is cosmetic and can be easily bypassed by anyone who opens browser DevTools and fakes the API response (e.g., changing a 403 into a 200 with {granted: true}). The frontend JavaScript trusts whatever the fetch() call returns — an attacker simply intercepts the response in the Network tab. This provides zero real security against anyone with basic browser knowledge.

The recommended approach is to enforce the access gate server-side using Routing Middleware — see item 9.11 (Server-Side Access Gate via Routing Middleware). That approach checks the gate at the infrastructure level before your API code even runs. No amount of DevTools trickery can get around it. If you only implement one, implement 9.11.

If you choose not to go with the server-side approach, or if you want a lightweight cosmetic layer on top of it, the frontend pattern below still has some value. It hides the login form from casual visitors, makes your app look like a normal Google login page, and adds a psychological speed bump. It deters unsophisticated visitors who don't know about DevTools — but it stops nobody who does. Think of it as a curtain, not a wall.

**THE SOLUTION:** Add a code barrier before the OAuth login. The user sees a text input asking for an access code. They type the code and click the button. If the code is correct, they're forwarded to the normal OAuth login (Google, etc.). If the code is wrong, they're immediately blocked — no second chance, no "try again" message, just a permanent block screen.

The clever part: the button looks like a normal "Sign in with Google" button, so casual visitors think it's just a regular Google login with a password field. They don't realize the access code is checked first. This means even if someone screenshots or shares your login page, it looks like a standard Google login — not a secret access gate.

How the flow works: User enters code → frontend sends code to your serverless function → server checks code using constant-time comparison (prevents timing attacks where an attacker can guess characters based on how fast the server responds) → if correct, returns "granted" and frontend triggers the real OAuth redirect → if wrong, server blocks the IP in Redis (see item 1.21) and frontend shows a permanent block screen.

Combine this with Turnstile (item 1.20) so bots can't even attempt the code, and IP blocking (item 1.21) so humans get one shot. The result: an attacker needs to (1) be a real browser to pass Turnstile, (2) know the access code on the first try, (3) have an allowed Google account, and (4) pass Auth0's email allowlist — all without any indication of what's expected at each step.

**THE FIX:**
```
// Frontend — looks like a normal Google login with a code field
const handleAuthenticate = async () => {
  // 1. Send code + Turnstile token to server
  const resp = await fetch('/api/access-gate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      code: accessCode.trim(),
      turnstileToken: turnstileToken
    })
  });
  const data = await resp.json();

  if (data.granted) {
    // 2. Code correct — NOW trigger real OAuth
    loginWithRedirect();
  } else {
    // 3. Wrong code — show block screen
    setBlocked(true);
  }
};

// Server (Vercel serverless function)
const ACCESS_CODE = process.env.ACCESS_GATE_CODE;
const TURNSTILE_SECRET = process.env.TURNSTILE_SECRET_KEY;

export default async function handler(req, res) {
  const clientIp = getClientIp(req); // cf-connecting-ip

  // Step 1: Check if already blocked (see item 1.21)
  const blocked = await redis.get(\
```

*This pattern works best for single-user or small-team apps where any unauthorized access attempt is inherently suspicious. For consumer-facing apps, permanent blocking on one wrong attempt would be too aggressive — use temporary lockouts instead (item 1.5). The access code should be stored as a server-side environment variable (never in frontend code) and should be strong enough that it can't be guessed — alphanumeric, 8+ characters. Constant-time comparison (Buffer.from().equals() in Node.js, hmac.compare_digest() in Python) is important because without it, an attacker can figure out the code one character at a time by measuring how long the server takes to respond. Also add an admin DELETE endpoint protected by a separate admin key so you can unblock your own IP if you mistype the code during testing.*

### 10.10. Multi-Factor Authentication (MFA / Google Authenticator)

**THE RISK:** ADDITIONAL HARDENING — MFA is strongly recommended for admin tools and apps with sensitive data, but it's an extra step that adds friction to every login. For public-facing apps with many users, consider making it optional or risk-based (only trigger MFA for suspicious logins). For private dashboards and admin panels, MFA is one of the most effective security measures you can add — it stops an attacker even if they have your password.

Passwords and access codes can be stolen — through phishing, shoulder surfing, keyloggers, or data breaches. If someone gets hold of your Google account password (through a phishing email, a leaked database, or someone watching you type), they can log into your app as you. OAuth helps because you're relying on Google's security, but if your Google account itself is compromised, every app that uses "Sign in with Google" is wide open. A password alone — no matter how strong — is a single point of failure.

**THE SOLUTION:** Add a second factor: something you physically have, not just something you know. After signing in with Google, Auth0 asks for a 6-digit code from an authenticator app on your phone (Google Authenticator, Authy, etc.). This code changes every 30 seconds and can only be generated by the app on your physical device. Even if an attacker has your Google password, they can't log in without your phone.

The setup is a one-time process: the first time you log in after MFA is enabled, you see a QR code on screen. You open Google Authenticator on your phone, tap the "+" button, scan the QR code with your camera — done. From that point on, the app shows a new 6-digit code every 30 seconds. Nobody else can generate these codes unless they scanned that same QR code at that exact moment (and the QR code is only shown once, behind your Google login, so an attacker would need access to your Google account AND be looking at your screen during setup).

The best part: if you're using Auth0, this requires zero code changes in your app. Auth0 handles the entire MFA flow — the QR code screen, the code verification, the recovery codes — all on their hosted login page. You just enable it in Auth0's settings or via an Action, and it works.

With allowRememberBrowser enabled, Auth0 remembers your browser for 30 days after a successful MFA verification. So you're not typing a 6-digit code every single login — just on new browsers, new devices, or after 30 days. Minimal friction for you, maximum friction for attackers.

**THE FIX:**
```
// OPTION 1: Enable MFA for a specific app via Auth0 Action
// (recommended — doesn't affect your other apps on the same tenant)

// In your existing post-login Action, add one line:
exports.onExecutePostLogin = async (event, api) => {
  const clientId = event.client.client_id;

  // Only require MFA for your admin dashboard
  if (clientId === 'YOUR_APP_CLIENT_ID') {
    // Require MFA — shows Google Authenticator prompt
    // allowRememberBrowser: remembers this browser for 30 days
    api.multifactor.enable('any', { allowRememberBrowser: true });
  }
};

// OPTION 2: Enable MFA for ALL apps on the tenant
// Auth0 Dashboard > Security > Multi-factor Auth
// Toggle "One-time Password" to ON
// Set policy to "Always"
// (simpler but applies to every app on this tenant)

// PREREQUISITE: Enable the OTP factor on your Auth0 tenant
// Via Auth0 Management API:
// PUT /api/v2/guardian/factors/otp  {"enabled": true}
// PUT /api/v2/guardian/factors/recovery-code  {"enabled": true}

// Or via Auth0 CLI:
// auth0 api put "guardian/factors/otp" --data '{"enabled": true}'
// auth0 api put "guardian/factors/recovery-code" --data '{"enabled": true}'

// FIRST-TIME USER EXPERIENCE:
// 1. User logs in normally (Google, email, etc.)
// 2. Auth0 shows a QR code on screen
// 3. User opens Google Authenticator > tap "+" > "Scan QR code"
// 4. Points phone camera at QR code — app starts showing 6-digit codes
// 5. User types current 6-digit code to confirm setup
// 6. Auth0 shows recovery codes (SAVE THESE — needed if phone is lost)
// 7. Done — future logins ask for the 6-digit code after OAuth

// RECOVERY: If user loses their phone
// Admin can reset MFA via Auth0 Dashboard > User Management > user > MFA
// Or via API: DELETE /api/v2/guardian/enrollments/{enrollment_id}
```

*Recovery codes are critical — if you lose your phone and don't have recovery codes, you'll need to reset MFA from the Auth0 dashboard as an admin. Save recovery codes in a password manager or print them. For the Auth0 Action approach (Option 1), the MFA line can go in the same Action as your email whitelist — no need for a separate Action. The 'any' parameter means Auth0 accepts any enabled factor (OTP, push notification, etc.). If you only want Google Authenticator specifically, keep OTP as the only enabled factor on your tenant. Auth0 free tier supports MFA with up to 7,500 MAU.*

### 10.11. Server-Side Access Gate via Routing Middleware (The Real Gate)

**THE RISK:** A frontend-only access gate (item 9.9) is cosmetic — anyone who opens DevTools can fake the API response and bypass it. The frontend JavaScript trusts whatever fetch() returns, so an attacker intercepts the response in the Network tab, changes the status to 200, and they're in. This was confirmed during penetration testing: the tester bypassed the frontend gate in under a minute using Chrome DevTools response overrides.

The real security gap: even after bypassing the frontend gate, the attacker reaches your actual API endpoints. If those endpoints rely on JWT auth alone, the attacker still can't read data — but they can probe endpoints, discover your API surface, and test for misconfigurations. A server-side gate adds a layer that runs before your API code even executes, blocking unauthorized requests at the infrastructure level.

Real-world learning: during a security audit, a frontend-only gate was bypassed trivially. The fix was adding server-side enforcement via Vercel Routing Middleware — a middleware layer that intercepts every request before it reaches serverless functions. After deployment, the same pen tester confirmed they could no longer reach any API endpoint without passing the gate.

**THE SOLUTION:** Use Vercel Routing Middleware (middleware.ts at the project root) to enforce the access gate server-side. This middleware runs on Vercel's Edge network before every request reaches your serverless functions. It checks Redis for two keys per client IP:

1. gate:blocked:{ip} — if present, return 403 immediately (permanent block from wrong access code)
2. gate:passed:{ip} — if NOT present, return 403 (hasn't passed the access gate yet)

Only if the IP has passed the gate AND is not blocked does the request continue to your serverless function. This cannot be bypassed from the browser because the middleware runs at Vercel's infrastructure level — the request never reaches your JavaScript code unless the IP is pre-authorized in Redis.

Key design decisions:
- Exempt the gate endpoint itself (/api/access-gate) so users can submit their code
- Exempt cron job requests (valid CRON_SECRET in Authorization header)
- If your app receives webhooks, exempt only specific webhook paths (see note below)
- Use Upstash Redis REST API directly (no SDK) for Edge compatibility
- Extract real client IP from cf-connecting-ip header when behind Cloudflare (not x-forwarded-for, which contains the Cloudflare proxy IP)
- Fail open if Redis is unreachable (let the request through to the JWT auth layer rather than locking everyone out)

The middleware does NOT count toward Vercel's 12-serverless-function limit on the Hobby plan — it runs on the Edge runtime, separate from your serverless functions.

Combined with the frontend gate (item 9.9), this creates defense-in-depth: frontend gate handles UX (hides login, shows block screen, deters casual visitors), server-side gate handles security (blocks unauthorized API access at the infrastructure level).

**THE FIX:**
```
// middleware.ts — at project root (not in /api/)
// Vercel Routing Middleware — runs before every request
import { next } from '@vercel/functions'
import type { RequestContext } from '@vercel/functions'

const KV_REST_API_URL = process.env.KV_REST_API_URL || ''
const KV_REST_API_TOKEN = process.env.KV_REST_API_TOKEN || ''
const CRON_SECRET = process.env.CRON_SECRET || ''

// Upstash REST API (Edge-compatible, no SDK needed)
async function redisGet(key: string): Promise<string | null> {
  if (!KV_REST_API_URL || !KV_REST_API_TOKEN) return null
  try {
    const resp = await fetch(
      \
```

*Real-world gotcha: Vercel's built-in ipAddress() function from @vercel/functions reads the wrong header when behind Cloudflare — it returns the Cloudflare proxy IP (172.70.x.x range) instead of the real visitor IP. Always extract the IP manually from cf-connecting-ip first. This was discovered during testing when the middleware was blocking the wrong IPs. Also: do NOT add localhost exemptions (127.0.0.1, ::1) — an attacker who gains any form of server-side access could exploit a localhost exemption to bypass the gate entirely. Test directly against the remote deployment. The access gate endpoint (/api/access-gate) handles setting the Redis keys when a user enters the correct code — see item 9.9 for the code verification flow. Webhook consideration: if your app receives webhooks from external services (Stripe, Calendly, GitHub), those requests come from the provider's servers — they have no user IP context and will fail the gate check. You'll need to exempt those paths, but do it by specific path only — never use a generic query parameter like ?action=webhook to bypass the gate, because an attacker could append that parameter to any endpoint URL and skip the gate entirely. Example: `const WEBHOOK_PATHS = ['/api/webhooks/stripe', '/api/webhooks/calendly']; if (WEBHOOK_PATHS.includes(pathname)) return next()`. The webhook endpoint itself must have its own authentication (provider signature verification) — the gate exemption only gets the request to the endpoint, it doesn't grant access to data.*

---
Source: tigzig.com/security