# Incident Response - Security Checklist

Being ready when something goes wrong: logging, alerting, an incident playbook, and safe rollback.

Part of the TIGZIG security checklist (112 items across 12 categories, distilled from hardening 20+ live microservices). Full checklist: https://www.tigzig.com/security

### 7.1. Incident Response Protocol

**THE RISK:** The instinct during an attack is to immediately restart containers, redeploy, or change code. This destroys logs and evidence. But blindly preserving logs while data is being stolen is equally wrong. The correct response depends on what's happening - is there active damage right now, or are you investigating something that already happened?

**THE SOLUTION:** First, assess: is the attack ACTIVE (data being exfiltrated, records being deleted, resources being drained right now) or DISCOVERED (you found evidence of a past breach, probing, or vulnerability)? For active damage, stop the bleeding first - shut down the service or block at the firewall immediately. Logs are worthless if the attacker is still inside deleting data. For discovered incidents (post-breach investigation, reconnaissance detected in logs, vulnerability found), preserve evidence first - don't restart anything until you've saved the logs.

**THE FIX:**
```
ACTIVE ATTACK (data theft, deletion, resource drain):
1. KILL   - shut down the service or block at Cloudflare firewall
             (stop the damage NOW - logs are secondary)
2. DUMP   - save container logs, API monitor, CF analytics,
             pg_stat_activity (from the stopped state)
3. ASSESS - what was accessed, what was taken
4. HARDEN - fix the vulnerability before restarting

DISCOVERED INCIDENT (past breach, probing, vuln found):
1. STOP   - do NOT restart or redeploy (preserves logs)
2. DUMP   - container logs, API monitor, CF analytics,
             pg_stat_activity
3. BLOCK  - Cloudflare firewall rules per IP (faster than app-level)
4. HARDEN - fix the vulnerability, then deploy
```

*The difference matters: if someone is actively draining your database, don't sit there downloading logs while they work. Pull the plug, then investigate. But if you found a suspicious log entry from yesterday, restarting containers before saving logs destroys your only evidence of what happened.*

### 7.2. Credential Rotation After Exposure

**THE RISK:** A secret committed to git - even briefly, even in a private repo - should be considered compromised. Bots that scrape GitHub for secrets operate within minutes. Simply deleting the file doesn't help because the secret remains in git history forever unless explicitly scrubbed.

**THE SOLUTION:** The moment you discover a leaked credential, generate a new one immediately - don't wait to investigate first. Update the new credential in all your environments (Vercel, Coolify, local), then revoke the old one from the provider's dashboard. Check the provider's access logs to see if anyone used the leaked credential while it was exposed. If the repo might go public, scrub the secret from git history too.

**THE FIX:**
```
1. Rotate the credential IMMEDIATELY (generate new)
2. Update env vars in all environments
   (Vercel, Coolify, local)
3. Revoke the old credential from provider dashboard
4. Check git history for other exposures:
   git log --all --oneline -- "**/.*env*"
5. Audit provider access logs for unauthorized usage
   during the exposure window

# Scrub from git history if repo may go public:
# Use BFG Repo-Cleaner or git filter-repo
```

### 7.3. Secret Rotation Plan

**THE RISK:** Most teams only rotate credentials after a breach. But secrets accumulate risk over time - they get shared in Slack, copied to local machines, cached in CI pipelines, and stored in browser password managers. The longer a secret lives, the more places it exists and the more likely it has leaked somewhere you don't know about.

**THE SOLUTION:** Have a documented plan for how to rotate each credential your app depends on: database passwords, API keys, OAuth client secrets, signing keys. Know the steps BEFORE you need them - during a breach is the wrong time to figure out the process. Ideally, rotate proactively on a schedule (quarterly for high-value secrets). At minimum, know which environment variables need updating when a key changes, and test that your app handles the rotation without downtime.

**THE FIX:**
```
# Secret rotation checklist per credential:
# 1. Where is it stored? (Vercel, Coolify, .env, CI)
# 2. What breaks if I change it? (which services)
# 3. Can I rotate without downtime?
#    (some services support two active keys)
# 4. How do I generate a new one?
#    (provider dashboard, CLI, API)
# 5. What environments need updating?
#    (prod, staging, local, CI)

# Practical steps:
# - Document rotation procedure for each key
# - Keep a list: which secrets exist, where, last rotated
# - After any team member leaves: rotate shared secrets
# - After any security incident: rotate everything
```

*This is one of those things most solo developers and small teams skip. Having the plan documented matters more than perfect execution - when a breach happens, you need to move fast, not figure out the steps.*

### 7.4. Incident Investigation - Use Your Centralized Logs

**THE RISK:** When an incident happens, your first instinct is to SSH into the affected container and grep logs. With 30+ backends, this takes hours - and by the time you find the relevant entries, the attacker may have moved to another service. Container logs are also lost on redeploy. If you don't have centralized logging set up BEFORE the incident, you're doing forensics blind.

**THE SOLUTION:** Your centralized API logging dashboard (item 2.14) is the first place to go during any incident. Filter by the suspicious IP across ALL backends to see the full attack timeline - what they probed first, which endpoints they hit, what payloads they sent. The 30-day window of request bodies and client IPs gives you the raw evidence. Cross-reference with Cloudflare analytics (which shows blocked requests the backend never saw) and pg_stat_activity (for active database queries). This is why centralized logging must be set up before you need it - during an incident is too late.

**THE FIX:**
```
# Incident investigation sequence:

# 1. Centralized API dashboard - filter by suspicious IP
#    See: all endpoints hit, payloads sent, status codes, timing
#    Cross-backend: did this IP probe other services?

# 2. Cloudflare analytics - what was BLOCKED
#    Dashboard > Analytics > filter by IP
#    Shows requests that never reached your backend
#    (WAF blocks, rate limit blocks, challenge failures)

# 3. Database activity - active/recent queries
#    SELECT * FROM pg_stat_activity WHERE state = 'active';
#    Check for: long-running queries, unusual query patterns

# 4. Container logs - only if centralized logs are incomplete
#    docker logs <container> --since 2h | grep <ip>
#    WARNING: lost on redeploy, this is your last resort

# 5. Timeline reconstruction
#    Combine all sources into a single timeline:
#    IP → first probe → escalation → data access → blocking
```

*The centralized logging pipeline (item 2.14) must be set up and running BEFORE an incident. During a breach you need answers in minutes, not hours. If you're setting up logging after discovering an attack, you've already lost the evidence from the attack window.*
