# Perimeter and Edge Protection (Cloudflare) - Security Checklist

Cloudflare edge protection: WAF, rate limiting, bot and abuse controls, and keeping the origin off the open internet.

Part of the TIGZIG security checklist (112 items across 12 categories, distilled from hardening 20+ live microservices). Full checklist: https://www.tigzig.com/security

### 5.1. Cloudflare Proxy (Orange Cloud)

**THE RISK:** DNS records set to "grey cloud" (DNS-only) expose your server's real IP address. Once an attacker knows the IP, they can bypass Cloudflare entirely - all DDoS protection, WAF rules, and rate limiting become useless. They hit your server directly. This is the #1 real-world IP spoofing risk: without Cloudflare in the path, CF-Connecting-IP can be trivially spoofed (it's just an HTTP header), and your get_client_ip function that trusts it first will accept any value the attacker sends.

**THE SOLUTION:** In your Cloudflare DNS settings, make sure every record has the orange cloud icon (proxied mode) turned on - not the grey cloud (DNS-only). When proxied, all traffic goes through Cloudflare first, which hides your server's real IP address and applies all your security rules. With the grey cloud, your server's IP is visible to anyone. For defense in depth: configure your server firewall to ONLY accept connections from Cloudflare's published IP ranges, and enable Cloudflare Authenticated Origin Pulls. This way, even if the origin IP leaks, direct connections are rejected.

**THE FIX:**
```
All DNS records must be orange cloud (proxied).
In Cloudflare DNS settings, toggle the proxy
icon to orange for every A/AAAA/CNAME record.
```

### 5.2. Grey Cloud Audit - Know What Bypasses Cloudflare

**THE RISK:** Some subdomains must be grey cloud (DNS-only) because third-party services like Auth0, Clerk, or AWS RDS require direct CNAME resolution. These subdomains bypass all Cloudflare protection - no rate limiting, no WAF, no IP blocking. If you don't track which domains are grey and why, you lose visibility into your actual attack surface.

**THE SOLUTION:** Maintain a list of every grey-clouded subdomain and the reason it must be grey (e.g., "auth.example.com - Auth0 requires direct CNAME"). Periodically audit this list: if a service is decommissioned, either delete the DNS record or switch it to orange cloud. Ensure no origin servers (your VPS, your hosting) are ever exposed through grey-cloud records - only third-party CNAMEs should be grey.

**THE FIX:**
```
# Audit grey-cloud records via Cloudflare API
# List all DNS records and filter for proxied=false
curl -s "https://api.cloudflare.com/client/v4/zones/{zone_id}/dns_records" \\
  -H "Authorization: Bearer {token}" | \\
  jq '.result[] | select(.proxied==false) | {name, type, content}'

# Expected grey cloud: third-party CNAMEs only
# auth.example.com    -> Auth0 tenant (required)
# clerk.example.com   -> Clerk frontend (required)
# db.example.com      -> managed DB (required)

# NOT acceptable: your VPS IP exposed via grey cloud
# app.example.com -> 1.2.3.4  (must be orange!)
```

*After decommissioning a service, delete its DNS record entirely rather than leaving a grey-cloud orphan pointing to a defunct server. Stale DNS records pointing to recycled IPs can be hijacked (subdomain takeover).*

### 5.3. Browser Integrity Check + Security Level

**THE RISK:** Bots and automated tools use fake or missing User-Agent headers. Without Browser Integrity Check, these requests reach your application unchallenged. The Security Level setting controls how aggressively Cloudflare challenges suspicious IPs - the default is too permissive for sites under active attack.

**THE SOLUTION:** Turn on three settings in Cloudflare: Browser Integrity Check (blocks requests with suspicious or missing browser headers), Bot Fight Mode (challenges known bot fingerprints), and set the Security Level to Medium or High (challenges visitors from IP addresses with a bad reputation). These are toggle switches in your Cloudflare dashboard - no code needed.

**THE FIX:**
```
Cloudflare dashboard > Security > Settings:

- Browser Integrity Check: ON
  (blocks suspicious User-Agent headers)
- Security Level: "Medium" or "High"
  ("I'm Under Attack" mode for active DDoS only)
- Bot Fight Mode: ON
  (challenges known bot fingerprints)
```

*For API endpoints called by legitimate bots or AI agents, create a WAF bypass rule matching the API path so they aren't blocked. You get 5 free WAF custom rules - use one for bot exemptions on specific paths.*

### 5.4. JS Challenge WAF Rule on Frontends

**THE RISK:** Browser Integrity Check and Bot Fight Mode catch known bad signatures, but many automated tools (scrapers, vulnerability scanners, credential stuffers) use real browser headers. They pass basic checks and hit your frontends unchallenged. Without a JS challenge, any script that can send HTTP requests can access your application pages.

**THE SOLUTION:** Create a Cloudflare WAF custom rule that applies a JS Challenge to all your frontend domains. Real browsers solve the challenge transparently (invisible to users). Automated scripts and bots cannot execute JavaScript, so they get blocked with a 403. List all your frontend hostnames in the rule expression. If you have API paths that legitimate bots need to access (AI agents, webhooks, health checks), add path exclusions to the same rule so those paths return 200 without a challenge.

**THE FIX:**
```
Cloudflare dashboard > Security > WAF > Custom rules

Action: JS Challenge
Expression example (multiple frontends):
  (http.host eq "app.example.com") or
  (http.host eq "dashboard.example.com") or
  (http.host eq "www.example.com" and
    not starts_with(http.request.uri.path, "/api/webhook") and
    not starts_with(http.request.uri.path, "/robots.txt"))

# Browsers: pass transparently (no visible CAPTCHA)
# Bots/scrapers: blocked (cannot execute JS)
# Excluded paths: return 200 to all clients
```

*Free plan gives you 5 custom WAF rule slots. A single rule can cover dozens of frontends using OR expressions. Add path exclusions for any URLs that need to be accessible to bots - sitemaps, robots.txt, API webhooks, AI agent endpoints. Keep backend domains out of this rule - they don't serve HTML and JS challenges break API clients.*

### 5.5. Native WAF Rate Limit Rule

**THE RISK:** Cloudflare's free tier provides DDoS protection at the network level but NOT HTTP-level (L7) rate limiting by default. Without a WAF rate limit rule, an attacker can flood your application endpoints with legitimate-looking HTTP requests that pass through Cloudflare unblocked.

**THE SOLUTION:** Create a rate limiting rule in Cloudflare's WAF that blocks any IP address exceeding a threshold (e.g., 50 requests per 10 seconds). This stops HTTP-level floods before they reach your server. The free tier allows one rate limit rule with a minimum 10-second window. Make it zone-wide so it covers all subdomains.

**THE FIX:**
```
Cloudflare dashboard > Security > WAF > Rate limiting rules

Rule: 50 requests per 10 seconds per IP
Action: Block for 10 seconds
Characteristics: MUST include cf.colo.id

Expression: (true)  - applies to all subdomains

Note: Free tier = 1 rule only, 10s minimum period.
```

*cf.colo.id is required because rate counting is per Cloudflare data center (PoP), not global. Make the rule zone-wide (expression: true) so every subdomain is covered. If you need per-domain thresholds, you'll need Workers (see next item). This native rule is your safety net - it catches volumetric floods that bypass more granular controls.*

### 5.6. Edge Rate Limiting via Cloudflare Workers

**THE RISK:** The free WAF rate limit (one rule, one threshold) treats all domains equally. But frontends need higher limits than backends - a React app sends 30-40 requests on page load (JS, CSS, images, API calls), while a backend API might only need 10 requests per minute per user. A single threshold either blocks legitimate frontend users or leaves backends wide open.

**THE SOLUTION:** Deploy a Cloudflare Worker that intercepts traffic at the edge and applies different rate limit thresholds per domain. Use Cloudflare's native rate limit bindings (defined in wrangler.toml) so counters are managed at the edge with zero additional latency. Map each domain to a tier - for example, frontends at 150 requests/60 seconds, backends at 10 requests/60 seconds, sensitive endpoints at 20 requests/60 seconds. Add a wildcard catch-all so new domains automatically get a moderate default.

**THE FIX:**
```
# wrangler.toml - define rate limit tiers as bindings
[[unsafe.bindings]]
name = "BACKEND_STRICT"
type = "ratelimit"
namespace_id = "1"
simple = { limit = 10, period = 60 }

[[unsafe.bindings]]
name = "FRONTEND_STANDARD"
type = "ratelimit"
namespace_id = "2"
simple = { limit = 150, period = 60 }

# src/index.js - map domains to tiers
const DOMAIN_CONFIG = {
  "api.example.com":       { limiter: "BACKEND_STRICT" },
  "dashboard.example.com": { limiter: "FRONTEND_STANDARD" },
  // ... add each domain
};

// Worker checks rate limit, returns 429 if exceeded
// Domains not in config get a moderate catch-all tier
// Deploy: npx wrangler deploy
// Add Worker Route per domain in Cloudflare dashboard
```

*Counters are per-IP, per-domain, per Cloudflare PoP (data center) - not globally synced. This means a determined attacker hitting from multiple locations gets the limit at each PoP separately. For most use cases this is fine. The catch-all wildcard route ensures no new subdomain goes unprotected - it gets a moderate default immediately. Free plan allows 100,000 Worker invocations/day across all Workers.*

### 5.7. Zone-Level IP Blocking

**THE RISK:** When you identify a malicious IP or subnet (from attack logs, vulnerability scanners, or brute-force attempts), blocking it at the application level still lets the traffic reach your server - consuming bandwidth and CPU for every rejected request. The attacker can also try other subdomains or endpoints.

**THE SOLUTION:** Block malicious IPs at the Cloudflare zone level using IP Access Rules. Traffic from blocked IPs is dropped at the edge before it reaches your server - zero bandwidth, zero CPU cost. Block entire /24 ranges when an attacker operates from a known bulletproof hosting provider. Document each block with the reason and evidence so you can audit and clean up later.

Deciding WHICH networks to block, and for how long, matters as much as the mechanics. Before blocking a whole network, sort it by what it actually is. A network on a vetted criminal list (see 5.10) is safe to block wholesale. A mainstream cloud (AWS, Azure, Google, big VPS hosts) is the opposite: real users and the AI/search bots you want live there too, so block only the individual offending IP, never the whole network. A consumer ISP is never a whole-network block, or you wall off real people on infected home machines. That leaves grey, abuse-tolerant cheap hosts, where you use judgment and verify first: check your own logs for any legitimate (2xx) traffic ever from that network. Zero legit traffic means blocking it costs you nothing; any real traffic means drop down to per-IP.

Match the block durability to where the badness lives, because a public IP is not permanently tied to one person. Datacenter and VPS IPs are pooled and reassigned: an attacker holds an IP today, gives it up, and weeks later an innocent customer inherits it. So a single IP should only ever be blocked temporarily and auto-expire, or you eventually wall out a stranger who did nothing. A whole network can stay blocked far longer, because there the badness belongs to the provider, not one renter, and innocent people do not host on abuse-run networks. The ladder: a repeat single-IP offender gets escalating but still auto-expiring blocks; a grey subnet under review gets a medium, auto-expiring block so a wrong call self-corrects; only a confirmed bad provider earns a standing, indefinite block with periodic review.

**THE FIX:**
```
# Block an IP range at Cloudflare edge
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/firewall/access_rules/rules" \\
  -H "Authorization: Bearer {token}" \\
  -H "Content-Type: application/json" \\
  -d '{
    "mode": "block",
    "configuration": {
      "target": "ip_range",
      "value": "185.177.72.0/24"
    },
    "notes": "Bulletproof hosting. Path traversal attack on 2026-03-01."
  }'

# List all blocked IPs/ranges
curl -s "https://api.cloudflare.com/client/v4/zones/{zone_id}/firewall/access_rules/rules" \\
  -H "Authorization: Bearer {token}" | jq '.result[] | {value: .configuration.value, notes}'
```

*Always document the reason in the notes field - date, what was attacked, evidence. Periodically review blocked ranges: if you have dozens of stale blocks from years ago, clean them up. Blocking a /24 is appropriate for bulletproof hosting providers (known for harboring malicious actors) but avoid blocking large ISP ranges as you'll catch legitimate users. Two rules learned here. First, the zero-legit-traffic test makes a grey-network block safe: if a network has only ever sent you errors and never a single successful request, you can block it with no collateral. Second, never permanently block a single datacenter IP, because it will be reassigned and a permanent block becomes a trap for whoever inherits it. Reserve indefinite blocks for whole networks you have confirmed are abuse-run; everything at the single-IP level should auto-expire.*

### 5.8. Honeypot Deception Endpoints

**THE RISK:** Automated scanners hammer every site looking for leaked secrets and misconfigurations - requesting paths like /.env, /wp-config.php, /.git/config, /.aws/credentials. A normal 404 tells them nothing and they keep probing. The day one of those files is genuinely exposed, it is game over. You want to catch the scanner the instant it reveals itself, not after it finds something.

**THE SOLUTION:** Serve deception endpoints at well-known scanner paths. No human or legitimate app ever requests /.env - so a single hit is a 100%-confidence attack signal. On a hit, block the source IP at the edge immediately and return convincing-but-fake bait (so the attacker wastes time analysing junk instead of realising they were detected). Run the check before your rate limiter so a scanner is caught on request number one, not after it has burned through a rate-limit window.

**THE FIX:**
```
// At the edge (Cloudflare Worker) or in middleware, before rate-limiting:
const BAIT_PATHS = ["/.env", "/wp-config.php", "/.git/config",
                    "/.aws/credentials", "/phpinfo.php"]; // keep your real list private

if (BAIT_PATHS.some(p => url.pathname === p || url.pathname.includes(".env"))) {
  // 1) block the IP at the edge (fire-and-forget)
  blockIpAtEdge(clientIp, "honeypot: " + url.pathname);
  // 2) return fake bait with 200 (not 403 - don't confirm it's special)
  return new Response("DB_PASSWORD=REDACTED\\nSECRET_KEY=REDACTED",
                      { status: 200, headers: { "Content-Type": "text/plain" } });
}
```

*Return 200 with fake content, not 403 - a 403 confirms the path is "interesting" and tells the scanner to look harder. Keep your full bait-path list private: publishing it lets scanners route around it. Start from the universally-known targets (.env, wp-config.php, .git/config, phpMyAdmin, common cloud-credential files). Because a hit is unambiguous, you can block on the first request with high confidence - no threshold needed.*

### 5.9. Automated Velocity IP Jail (and Where to Read the Signal)

**THE RISK:** A scanner that avoids your honeypot paths still floods you, enumerating hundreds of URLs and racking up 4xx responses. Edge rate limits slow each burst, but the same IP keeps coming back, and every request still costs you an edge-Worker invocation or a server hit. You want repeat offenders promoted from "throttled every time" to "dropped before they reach anything". But there is a subtler trap in how you DETECT them: not every abusive request even reaches your application logs. Anything your CDN already blocked at the edge never touches your server, so it is never logged. Some backends do not write logs at all. And a flood of perfectly "successful" 200s leaves no 4xx trail to spot. A jail that reads only your own app logs is blind to all three.

**THE SOLUTION:** Run a small scheduled job that finds IPs which crossed a velocity threshold within a short rolling window and promotes them to an edge IP block, so the next requests from that IP are dropped at the perimeter, not merely rate-limited. The choice that matters most is WHERE you read the signal from. Prefer your CDN's analytics API over your own application logs. The CDN sees every single request, its real status code, and the real client IP, including the traffic it blocked at the edge and the traffic to backends that never log a line. Reading from there instead of your logs closes all three blind spots above in one move. Auto-expire each block after a set period so stale rules don't pile up, and lengthen the block for repeat offenders.

**THE FIX:**
```
# Cron job (every minute or two). Read your signal from the CDN's analytics
# API, NOT just your app logs. The CDN also sees edge-blocked traffic,
# non-logging backends, and floods of "successful" 200s:
#
#   query the CDN analytics API: group the last N minutes of requests by
#   client IP, keep the IPs over your threshold (4xx velocity, and/or raw
#   request volume to catch 200-floods), then push each to the edge blocklist.
#
# If you must fall back to app logs (blind to edge-blocked / non-logging traffic):
SELECT client_ip, count(*) AS hits
FROM request_logs
WHERE created_at > now() - interval '5 minutes'
  AND status_code >= 400 AND status_code < 500
  AND client_ip NOT IN (SELECT cidr FROM allowlist)   -- see note
GROUP BY client_ip
HAVING count(*) >= :threshold;

# For each offender: create an edge IP block (with an expiry you track),
# then a second pass deletes edge rules whose expiry has passed.
```

*CRITICAL: allowlist your OWN infrastructure before enabling this. Your servers' outbound IPs, your office/home IP, and internal private ranges. Otherwise your own cron jobs, health checks, or simply browsing your asset-heavy app fast can trip the threshold and jail you out of your own platform. (We learned this the practical way.) Count 4xx broadly, including 429s: an attacker who only ever gets rate-limited must still eventually be jailed, or they ping your edge forever. Two more guards are worth building in. First, exempt verified search crawlers. A raw volume rule will eventually catch Googlebot or Bingbot on a deep crawl and deindex you. Verify a crawler by doing a reverse-DNS lookup on its IP and then forward-confirming that the hostname resolves back to the same IP. Never trust the user-agent string, which anyone can fake. Second, observe before you enforce. Run any new volume rule in log-only mode for a few days and look at what it WOULD have blocked. A legitimate heavy user or a partner integration can look exactly like a flood until you have seen the real traffic distribution. Switch on blocking only once you are confident the threshold catches abuse and nothing else. One more subtlety: requests you already block at the edge still appear in CDN analytics as 4xx (an edge block returns a 403), so an already-blocked attacker who keeps hammering gets re-counted and re-flagged as a fresh offender. It is harmless (they are blocked either way) but it inflates your roster with actors that are already contained, so when triaging discount any IP or network that already has a standing block.*

### 5.10. Threat-Intel Edge Blocklists (and Reporting Back)

**THE RISK:** You can only block IPs you have personally watched attack you - but the same hosts are hitting thousands of sites simultaneously. Reacting one IP at a time keeps you permanently a step behind the networks that exist purely to host abuse.

**THE SOLUTION:** Sync vetted, free threat-intelligence feeds to your edge as block rules on a daily schedule - Spamhaus DROP ("Do not Route or Peer") lists entire networks so malicious they should be dropped wholesale, and it is conservative enough to trust for blocking. Optionally close the loop: report the scanners your honeypot catches to a community abuse database so other operators benefit from your detections.

**THE FIX:**
```
# Daily cron: fetch a vetted feed and sync to your edge blocklist
curl -s https://www.spamhaus.org/drop/drop_v4.json \\
  | jq -r '.cidr' \\
  | while read net; do
      upsert_edge_block "$net" "Spamhaus DROP - daily sync"
    done

# (Optional) report a honeypot-caught scanner back to the community
curl -s https://api.abuseipdb.com/api/v2/report \\
  -H "Key: $ABUSEIPDB_KEY" \\
  --data-urlencode "ip=$SCANNER_IP" --data "categories=21"  # web app attack
```

*Distinguish vetted from noisy. Spamhaus DROP is conservative (only the worst networks) - safe to block inbound. Community-reputation lists (e.g. AbuseIPDB blacklist) are far noisier and frequently flag legitimate AI/search crawlers (Googlebot, GPTBot) that the community reports as aggressive - great to report TO, risky to auto-block FROM. We submit our honeypot catches to the community but deliberately do not block inbound from community lists, to avoid harming our own search indexing.*

### 5.11. Lock Your Origin to Cloudflare (Close the Direct-IP Bypass)

**THE RISK:** Every edge protection - WAF, rate limits, bot rules, IP blocks - only applies to traffic that actually goes THROUGH Cloudflare. But your origin server still has a public IP, and that IP is NOT a secret: services like Shodan and Censys scan the whole internet and catalogue every server from its TLS certificate, and old DNS records leak it. An attacker who connects straight to the origin IP - just putting a valid hostname in the TLS SNI - reaches your application with Cloudflare entirely out of the path. Every edge defense is skipped, including the rate limits that would otherwise stop a flood. A common false-comfort here: a bare-IP request with no/blank hostname often returns a connection error, which looks like "the origin refuses direct access" - but that is just the web server rejecting an unmatched hostname, NOT a firewall. With a valid hostname it answers anyone.

**THE SOLUTION:** Restrict your origin's inbound HTTPS port to Cloudflare's published IP ranges ONLY, so any direct-to-IP connection is dropped before it reaches your app - forcing all traffic back through Cloudflare where your layers live. Cloudflare publishes the official list of its ranges; load it into your firewall (a cloud-provider network firewall is cleanest, or an on-box rule) and refresh it on a schedule so new ranges are picked up automatically. Keep SSH and the HTTP port (80) open: port 80 carries the Let's Encrypt renewal challenge, which is only a public proof-of-control token, never a secret. Verify the fix from a non-Cloudflare host: a direct connection to the origin IP must now time out, while the normal hostname path still returns 200.

**THE FIX:**
```
# Cloud network firewall (preferred): allow 443 only from Cloudflare ranges.
# Cloudflare publishes them at https://www.cloudflare.com/ips (v4 + v6).
#   inbound  tcp/443  source = <cloudflare ranges>      # app traffic, CF only
#   inbound  tcp/80   source = 0.0.0.0/0, ::/0          # ACME renewal (no secrets)
#   inbound  tcp/22   source = <your admin IPs>         # SSH
#   (everything else: default deny)

# On-box alternative (Docker example) - drop non-Cloudflare hits on 443:
ipset create cf4 hash:net family inet
curl -s https://api.cloudflare.com/client/v4/ips \\
  | jq -r '.result.ipv4_cidrs[]' | while read c; do ipset add cf4 "$c"; done
iptables -I DOCKER-USER -i eth0 -p tcp --dport 443 -m set ! --match-set cf4 src -j DROP

# Verify from an OUTSIDE (non-Cloudflare) host - must fail now:
curl --connect-to app.example.com:443:<ORIGIN_IP>:443 https://app.example.com/  # -> timeout
```

*Residual gap (be honest about it): an IP allowlist trusts ALL of Cloudflare's ranges, which everyone shares - so a determined attacker using their OWN Cloudflare account could still route to your origin. They would, however, be funnelled through Cloudflare and could no longer hit the raw box directly, so the volumetric/exhaustion risk is largely neutralised. To close the residual completely, add Authenticated Origin Pulls (mTLS): your origin only completes the TLS handshake for a client certificate that ONLY your Cloudflare zone presents - and unlike a shared secret, a certificate can't be copied by an observer (it proves possession of a private key that never leaves Cloudflare). One practical caveat from our own assessment: on managed reverse-proxy platforms that generate their config from container labels (e.g. Caddy-based PaaS), wiring mTLS in cleanly - and making it survive redeploys - takes care and testing, so we treat it as a deliberate, tested follow-up rather than a quick toggle. Start with the IP allowlist (closes the critical exposure today); layer mTLS on when you can test it properly. And don't stop at your app port: the same firewall should audit EVERY port the origin exposes - admin dashboards, realtime/websocket ports, and metrics/monitoring endpoints are routinely left open to the whole internet and bypass your edge entirely. Default-deny everything except the ports you actually serve through the CDN (usually just 443) plus SSH locked to your own IPs.*

### 5.12. Serve Bulk / Large Downloads from Edge Object Storage, Not Your Origin

**THE RISK:** If people download big files (a whole-database export, a large spreadsheet, a video) straight from your own server, you're exposed two ways. First, your server can only push data out so fast - if a few people, or a script, keep pulling a big file over and over, they clog your pipe and everyone else's experience slows to a crawl. It's a cheap way for someone to grind your site down: each download looks like a normal, successful request, so your usual alarms (which watch for errors) never notice. Second, big files often just fail anyway: the systems in front of your server usually give up on a slow transfer after about a minute and a half, and many "serverless" setups refuse to send large responses at all. So it's both unreliable AND an easy target. (Technical: origin egress saturation + the ~100s edge timeout + serverless response caps.)

**THE SOLUTION:** Don't hand out big files from your own server at all. Put them on a service built for storing and serving files - like Cloudflare R2 or Amazon S3 - which sits close to users around the world and has a huge, cheap pipe (with R2, downloads are free). Build the files ahead of time on a timer, copy them onto that "shelf", and let people download straight from there, so your own server is never part of the download. Keep a simple list (a manifest) of what's available and when it was last refreshed, so people and AI agents can find the current files. Now if a download flood comes, it hits the storage service - which is built to shrug it off - instead of your app. (Technical: pre-generate artifacts on a cron, serve them as static objects from R2/S3 via a small edge worker, origin out of the path.)

**THE FIX:**
```
# 1. Pre-generate artifacts on a cron, mirror them to the bucket (mtime-gated).
# 2. A small edge worker serves them from the bucket, edge-cached, origin untouched:
async fetch(request, env) {
  const key = mapPathToObjectKey(new URL(request.url).pathname);  // ignore junk params
  const obj = await env.BUCKET.get(key);            // from object storage, at the edge
  if (!obj) return fetch(request);                  // rare miss -> origin fallback
  return new Response(obj.body, { headers: { "cache-control": "public, max-age=7200" }});
}
# Keep a manifest.json in the bucket so consumers see available files + freshness.
```

*Two bonus wins: (1) this removes the heavy-download path from your origin entirely, so it can't be used to exhaust the box - it pairs directly with the asymmetric-cost-endpoints item in the Backend section. (2) Pre-generation + a short edge cache means freshness is bounded by your generation schedule, not rebuilt per request. One pitfall to watch: the lag between "file regenerated" and "copied to the bucket" - trigger the upload right after generation, or poll frequently, so the edge copy never serves stale (or, worse, a never-yet-uploaded file silently falls back to the slow origin).*

### 5.13. Bot Policy - Welcome the Good Bots, Shed the Freeloaders

**THE RISK:** Not every visitor to your site is a person. A big slice is automated "bots", and they are not all the same. Some are the good kind: a search engine like Google sending real readers your way, or an AI assistant like ChatGPT or Claude that reads your page and quotes you to its users with a link back. Some are freeloaders: commercial "SEO" crawlers (names like SemrushBot, AhrefsBot, DotBot) that copy your whole site to sell competitor-research to other people, send you nothing back, and crawl around the clock - quietly eating your server capacity and your monthly free allowance. Two traps people fall into: (1) "I will just block all bots" - which also blocks Google and the AI assistants, making you invisible to the exact audiences you want. (2) Flipping on Cloudflare's one-click "Block AI bots" switch - which slams the door on ChatGPT, Claude, and Perplexity precisely when you are trying to be found by them. (Technical: SEO backlink crawlers and scrapers consume origin CPU and Worker/edge quota; blanket bot-blocking and the managed AI-bots rule both catch the crawlers you actually want.)

**THE SOLUTION:** Sort bots into "want" and "do not want", and treat each group differently - do not paint them all with one brush. WANT (let them through): search engines (Googlebot, Bingbot) and AI agents (GPTBot, ClaudeBot, PerplexityBot). DO NOT WANT (turn away): SEO backlink crawlers (SemrushBot, AhrefsBot, DotBot, MJ12bot) and aggressive scrapers (Bytespider). The good news: the freeloaders are honest - they announce their name in every request - and the reputable ones obey a "no entry" sign. So: (1) Add a robots.txt "no entry" list naming those crawlers; the polite ones stop on their own. (2) For any that ignore the sign, add ONE firewall rule that blocks just those names - it only matches a request that literally calls itself "SemrushBot", so a real person on Chrome is never touched. Two cautions: do NOT use the blanket "Block AI bots" toggle (it blocks the AI agents you want), and do NOT assume "verified bots equals good" - Semrush is itself a verified bot, so "allow all verified bots" lets the freeloader straight in. You exclude them by NAMING them, never by trusting a category. (Technical: robots.txt Disallow groups for the named user-agents plus a single WAF custom rule matching http.user_agent; Cloudflare Verified Bots includes SEO vendors so it is not an allowlist; the managed AI-scrapers rule blocks GPTBot/ClaudeBot/etc.)

**THE FIX:**
```
# 1. robots.txt - name the freeloader crawlers (the honest ones obey this)
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MJ12bot
User-agent: Bytespider
Disallow: /
# Do NOT list Googlebot / Bingbot / GPTBot / ClaudeBot / PerplexityBot here -
# they stay under "User-agent: *" and remain fully welcome.

# 2. One Cloudflare WAF rule for any that ignore robots.txt. It blocks by NAME,
#    so a real browser (whose user-agent says "Chrome", not "SemrushBot") is safe:
(http.user_agent contains "SemrushBot") or
(http.user_agent contains "AhrefsBot")  or
(http.user_agent contains "Bytespider")     ->  Action: Block
# Runs at the edge, BEFORE your app, so a blocked crawler costs you nothing.
```

*Know what this can and cannot do. A name-based block (robots.txt or the firewall rule) only stops bots that tell the truth about who they are - and the SEO companies do, because they are real businesses with a reputation to protect. It will NOT stop a disguised scraper that pretends to be an ordinary browser. For those, the next signal is "a browser arriving from a data-center network" (no real person browses from an AWS or Azure server) - block on that and you catch most of them; the truly determined ones routing through home-internet proxies are effectively uncatchable, so do not over-engineer - they rarely bother a small site. Also worth knowing: the robots.txt list is a polite request (good crawlers honor it), while the firewall rule is the hard stop - and the firewall rule is the one that actually reclaims your server and Worker quota, because it turns the crawler away at the edge before any work happens.*

### 5.14. Block Path Traversal / LFI at the Edge

**THE RISK:** A file-serving endpoint (a download or report URL) is like a clerk who fetches a file by name from a back room. An attacker asks for a file with a "climb out of the room" name - like ../../../etc/passwd (the Linux password file) or the Windows equivalent - hoping the clerk walks out of the allowed folder and hands over a system file. They try dozens of disguised spellings (URL-encoded slashes like ..%2f, doubled dots like ....//, backslashes) to slip past a naive filter. If even one works, they read the secrets on your server; even when it fails, every file-serving backend is a constant target for these scans. (Technical: path traversal / Local File Inclusion - ../ sequences plus encodings that try to escape the served directory and read /etc/passwd, /proc/self/environ, etc.)

**THE SOLUTION:** The safest place to stop this is at your edge (Cloudflare), before it reaches any backend - one rule then protects all your apps at once. The trick is knowing WHICH patterns to block. Some strings are 100% certain to be an attack and never appear in normal traffic - block those with total confidence: the actual target-file names (/etc/passwd, /etc/shadow, win.ini, boot.ini, /proc/self/) and the "disguise" encodings whose only purpose is to dodge a filter (..%2f, ....//, the backslash form of dot-dot, %2e%2e/). A real browser or app never sends those. But do NOT block loose tokens: a bare ".." (version strings and relative paths use it), a bare "%2f" (some APIs legitimately accept an encoded slash), or plain words like "shadow"/"etc"/"windows" (your own content may contain them - a blog post about "shadow lending" would get blocked). Block loose and you take your own site down. One more safety lever: match the URL PATH only, never the query string, so user input in ?params is never affected. Start with the certain list; if new traversal shapes show up in your logs, append them. (Technical: at the edge - a Cloudflare Worker or WAF rule - substring-match the request path against a tight set of evasion-and-target signatures; on hit, block the IP and log. Backends should still resolve and confirm the final file path stays inside the intended folder as a last resort.)

**THE FIX:**
```
// Edge worker / WAF - block ONLY 100%-certain traversal signatures.
// Match the URL PATH only (never the query string, so ?params stay safe).
const TRAVERSAL = [
  "..%2f", "..%5c", "....//", "%2e%2e/", "%2e%2e%2f",       // evasion-only encodings
  "/etc/passwd", "/etc/shadow", "win.ini", "boot.ini", "/proc/self/",  // target files
];                                    // (also include the backslash form of dot-dot)
const p = url.pathname.toLowerCase();
if (TRAVERSAL.some(sig => p.includes(sig))) {
  return new Response("", { status: 200 });   // dead-end + block the IP + log it
}
// NEVER add bare "..", bare "%2f", or words like "shadow"/"etc" - they break real traffic.
```

*Why "tight" beats "thorough" here: the instinct is to block anything containing "..", but real URLs and API parameters contain ".." and encoded slashes for innocent reasons, so a broad rule blocks legitimate users and brings the site down. Each signature above was chosen because it has zero legitimate use, verified against real traffic. Pair this with the honeypot (item 5.8): treat a traversal hit exactly like a fake-secret-file probe - return a dead-end, block the IP, log it - so a scanner that tries traversal is jailed on its first attempt and cannot move on to your other backends. And keep the backend safety net: even with the edge rule, your file-serving code should confirm the resolved path stays inside the intended folder and reject anything that escapes it.*
