Large File Upload for Database AI Text-to-SQL Apps: A Practical Guide

Published: December 12, 2025

Everything I learned uploading 1.6GB files through a FastAPI backend without crashing the server

Large File Upload for Database AI

The Use Case

Upload large CSV/data files to be made available to a Text-to-SQL agent for natural language querying and analysis.

The Challenge

Uploading large files to PostgreSQL (or MySQL) through a web API sounds simple until you try it with a 70MB compressed file that expands to 1.6GB. Your app crashes, memory explodes, connections timeout, and users see cryptic 500 errors. And it gets worse with multiple concurrent users.

This guide documents 30+ patterns I implemented to handle large file uploads reliably.

Live App and Source Code

App live here: app.tigzig.com/analyzer. Pick a dataset, click "Use Temporary Database". That's it. Post upload, schema automatically sent to the AI Agent. Start querying in natural language. Each dataset has a pre-built prompt. Copy, paste, run. Or use your own. The more specific, the better.

The app has sample datasets ranging from 64 rows to 11.8 million rows (1.6 GB). Setup takes 30 seconds for small files. Around 2 minutes for the 1.6 GB file. 9 LLMs available: Gemini 2.0 Flash (cheapest) to Claude 4.5 Sonnet (best quality).

For more info on datasets, how to explore and other app capabilities:

Try Text-to-SQL on Real Data: GB Files, Multi-Million Rows

App is open source. For source code just hit "Docs" on app site.


BACKEND FILE HANDLING

▸ Don't Decompress Large Files in Memory

A 70MB .gz file decompresses to 1.6GB. Loading into RAM crashes your server. Stream to disk with chunked reads (32MB chunks). Read chunk, write to temp file, repeat.

▸ File Size Threshold Detection

You don't want chunked processing overhead for small files. Detect file size from Content-Length header. Use 100MB as threshold. Below 100MB use memory, above use disk streaming. For compressed files always use chunked (you don't know final size).

▸ Gzip Decompression Streaming

Standard gzip.decompress() loads entire file to memory. Stream decompress in 32MB chunks. Track metrics: original size, decompressed size, compression ratio, speed.


DATABASE INSTANCE CREATION (NEON)

▸ Instant PostgreSQL with Neon API

User needs a fresh database. Traditional provisioning takes minutes. Use Neon's serverless Postgres API - creates databases in seconds.

▸ Race Condition in Database Creation

When creating a Neon database, you create a role then a database. Sometimes database creation fires before the role is ready. Add 500ms delay between role and database creation. Implement retry with exponential backoff (1s, 2s, 4s, 8s). Max total wait 15 seconds. OR use polling for an even more robust process.

▸ Project Lock Handling

Neon returns HTTP 423 (Locked) when a project is temporarily busy. Detect 423, wait with exponential backoff, retry up to 4 times.


ASYNC HANDLING

▸ Make Your Endpoints Async

Synchronous endpoints block the event loop. One slow upload blocks all other requests. Use async def for all upload endpoints.

▸ Async File Reading

file.read() is blocking even in async endpoints. Use await file.read(chunk_size) for chunked async reads.

▸ Thread Pool for Blocking Database Operations

psycopg2 and mysql.connector as well many file I/O and CPU bound tasks are synchronous libraries. Calling them in async endpoints doesn't work and still blocks the event loop. Run blocking operations in thread pool executor using asyncio.get_event_loop().run_in_executor().


REMOVE BLOCKING OPERATIONS

▸ Pandas to Polars

Pandas operations are slow. pd.read_csv() on a 500MB file freezes your server. Replace Pandas with Polars. Benefits: lazy evaluation with scan_csv(), native streaming with collect_chunks(), Rust-based so significantly faster, better memory efficiency.

▸ Identify Every Blocking Operation

Even a single blocking operation in an otherwise async endpoint chokes the entire application. Audit your upload/processing flow for significant blocking calls - file I/O, database operations, external API calls. A 100ms blocking call in a high-traffic endpoint kills concurrency. For low-traffic internal tools, prioritize the big blockers first.


CONNECTION POOLING

▸ Why Connection Pooling

Creating a new database connection per request is slow (100-300ms) and exhausts server resources. Use ThreadedConnectionPool from psycopg2. Set minconn=1, maxconn=10.

▸ Per-Database Pools

Different users connect to different databases. One pool per database needed. Create pools keyed by connection details. Key format: host:port:database:user.

▸ Stale Connection Detection

Pooled connections go stale (timeout, network issues, server restart). Using a stale connection causes cryptic errors. Validate connection before use with SELECT 1 query.

▸ Stale Connection Retry

Even with validation, you might get multiple stale connections in a row. Retry up to 3 times with validation. If all retries fail, create fresh connection outside pool.

▸ Thread-Safe Pool Access

Multiple async requests accessing pools simultaneously causes race conditions. Use asyncio.Lock() for pool operations.


TIMEOUTS

▸ Connection Timeout

Connecting to an unavailable database hangs forever. Set connect_timeout=30 seconds.

▸ Statement Timeout

Long-running queries block connections and cause cascading failures. Set statement_timeout=900000 (15 minutes in milliseconds) at connection level.

▸ HTTP Request Timeout

External API calls (OpenAI, Neon) can hang. Set timeout=60 seconds on HTTP clients.

▸ Upload Timeout

Large file uploads take time. Default timeouts kill them. Extended timeouts for upload endpoints. Use 15 minutes (900 seconds) for large uploads.


REQUEST BODY LIMITS

Default body size limits reject large uploads. Custom middleware to allow up to 1.5GB bodies on upload endpoints only.


RATE LIMITING

▸ IP-Based Rate Limiting

One client can overwhelm your server with too many requests. Use slowapi library for IP-based rate limiting. Set 300/hour (1 request per 12 seconds average).

▸ Rate Limit Hit Logging

You need visibility into who's hitting rate limits. Middleware that logs rate limit hits with IP, path, and timestamp when HTTP 429 returned.


API KEY SECURITY

Sensitive endpoints (database creation) need protection. API key in Authorization header with Bearer scheme. Use secrets.compare_digest() for constant-time comparison to prevent timing attacks.


BATCH PROCESSING FOR MYSQL

▸ Chunk Size Selection

Inserting millions of rows one-by-one is slow. Inserting all at once exhausts memory. Batch insert in chunks of 25,000 rows. Commit per chunk.

▸ Disable Constraints During Bulk Insert

Foreign key checks, unique checks, and autocommit slow down bulk inserts significantly. Disable foreign_key_checks, unique_checks, autocommit, and sql_log_bin before insert. Re-enable after completion.

▸ MySQL Connection Optimization

Use allow_local_infile=True, use_pure=False (C extension for speed), pool_size=32, max_allowed_packet=1GB.


POSTGRESQL COPY COMMAND

INSERT statements are slow for bulk data. Use PostgreSQL's COPY command with copy_expert(). Performance is 10-100x faster than equivalent INSERT statements.


SSL/TLS CONFIGURATION

▸ Neon Requires SSL

Always set sslmode='require' for Neon connections.

▸ Optional SSL for Custom MySQL

Support optional ssl_verify_cert and ssl_verify_identity parameters for custom databases.


TEMPORARY FILE MANAGEMENT

▸ Temporary File Creation

Use tempfile.NamedTemporaryFile with delete=False and suffix='.csv'. Manual cleanup required.

▸ Guaranteed Cleanup

Temp files pile up on errors. Clean up in finally block always. Check if file exists before deleting. Log cleanup errors but don't crash.

▸ Middleware Cleanup for Query Results

Middleware that automatically deletes temp files after FileResponse is sent.


ERROR HANDLING PATTERNS

▸ Try-Except-Finally Structure

Catch HTTPException and re-raise as-is. Catch specific database errors (psycopg2.Error). Catch general Exception as fallback. Always clean up in finally block.

▸ Specific Exception Handling

Handle gzip.BadGzipFile with HTTP 400. Handle psycopg2.OperationalError for connection issues. Provide meaningful error messages.

▸ Fallback Mechanisms

Try primary method, log warning on failure, try fallback method. Example: Polars primary, Pandas fallback.


PROGRESS TRACKING AND LOGGING

▸ Multi-Level Timing

Track overall time, file processing time, schema detection time, database insert time. Log breakdown at completion.

▸ Chunk Progress Logging

Log chunk number, rows in chunk, and running total after each chunk processed.

▸ Log Tags for Filtering

Use consistent tags like [TIMER] for performance, [POOL] for connections, [GZ] for compression. Makes log filtering easy.


DELIMITER DETECTION

Files come with different delimiters (comma, tab, pipe). Read first few lines, detect delimiter. Priority order: Tab, Comma, Pipe. Default to comma.


COLUMN NAME SANITIZATION

Column names with spaces, special characters, or SQL keywords break queries. Remove special characters, replace spaces with underscores, handle SQL keywords, ensure name starts with letter.


STREAMING RESPONSE FOR EXPORTS

Exporting large query results to file loads everything to memory. Stream results with a generator function. Fetch in chunks (10,000 rows). Yield header row first, then data rows. Clean up cursor and connection in finally block of generator.


LLM SCHEMA DETECTION WITH FALLBACK

OpenAI API might be down or rate-limited. Try OpenAI first, fall back to OpenRouter on failure. Log warning when using fallback.


APPLICATION SHUTDOWN CLEANUP

Connection pools need proper cleanup on shutdown. Register shutdown event handler that calls close_all_pools() on the pool manager.


UVICORN WORKERS AND TIMEOUT CONFIG

Default uvicorn runs single worker - one slow upload blocks everyone. Use multiple workers for concurrent requests. I use 4 workers depending on server capacity. Set timeout-keep-alive high for large uploads - default 5 seconds kills long-running connections. I use 1800 seconds (30 min).


POSTGRES OVER MYSQL FOR LARGE FILE OPERATIONS

In my testing, Postgres handles large file uploads significantly better than MySQL. COPY command is faster and more reliable than batched INSERT. Connection handling is smoother. MySQL was fragile with large files - frequent timeouts, connection drops, inconsistent behavior. If you have a choice, use Postgres for bulk data operations.


Quick Reference

Configuration I Use (adjust based on your use case):

Memory Benchmarks with Chunked Processing (Postgres):


Conclusion

Large file upload is death by a thousand cuts. Each optimization solves a specific problem:

No single solution works for everything. The key is layered defenses: detect the file characteristics, select the right processing path, handle errors gracefully, and always clean up.

Notes:

  1. My tools are deployed with small businesses and small teams. For enterprise grade with hundreds or thousands of users, additional considerations would apply.
  2. Public app routes all DB credentials and API calls through my backend. It's a sandbox only for testing. Deploy on your own servers for live work.