Mistakes I Made Building Text-to-SQL Agents in Live Projects. My 2025 Learnings

Published: December 26, 2025

I've been setting up Text-to-SQL apps and agents for over a year now. Multi-user systems. Live databases. Made many mistakes. Here's what broke, what it cost, and how I fixed it.


STRATEGIC & ARCHITECTURE

▸ Built for analytics when main use was operations

Designed for analysts running complex queries. Actual users: customer service, finance & operations teams pulling raw data for reconciliations and responses.

Most queries are straightforward - ID lookups, transaction history, ad-hoc filters. Don't over-engineer for complexity that doesn't exist.

▸ Using multi-step agents where simple one would do

Built a reasoning-heavy agent: plan → execute → validate. Cost went from <$1 per 100 queries to $20+. Queries took 1-3 minutes. Single-step agents work for 80% of SQL tasks. Use multi-step only when genuinely required.


DATA & CONTEXT SHARING

▸ Didn't share schema with the agent

User asks query. Agent runs blind - wrong joins, hallucinated columns, missed filters.

Always include: table names, columns, data types, primary/foreign keys, relationships.

▸ Did not provide business context

Agent doesn't know what fields mean, how they're computed, what edge cases exist.

Share: variable definitions, computation logic, common query patterns, odd values to expect, business rules. Schema alone isn't enough.

▸ Used plain text dumps for schema

Dumped schema as big text blocks adding things randomly. Agent got confused.

Use YAML - is human-readable, cleaner hierarchy.

▸ Didn't share categorical variable distributions

Agent didn't know valid values for status fields, product categories, regions. Made up values. Spelling inconsistencies broke queries.

Share distributions for cardinal variables upfront. Agent needs to know: what values exist, common misspellings, edge cases. Gives it X-ray vision into the data. Basically share a condensed EDA output with the agent.


AGENT INSTRUCTIONS & CONSTRAINTS

▸ No LIMIT clause rules

Agent would download full datasets for 'exploration'. 1M rows pulled for simple queries.

Rule I added: Always LIMIT 100 on raw SELECT queries. No LIMIT on aggregations (COUNT, SUM, GROUP BY). Explicitly tell the agent when and when not to use LIMIT. For stricter control, apply at DB level (if supported) or API layer. This is a tricky one though.

▸ Unlimited query execution

Agent would run say 10 SQL queries for a simple question. Cost and latency go up.

Cap it. Simple agent: 2-3 queries max. Complex reasoning agent: 7-10. Specify execution limit upfront based on use case.

▸ No debug protocol

Query failed. Agent didn't self-diagnose.

Give the agent a debug protocol: That it should try and debug iteratively in case of errors, say up to additional 5 queries.

▸ Assumed agents write perfect SQL

They don't. Division by zero errors is very common.

Tell the agent: wrap divisors with NULLIF(x, 0). Use COALESCE() for fallback values. Identify common error patterns and suggest fixes upfront.


AGENT BEHAVIOR REALITY

▸ Assumed agents follow rules 100%

Set LIMIT rules, query caps, table restrictions. Agents still break them occasionally. Non-deterministic behavior.

You will never get 100% compliance. You can add additional checks at DB level or API level. But that's the nature of the beast. See what you can live with.


MODEL SELECTION & COST OPTIMIZATION

▸ Used expensive models everywhere

Used to use GPT-4.1 for everything. 5-10x cost vs cheaper models.

GPT-4.1-mini and even GPT-4o-mini great for routine tasks. Reserve GPT-4.1 / 5.1 / 5.2 for harder tasks.

▸ Upgraded to latest model immediately

Switched from GPT-4.1 to GPT-5 the day it launched. Cost blew up - reasoning tokens added massive overhead.

Let models stabilize. Test cost and performance before upgrading. Don't chase 'latest' by default.

▸ Didn't specify reasoning parameters

Many models have reasoning controls. Didn't use them. Agent applied full reasoning to simple queries. For simple tasks, set reasoning to zero or minimal. Saves tokens and latency. Only enable reasoning when genuinely needed.

▸ Didn't monitor cache hits

System instructions and schemas ran 20K+ tokens. No cache hits = repeated full charge.

Monitor cache hit rates in logs. Key rules: exact prompt match (byte-for-byte), no dynamic content at start, proper cache configuration. This is a complex area with major cost impact - detailed cache optimization guide coming soon.


SECURITY & ACCESS CONTROL

▸ Provided write access inadvertently

User truncated live tables. My fault - the test ID write permissions went to production.

Fix: separate DB users with fine-grained permissions.

▸ No CREATE table functionality

User pulls 5,000 records. Can't print it. Can't export it.

Added: write access to 'workdir' schema. Added export functionality. Gave users control over larger result sets.


PERFORMANCE & LIMITS

▸ No rate limits

One user could trigger 100+ queries in a minute.

Rate limits at API level (SlowAPI for FastAPI). Rate limits at agent level.

▸ Async in name only

Wrote async def everywhere. One blocking file read killed concurrency for all users.

Audit for hidden blocking calls. Wrap blocking DB operations in thread pools. Replace blocking file reads. Async must be end-to-end or it doesn't work.

▸ Didn't test edge cases

Worked fine with clean data. Broke with null values, special characters in column names, mixed delimiters, inconsistent date formats. Push it to limits. Test with messy real-world data. Observe failures. Adjust instructions.

▸ Temporary tables without cleanup

Allowed agent to create intermediate tables. Never dropped them. Database bloated over time.

Have agent use clear naming (e.g. temp_agent_xxxx) and purge regularly.


Result

Cost per 100 queries: $20+ → under $2 for most tasks Latency: 2+ minutes → 10-30 seconds Error rate: ~30% → under 5% User satisfaction: measurable improvement


Test it