Models & Researchprompt engineeringsql qaevaluationai agents

DSPy evaluates and refines Datasette Agent prompts

||By LDS Team
5.6
Relevance Score
DSPy evaluates and refines Datasette Agent prompts

Developer Simon Willison used DSPy's GEPA optimizer to evaluate and rewrite the production system prompt behind Datasette Agent's read-only SQL question-answering feature, publishing the results on July 2, 2026. Testing with gpt-4.1-nano as the task model against a 30-question benchmark, the GEPA-rewritten prompt raised training accuracy from 90% to 95% but dropped held-out test accuracy from 95% to 85% - a regression traced to the optimizer's own added advice conflicting with the prompt's existing display-mode rules. For practitioners building LLM agents, the case is a concrete reminder that automated prompt optimizers can overfit small training sets and that an optimized prompt still needs regression review before it ships, the same as any code change.

For practitioners tuning LLM agent prompts with automated optimizers, this case study is a concrete warning: a GEPA-optimized prompt that improved training accuracy by 5 points cost 10 points on held-out test questions, and the failure mode is instructive rather than random - the optimizer's own advice collided with unrelated rules already in the prompt.

What happened

Simon Willison used the DSPy framework's GEPA (Genetic-Pareto reflective) optimizer to evaluate and rewrite the production system prompt used by Datasette Agent's read-only SQL question-answering tool. Willison commissioned the work as an asynchronous Claude Code research task using Claude Fable 5, which built a harness (harness.py) that runs DSPy against Datasette Agent's actual tool implementations and system prompt - not a paraphrase - against a real in-process Datasette instance, using datasette 1.0a35, datasette-agent 0.3a0, and dspy 3.2.1.

Technical context

The evaluation used a deterministically generated bookstore database and 30 natural-language questions (20 train, 10 held-out test), with gold answers computed directly from SQL so they are correct by construction. Baseline gpt-4.1-mini scored 95-97% and was near ceiling; gpt-4.1-nano had more headroom (90% train, 95% test with a corrected metric) and was used as the optimization target, with gpt-5-mini as GEPA's reflection model. GEPA rewrote the roughly 2,400-character production prompt into an approximately 8,800-character rulebook, lifting train accuracy to 95% but dropping test accuracy to 85%.

For practitioners

The regression is the most useful part of the writeup. GEPA's added advice told the model to run SELECT DISTINCT status FROM orders when unsure - using display='user', a mode that renders results for the end user but hides the rows from the model itself. On a held-out revenue question, gpt-4.1-nano followed that advice literally, saw only a row count, re-ran the identical query three times, and exhausted its iteration budget instead of answering. Willison also found that two of three apparent baseline failures were bugs in his own scoring metric, not agent errors - his stated takeaway is that eval quality has to be debugged before optimizer quality is trusted. The broader lessons generalize beyond this project: prompt optimizers can overfit on small training sets (20 questions here), and an optimized prompt needs the same regression testing as a code diff, not automatic adoption.

What to watch

Willison flags follow-up prompt candidates the run surfaced, including adding column names to the schema listing (the current table-only listing caused column-guessing errors) and adding an explicit rule against using display='user' for data the model needs to read itself - concrete, testable changes for anyone maintaining a similar tool-calling SQL agent.

Key Points

  • 1Simon Willison used DSPy's GEPA optimizer to rewrite Datasette Agent's production SQL-answering system prompt, tested against a 30-question benchmark.
  • 2The optimized prompt lifted training accuracy from 90% to 95% but caused a regression on held-out test questions, dropping accuracy from 95% to 85%.
  • 3The case shows automated prompt optimizers can overfit tiny training sets and that optimized prompts need the same regression review as code changes.

Scoring Rationale

Solid, well-documented practitioner case study on prompt optimization with concrete, reproducible findings (GEPA overfitting, a specific display-mode regression), but it is a single independent researcher's blog-published side project on a niche developer tool, not an industry-moving event, so it sits mid-pack rather than in the notable tier.

Sources

Public references used for this report.

2 sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems