Need-based Context Engineering: Let tests tell you what your AI agent actually needs

Andrew Joia
Oct 15, 2025
In which I discover that starting with zero context produces better agents than front-loading every instruction I can think of.
If you're processing 10,000 conversations per day, reducing your prompt from 500 tokens to 50 tokens saves 4.5 million tokens daily. At current pricing, that's real money compounding over time.
TL;DR
Context bloat kills performance: Most prompts are 10x longer than they need to be
Start with nothing: An empty string, not minimal instructions
Let failures guide you: Add context only when scenario tests demand it
Result: Faster, cheaper, more maintainable agents that actually work better
This is the natural extension of domain-driven TDD for AI agents—applying YAGNI to prompt engineering through systematic testing with Scenario.
The anxiety-driven prompt: Context bloats in AI agents
Here's what happens when you build AI agents without tests:
You are a helpful flight booking assistant with access to our booking system.
You should be professional, courteous, and efficient. Always greet users warmly.
When booking flights, collect the following information:
- Passenger name (first and last)
- Departure airport (use IATA codes like JFK, LAX)
- Destination airport (use IATA codes like LHR, CDG)
- Departure date (YYYY-MM-DD format)
- Return date (YYYY-MM-DD format, if applicable)
- Number of passengers (validate > 0)
- Class of service (economy, business, first)
- Special requests or dietary requirements
Important guidelines:
- Always validate airport codes against our database
- Confirm all details before making the booking
- Handle errors gracefully with helpful messages
- If the user seems confused, offer examples
- Never make assumptions about dates
- Ask for clarification on ambiguous requests
- Preserve user context across the conversation
[... another 150 lines of "what ifs" ...]
This prompt is 50+ lines and growing. Every time something breaks in production, someone adds another instruction. Nobody knows which lines actually matter. Everyone's afraid to remove anything.
Sound familiar?
YAGNI Meets AI: The methodology: Need based context engineering
Need-based context engineering applies a simple principle from software engineering to AI agents: You Aren't Gonna Need It until tests prove otherwise.
Here's the discipline:
Start with an empty string (or blank system message)
Write scenario tests that define business outcomes
Run tests, watch failures
Add only the context needed to make tests pass
Repeat
Every line in your prompt exists because a test failure demanded it. Nothing more.
The experiment: Flight booking from zero
I validated this approach while building the flight booking agent from my previous article on domain-driven TDD. But this time, I started from zero.
Step 1: The empty prompt
Here's my initial system prompt:
const systemPrompt = "";
Zero characters. Not "You are a flight booking agent." Not even a role description. Nothing.
Your instinct is screaming at you right now. Mine was too.
Step 2: The First Scenario
Before adding any context, I wrote a test:
const result = await scenario.run({
name: "Basic flight booking interaction",
description: "User wants to book a flight from NYC to London",
agents: [
scenario.userSimulatorAgent(),
agentAdapter, // uses empty system prompt
scenario.judgeAgent({
criteria: [
"The agent responds to the user",
"The agent understands this is about booking flights",
],
}),
],
script: [scenario.proceed(10)],
});
I ran it expecting immediate failure.
Result: PASS ✅
Wait, what?
Step 3: The First Real Insight
Modern LLMs already know what a flight booking conversation looks like. They've seen thousands of examples in training. With an empty system prompt, GPT-4o still:
Responded coherently to booking requests
Asked reasonable questions
Collected relevant information
Maintained conversation context
This challenges everything we assume about prompt engineering. Most context is redundant.
Step 4: Push until something breaks
I scaled up to a complete booking journey:
scenario.judgeAgent({
criteria: [
"The agent collects departure airport",
"The agent collects destination airport",
"The agent collects departure date",
"The agent collects passenger name",
"The agent collects number of passengers",
"The agent collects class of service",
],
}),
Result: PASS ✅
Still passing with an empty prompt!
Step 5: The first real failure
I added this criterion:
"The agent uses valid IATA codes for airports";
Result: FAIL ❌
Finally! The agent was accepting "NYC" instead of requiring "JFK" or "LGA". More importantly, it wasn't validating codes at all.
This failure revealed a genuine gap—not something I guessed at, but something tests proved was missing.
Step 6: Surgical context addition
I updated the system prompt:
const systemPrompt = "Use valid IATA airport codes (e.g., JFK, LAX, LHR).";
Result: PASS ✅
One line. One instruction. Directly addressing the test failure.
The pattern emerges
I continued this cycle:
Test: Agent should use tools to create bookings
Status: FAIL ❌
Added: "Use the available tools to complete bookings."
Test: Agent should handle unavailable flights
Status: FAIL ❌
Added: "If a flight isn't available, suggest alternatives."
Test: Agent should collect special requests
Status: PASS ✅
Added: Nothing—already handled
Test: Agent should confirm before booking
Status: FAIL ❌
Added: "Confirm all details before finalizing."
The final result
After completing the full domain-driven TDD cycle, my production prompt was:
const systemPrompt = `Use valid IATA airport codes (e.g., JFK, LAX, LHR).
Use the available tools to complete bookings.
If a flight isn't available, suggest alternatives.
Confirm all details before finalizing.`;
Four lines. Each one exists because a test demanded it. Each one provably affects behavior.
Compare that to the 50+ line anxiety-driven prompt at the beginning.
Why this works: The science
Irrelevant context degrades LLM performance
Research on long-context LLMs shows that irrelevant information actively harms performance. The "lost in the middle" problem is real: models perform worse when they have to find relevant information buried in noise.
By keeping context minimal and test-driven, you're improving signal-to-noise ratio, making it easier for the model to focus on what matters.
Frontier models already know the basics
Frontier models have been trained on massive datasets that include customer service conversations, booking systems, technical documentation. They've already internalized most patterns you're trying to teach them.
When you write "Always greet users warmly," you're probably not teaching the model anything new—you're adding redundant context. Need-based context engineering respects the model's existing capabilities.
Tests align intent with context
Every line of your prompt is proven to affect behavior. When someone reads your prompt six months from now, they can trace each instruction back to a specific test. Want to remove a line? Run the tests. If they pass, it wasn't necessary.
This is documentation that traditional prompts lack: every line has a clear, validated purpose.
Practical benefits at scale; How to apply:
Cost savings
If you're processing 10,000 conversations per day, reducing your prompt from 500 tokens to 50 tokens saves 4.5 million tokens daily. At current pricing, that's real money compounding over time.
“Most agents fail because we overthink their prompts. Here’s how starting from nothing made mine better — and 10x cheaper.”
Performance improvements
Shorter contexts mean faster response times. In customer-facing applications, trimming prompt bloat can shave hundreds of milliseconds off every interaction.
Easier maintenance
Six months from now, when someone needs to update the agent, a lean test-driven prompt is infinitely easier to modify. They can see exactly what each line does because tests prove it.
Better agent performance
Counterintuitively, agents often perform better with less context. Removing noise makes it easier for the model to understand its actual job.
From theory to practice
Ready to try need-based context engineering? Start here:
// Start with nothing
const systemPrompt = "";
// Write a scenario test
const result = await scenario.run({
name: "Core capability test",
description: "User performs basic task",
agents: [
scenario.userSimulatorAgent(),
yourAgent(systemPrompt),
scenario.judgeAgent({
criteria: ["Define your business outcome"],
}),
],
script: [scenario.proceed(20)],
});
// Run it. Be surprised. Add context only when failures demand it.
Then follow the cycle from my domain-driven TDD article:
Add more complex scenarios
Note failures
Add minimal context to fix each failure
Repeat
Learn more about simulation-based testing and how to integrate Scenario into your CI/CD pipeline.
Conclusion
Traditional prompt engineering is guesswork dressed up as craft. Need-based context engineering replaces guesses with evidence.
Start with nothing. Let tests reveal what's actually needed. Ship agents that are leaner, faster, and more aligned with real requirements.
This is the natural extension of domain-driven TDD not just testing AI agents, but systematically discovering what context they actually need.
Try Scenario and start building agents with need-based context engineering.
❓ FAQ
Should I really start with an empty string?
Yes. Run the test. You'll be surprised how much works without any instructions. Add context only when evidence demands it.
What if I know certain context is needed from experience?
Add it but write a test that validates it actually does what you think. Without the test, you're guessing.
Isn't this slower than just writing a comprehensive prompt?
It's slower in the first hour, faster over the project lifetime. Upfront prompting feels fast because you're making quick decisions, but you're deferring the cost to debugging, maintenance, and production optimization.
Can I use this with RAG systems?
Yes. RAG systems benefit even more because they already have dynamic context from retrieval. Use scenarios to test whether retrieval provides enough context, or whether you need additional base instructions.
Does this work with smaller models?
Absolutely. Smaller models will need more explicit context than frontier models, but the methodology is identical: start minimal, run tests, add context as failures demand. The balance point will just be different. You might be surprised by how well a cheaper, smaller, and faster model performs!
Related Reading
👉 Start building with confidence:
Start with zero. Let tests drive context. Ship with confidence.