Systematically Improving RAG Agents

Tahmid Tapadar
Apr 24, 2025
At first glance, Retrieval-Augmented Generation (RAG) seems deceptively simple. You take a user’s question, retrieve a few relevant documents, and let a language model generate the answer. But as anyone who’s built one knows, getting a RAG system to work reliably - across edge cases, domains, and time - requires more than wiring up a retriever and an LLM. The difference between a fragile prototype and a production-ready RAG agent isn’t the framework or the model. It’s iteration.
Let’s start there.
The RAG Flywheel
Improving RAG agents is a process, not a checklist. We’ve found the most reliable systems emerge through a flywheel of continuous improvement:
Build a basic system: Start simple—maybe even naive. Use public embeddings, default chunking, no reranker. Get something working end-to-end.
Create evaluation data: This is critical. Without it, you’re flying blind.
Run experiments: Adjust retrieval methods, embeddings, chunking strategies.
Analyse errors: Look at real outputs. Where are things going wrong?
Incorporate feedback: From users, metrics, or failures.
Repeat.
This loop is how good RAG agents are built - not by adding more components, but by learning which ones actually matter for your task.
But the flywheel doesn’t turn unless you have data. And at the beginning, you often don’t.
Why Synthetic Data Matters
In many domains - enterprise tools, legal, finance, internal docs - you don’t start with a benchmark dataset. You don’t have thousands of labeled questions or relevance scores. You barely have users.
But you do have access to your own corpus. And with a bit of prompting, you can start generating useful data from it.
Here’s what that looks like:
For each document chunk, use an LLM to generate a set of plausible questions it might answer.
Optionally, generate ranking data by asking the model to rate or compare which chunks are more relevant to which questions.
Use this as evaluation (and later as training) data to tune your embeddings, rerankers, or retrieval strategies.
The point isn’t to generate perfect labels. It’s to give yourself enough signal to start iterating.
And unlike the old days of machine learning, this process is fast, cheap, and scalable. You don’t need a labeling firm. You need a few good prompts and a basic eval loop.
There’s many synthetic data generator libraries out there nowadays. My personal favourite is data_simulator by langwatch, which first filters the documents to ensure the generated queries are relevant.
Why Embeddings Need to Reflect Meaning (Not Just Language)
Imagine you’re building a dating app. Two users fill in their bios:
“I love coffee.”
“I hate coffee.”
From a linguistic standpoint, these statements are opposites. But from a recommendation perspective, there’s a case to be made that they belong together. Both are expressing strong food preferences. Both might be ‘foodies’ which is why they mentioned their preferences.
The point here is subtle, but important: semantic similarity is not the same as task relevance.
Most off-the-shelf embedding models are trained on broad internet data. They encode assumptions about what similarity means - assumptions that may not apply to your use case. This is especially problematic when building search and retrieval systems where nuance matters.
That’s why fine-tuning your embedding model, even on a small number of labeled pairs, can make a noticeable difference. I’ve often seen teams improve their recall by 10-15% by fine-tuning their embedding models with just a couple hundred examples. Whether you’re recommending dating profiles or retrieving security policies, what matters is that the notion of “relevance” is aligned with your application’s goal.
Sometimes that means “similar meaning.” Sometimes it means “answers the same question.”
Chunking: Breaking Information into Pieces That Still Make Sense
Once you’ve decided how to represent your data (through embeddings), the next challenge is deciding how to split it.
Chunking is often treated as a preprocessing detail—something you do once, with a default length, and move on. But it can have an outsized impact on retrieval quality.
Here’s a simple example. Imagine you’re watching a cooking show and someone says:
“First, coat the pan with olive oil. Then add the chopped garlic and sauté for a minute until fragrant. Don’t let it brown.”
If you chunk this naively - say, by every 100 characters - you might end up splitting the instructions halfway through the garlic step. The retriever then returns something like:
“…coat the pan with olive oil. Then add the chopped…”
Which is both unhelpful and, in a cooking context, potentially misleading.
The issue is that retrieval models work best when each chunk represents a complete idea. When chunks contain disconnected thoughts, the embedding becomes muddled and the context retrieved becomes less useful - especially when passed into an LLM for answer synthesis.
Semantic chunking techniques - like embedding-based splitting or topic-aware segmentation - can help here. They preserve topical boundaries, detect shifts in meaning, and ensure that chunks feel more like self-contained thoughts than arbitrary blocks of text.
In practice, chunking is one of the easiest levers to pull, and one of the most overlooked.
Querying: The Art of Asking Better Questions
Another under-appreciated piece of the puzzle: queries don’t always represent what a user means. Especially in customer-facing applications, users will ask vague or shorthand questions:
“What happened with the Q3 report?”
This doesn’t give the retriever much to work with. There’s no mention of the department, the exact metric, or the event being referenced.
Instead of taking these queries at face value, we can reframe them automatically using query transformation techniques:
Multi-query generation: Create multiple rewordings of the original question.
HyDE (Hypothetical Document Embeddings): Generate a “fake” answer and embed that for retrieval.
Query decomposition: Break complex questions into simpler sub-questions.
These techniques improve recall not by changing your database, but by being more intentional about how you ask it.
Reranking: The Final Filter
Retrievers give you candidates. But candidates aren’t answers.
Cross-encoders - models that take in both the query and a candidate chunk together - can score these pairs more precisely than simple embedding similarity. They’re slower, but more accurate, and are particularly useful when you want to ensure that the top few results are as relevant as possible.
Used wisely, rerankers can replace quantity with quality. Instead of sending ten partially relevant chunks to the LLM, send three that are near-perfect.
Evaluation: You Can’t Improve What You Don’t Measure
It’s surprisingly common to iterate on RAG systems without a clear sense of how well they’re working. In absence of ground truth, we rely on intuition - or worse, isolated examples that look good in demos.
A more grounded approach is to treat RAG like any other data pipeline: define metrics, test systematically, and trace errors to root causes.
For example:
If they’re vague or incomplete, check recall: are the right documents even being found?
If irrelevant content is showing up, look at precision and chunk filtering.
If you’re evaluating in production, tools like RAGAs provide structured ways to measure these breakdowns without needing human-labeled answers.
A Flywheel, Not a Checklist
Improving a RAG system isn’t a one-off effort. It’s a flywheel:
Start with a simple system.
Add synthetic or real user queries.
Evaluate failure cases.
Adjust chunking, query rewriting, embeddings, or rerankers.
Repeat.
Each step feeds the next, and the system gets better—not because any one tool is perfect, but because you’re systematically identifying and fixing what’s broken. Platforms like LangWatch are useful here, not as silver bullets, but as tools that help you observe where your system is falling short. They let you trace retrieval quality, flag hallucinations and correlate performance with specific design choices.
Improving RAG - Closing thoughts
The promise of RAG is clear: to combine the reasoning power of language models with the precision of structured knowledge. But the path to that promise runs through careful design.
Better RAG agents aren’t built by swapping tools, but by understanding what each part of the system does, where it can go wrong, and how to make it better. It’s part engineering, part information retrieval, and part empathy—understanding both what users need, and how models interpret that need.
It’s in this process - of systematically identifying and addressing what isn’t working - that RAG systems become reliable.
If you’re interested in putting this into practice, I’ve created a notebook series that follows this same philosophy. It walks through building and evaluating a RAG agent from the ground up, with a focus on iteration, measurement, and improvement. You’ll find examples of synthetic data generation, agent evaluation, and strategies for tuning retrieval step by step.
Check it out here.
At first glance, Retrieval-Augmented Generation (RAG) seems deceptively simple. You take a user’s question, retrieve a few relevant documents, and let a language model generate the answer. But as anyone who’s built one knows, getting a RAG system to work reliably - across edge cases, domains, and time - requires more than wiring up a retriever and an LLM. The difference between a fragile prototype and a production-ready RAG agent isn’t the framework or the model. It’s iteration.
Let’s start there.
The RAG Flywheel
Improving RAG agents is a process, not a checklist. We’ve found the most reliable systems emerge through a flywheel of continuous improvement:
Build a basic system: Start simple—maybe even naive. Use public embeddings, default chunking, no reranker. Get something working end-to-end.
Create evaluation data: This is critical. Without it, you’re flying blind.
Run experiments: Adjust retrieval methods, embeddings, chunking strategies.
Analyse errors: Look at real outputs. Where are things going wrong?
Incorporate feedback: From users, metrics, or failures.
Repeat.
This loop is how good RAG agents are built - not by adding more components, but by learning which ones actually matter for your task.
But the flywheel doesn’t turn unless you have data. And at the beginning, you often don’t.
Why Synthetic Data Matters
In many domains - enterprise tools, legal, finance, internal docs - you don’t start with a benchmark dataset. You don’t have thousands of labeled questions or relevance scores. You barely have users.
But you do have access to your own corpus. And with a bit of prompting, you can start generating useful data from it.
Here’s what that looks like:
For each document chunk, use an LLM to generate a set of plausible questions it might answer.
Optionally, generate ranking data by asking the model to rate or compare which chunks are more relevant to which questions.
Use this as evaluation (and later as training) data to tune your embeddings, rerankers, or retrieval strategies.
The point isn’t to generate perfect labels. It’s to give yourself enough signal to start iterating.
And unlike the old days of machine learning, this process is fast, cheap, and scalable. You don’t need a labeling firm. You need a few good prompts and a basic eval loop.
There’s many synthetic data generator libraries out there nowadays. My personal favourite is data_simulator by langwatch, which first filters the documents to ensure the generated queries are relevant.
Why Embeddings Need to Reflect Meaning (Not Just Language)
Imagine you’re building a dating app. Two users fill in their bios:
“I love coffee.”
“I hate coffee.”
From a linguistic standpoint, these statements are opposites. But from a recommendation perspective, there’s a case to be made that they belong together. Both are expressing strong food preferences. Both might be ‘foodies’ which is why they mentioned their preferences.
The point here is subtle, but important: semantic similarity is not the same as task relevance.
Most off-the-shelf embedding models are trained on broad internet data. They encode assumptions about what similarity means - assumptions that may not apply to your use case. This is especially problematic when building search and retrieval systems where nuance matters.
That’s why fine-tuning your embedding model, even on a small number of labeled pairs, can make a noticeable difference. I’ve often seen teams improve their recall by 10-15% by fine-tuning their embedding models with just a couple hundred examples. Whether you’re recommending dating profiles or retrieving security policies, what matters is that the notion of “relevance” is aligned with your application’s goal.
Sometimes that means “similar meaning.” Sometimes it means “answers the same question.”
Chunking: Breaking Information into Pieces That Still Make Sense
Once you’ve decided how to represent your data (through embeddings), the next challenge is deciding how to split it.
Chunking is often treated as a preprocessing detail—something you do once, with a default length, and move on. But it can have an outsized impact on retrieval quality.
Here’s a simple example. Imagine you’re watching a cooking show and someone says:
“First, coat the pan with olive oil. Then add the chopped garlic and sauté for a minute until fragrant. Don’t let it brown.”
If you chunk this naively - say, by every 100 characters - you might end up splitting the instructions halfway through the garlic step. The retriever then returns something like:
“…coat the pan with olive oil. Then add the chopped…”
Which is both unhelpful and, in a cooking context, potentially misleading.
The issue is that retrieval models work best when each chunk represents a complete idea. When chunks contain disconnected thoughts, the embedding becomes muddled and the context retrieved becomes less useful - especially when passed into an LLM for answer synthesis.
Semantic chunking techniques - like embedding-based splitting or topic-aware segmentation - can help here. They preserve topical boundaries, detect shifts in meaning, and ensure that chunks feel more like self-contained thoughts than arbitrary blocks of text.
In practice, chunking is one of the easiest levers to pull, and one of the most overlooked.
Querying: The Art of Asking Better Questions
Another under-appreciated piece of the puzzle: queries don’t always represent what a user means. Especially in customer-facing applications, users will ask vague or shorthand questions:
“What happened with the Q3 report?”
This doesn’t give the retriever much to work with. There’s no mention of the department, the exact metric, or the event being referenced.
Instead of taking these queries at face value, we can reframe them automatically using query transformation techniques:
Multi-query generation: Create multiple rewordings of the original question.
HyDE (Hypothetical Document Embeddings): Generate a “fake” answer and embed that for retrieval.
Query decomposition: Break complex questions into simpler sub-questions.
These techniques improve recall not by changing your database, but by being more intentional about how you ask it.
Reranking: The Final Filter
Retrievers give you candidates. But candidates aren’t answers.
Cross-encoders - models that take in both the query and a candidate chunk together - can score these pairs more precisely than simple embedding similarity. They’re slower, but more accurate, and are particularly useful when you want to ensure that the top few results are as relevant as possible.
Used wisely, rerankers can replace quantity with quality. Instead of sending ten partially relevant chunks to the LLM, send three that are near-perfect.
Evaluation: You Can’t Improve What You Don’t Measure
It’s surprisingly common to iterate on RAG systems without a clear sense of how well they’re working. In absence of ground truth, we rely on intuition - or worse, isolated examples that look good in demos.
A more grounded approach is to treat RAG like any other data pipeline: define metrics, test systematically, and trace errors to root causes.
For example:
If they’re vague or incomplete, check recall: are the right documents even being found?
If irrelevant content is showing up, look at precision and chunk filtering.
If you’re evaluating in production, tools like RAGAs provide structured ways to measure these breakdowns without needing human-labeled answers.
A Flywheel, Not a Checklist
Improving a RAG system isn’t a one-off effort. It’s a flywheel:
Start with a simple system.
Add synthetic or real user queries.
Evaluate failure cases.
Adjust chunking, query rewriting, embeddings, or rerankers.
Repeat.
Each step feeds the next, and the system gets better—not because any one tool is perfect, but because you’re systematically identifying and fixing what’s broken. Platforms like LangWatch are useful here, not as silver bullets, but as tools that help you observe where your system is falling short. They let you trace retrieval quality, flag hallucinations and correlate performance with specific design choices.
Improving RAG - Closing thoughts
The promise of RAG is clear: to combine the reasoning power of language models with the precision of structured knowledge. But the path to that promise runs through careful design.
Better RAG agents aren’t built by swapping tools, but by understanding what each part of the system does, where it can go wrong, and how to make it better. It’s part engineering, part information retrieval, and part empathy—understanding both what users need, and how models interpret that need.
It’s in this process - of systematically identifying and addressing what isn’t working - that RAG systems become reliable.
If you’re interested in putting this into practice, I’ve created a notebook series that follows this same philosophy. It walks through building and evaluating a RAG agent from the ground up, with a focus on iteration, measurement, and improvement. You’ll find examples of synthetic data generation, agent evaluation, and strategies for tuning retrieval step by step.
Check it out here.
At first glance, Retrieval-Augmented Generation (RAG) seems deceptively simple. You take a user’s question, retrieve a few relevant documents, and let a language model generate the answer. But as anyone who’s built one knows, getting a RAG system to work reliably - across edge cases, domains, and time - requires more than wiring up a retriever and an LLM. The difference between a fragile prototype and a production-ready RAG agent isn’t the framework or the model. It’s iteration.
Let’s start there.
The RAG Flywheel
Improving RAG agents is a process, not a checklist. We’ve found the most reliable systems emerge through a flywheel of continuous improvement:
Build a basic system: Start simple—maybe even naive. Use public embeddings, default chunking, no reranker. Get something working end-to-end.
Create evaluation data: This is critical. Without it, you’re flying blind.
Run experiments: Adjust retrieval methods, embeddings, chunking strategies.
Analyse errors: Look at real outputs. Where are things going wrong?
Incorporate feedback: From users, metrics, or failures.
Repeat.
This loop is how good RAG agents are built - not by adding more components, but by learning which ones actually matter for your task.
But the flywheel doesn’t turn unless you have data. And at the beginning, you often don’t.
Why Synthetic Data Matters
In many domains - enterprise tools, legal, finance, internal docs - you don’t start with a benchmark dataset. You don’t have thousands of labeled questions or relevance scores. You barely have users.
But you do have access to your own corpus. And with a bit of prompting, you can start generating useful data from it.
Here’s what that looks like:
For each document chunk, use an LLM to generate a set of plausible questions it might answer.
Optionally, generate ranking data by asking the model to rate or compare which chunks are more relevant to which questions.
Use this as evaluation (and later as training) data to tune your embeddings, rerankers, or retrieval strategies.
The point isn’t to generate perfect labels. It’s to give yourself enough signal to start iterating.
And unlike the old days of machine learning, this process is fast, cheap, and scalable. You don’t need a labeling firm. You need a few good prompts and a basic eval loop.
There’s many synthetic data generator libraries out there nowadays. My personal favourite is data_simulator by langwatch, which first filters the documents to ensure the generated queries are relevant.
Why Embeddings Need to Reflect Meaning (Not Just Language)
Imagine you’re building a dating app. Two users fill in their bios:
“I love coffee.”
“I hate coffee.”
From a linguistic standpoint, these statements are opposites. But from a recommendation perspective, there’s a case to be made that they belong together. Both are expressing strong food preferences. Both might be ‘foodies’ which is why they mentioned their preferences.
The point here is subtle, but important: semantic similarity is not the same as task relevance.
Most off-the-shelf embedding models are trained on broad internet data. They encode assumptions about what similarity means - assumptions that may not apply to your use case. This is especially problematic when building search and retrieval systems where nuance matters.
That’s why fine-tuning your embedding model, even on a small number of labeled pairs, can make a noticeable difference. I’ve often seen teams improve their recall by 10-15% by fine-tuning their embedding models with just a couple hundred examples. Whether you’re recommending dating profiles or retrieving security policies, what matters is that the notion of “relevance” is aligned with your application’s goal.
Sometimes that means “similar meaning.” Sometimes it means “answers the same question.”
Chunking: Breaking Information into Pieces That Still Make Sense
Once you’ve decided how to represent your data (through embeddings), the next challenge is deciding how to split it.
Chunking is often treated as a preprocessing detail—something you do once, with a default length, and move on. But it can have an outsized impact on retrieval quality.
Here’s a simple example. Imagine you’re watching a cooking show and someone says:
“First, coat the pan with olive oil. Then add the chopped garlic and sauté for a minute until fragrant. Don’t let it brown.”
If you chunk this naively - say, by every 100 characters - you might end up splitting the instructions halfway through the garlic step. The retriever then returns something like:
“…coat the pan with olive oil. Then add the chopped…”
Which is both unhelpful and, in a cooking context, potentially misleading.
The issue is that retrieval models work best when each chunk represents a complete idea. When chunks contain disconnected thoughts, the embedding becomes muddled and the context retrieved becomes less useful - especially when passed into an LLM for answer synthesis.
Semantic chunking techniques - like embedding-based splitting or topic-aware segmentation - can help here. They preserve topical boundaries, detect shifts in meaning, and ensure that chunks feel more like self-contained thoughts than arbitrary blocks of text.
In practice, chunking is one of the easiest levers to pull, and one of the most overlooked.
Querying: The Art of Asking Better Questions
Another under-appreciated piece of the puzzle: queries don’t always represent what a user means. Especially in customer-facing applications, users will ask vague or shorthand questions:
“What happened with the Q3 report?”
This doesn’t give the retriever much to work with. There’s no mention of the department, the exact metric, or the event being referenced.
Instead of taking these queries at face value, we can reframe them automatically using query transformation techniques:
Multi-query generation: Create multiple rewordings of the original question.
HyDE (Hypothetical Document Embeddings): Generate a “fake” answer and embed that for retrieval.
Query decomposition: Break complex questions into simpler sub-questions.
These techniques improve recall not by changing your database, but by being more intentional about how you ask it.
Reranking: The Final Filter
Retrievers give you candidates. But candidates aren’t answers.
Cross-encoders - models that take in both the query and a candidate chunk together - can score these pairs more precisely than simple embedding similarity. They’re slower, but more accurate, and are particularly useful when you want to ensure that the top few results are as relevant as possible.
Used wisely, rerankers can replace quantity with quality. Instead of sending ten partially relevant chunks to the LLM, send three that are near-perfect.
Evaluation: You Can’t Improve What You Don’t Measure
It’s surprisingly common to iterate on RAG systems without a clear sense of how well they’re working. In absence of ground truth, we rely on intuition - or worse, isolated examples that look good in demos.
A more grounded approach is to treat RAG like any other data pipeline: define metrics, test systematically, and trace errors to root causes.
For example:
If they’re vague or incomplete, check recall: are the right documents even being found?
If irrelevant content is showing up, look at precision and chunk filtering.
If you’re evaluating in production, tools like RAGAs provide structured ways to measure these breakdowns without needing human-labeled answers.
A Flywheel, Not a Checklist
Improving a RAG system isn’t a one-off effort. It’s a flywheel:
Start with a simple system.
Add synthetic or real user queries.
Evaluate failure cases.
Adjust chunking, query rewriting, embeddings, or rerankers.
Repeat.
Each step feeds the next, and the system gets better—not because any one tool is perfect, but because you’re systematically identifying and fixing what’s broken. Platforms like LangWatch are useful here, not as silver bullets, but as tools that help you observe where your system is falling short. They let you trace retrieval quality, flag hallucinations and correlate performance with specific design choices.
Improving RAG - Closing thoughts
The promise of RAG is clear: to combine the reasoning power of language models with the precision of structured knowledge. But the path to that promise runs through careful design.
Better RAG agents aren’t built by swapping tools, but by understanding what each part of the system does, where it can go wrong, and how to make it better. It’s part engineering, part information retrieval, and part empathy—understanding both what users need, and how models interpret that need.
It’s in this process - of systematically identifying and addressing what isn’t working - that RAG systems become reliable.
If you’re interested in putting this into practice, I’ve created a notebook series that follows this same philosophy. It walks through building and evaluating a RAG agent from the ground up, with a focus on iteration, measurement, and improvement. You’ll find examples of synthetic data generation, agent evaluation, and strategies for tuning retrieval step by step.
Check it out here.
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Benefits
Features
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Benefits
Features
Boost your LLM's performance today
Get up and running with LangWatch in as little as 10 minutes.
Benefits
Features