Introduction: The Real Issue isn’t RL

Tahmid AI Researcher @ LangWatch

May 30, 2025

There’s been a lot of chatter lately about how reinforcement learning (RL) doesn’t really teach language models anything new. The argument goes something like this: RL just brings out behaviours that were already latent in the model. I’ve seen papers like:

“RL from a single example works”
“RL with random rewards still shifts outputs”
“Base model pass@256 can match RL model pass@1”

The underlying message is that RL isn’t doing anything magical. It’s just reshaping what the model already knows.

I don’t disagree with the observations. But I think the conclusions being drawn from them are too rigid. These results don’t mean RL is fundamentally limited. They just reflect the narrow contexts in which it’s being used.

Most RL setups today are designed to fine-tune already capable models with shallow reward functions. If all you’re doing is nudging a model that already knows the answer, then sure, it’s going to look like RL is just sharpening the distribution. That’s expected.

But that doesn’t mean this is all RL is good for. It means we need to take a closer look at the setups we’re using to evaluate it.

How we evaluate shapes what we see

Most recent RL work follows a very specific recipe:

Start with a strong base model trained on massive amounts of data.
Add a lightweight reward function, often designed around human preferences or correctness signals.
Fine-tune the model using RL on tasks it has already seen examples of during pre-training.

In that setup, it's no surprise that RL doesn’t seem to create entirely new skills. The base model already contains a huge amount of knowledge. RL just shifts the model's output distribution to make desirable completions more likely.

This behavior isn’t unexpected. It's exactly what we’d predict if RL is used to reinforce patterns that are already present, but underrepresented. But again, that says more about the setup than the method.

The Myth of Emergence

There’s another idea floating around that pre-training alone is responsible for all the interesting capabilities we see in models. According to this view, RL doesn’t add anything, but it just helps retrieve what pre-training already encoded.

That sounds good on paper. But if you dig into the details of many recent papers, you’ll find that these capabilities often don’t emerge spontaneously. They show up when the pre-training or intermediate training data is designed in a very specific way.

In other words, these skills are not just hiding in the model by accident. They are cultivated. RL helps bring them out, but only if they were already seeded in the first place.

A better way to think about RL

At this point, it might help to shift how we think about RL. Instead of seeing it as a tool to teach models completely new behaviours, maybe it's more accurate to treat it as a form of targeted refinement.

You can think of it like onboarding a new hire. You don’t teach someone how to reason from scratch. You hire someone who is already smart and capable, then give them feedback to align them with the needs of your organisation. That feedback makes them more effective, not more intelligent.

RL plays a similar role. It doesn’t turn a weak model into a strong one. It takes a capable model and helps it behave in ways that are more aligned with your goals.

If we want more from RL, we need to ask more

If the goal is to use RL to create new capabilities, then we need to change the context in which we apply it. That starts with designing different kinds of experiments.

A few ideas:

Create reward functions that provide clearer and more meaningful signals, especially for harder tasks where correctness is not binary.
Use environments that go beyond what the model has already seen in pre-training. That way, we can actually observe generalisation or problem-solving.
Try running RL earlier in the pipeline, before the model has already learned so much from unsupervised data.

Most current results are based on running RL after the model has already seen everything. Of course the improvements look small and surface-level.

Conclusion: Don’t blame the method

Saying RL can’t create new capabilities because it doesn’t seem to in current experiments is like saying a car can’t go uphill because you’ve only driven it on flat roads.

If all your evaluations are built on top of models that already contain the answer, and your rewards are tuned to reinforce what’s already there, then of course RL will look like it’s just nudging outputs around. And in that context, it’s doing its job.

But if we want RL to go beyond this, we need better eval setups. And that means we need better tools to see what’s really happening.

At LangWatch, this is exactly what we care about. We’re building infrastructure to trace, inspect, and understand LLM behavior at a much deeper level. Whether you're aligning a model with RLHF, experimenting with tool use, or probing for emergent behavior, you need to know more than just pass@1. You need visibility into how the model is solving problems, what changes between versions, and whether your interventions are actually moving the needle.

So no, RL isn’t broken, but our ability to evaluate it has a long way to go. That’s the part we’re working on.

I'm Tahmid - AI researcher @ LangWatch. With previous AI/ML experience at the secret service, corporates financial instituations and building my own AI start-up.

Schedule a call with me via this link to learn more about Evaluating Agents

“RL from a single example works”
“RL with random rewards still shifts outputs”
“Base model pass@256 can match RL model pass@1”

The underlying message is that RL isn’t doing anything magical. It’s just reshaping what the model already knows.

But that doesn’t mean this is all RL is good for. It means we need to take a closer look at the setups we’re using to evaluate it.

How we evaluate shapes what we see

Most recent RL work follows a very specific recipe:

Start with a strong base model trained on massive amounts of data.
Add a lightweight reward function, often designed around human preferences or correctness signals.
Fine-tune the model using RL on tasks it has already seen examples of during pre-training.

The Myth of Emergence

In other words, these skills are not just hiding in the model by accident. They are cultivated. RL helps bring them out, but only if they were already seeded in the first place.

A better way to think about RL

RL plays a similar role. It doesn’t turn a weak model into a strong one. It takes a capable model and helps it behave in ways that are more aligned with your goals.

If we want more from RL, we need to ask more

If the goal is to use RL to create new capabilities, then we need to change the context in which we apply it. That starts with designing different kinds of experiments.

A few ideas:

Create reward functions that provide clearer and more meaningful signals, especially for harder tasks where correctness is not binary.
Use environments that go beyond what the model has already seen in pre-training. That way, we can actually observe generalisation or problem-solving.
Try running RL earlier in the pipeline, before the model has already learned so much from unsupervised data.

Most current results are based on running RL after the model has already seen everything. Of course the improvements look small and surface-level.

Conclusion: Don’t blame the method

Saying RL can’t create new capabilities because it doesn’t seem to in current experiments is like saying a car can’t go uphill because you’ve only driven it on flat roads.

But if we want RL to go beyond this, we need better eval setups. And that means we need better tools to see what’s really happening.

Schedule a call with me via this link to learn more about Evaluating Agents

“RL from a single example works”
“RL with random rewards still shifts outputs”
“Base model pass@256 can match RL model pass@1”

The underlying message is that RL isn’t doing anything magical. It’s just reshaping what the model already knows.

But that doesn’t mean this is all RL is good for. It means we need to take a closer look at the setups we’re using to evaluate it.

How we evaluate shapes what we see

Most recent RL work follows a very specific recipe:

Start with a strong base model trained on massive amounts of data.
Add a lightweight reward function, often designed around human preferences or correctness signals.
Fine-tune the model using RL on tasks it has already seen examples of during pre-training.

The Myth of Emergence

In other words, these skills are not just hiding in the model by accident. They are cultivated. RL helps bring them out, but only if they were already seeded in the first place.

A better way to think about RL

RL plays a similar role. It doesn’t turn a weak model into a strong one. It takes a capable model and helps it behave in ways that are more aligned with your goals.

If we want more from RL, we need to ask more

If the goal is to use RL to create new capabilities, then we need to change the context in which we apply it. That starts with designing different kinds of experiments.

A few ideas:

Create reward functions that provide clearer and more meaningful signals, especially for harder tasks where correctness is not binary.
Use environments that go beyond what the model has already seen in pre-training. That way, we can actually observe generalisation or problem-solving.
Try running RL earlier in the pipeline, before the model has already learned so much from unsupervised data.

Most current results are based on running RL after the model has already seen everything. Of course the improvements look small and surface-level.

Conclusion: Don’t blame the method

Saying RL can’t create new capabilities because it doesn’t seem to in current experiments is like saying a car can’t go uphill because you’ve only driven it on flat roads.

But if we want RL to go beyond this, we need better eval setups. And that means we need better tools to see what’s really happening.

Schedule a call with me via this link to learn more about Evaluating Agents

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Get started

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Get started

Boost your LLM's performance today

Get up and running with LangWatch in as little as 10 minutes.

Get started