The Five Stages of Reinforcement Learning

agents
llms
reinforcement-learning
The five stages of a reinforcement-learning setup for LLM agents — tasks, harness, rollout, reward, trainer — mapped onto my own structured-extraction work, and why reward design is the part that interests me most.
Author

Alex Strick van Linschoten

Published

June 16, 2026

Last time I wrote we were talking at a high level about the different possible approaches to improving performance, especially when talking about more complex agentic workflows. Since then I’ve learned more about what exactly is involved in the reinforcement learning process and I’m going to lay that out today.

While for supervised fine-tuning you have your (fixed) dataset, a model (to either update in part of more substantially) and a trainer, there are five broad pieces or stages that we can talk about for RL:

  1. Tasks – the problems or input that you attempt to solve / handle
  2. Harness – the tools you use to attempt to solve the problems
  3. Rollout – a recorded attempt of solving the problem (with the full ‘trajectory’ of traces captured)
  4. Reward – a score for the attempt (how well did it solve it? which may or may not be whether it was a ‘correct’ answer)
  5. Trainer – a way to nudge the model’s weights to achieve higher scores using some algorithm (as of June 2026, this is usually something called GRPO)

Note that there’s a lot of jargon here, and even some of the above terms have more equivalents. (For example, the model that you update is also known as a ‘policy’.) There are also some groupings of parts. For example, we can think of a split between the static parts (tasks, harness, rewards) and those which are more process or that touch training infrastructure and GPUs (like the trainer and the rollout). This split in turn is where you can find yet another (popular) term: ‘environment’, which is the combination of tasks, harness and rewards. It seems there is quite a large set of new terminology associated with the world of RL which you have to navigate in order to understand the frameworks used and the research being conducted. Here’s a helpful way of decomposing that, drawn from a case study associated with Snorkel.

FinQA RL environment breakdown

It can be easy to think of ‘tasks’ as being a bit like the datasets used in SFT, but it’s still quite different. In SFT each example comes with the answer you want the model to copy; an RL task comes with a way to score an attempt, which may or may not involve a golden answer. (Sometimes there’s a correct answer to check against, sometimes only a rubric or a judge). The model learns by maximising that score across many attempts, not by copying a single target. The approach with RL is that the GRPO algorithm (during training) will attempt to solve those problems multiple times, and then it will use the reward to differentiate between those attempts and then based on which attempts were relatively stronger it will update weights of the model/policy accordingly.

Perhaps we can ground this in an actual example. If you recall, previously I worked on a fine-tuning case study where I extracted structured data from press release. I had labelled some data where I manually extracted the relevant parts from press releases, and then I fine-tuned some models on that data. A very naive translation of that work into the reinforcement learning template would look something like this:

The interesting part in all of this for a domain expert (or a recovering domain expert like myself) is the reward. It seems there’s a lot of skill in determining how to test how well the model is doing at solving a problem. Because note that for many problems it won’t be as ‘easy’ as just extracting data out of some short paragraph (where there actually is a correct answer). In many long-running tasks where you might be attempting to make your agent better at conducting biological research, or better at coming up with nuanced responses to complicated legal queries etc, you are going to need quite nuanced ways of grading this.

From what I read, it seems in 2025-2026 there was a shift away from hand-crafted step-by-step reward shaping – scoring the model at every intermediate step of its attempt, known as ‘dense’ or ‘process’ rewards – towards ‘sparse’ or outcome rewards that only score the final result. (One thing I had to untangle: ‘sparse’ doesn’t mean ‘simple’. The multi-part reward I sketched above for the extraction task – valid JSON and fields match and no invented provinces – is still a sparse reward, because it’s all computed once on the final answer. Dense vs sparse is about when you score across an attempt, not how many checks you bundle in.) Part of why outcome rewards seem to work (I think?) is that GRPO compares a whole group of attempts against each other, so even a coarse final score is enough to separate the stronger attempts from the weaker ones. There’s a lot of nuance here and I’m interested to learn why this happened and where it still makes sense to have domain experts in the loop creating these reward functions. It seems to me even if models are getting stronger and even if the algorithm is strong enough to update the weights based on much vaguer outcome rewards, there needs to be some more to it beyond just that…

If it was that simple, after all, then effectively we’ve found a way to achieve self-improvement from fairly vaguely defined outcomes. It would be almost like if the best way to teach children at school would just be to define the curriculum they had to master by the end of the school year, and then every day school just consisted of the teacher saying, “you know, you’re smart, you can figure it out. I think you should just try harder. I know you have it in you!” And perhaps that’s a little bit the argument! i.e. maybe that tuition approach wouldn’t work for a 2-year old, but it might well work for a 16-year old whose brain is much more developed and can actually probably respond quite well to such a prompt/nudge. But looking forward to learning more about that soon!

Some more questions that occurred to me as I went through all of this: