The Five Stages of Reinforcement Learning
Last time I wrote we were talking at a high level about the different possible approaches to improving performance, especially when talking about more complex agentic workflows. Since then I’ve learned more about what exactly is involved in the reinforcement learning process and I’m going to lay that out today.
While for supervised fine-tuning you have your (fixed) dataset, a model (to either update in part of more substantially) and a trainer, there are five broad pieces or stages that we can talk about for RL:
- Tasks – the problems or input that you attempt to solve / handle
- Harness – the tools you use to attempt to solve the problems
- Rollout – a recorded attempt of solving the problem (with the full ‘trajectory’ of traces captured)
- Reward – a score for the attempt (how well did it solve it? which may or may not be whether it was a ‘correct’ answer)
- Trainer – a way to nudge the model’s weights to achieve higher scores using some algorithm (as of June 2026, this is usually something called GRPO)
Note that there’s a lot of jargon here, and even some of the above terms have more equivalents. (For example, the model that you update is also known as a ‘policy’.) There are also some groupings of parts. For example, we can think of a split between the static parts (tasks, harness, rewards) and those which are more process or that touch training infrastructure and GPUs (like the trainer and the rollout). This split in turn is where you can find yet another (popular) term: ‘environment’, which is the combination of tasks, harness and rewards. It seems there is quite a large set of new terminology associated with the world of RL which you have to navigate in order to understand the frameworks used and the research being conducted. Here’s a helpful way of decomposing that, drawn from a case study associated with Snorkel.

It can be easy to think of ‘tasks’ as being a bit like the datasets used in SFT, but it’s still quite different. In SFT each example comes with the answer you want the model to copy; an RL task comes with a way to score an attempt, which may or may not involve a golden answer. (Sometimes there’s a correct answer to check against, sometimes only a rubric or a judge). The model learns by maximising that score across many attempts, not by copying a single target. The approach with RL is that the GRPO algorithm (during training) will attempt to solve those problems multiple times, and then it will use the reward to differentiate between those attempts and then based on which attempts were relatively stronger it will update weights of the model/policy accordingly.
Perhaps we can ground this in an actual example. If you recall, previously I worked on a fine-tuning case study where I extracted structured data from press release. I had labelled some data where I manually extracted the relevant parts from press releases, and then I fine-tuned some models on that data. A very naive translation of that work into the reinforcement learning template would look something like this:
- Tasks: this is literally the inputs of the dataset. As an example, that’d be one of the press releases out of which we wanted to extract some sort of structured data (i.e. where events took place, some quantitative numbers associated with those events and so on)
- Harness: in our most simplistic harness, we could simply have one tool (“extract_data”) available to our model where we passed the press release along with a prompt saying “please extract the data in so and so format”. A slightly more extended version could add tools for validating or double-checking the extracted values (checking the JSON formatting was correct and so on), potentially, as part of a longer agentic loop. The important thing is that these tools are defined in code and that they would be available to the model to use. When we’re talking about RL for these kinds of agent loops, what we really seem to be talking about is improving its ability to use the tools it has available to it.
- Rollout: Not much to say about this one. Fairly straightforward.
- Reward: in our naive version, we could simply check whether the extracted data matched the golden dataset’s correct equivalent for any particular press release. (Side note: it’s unclear to me whether RL would actually improve the ability of the model to do the extraction itself, or just to make sure that it would use the
extract_datatool/function. I hope to answer this question soon! But let’s say we were doing the slightly extended version which had some validation and ‘checking’ tools available to it, then we can just say that we’d hope that the model would use these tools productively to check and correct its work before returning values back to the user.) We could think of many other parts to the reward function which aren’t just ‘was this correct’, too. We could deterministically check that the value returned was valid JSON, or check that we don’t have province names returned in the answer that weren’t in the press release, or that the numbers don’t add up to something non-sensical for that domain and so on. You can then weight these various parts to the reward function depending on their importance. - Trainer: Not much to say about this except to say that this is usually done by different frameworks than what one is used to with standard machine learning or fine-tuning. Part of the difference is that you are doing actual inference during the training (multiple times over each problem so you can compare between them) so you are both serving the model and updating weights at the same time. (I think this basically means you have the model loaded into GPU memory (VRAM) roughly two times – one copy generating attempts, one copy training – which is why people quote a rough 50/50 memory split. Newer frameworks like Unsloth dodge the double-load by sharing the weights between the two, which is how GRPO now fits in as little as ~5GB of VRAM for a small model.)
The interesting part in all of this for a domain expert (or a recovering domain expert like myself) is the reward. It seems there’s a lot of skill in determining how to test how well the model is doing at solving a problem. Because note that for many problems it won’t be as ‘easy’ as just extracting data out of some short paragraph (where there actually is a correct answer). In many long-running tasks where you might be attempting to make your agent better at conducting biological research, or better at coming up with nuanced responses to complicated legal queries etc, you are going to need quite nuanced ways of grading this.
From what I read, it seems in 2025-2026 there was a shift away from hand-crafted step-by-step reward shaping – scoring the model at every intermediate step of its attempt, known as ‘dense’ or ‘process’ rewards – towards ‘sparse’ or outcome rewards that only score the final result. (One thing I had to untangle: ‘sparse’ doesn’t mean ‘simple’. The multi-part reward I sketched above for the extraction task – valid JSON and fields match and no invented provinces – is still a sparse reward, because it’s all computed once on the final answer. Dense vs sparse is about when you score across an attempt, not how many checks you bundle in.) Part of why outcome rewards seem to work (I think?) is that GRPO compares a whole group of attempts against each other, so even a coarse final score is enough to separate the stronger attempts from the weaker ones. There’s a lot of nuance here and I’m interested to learn why this happened and where it still makes sense to have domain experts in the loop creating these reward functions. It seems to me even if models are getting stronger and even if the algorithm is strong enough to update the weights based on much vaguer outcome rewards, there needs to be some more to it beyond just that…
If it was that simple, after all, then effectively we’ve found a way to achieve self-improvement from fairly vaguely defined outcomes. It would be almost like if the best way to teach children at school would just be to define the curriculum they had to master by the end of the school year, and then every day school just consisted of the teacher saying, “you know, you’re smart, you can figure it out. I think you should just try harder. I know you have it in you!” And perhaps that’s a little bit the argument! i.e. maybe that tuition approach wouldn’t work for a 2-year old, but it might well work for a 16-year old whose brain is much more developed and can actually probably respond quite well to such a prompt/nudge. But looking forward to learning more about that soon!
Some more questions that occurred to me as I went through all of this:
- what are the hard parts? for people working on using RL to improve their system, is the hard part setting up the training infrastructure and system in such a way or is it finding the right way to specify rewards or what? (My reading suggests that it’s all about the rewards and less about the narrow technical details, but I haven’t gone particularly deep so I might be wrong there.)
- How does this process work with teams doing RL? What are the personas involved? Is it possible (or common?) for a single person to be doing this, or if it’s more often a team, then what kinds of labels are given to the various people? Are they all engineers? Are there domain experts mixed in with the team?
- What tools map on to the different stages? (This is maybe the easiest question to answer and I assume during the course of my study I’ll build up a much more detailed picture of this. It’s also somehow the least interesting question.)
- For terminology, it occurred to me that I need to keep an ongoing list of terms and their equivalents, perhaps in a table form, so that I can track all the different ways that researchers and practitioners refer to the same thing in slightly different ways :)
- I’m very curious about the best practices for reward design. I assume there are some, and I assume that the very best practices are probably being hoarded by frontier labs, but I’m looking forward to seeing what I can discover.
- How do domain experts distinguish themselves in this field? I imagine it’s quite easy for mathematicians and scientists to work with the RL teams since they are already at least partly ‘technical’, but I’d be curious how experts from the humanities are being brought into the RL work. For example, for something like OpenAI’s “Deep Research” feature, were there non-engineers on the team that developed this?
- Connected to my previous work on hinbox, I’m very interested in how we define (via the reward function) what a perfect profile is. This is a very qualitative evaluation and has many different facets to it. Do we give our answer based on the diff (i.e. how we just updated part of a particular profile) or do we give our answer based on the profile as a whole? Or both? or some blended answer?
- (How do RLMs fit into all of this? I’ve previously written about RLMS in the context of ZenML) but it’s unclear to me how the two are related, if at all).
- How much of the RL energy and craze at the moment is big model hype? Will all of these techniques last? Who are the people who are challenging the rise of RL? Who are the people doing weird things in the space? Who are the people who really say it’s a waste of time? Really interested in all of this, since for sure it does feel like there’s a bit too much attention and effort being paid to it.