Critical Rollout Size

Noise and Data

I started my career in data as an opto-mechanical engineer for an ultra high precision spectroscopy company. The group I was in developed sensors to measure specific air molecules down to parts per trillion precision. The work was amazing but you quickly realize how important the measurement device is to the measurement. Signal lives and dies by the noise you allow.

Recently, I’ve been thinking about Allan-Werle plots again (https://link.springer.com/article/10.1007/s00340-024-08254-5 - great example here). These plots allow you to calculate the optimal integration time of your measurement. Integration time is determined by non-white noise/drift/etc. White noise on the other hand will continue to average out of your sample the more samples you take. If you are in Machine Learning (ML) then this might sound familiar.

The key difference in ML is the assumption that your dataset is a representation of what you want to model. Under this assumption (technically i.i.d), the noise behaves akin to random (white) noise. This is clear in critical batch size (CBS - https://arxiv.org/pdf/1812.06162). ML has separated the measurement device (dataset curation) from the signal integration (modeling). This is powerful.

Recently, I had an idea which improved SFT. I used a common research paper trope; you denoise the update to your model. Unfortunately, these ideas never scale because the batch sizes are too small. When the batch sizes are small you can throw out updates which are otherwise noisy. You stop fitting noise and this is good. However, if you were to increase the batch size of your experiment then you quickly realize that scale is also a denoiser. The noise you were throwing out contains non-negligible signal.

This leads us back to CBS. There is a limit of SNR which is optimal to train given other training constraints. Assuming that your dataset is representative of the population you want to model then you optimize by increasing the batch to a CBS limit. My overview is not intended to be rigorous and I am glossing over much of what makes CBS a good paper. Yet, we’ve reached the point I’ve wanted to make: reinforcement learning (RL) can break the assumption that the data is uncorrelated/representative of the population.

In spectroscopy, if we could assume all our measurements were independent (like in ML) then we could have a similar CBS-like theory around signal quality to measurement time. However, the non-random components limit us. RL has the exact same mechanism as spectroscopy. If we are producing rollouts then we can’t assume the data is a model of our population nor that it is uncorrelated. In fact, the covariance of our updates is important. We can then use the covariate frame to develop a Critical Rollout Size and compare this against literature.

Deriving Critical Rollout Size

If we wanted to track the non-random effects of our updates then we can look at our gradient updates, remove the known signal and then determine the covariance of the noise. If we have no covariance then we enter into CBS territory. If we have covariance, then we will determine if the covariance dominates (we reach a saturation point) or random variations dominate (We still reach a critical point). To be explicit, I will refer to the dominating point (critical or saturation) as the critical rollout size (CRS). To determine which aspect of variance dominates, we will define a value rho.

Rho will estimate rollout-to-rollout gradient correlation inside a prompt group. We first split our prompt group into equal rollouts for reference and measurement. We compute the mean gradient from our reference. We compute residuals for each rollout in our measurement group. From the residuals we can then estimate our noise power and cross covariance. Finally, Rho is defined by the cross covariance divided by the noise power. This gives us a measure of gradient correlation.

With these definitions we can define our saturation point as the batch size equivalent to the inverse of rho. That is, we will continue to improve our variance up until a critical saturation point where further rollouts will not help our variance reduction. This is derived from an equicovariance formula. We do need to prove that our saturation point is lower than our traditional critical rollout. However, we now have a way to calculate our saturation point.

The intuition here is that our uncorrelated noise (like that in CBS) still decreases with increased batch size. Before, we assumed that our samples were uncorrelated but noisy representations of our target distribution. Our noise scaled with the inverse of batch size. When we assume correlated noise, our noise still scales with the inverse to batch size within a linear regime but the equicovariance formula drives us to a point where our effective batch size approaches the inverse of rho.

The rest of this post will be exploring this idea. I ran a quick experiment with a DAPO like setup to empirically validate critical rollout sizing. I ran this with both rollout sizes of 32 and 16 to measure rho. I then swept rollout sizes 4 through 16 to verify the measurement. The experimental results follow.

Experimental Results

Within our setup, I used GSM8K as our dataset and Qwen3 0.6B (instruct). I used the Prime Intellect verifiers environment. For all of the rollout tests, I run with 1536 total examples per batch. So if we have higher rollouts then we will have fewer unique prompts. 1536 was taken directly from initial tests which showed 1-3k examples for critical batch size. Our initial tests involved a 16 prompt by 16 rollout run and then a 48 prompt by 32 rollout run. Both of these tests showed similar results in regards to critical batch size and critical rollout size.

All of these tests were run on Hyperbolic A6000 single GPU instances rented from Prime Intellect. I have a custom single instance vLLM inference - training pipeline. It follows standard practices but is on a single GPU. Additionally, in order to save on memory I run a LoRA (r=16, alpha=32, dropout=0) and gradient accumulation. Learning rate is set to 1e-5 and gradient norm clipping was set to 0.2. These values were not optimized in this experiment but were set in accordance with prior experiments. I limited traces to a total of 2048 new tokens and pruned overlong traces.

Initial testing was run to see what the gradient noise would look like and to determine if different rollout - prompt sizing would yield different results. Mostly, I wanted to get a feel for what I was about to do. Both the 16 and 32 rollouts agreed. Below I present the stats from my experiment with and without IQR filtering of the rollout saturation value (K_sat).

Unfiltered
Mean: 56.1
Median: 8.67
Std Dev: 1,055
Max: 63,392

Filtered
Mean: 8.206994
Median: 7.920385
Std Dev: 4.947615
Max: 24.699394

Empirically, we saw that our rollout saturation point (K_sat) tends to be lower than our rollout critical point (K_crit). K_crit is the number of rollouts per prompt needed for the mean gradient estimate to stop being noise‑dominated once you account for rollout‑to‑rollout correlation. It is derived from a correlated CBS formula. I did not provide K_crit values above as they were not dominating. I refer to the lowest value of K_crit and K_sat as the critical rollout size. From this experiment, critical rollout size will always be K_sat. We also see that our ideal rollout size (~8 per prompt) matches what often shows up in literature (https://arxiv.org/html/2509.26209v1).

Below, I provide a chart of 4, 8, 16 and 32 rollouts per prompt. We used the same number of traces per batch. The 32 rollout run was part of our rollout calculation run but used the same number of total samples. It was also cut short due to time constraints and the added computation. However, I found it an interesting addition since it was run with equivalent test hyperparameters.

In our chart, we see that group size 8 was the best performing and was the most stable. Group size 4 degraded at the tail end of training. Group size 16 looked like it was showing signs of degradation as well at the stopping point of our run. Group size 32 lagged nearly every single other group.

Our expected rollout size (~8) matched our most performant rollout size (8) which is a cool result! The implications will be discussed in the next section. However, there was more data from our experiments which agreed with prior literature. I will raise these results before our discussion.

In literature (https://arxiv.org/abs/2504.03380), we see that when we filter datasets on math closer to a 0.5 pass rate then our model performs better in RL training. There are some entropy and variance arguments for this result. I will provide variance based data below.

Plot 1. Here is a minimally filtered plot of Ksat and pass rate. There were a few data points which had a Ksat of >10,000 near the 1.0 pass rate level which made the visualization difficult and so were trimmed.

The minimally filtered chart was the first I saw on WandB. I was struck that most noisy updates clustered towards both 0 and 1 pass rates. Noisier is a loaded term here because K_sat is both influenced by covariance and absolute variance. However, considering our rollout sizes, examples where K_sat >> K will appear noisy as an update. I had remembered the prior studies showing difficulty filtering near 0.5 pass rate and assumed that this lent evidence. It is the same band that our study found to be most effective.

Plot 2. Here is a plot of Ksat and Pass Rate after we have filtered outliers.

Then I used IQR filtering to remove outliers and saw something even more striking. In the same band that our cited study shows improved learning dynamics, we have no rollouts which have a saturation point of 1. As we get closer to 0.5 then we see that the minimum saturation point increases. When we break out the quintiles of median Ksat, we also see this trend.

KSAT-Only IQR Filtered

Pass-Rate Quintile	Count	Mean KSAT	Median KSAT	Std Dev
Q0	915	8.45	8.09	4.82
Q1	804	9.25	8.51	4.15
Q2	976	9.17	8.57	4.76
Q3	673	7.92	7.62	5.37
Q4	720	5.68	6.12	4.87

Quintile Cutpoints: 0.03125, 0.375, 0.625, 0.8125, 0.928571, 0.96875

What is evident by this quintile breakout is that our mid-range pass rates have the tightest K_sat band and the highest K_sat values. Towards the higher pass rate quintiles, the rollout signal saturates faster. These are less informative rollouts. There is some indication of a similar trend on the lower pass rate end but it is statistically weaker. However, we do see low saturation rollouts even if the mean/median are not as affected as the high pass rate quintiles.

We had minimal deviation between runs in our KL divergence. Like the original DAPO, I did not use a KL penalty.

I did use overlong filtering. I left off the 32 rollout run because it did not have enough steps to adequately separate itself. However, We can see that either side of the predicted rollout had higher overlong filtering and as such fewer trajectories per batch as training progressed.

Discussion of results

Taking written inspiration from the critical batch size paper I’m going to start with the intuition behind a rollout saturation point. When we generate rollouts in a GRPO like manner, we are generating the signal relative to the other rollouts. Our logical extremes would be no rollouts with zero signal and critical batch size with optimal signal. However, since rollouts generate signal through comparison, we can consider a point before the critical batch size where each subsequent rollout generates negligible extra signal because it is highly correlated with prior rollouts.

Intuitively, CBS would only be reached if our correlated noise was negligible at CBS scale. Otherwise we would have a similar regime to CBS but with correlated noise. Scaling rollouts would denoise until the correlated noise limit was reached as opposed to the signal limit. Under our CRS, we would have nearly linear benefit to increasing our rollouts. Over our CRS, increasing rollouts would lead to wasted compute.

The pass rate and impact on learning is supported within the limits of my experiment. The empirical results of 8 rollouts per prompt is supported by the theoretical results. We predicted this value from rollouts of 16 and 32. We show that the 8 rollouts have better learning performance on our toy samples.

In our toy training example, we see instability on either side of our critical batch size. We keep the sequences per batch equal across all runs so low rollout runs have higher prompts per batch and vice versa. In the low rollout regime (in our case 4 rollouts per batch), we are stable early in training because of our large number of prompts. However, in later stages of training our per prompt signal is poor because we either are saturating (high pass rate with low signal) or we are not exploiting the full signal per prompt (mid pass rate with good signal). As such, the effective batch size significantly decreases as training progresses. We can see that our 4 batch size run catastrophically begins to fail on our final updates.

In the inverse case, we have fewer prompts but more rollouts. Visually, the 16 and 32 rollout runs look less stable. That’s because they are! With fewer prompts, the batches begin with lower overall signal per update step because the traces are fully saturated. In the 16 rollout case, effectively 50% of the batch does not improve our signal. Since we lose 50% and then don’t compensate with increased prompts (which would reduce variance) then the 16 results in a higher variance regime. The 32 batch learns the slowest. It gains no extra variance reduction for increased rollouts and even lower prompt samples so would have the least amount of signal per step.

There is another consideration for the collapse in the 4 rollout run. We are overlong filtering. It was found in prior experiments that overlong samples were detrimental. So we employ overlong filtering (like in the DAPO paper). However, in our prior experiments we also saw that too much overlong filtering is detrimental. We see this same phenomenon here when the 4 rollout run collapses. Most of the batch is filtered. The rest that is not filtered is very low signal and thus leads to a dangerously low effective batch size. It is not necessarily the filtering which is dangerous but the filtering which leads to an unstable batch. The 16 rollout run is more stable but again suffers from lower batch size in real and effective traces.

In GRPO studies with GSM8K we do find that the traces increase in length then stabilize around 5-6k tokens. In my toy sample, the model is restricted to below this value which appears detrimental. It also means that if there is not enough learning signal in our traces that we will eventually have reinforcing degradation instead of reinforcement learning. A valid criticism would be our limited tokens per rollout. However, our 8 rollout run shows resilience to the model collapse. The implication would be that increasing the tokens per rollout would not drastically change our findings even if it improved the learning across rollout runs. This should be addressed in future experiments.

Without intention, we also lend noise based evidence to difficulty filtering. Our high saturation rollouts and low saturation rollouts have a distinct pattern which matches empirical study of difficulty filtering. Specifically, we show that the ~0.5 pass rate rollouts had the tightest standard deviation and most consistent signal. Although this was applied to math, the nature of our saturation metric means we should be able to apply difficulty filtering to non-math based RL. The pass rate metric is not necessarily needed as we can use rewards and our saturation statistics to determine optimal difficulty for our models. Our filtering would be applied at the saturation limits of reward bands.

We do break some key assumptions in CBS like the very nature of the estimator. Technically, the estimator in RLVR is dependent on rollout size. The critical point in CBS is reliant on its estimator's independence from batch size. Despite this, we do end up with saturation points (which act as inflection points and hence I am calling critical). I’ve seen that despite the drifting nature of the objective, we may be able to use this rollout saturation point. In the following I will address some other points brought up in the original CBS paper. These are titled as they were in the paper but then extended for CRS.

Larger for difficult tasks - In the original paper, they expect CBS to increase for difficult tasks. However, I wouldn’t expect rollout saturation to increase for difficult tasks. Although it is an industry trend to use high temperature sampling in RLVR environments, the model is restricted in its exploration by pretraining’s local objectives. This is what makes LLMs effective in RL with strong enough initial pass rates but hinders them on difficult tasks. The local constraint hinders global based exploration.
Growth over training - CBS is expected to grow over training. CRS is expected to shrink. CRS is dependent on entropy and rollout diversity. We expect both to decrease throughout training. In addition, increased pass rate rollouts tend to have lower rollout saturation points. This leads to a decreased saturation point as training improves the model.
Weak dependence on model size - CBS has weak dependence on model size because at fixed loss, the model parameters cancel on the noise scale. However, CRS we would expect has strong dependence on model size because rollout correlation and entropy would not cancel.
Learning rate tuning - CBS and CRS align here. Too low of a learning rate will artificially inflate noise which would inflate the noise to correlated noise ratio. Thus we would expect a higher rollout saturation point without a tuned learning rate.

Now, one might quickly point to DOTA and the original experiments with OpenAI 5 as a counter to what I have laid out (https://arxiv.org/pdf/1912.06680). There were reinforcement learning examples in the original CBS paper as well. I don’t find this a compelling argument. In the game environment we will likely find that each rollout update has a lower contribution to rho. If we have a small enough covariance/variance ratio then the rollouts will act very close to CBS directly. In fact, that is what we see in non-LLM RL research and is addressed in the original CBS paper.

Additionally, we can look at the challenges laid out in the DOTA paper: long time horizons, partially observed state and high dimensional action/observation spaces. The characteristics of DOTA game play required that the researchers construct more environmental restrictions. They only allowed a subset of characters. Some game mechanics were controlled from hand scripted logic. When you combine these characteristics you end up with a system with low covariance relative to variance. This is not because covariance is low per se but is because variance is high and dominating. Traditional RL in this environment can use millions of samples because of the huge gradient noise scale.

The diversity is also a key consideration in the DOTA paper (and RL). They played against past opponents in slightly perturbed sampling regimes. The researchers scaled the environment so they could use “reasonably diverse” large batches. Finally, PPO in their experiments results in lower intra-group coupling than GRPO. RL in non-LLM settings set diversity as a specific goal and recommendation to control the observed variance.

What I have laid out is not ground breaking and follows from intuition. Critical batch size directly follows from covariance. Here I wanted to explore what we see in LLM based RLVR which is high covariance. I wanted to explore what this would mean as it relates to dataset noise and treating the RL model as a measuring device. What I have laid out follows my initial lines of inquiry. I found this a useful exercise for myself and hope that others may find this useful as well.

Future Areas of Inquiry

This is a limited experiment. It is small, uses small models and uses an overused dataset. However, I have shown that theory appears to be supported in my toy model. Right now, I don’t have access or capital for the GPUs to push this much further at pace. I’m sure I will progress in the future work section. However, if anyone would be willing to sponsor my work with credit then I would be very gracious. My goals would be to push this more formally towards a CBS like framework and show how we could use this method effectively in practice.

Below are some questions that are on my mind.

Does CRS hold at various scales and datasets?

This is pretty self explanatory. Does this only work as a product of my experiment or do the indications hold true more broadly?

What’s causing the noisy samples?

I did not yet pull noisy samples. It would be interesting to look at what the samples express in linguistic space or the property of the prompts. I would also like to filter our noisy samples entirely and see how that affects training.

How effective is difficulty filtering using our noise metrics?

We have indications that the noise metrics match closely with results from difficulty filtering. This indicates that noise based approaches may be effective in curriculum learning.

Can we correlate our covariance noise with anything else (entropy, variety, noise)?

The goal here would be to completely remove the need for gradient calculations. Alternatively, I’m sure that my implementation can be optimized. The ability to quickly and efficiently calculate the saturation point of rollouts will be useful.

What are the implications on a multi-turn environment?

Like DOTA, I would assume a multi-turn environment will have even noisier updates which pushes us closer to CBS like dynamics. This begs many follow-up questions all around scaling the environment.

How can we increase diversity?

Without scaling the environment, how can we increase diversity? Using CRS as a metric, we can directly look at strategies to increase diversity without the need to train a model with swept rollout sizes.

Is it optimal to increase diversity only of pivot tokens?

O3 and 01 by OAI were ground breaking moments in RLVR for LLMs. However, they had quite a few quirks which were likely from OAI increasing entropy in any way possible during RL. GPT5 variants do not have these quirks. Could this be related to increasing entropy on pivot tokens instead of across entire traces? Non-pivot tokens tend to be completion focused as opposed to redirecting (potentially as a result of local - global dynamics of pretraining).

If we bound LLMs in game-like settings does that help or hurt?

This is essentially why we build current RL environments. How much does our restriction of the model impact our CRS? What kind of environments allow us to generate viable signal? These are questions that a more robust CRS analysis allows us to address.