How the same evidence can yield different answers when you only change the order.
- The phenomenon: same chunks, different outcomes
- What we can observe during decoding
- A simple instability signal
- Measuring order sensitivity
- What this diagnostic is and is not
- Practical notes
Retrieval order is often treated as a formatting choice. In practice, changing context order can alter early next token distributions and move decoding onto a different trajectory. This post presents a practical way to measure that sensitivity using token probabilities.
The phenomenon: same chunks, different outcomes
In many retrieval augmented setups, the model receives a question and a fixed set of retrieved chunks. When only the chunk order is permuted, final answers often change.
Two observations matter. First, order effects are not rare edge cases. They appear even with greedy decoding because the model still makes a sequence of local choices.
Second, the effect is process level: drift is often visible before the final answer diverges.

What we can observe during decoding
At each decoding step t, an autoregressive model emits a next token distribution p_t(\cdot).
Many APIs expose only top k candidates with log probabilities. That is sufficient: renormalize on the logged support and compute lightweight diagnostics.
We track two quantities at each decoding step.
- Uncertainty at a step.
- Distributional change from one step to the next.

k token probabilities to an instability score and summary statistics.A simple instability signal
Define H_t as the entropy of p_t and D_t as the Jensen Shannon divergence between p_t and p_{t-1}.
Then define the instability index as in the equation below. Here lambda is a mixing weight, and this post uses lambda = 1 by default.
For trace level summaries, use the peak statistic S. For early warning, use the prefix statistic S_w.
The construction is intentionally simple. The objective is not a full theory of decoding, but a consistent metric that can be computed across runs.
Measuring order sensitivity
The protocol below keeps model, prompt, and decoding settings fixed, and varies only chunk order. This isolates order sensitivity and keeps runs directly comparable.
- Fix a question and its retrieved chunk set.
- Generate multiple permutations of the chunk order.
- Decode with the same settings for each permutation and log top token probabilities.
- Compute , then summarize each run by or .
- Quantify sensitivity from variability across permutations.

Sensitivity can be reported in equivalent forms, depending on whether you want dispersion, pairwise gaps, or threshold crossing rates.
- The fraction of permutations that cross a chosen risk threshold.
With correctness labels, you can test whether larger sensitivity is associated with higher failure probability.
What this diagnostic is and is not
This diagnostic is useful in three practical ways.
- A way to measure how much the decoding path changes under context permutations.
- A method that works with logged token probabilities.
- A tool to compare retrieval and prompting strategies by process dynamics.
It also has clear limits, summarized by the two points below.
- Not an intervention.
- Not a claim that you can eliminate order effects with a single heuristic.
Practical notes
A few implementation details matter for reproducibility.
- Keep decoding settings fixed.
- Keep the chunk set fixed.
- Use a fixed random seed for permutation sampling.
- Use the same stopping rules across runs.
When full vocabulary logits are available, compute entropy and divergence on the full distribution. When only top k logs are available, renormalize on the logged support and treat the result as a consistent approximation.
# Inputs: per step top k logprobs: logp[t] = {token: log_prob}
# Output: instability strength S
def renormalize(logp_dict):
probs = {tok: math.exp(lp) for tok, lp in logp_dict.items()}
z = sum(probs.values())
return {tok: p / z for tok, p in probs.items()}
def entropy(p):
return -sum(pi * math.log(pi) for pi in p.values() if pi > 0.0)
def jsd(p, q):
keys = set(p) | set(q)
m = {k: 0.5 * (p.get(k, 0.0) + q.get(k, 0.0)) for k in keys}
def kl(a, b):
return sum(ai * math.log(ai / b[k]) for k, ai in a.items() if ai > 0.0)
return 0.5 * kl(p, m) + 0.5 * kl(q, m)
p_prev = None
I = []
for t in range(T):
p_t = renormalize(logp[t])
H_t = entropy(p_t)
D_t = 0.0 if p_prev is None else jsd(p_t, p_prev)
I.append(D_t + 1.0 * H_t)
p_prev = p_t
S = max(I)