strong model first or weak model first? find the crossover.
how bad does the weak model have to be before strategy a becomes worth its generation premium?
a cost model for multi-step llm agent workflows
when building an agent that generates code through multiple llm calls, you choose between two strategies:
strategy a (strong → weak): pay upfront for high-quality generation, then fix residual bugs cheaply.
strategy b (weak → strong): generate cheaply, then deploy the strong model to fix what broke.
| variable | meaning |
|---|---|
| $c_m^{in},\; c_m^{out}$ | input / output cost per token for model $m$ |
| $L_0$ | initial prompt tokens |
| $G_0$ | output tokens for initial generation |
| $G$ | output tokens per fix attempt |
| $E$ | error trace tokens added per iteration |
| $q_m, \;\phi_m$ | bug count and hard-bug fraction after generation by model $m$ |
| $p_m^e, \;p_m^h$ | probability model $m$ fixes an easy / hard bug in one attempt |
each bug of difficulty $d$ takes $1 / p_{fix}(m,d)$ attempts in expectation (geometric distribution). the total expected fix iterations:
and the variance of total attempts:
each attempt adds $\Delta = G + E$ tokens. in a shared conversation, all attempts accumulate in one thread. attempt $t$ sees context $L_1 + (t-1)\Delta$ where $L_1 = L_0 + G_0$.
the total fix cost sums over all $T$ attempts:
taking expectations ($T$ is random — sum of geometric variables):
the $I^2$ term means cost grows quadratically with total iterations. bug 10's context includes all attempts from bugs 1–9.
a well-designed agent resets context after each bug is fixed. bug $i$ gets a fresh conversation starting from $L_1$ (prompt + current code). only retries within that bug accumulate context.
the expected cost for bug $i$ with fix probability $p_i$:
summing across all bugs:
the $I^2$ cross-bug term vanishes. the penalty depends only on $V$ (within-bug retry variance), not the square of total iterations. this makes the strategy comparison much tighter.
| context model | penalty term |
|---|---|
| shared conversation | $c^{in} \cdot \frac{\Delta}{2} \cdot (I^2 + V - I)$ |
| fresh per bug | $c^{in} \cdot \Delta \cdot V$ |
the shared model has the $I^2$ term — cross-bug context accumulation. the fresh model eliminates it entirely. in practice, most well-designed agents reset context between bugs, making the fresh model more realistic.
strategy b forces the expensive model to read the most context. its penalty is multiplied by $c_s^{in}$ (the strong model's input price).
strategy a's penalty is multiplied by $c_w^{in}$ (the cheap model's input price). even if the weak model needs more retries, the dollar cost of that context is small.
the penalty is doubly bad for strategy b: higher coefficient ($c_s^{in}$) and the weak model produces enough bugs to keep $I_B$ large despite the strong model's better fix rate. this holds under both context models, but the effect is dramatic in shared conversation mode.
llm routing & cascading: de koninck et al. (ICLR 2025) unify routing (pick one model) and cascading (try cheap first, escalate) into a single framework achieving 97% of gpt-4 accuracy at 24% cost.
budget reallocation: the larger the better? (2024) shows that given the same compute budget, running a smaller model multiple times can match or surpass a larger model.
the advisor pattern: anthropic uses a cheap model as executor with opus as an on-demand advisor. sonnet + opus advisor gains 2.7 points on swe-bench at 11.9% less cost than opus end-to-end.
existing work focuses on per-query routing with fixed costs. this model adds the context growth penalty — the compounding cost of accumulated conversation across iterations — which existing frameworks don't capture.
constant $\Delta$: every attempt adds the same tokens. in practice, error traces vary and later attempts may produce longer outputs.
no regressions: fixing a bug never introduces a new one. real agents have regression rates that would add a branching factor.
no caching: prompt caching (which discounts repeated input prefixes) would reduce the context penalty. switching models breaks the cache.
uniform bug difficulty: bugs are either easy or hard. a continuous difficulty distribution would be more realistic but doesn't change the qualitative result.