strong model first or weak model first? find the crossover.
at the same number of bugs, which fixer handles them cheaper?
a cost model for multi-step llm agent workflows
when building an agent that generates code through multiple llm calls, you choose between two strategies:
strategy a (strong → weak): pay upfront for high-quality generation, then fix residual bugs cheaply.
strategy b (weak → strong): generate cheaply, then deploy the strong model to fix what broke.
| variable | meaning |
|---|---|
| $c_m^{in},\; c_m^{out}$ | input / output cost per token for model $m$ |
| $L_0$ | initial prompt tokens |
| $G_0$ | output tokens for initial generation |
| $G$ | output tokens per fix attempt |
| $E$ | error trace tokens added per iteration |
| $q, \;\phi$ | bug count and hard-bug fraction |
| $p^e, \;p^h$ | probability model fixes an easy / hard bug |
| $\alpha$ | prompt caching discount rate (e.g., 0.1) |
each bug of difficulty $d$ takes $1 / p_{fix}(d)$ attempts in expectation. the total expected fix iterations $I$ and variance $V$ are:
all attempts accumulate in one thread. the total input cost accounts for full-price new tokens and discounted cached tokens:
where $L_1 = L_0 + G_0$ and $\Delta = G + E$. the $I^2$ term drives quadratic cost growth, though $\alpha$ mitigates it.
context resets after each bug. full price is paid for $L_1$ for the first attempt of each of the $q$ bugs:
where $V_{cached} = \sum (1-p_j)V_j$. this eliminates cross-bug quadratic growth. when $\alpha = 1$, this simplifies to the linear model: $I c^{in} L_1 + c^{in} \Delta V$.
shared mode's cost scales with the square of total iterations across all bugs. fresh mode's cost scales linearly with the number of bugs. for complex tasks with many bugs, resetting context is almost always the dominant strategy.
caching divides every attempt's input into new and cached tokens. attempt $t$ in a thread sees context $L_1 + (t-1)\Delta$.
| type | tokens | cost |
|---|---|---|
| first attempt | $L_1$ | $c^{in} L_1$ |
| subsequent | $\Delta$ (new) $+ (L_1 + (t-2)\Delta)$ (cached) | $c^{in} \Delta + \alpha c^{in} (L_1 + (t-2)\Delta)$ |
summing these expectations yields the formulas in steps 04 and 05. caching significantly lowers the cost of retries but cannot discount the initial generation-to-fix handoff.
strategy b forces the expensive model to read the most context. its penalty is multiplied by $c_s^{in}$ (the strong model's input price).
strategy a's penalty is multiplied by $c_w^{in}$ (the cheap model's input price). even if the weak model needs more retries, the dollar cost of that context is small.
the penalty is doubly bad for strategy b: higher coefficient ($c_s^{in}$) and the weak model produces enough bugs to keep $I_B$ large despite the strong model's better fix rate. this holds under both context models, but the effect is dramatic in shared conversation mode.
llm routing & cascading: de koninck et al. (ICLR 2025) unify routing (pick one model) and cascading (try cheap first, escalate) into a single framework achieving 97% of gpt-4 accuracy at 24% cost.
budget reallocation: the larger the better? (2024) shows that given the same compute budget, running a smaller model multiple times can match or surpass a larger model.
the advisor pattern: anthropic uses a cheap model as executor with opus as an on-demand advisor. sonnet + opus advisor gains 2.7 points on swe-bench at 11.9% less cost than opus end-to-end.
existing work focuses on per-query routing with fixed costs. this model adds the context growth penalty — the compounding cost of accumulated conversation across iterations — which existing frameworks don't capture.
constant $\Delta$: every attempt adds the same tokens. in practice, error traces vary and later attempts may produce longer outputs.
no regressions: fixing a bug never introduces a new one. real agents have regression rates that would add a branching factor.
fixed cache rate: the model uses a single $\alpha = 0.1$ discount. real caching has a TTL (e.g. 5 min) and the discount may vary by provider.
uniform bug difficulty: bugs are either easy or hard. a continuous difficulty distribution would be more realistic but doesn't change the qualitative result.