strong model first or weak model first? find the crossover.
a cost model for multi-step llm agent workflows
when building an agent that generates code through multiple llm calls, you choose between two strategies:
strategy a (strong → weak): pay upfront for high-quality generation, then fix residual bugs cheaply.
strategy b (weak → strong): generate cheaply, then deploy the strong model to fix what broke.
| variable | meaning |
|---|---|
| $c_m^{in},\; c_m^{out}$ | input / output cost per token for model $m$ |
| $L_0$ | initial prompt tokens |
| $G_0$ | output tokens for initial generation |
| $G$ | output tokens per fix attempt |
| $E$ | error trace tokens added per iteration |
| $q_m, \;\phi_m$ | bug count and hard-bug fraction after generation by model $m$ |
| $p_m^e, \;p_m^h$ | probability model $m$ fixes an easy / hard bug in one attempt |
each bug of difficulty $d$ takes $1 / p_{fix}(m,d)$ attempts in expectation (geometric distribution). the total expected fix iterations:
and the variance of total attempts:
each attempt adds $\Delta = G + E$ tokens to the context. labelling all attempts globally as $t = 1, 2, \ldots, T$, attempt $t$ sees context $L_1 + (t-1)\Delta$ where $L_1 = L_0 + G_0$.
the total cost of the fix phase sums over all attempts:
taking expectations over the random total attempt count $T$:
the first term is the base fix cost (linear in $I$). the second term is the context growth penalty — and it is quadratic in $I$.
strategy b forces the expensive model to read the most context. its context growth penalty is multiplied by $c_s^{in}$ (the strong model's input price) and grows as $I^2$.
strategy a's penalty is multiplied by $c_w^{in}$ (the cheap model's input price). even if strategy a needs more iterations, the dollar cost of that context bloat is small.
the quadratic penalty is doubly bad for strategy b: higher coefficient ($c_s^{in}$) and the weak model often produces enough bugs to keep $I_B$ large despite the strong model's better fix rate.