llm strategy calculator

strong model first or weak model first? find the crossover.

strong model

weak model

context model

context resets after each bug is fixed

prompt caching

no caching — all input tokens at full price

cost breakdown ($)

sensitivity: bug count

at the same number of bugs, which fixer handles them cheaper?

the model

a cost model for multi-step llm agent workflows

the problem

when building an agent that generates code through multiple llm calls, you choose between two strategies:

strategy a (strong → weak): pay upfront for high-quality generation, then fix residual bugs cheaply.

strategy b (weak → strong): generate cheaply, then deploy the strong model to fix what broke.

parameters

variable	meaning
$c_m^{in},\; c_m^{out}$	input / output cost per token for model $m$
$L_0$	initial prompt tokens
$G_0$	output tokens for initial generation
$G$	output tokens per fix attempt
$E$	error trace tokens added per iteration
$q, \;\phi$	bug count and hard-bug fraction
$p^e, \;p^h$	probability model fixes an easy / hard bug
$\alpha$	prompt caching discount rate (e.g., 0.1)

expected iterations

each bug of difficulty $d$ takes $1 / p_{fix}(d)$ attempts in expectation. the total expected fix iterations $I$ and variance $V$ are:

$$I = \frac{(1 - \phi)\, q}{p^e} + \frac{\phi\, q}{p^h} \qquad V = \frac{(1-\phi)\,q(1 - p^e)}{(p^e)^2} + \frac{\phi\,q(1 - p^h)}{(p^h)^2}$$

context growth: shared conversation

all attempts accumulate in one thread. the total input cost accounts for full-price new tokens and discounted cached tokens:

$$\mathbb{E}[C_{in}] = c^{in} \left[ L_1 + (I-1)(\alpha L_1 + \Delta) + \alpha \frac{\Delta}{2} (I^2 + V - 3I + 2) \right]$$

where $L_1 = L_0 + G_0$ and $\Delta = G + E$. the $I^2$ term drives quadratic cost growth, though $\alpha$ mitigates it.

context growth: fresh per bug

context resets after each bug. full price is paid for $L_1$ for the first attempt of each of the $q$ bugs:

$$\mathbb{E}[C_{in}] = c^{in} \left[ q L_1 + (I-q)(\alpha L_1 + \Delta) + \alpha \Delta V_{cached} \right]$$

where $V_{cached} = \sum (1-p_j)V_j$. this eliminates cross-bug quadratic growth. when $\alpha = 1$, this simplifies to the linear model: $I c^{in} L_1 + c^{in} \Delta V$.

comparing the two context models

shared mode's cost scales with the square of total iterations across all bugs. fresh mode's cost scales linearly with the number of bugs. for complex tasks with many bugs, resetting context is almost always the dominant strategy.

prompt caching

caching divides every attempt's input into new and cached tokens. attempt $t$ in a thread sees context $L_1 + (t-1)\Delta$.

type	tokens	cost
first attempt	$L_1$	$c^{in} L_1$
subsequent	$\Delta$ (new) $+ (L_1 + (t-2)\Delta)$ (cached)	$c^{in} \Delta + \alpha c^{in} (L_1 + (t-2)\Delta)$

summing these expectations yields the formulas in steps 04 and 05. caching significantly lowers the cost of retries but cannot discount the initial generation-to-fix handoff.

insights

strategy b forces the expensive model to read the most context. its penalty is multiplied by $c_s^{in}$ (the strong model's input price).

strategy a's penalty is multiplied by $c_w^{in}$ (the cheap model's input price). even if the weak model needs more retries, the dollar cost of that context is small.

the penalty is doubly bad for strategy b: higher coefficient ($c_s^{in}$) and the weak model produces enough bugs to keep $I_B$ large despite the strong model's better fix rate. this holds under both context models, but the effect is dramatic in shared conversation mode.

related work

llm routing & cascading: de koninck et al. (ICLR 2025) unify routing (pick one model) and cascading (try cheap first, escalate) into a single framework achieving 97% of gpt-4 accuracy at 24% cost.

budget reallocation: the larger the better? (2024) shows that given the same compute budget, running a smaller model multiple times can match or surpass a larger model.

the advisor pattern: anthropic uses a cheap model as executor with opus as an on-demand advisor. sonnet + opus advisor gains 2.7 points on swe-bench at 11.9% less cost than opus end-to-end.

existing work focuses on per-query routing with fixed costs. this model adds the context growth penalty — the compounding cost of accumulated conversation across iterations — which existing frameworks don't capture.

assumptions & limitations

constant $\Delta$: every attempt adds the same tokens. in practice, error traces vary and later attempts may produce longer outputs.

no regressions: fixing a bug never introduces a new one. real agents have regression rates that would add a branching factor.

fixed cache rate: the model uses a single $\alpha = 0.1$ discount. real caching has a TTL (e.g. 5 min) and the discount may vary by provider.

uniform bug difficulty: bugs are either easy or hard. a continuous difficulty distribution would be more realistic but doesn't change the qualitative result.

llm strategy calculator

strong model

weak model

context model

prompt caching

cost breakdown ($)

sensitivity: bug count

costs ($ / million tokens)

token sizes

bug model

fix probability (per attempt)

the model

the problem

parameters

expected iterations

context growth: shared conversation

context growth: fresh per bug

comparing the two context models

prompt caching

insights

related work

assumptions & limitations