> someone corrected me above, it does seem to matter more then I thought if you ...

> someone corrected me above, it does seem to matter more then I thought

if you llm agent takes different decisions from the same prompt, then you have to deal with it

1) your benchmarks become stochastic so you need multiple samples to get confidence for your AB testing

2) if your system assumes at least once completion you have to implement single record and replay so you dont get multiple rollout of with different actions