everybody loves building agents, nobody likes debugging them. agents hit the cla...

furyofantares · 2025-11-07T00:20:52 1762474852

That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.

AdieuToLogic · 2025-11-07T02:19:47 1762481987

In the event this comment is slathered in sarcasm:

  Well done!  :-D

ht96 · 2025-11-07T00:48:21 1762476501

Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)

AdieuToLogic · 2025-11-07T02:27:08 1762482428

There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.

cantor_S_drug · 2025-11-07T08:39:15 1762504755

https://x.com/rerundotio/status/1968806896959402144

This is a use of Rerun that I haven't seen before!

This is pretty fascinating!!!

Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.

Interesting use of Rerun!

https://github.com/gustofied/P2Engine

aenis · 2025-11-07T05:48:55 1762494535

For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.

saturatedfat · 2025-11-07T07:46:18 1762501578

heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.

darin@mcptesting.com

(gist: evals as a service)

tptacek · 2025-11-07T18:36:31 1762540591

That everybody seems to love building these things while people like you harbor deep skepticism about them is a reason to get your hands dirty with an agent, because the cost of doing that is 30-45 minutes of your time, and doing so will arm you with an understanding you can use to make better arguments against them.

For the problem domains I care about at the moment, I'm quite bullish about agents. I think they're going to be huge wins for vulnerability analysis and for operations/SRE work (not actually turning dials, but in making telemetry more interpretable). There are lots of domains where I'm less confident in them. But you could reasonably call me an optimist.

But the point of the article is that its arguments work both ways.