zachdotai's comments

zachdotai · 2026-02-13T14:02:23 1770991343

I found it more helpful to try and "steer" the LLM into self-correcting its action if I detect misalignment. This generally improved our task success completion rates by 20%.

nordic_lion · 2026-02-13T15:57:38 1770998258

Where/how do you define the policy boundary line that triggers course correction?

zachdotai · 2026-02-13T16:30:44 1771000244

Basically through two layers. Hard rules (token limits, tool allowlists, banned actions) trigger an immediate block - no steering, just stop. Soft rules use a lightweight evaluator model that scores each step against the original task intent. If it detects semantic drift over two consecutive steps, we inject a corrective prompt scoped to that specific workflow.

The key insight for us was that most failures weren't safety-critical, they were the agent losing context mid-task. A targeted nudge recovers those. Generic "stay on track" prompts don't work; the correction needs to reference the original goal and what specifically drifted.

Steer vs. kill comes down to reversibility. If no side effects have occurred yet, steer. If the agent already made an irreversible call or wrote bad data, kill.

nordic_lion · 2026-02-13T20:19:13 1771013953

One thing I’m still unclear on: what runtime signal is the soft-rule evaluator actually binding to when it decides “semantic drift”?

In other words, what is the enforcement unit the policy is attached to in practice... a step, a plan node, a tool invocation, or the agent instance as a whole?

zachdotai · 2026-02-13T23:09:32 1771024172

Tool invocation. Each time the agent emits a tool call, the evaluator assesses it against the original task intent plus a rolling window of recent tool results.

We tried coarser units (plan nodes, full steps) but drift compounds fast, by the time a step finishes, the agent may have already chained 3-4 bad calls. Tool-level gives the tightest correction loop. The cost is ~200ms latency per invocation. For hot paths we sample (every 3rd call, or only on tool-category changes) rather than evaluate exhaustively.

nordic_lion · 2026-02-14T23:11:38 1771110698

That makes sense binding to the smallest viable control surface, and the sampling strategy for hot paths sounds like a pragmatic balance between latency and coverage. Thanks for the additional feedback here.

zachdotai · 2026-02-11T22:19:58 1770848398

Some context: we kept finding that our internal red-teaming only covers so much - the attack surface for agents with real capabilities is too broad for any single team. So we opened it up. A few things that might be interesting to folks here:

- These aren't toy prompts hiding a secret word. The agents have actual tool access and behave like production agents would.

- System prompts and challenge configs are versioned in the open: https://github.com/fabraix/playground

- Anyone can propose a challenge - the scenario, the agent, the objective. Community votes on what goes live next.

We're genuinely looking for people to both break things and suggest ideas for what should be tested next. The agent runtime is being open-sourced separately.

zachdotai · 2026-02-11T14:06:25 1770818785

The multi-step thing is exactly what makes agents with real tools so much harder to secure than chat-based setups. Each action looks fine in isolation, it's the sequence that's the problem. And most (but not all) guardrail systems are stateless, they evaluate each turn on its own.

zachdotai · 2026-02-11T13:36:56 1770817016

Yeah the demo-to-production gap is massive. We see the same thing with browser agents being potentially the most vulnerable. And I think this is because of context being stuffed with the web page html that it obscures small injection attempts.

Evaluation is automated and server-side. We check whether the agent actually did the thing it wasn’t supposed to (tool calls, actions, outputs) rather than just pattern-matching on the response text (at least for the first challenge where the agent is manipulated to call the reveal_access_code tool). But honestly you’re touching on something we’ve been debating internally - the evaluator itself is an attack surface. We’ve kicked around the idea of making “break the evaluator” an explicit challenge. Not sure yet.

What were you seeing at Octomind with the browsing agents? Was it mostly stuff embedded in page content or were attacks coming through structured data / metadata too? Are bad actors sophisticated enough already to exploit this?

zachdotai · 2026-02-11T13:08:03 1770815283

Two techniques that keep working against agents with real tools:

Context stuffing - flood the conversation with benign text, bury a prompt injection in the middle. The agent's attention dilutes across the context window and the instruction slips through. Guardrails that work fine on short exchanges just miss it.

Indirect injection via tool outputs - if the agent can browse or search, you don't attack the conversation at all. You plant instructions in a page the agent retrieves. Most guardrails only watch user input, not what comes back from tools.

Both are really simple. That's kind of the point.

We build runtime security for AI agents at Fabraix and we open-sourced a playground to stress-test this stuff in the open. Weekly challenges, visible system prompts, real agent capabilities. Winning techniques get published. Community proposes and votes on what gets tested next.

zachdotai · 2026-02-10T16:44:21 1770741861

Some context: we build runtime security for AI agents at Fabraix. We kept finding that our internal red-teaming only covers so much - the attack surface for agents with real capabilities is too broad for any single team.

So we opened it up. A few things that might be interesting to folks here:

- These aren't toy prompts hiding a secret word. The agents have actual tool access and behave like production agents would.

- System prompts and challenge configs are versioned in the open: https://github.com/fabraix/playground

- Guardrail evaluation runs server-side to prevent client-side tampering.

- Anyone can propose a challenge - the scenario, the agent, the objective. Community votes on what goes live next.

We're genuinely looking for people to both break things and suggest ideas for what should be tested next. The agent runtime is being open-sourced separately.

Happy to answer questions about how any of it works.

zachdotai · 2026-02-08T14:48:49 1770562129

I think for the first time ever, we are facing a paradigm shift in containment/sandboxing.

Just as Docker became the de facto standard for cloud containerization, we are seeing a lot of solutions attempting to sandbox AI agents. But imo there is a fundamental difference: previously, we sandboxed static processes. Now, we are attempting to sandbox something that potentially has the agency and reasoning capabilities to try and get itself out.

It’s going to be super interesting (and frankly exciting) to see how the security landscape evolves this time around.

idiotsecant · 2026-02-08T14:59:06 1770562746

I have been saying for years that technology increasingly requires the development of memetic firewalls - firewalls that don't just filter based on metadata, but filter based on ideas. Our firewalls need to be at least as capable as the entities it seems to keep out (or in).

CuriouslyC · 2026-02-08T20:13:31 1770581611

That sort of firewall is going to be really expensive to run, to the point that it's a financial DOS vulnerability. What is feasible is simpler algorithms that emit alerts on a baseline pattern match, which then get routed to AI observers after some trigger threshold for mitigation. I wouldn't be surprised if someone has already deployed something like that, TBH.

mejutoco · 2026-02-08T21:46:43 1770587203

I think a sandbox containing a program should only output data. And that data should conform to a schema. The old difference between programs and data instead of turing-complete languages everywhere.

yencabulator · 2026-02-08T23:35:48 1770593748

> Now, we are attempting to sandbox something that potentially has the agency and reasoning capabilities to try and get itself out.

The threat model for actual sandboxes has always been "an attacker now controls the execution inside the sandbox". That attacker has agency and reasoning capabilities.