I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?
Absolutely, benchmarks are a different breed. Forge's eval is deliberately scoped as a stress test of the recovery loop, not a measure of end-to-end agentic quality.
Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.
Concrete example:
Task: get, analyze and report on Q3 sales data.
Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.
We send this to the model:
tool_result: [PrereqError] analyze_sales requires fetch_sales_data
to be called first. Available next steps: fetch_sales_data
Model emits a corrected fetch_sales_data(...) on the next turn.
Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.
We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.
And lastly bare text response nudges. Small models love to chat, we need them to call tools!
Interesting approach. The key question for adoption is usually about the migration path — how painful is it for existing teams to switch, and what does the intermediate state look like?
The operational complexity is worth comparing here. The migration path and schema evolution story often matter more than raw performance numbers for teams choosing between these options.
reply