The RLM framing basically turns long-context into an RL problem over what to remember and where to route it: main model context vs Python vs sub-LLMs. That’s a nice instantiation of The Bitter Lesson, but it also means performance is now tightly coupled to whatever reward signal you happen to define in those environments. Do you have any evidence yet that policies learned on DeepDive / Oolong-style tasks transfer to “messy” real workloads (multi-week code refactors, research over evolving corpora, etc.), or are we still in the “per-benchmark policy” regime?
The split between main model tokens and sub-LLM tokens is clever for cost and context rot, but it also hides the true economic story. For many users the cost that matters is total tokens across all calls, not just the controller’s context. Some of your plots celebrate higher “main model token efficiency” while total tokens rise substantially. Do you have scenarios where RLM is strictly more cost-efficient at equal or better quality, or is the current regime basically “pay more total tokens to get around context limits”?
math-python is the most damning data point: same capabilities, but the RLM harness makes models worse and slower. That feels like a warning that “more flexible scaffold” is not automatically a win; you’re introducing an extra layer of indirection that the model has not been optimized for. The claim that RL training over the RLM will fix this is plausible, but also unfalsifiable until you actually show a model that beats a strong plain-tool baseline on math with less wall-clock and tokens.
Oolong and verbatim-copy are more encouraging: the controller treating large inputs as opaque blobs and then using Python + sub-LLMs to scan/aggregate is exactly the kind of pattern humans write by hand in agents today. One thing I’d love to see is a comparison vs a well-engineered non-RL agent baseline that does essentially the same thing but with hand-written heuristics (chunk + batch + regex/SQL/etc.). Right now the RLM looks like a principled way to let the model learn those heuristics, but the post doesn’t really separate “benefit from architecture” vs “benefit from just having more structure/tools than a vanilla single call.”
On safety / robustness: giving the model a persistent Python REPL and arbitrary pip is powerful, but it also dramatically expands the attack surface if this ever runs on untrusted inputs. Are you treating RLM as strictly a research/eval harness, or do you envision this being exposed in production agent systems? If the latter, sandboxing guarantees and resource controls probably matter as much as reward curves.
The beauty of Suno, at least for me, was the opportunity to turn my original lyrics into listenable music free without having it attached in any way to any of the big labels, who are evil to the core. I really hope they keep the existing user experience intact.
Hello, I'm one of the original evangelists for Ruby on Rails and the author of The Rails Way as well as Patterns of Application Development Using AI. Over the past three decades, I’ve led teams and built products at every scale — from early-stage startups to global platforms — combining deep technical expertise with a creative, forward-looking approach to software craftsmanship.
I bring 30 years of hands-on engineering experience, including senior leadership in architecture, AI integration, and product strategy. Whether working as an individual contributor or guiding organizations through transformation, I focus on delivering clarity, velocity, and sustainable innovation. My last gig was leading AI strategy related to Developer Experience at Shopify.
Currently evaluating consulting and permanent opportunities with preference for executive leadership position at a larger company, although will consider consulting and fractional CTO type roles for startups and smaller ventures if the project and team are interesting enough.
Big news in the music AI space. Interesting and potentially worrying implications for Suno, which has pulled far ahead in the race and recently announced $150M ARR milestone.
my biggest TIL takeaway from that article was an "oh wow" moment:
The other sound that ‘ȝ’ once spelled is the “harsh” or “guttural” sound made in the back of the mouth, which you hear in Scots loch or German Bach.4 This sound is actually the reason for the most famous bit of English spelling chaos: the sometimes-silent, sometimes-not sequence ‘gh’ that you see in laugh, cough, night, and daughter. Maybe one day I’ll tell you that story too.
I personally am more fond of provoking an "angstschreeuw" in English speakers by asking them to pronounce "slechtstschrijvend" or "zachtstschrijdend" and watching them recoil in horror at the consonant clusters[0][1].
> I'd love to integrate with whatever model subscription is available but it seems using Max outside of Claude products is against their terms. I suggest reaching out to Anthropic and letting them know you would like to use your Max subscription with other coding agents.
Like it's probably a good thing for humanity if the USA does not feel the need to go to war with China over Taiwan.
reply