It put the chart title directly on top of Australia.
Which just about sums up my experience with using LLMs to code, really (though not with these state-of-the-art models, admittedly) - it's amazing what they can do, but left to their own devices they'll make boneheaded decisions.
> it's amazing what they can do, but left to their own devices they'll make boneheaded decisions.
Yeah, the whole "can run for 9 hours on a task" to me is not a positive.
I tend to find if Opus 4.8 runs for ~15 mins on a task, then the end result has gone off in a weird direction at some point, and it needs winding back a fair bit.
And that's with extremely clear direction, literal specification docs to follow, etc.
That being said, having functional code already created beforehand (ie by a human) goes a long way to ensuring the AI model has a path it can build on without making too many dumb architectural choices by itself. Generally.
Which just about sums up my experience with using LLMs to code, really (though not with these state-of-the-art models, admittedly) - it's amazing what they can do, but left to their own devices they'll make boneheaded decisions.