They're tuned (and its part of their nature) to be convincing to people who don't already know the answer. I couldn't get it to figure out how to substitute peanut butter for butter in a cookie recipe yesterday.
I ended up spending an hour on it and dumping the context twice. I asked it to evaluate its own performance and it gave itself a D-. It came up with the measurements for a decent recipe once, then promptly forgot it when asked to summarize.
Good luck trying to use them as a search engine (or a lawyer), because they fabricate a third of the references on average (for me), unless the question is difficult, then they fabricate all of them. They also give bad, nearly unrelated references, and ignore obvious ones. I had a case when talking about the Mexican-American war where the hallucinations crowded out good references. I assume it liked the sound of the things it made up more than the things that were available.
edit: I find it baffling that GPT-5 and Quen3 often have identical hallucinations. The convergence makes me think that there's either a hard limit to how good these things can get which has been reached, or that they're just directly ripping each other off.
I ended up spending an hour on it and dumping the context twice. I asked it to evaluate its own performance and it gave itself a D-. It came up with the measurements for a decent recipe once, then promptly forgot it when asked to summarize.
Good luck trying to use them as a search engine (or a lawyer), because they fabricate a third of the references on average (for me), unless the question is difficult, then they fabricate all of them. They also give bad, nearly unrelated references, and ignore obvious ones. I had a case when talking about the Mexican-American war where the hallucinations crowded out good references. I assume it liked the sound of the things it made up more than the things that were available.
edit: I find it baffling that GPT-5 and Quen3 often have identical hallucinations. The convergence makes me think that there's either a hard limit to how good these things can get which has been reached, or that they're just directly ripping each other off.