I feel there is a point when all these benchmarks are meaningless. What I care a...

razster · 2025-12-12T05:25:04 1765517104

The latest of the big three... OpenAI, Claude, and Google, none of their models are good. I've spent too much time monitoring them than just enjoying them. I've found it easier to run my own local LLM. The latest Gemini release, I gave it another go but only for it to misspell words and drift off into a fantasy world after a few chats with help restructuring guides. ChatGPT has become lazy for some reason and changes things I told it to ignore, randomly too. Claude was doing great until the latest release, then it started getting lazy after 20+k tokens. I tried making sure to keep a guide to refresh it if it started forgetting, but that didn't help.

Locals are better; I can script and have them script for me to build a guide creation process. They don't forget because that is all they're trained on. I'm done paying for 'AI'.

marcosscriven · 2025-12-12T08:59:10 1765529950

What are your best local models, and what hardware do you run them on?

balder1991 · 2025-12-12T15:30:49 1765553449

I have this impression that LLMs are so complicated and entangled (in comparison to previous machine learning models) that they’re just too difficult to tune all around.

What I mean is, it seems they try to tune them to a few certain things, that will make them worse on a thousand other things they’re not paying attention to.

striking · 2025-12-12T06:35:09 1765521309

What's to stop you from using the APIs the way you'd like?

joshribakoff · 2025-12-12T13:23:17 1765545797

The API is a way to access a model, he is criticizing the model not the access the method (at least until the last sentence where he incorrectly implied you can only script a local model, but I don’t think thats a silver bullet, in my experience that is even more challenging than starting with a working agent)

fleischhauf · 2025-12-12T12:02:37 1765540957

I'm always impressed how fast people get used to new things. couple of years ago something like chatgpt was completely impossible, and now people complain it something's does mit do what you told it to and sometimes lies. (not saying your points are not valid or you should not raise them) Some of the points are just not fixable at this point due to tech limitations. A language model currently simply has no way to give an estimate of its confidence. Also there is no way to completely do away with hallucinations (lies). there need to be some more fundamental improvements for this to work reliably.

davebren · 2025-12-12T14:33:43 1765550023

Your point would stand if the entire economy wasn't shifted around this product and employees weren't being told to use it or lose their jobs.

empiko · 2025-12-12T08:51:22 1765529482

Consider using structured output. You can define a JSON with specific fields, and LLMs are only used to fill in the values.

https://ai.google.dev/gemini-api/docs/structured-output

ifwinterco · 2025-12-12T08:12:22 1765527142

I'm not an expert but my understanding is transformers based models simply can't do some of those things, it isn't really how they work.

Especially something like expressing a certainty %, you might be able to get it to output one but it's just making it up. LLMs are incredibly useful (I use them every day) but you'll always have to check important output

carsoon · 2025-12-12T20:27:48 1765571268

Yeah I have seen multiple people use this certainty % thing but its terrible. A percentage is something calculated mathemtatically and these models cannot do that.

Potentially they could figure it out if they looks into a comparison of next token probabilites, but this is not exposed in any modern model and especially not fed back into the chat/output.

Instead people should just ask it to explain BOTH sides of an argument or explain why something is BOTH correct and incorrect. This way you see how it can halluciate either way and get to make up your own mind about the correct outcome.

nullbound · 2025-12-11T21:31:57 1765488717

<< I feel there is a point when all these benchmarks are meaningless.

I am relatively certain you are not alone in this sentiment. The issue is that the moment we move past seemingly objective measurements, it is harder to convince people that what we measure is appropriate, but the measurable stuff can be somewhat gamed, which adds a fascinating layer of cat and mouse game to this.

delifue · 2025-12-12T01:31:29 1765503089

Once a metric becomes optimization target, it ceases to become good metric.

hnfong · 2025-12-12T04:07:29 1765512449

There's a leaderboard that measures user experience, the "lmsys" Chatbot Arena Leaderboard ( https://huggingface.co/spaces/lmarena-ai/lmarena-leaderboard ). Main issue with it these days are that it kinda measures sycophancy and user preferred tone more than substance.

Some issues you mentioned like length of response might be user preference. Other issues like "hallucination" are areas of active research (and there are benchmarks for these).

carsoon · 2025-12-12T20:31:03 1765571463

I have a kinda strange chatgpt personalization prompt but it's been working well for me. The focus is me to get the model to analyze 2 sides and the extremes on both ends so it explains both and lets me decide. This is much better than asking it to make up accuracy percentages.

I think we align on what we want out of models:

""" Don't add useless babelling before the chats, just give the information direct and explain the info.

DO NOT USE ENGAGEMENT BAITING QUESTIONS AT THE END OF EVERY RESPONSE OR I WILL USE GROK FROM NOW ON FOREVER AND CANCEL MY GPT SUBSCRIPTION PERMANENTLY ONLY. GIVE USEFUL FACTUAL INFORMATION AND FOLLOW UPS which are grounded in first principles thinking and logic. Do not take a side and look at think about the extreme on both ends of a point before taking a side. Do not take a side just because the user has chosen that but provide infomration on both extremes. Respond with raw facts and do not add opinions.

Do not use random emojis. Prefer proper marks for lists etc. """

Those spelling/grammar errors are actually there and I don't want to change it as its working well for me.

dontlikeyoueith · 2025-12-12T17:09:27 1765559367

> Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?)

They're literally incapable of this. Any number they give you is bullshit.