Hacker Newsnew | past | comments | ask | show | jobs | submit | eterm's commentslogin

Have you considered picking a new name for a different concept?

Or have ctrl+o cycle between "Info, Verbose, Trace"?

Or give us full control over what gets logged through config?

Ideally we would get a new tab where we could pick logging levels on:

  - Thoughts
  - Files read / written
  - Bashes
  - Subagents
etc.

That's what ethics are. If you don't make sacrifices for them they aren't ethics they're just conveniences.

This is easy to say until you're an immigrant worker in a foreign country - something one probably worked for their entire life up to that point - risking it all (and potentially wrecking the life of their entire family) just to stop some random utility from having a Copilot button. It's not "this software will be used to kill people", it's more like "there's this extra toolbar which nobody uses".

In life you have to choose your battles.


I hadn't made more solid connections between the current state of software and industry, the subjugation of immigrants, and the death of the American neoliberal order until this comment thread but it here it lies bare, naked, and essentially impossible to ignore. With regards to the whole picture, there's no good or moral place to "RETVRN" to in a nostalgic sense. The one question that keeps ringing through my head as I see the world in constant upheaval, and my one refuge in meaning, technical craftsmanship, tumbling, is: Why did I not see this coming?

"why won't other people make sacrifices for me?"

Because the society in US is arranged as a competition with no safety net and where your employer has a disproportionate amount of influence on your well being and the happiness of your kids.

I'm not going to give up $1M in total comp and excellent insurance for my family because you and I don't like where AI is going.


Just having the option of giving up $1 million in compensation put one far far far above meaningful worries about your well-being and the happiness of your kids.

Not really. We would have to downsize our life.

I'll have to explain it to the wife: "well, you see, we cant live in this house anymore because AI in Notepad was just too much".

I'll dial up my ethical and moral stance on software up to 11 when I see a proper social safety net in this country, with free healthcare and free education.

And if we cant all agree on having even those vital things for free, then relying on collective agreement on software issues will never work in practice so my sacrifice would be for nothing. I would just end up being the dumb idealist.


I use that a lot, but I find it is useful to avoid purity spirals. ( https://en.wikipedia.org/wiki/Purity_spiral )

I didn't see it as closing down discussion, so I'll be mindful of that in future.

There is a real danger when presented with a problem to discard a partial solution because it fails to tackle a much larger problem.

It's a call for pragmatism over idealism.


Companies do it with email unsubscribe categories to, which is skirting laws for sure.

These days there is Winget which I'd rather use than either of those.

It's the opposite of free, it's valuable.

Even for features that stay on the cutting-room floor. Especially for features that stay on the cutting-room floor.


I find that 80% of the time the assumptions i made doing detailed planning are invalidated when doing the actual work.

Usually whole subtasks need to be junked and others created.


4. The graph starts January 8.

Why January 8? Was that an outlier high point?

IIRC, Opus 4.5 was released late november.


Right after the Holiday double token promotion users felt (perceived) a huge regression in capabilities. I bet that triggered the idea.

People were away for the holidays. What do you want them to do?

Or maybe, juste maybe, that's when they started testing…

Wayback machine has nothing for this site before today, and article is "last updated Jan 29".

A benchmark like this ought to start fresh from when it is published.

I don't entirely doubt the degradation, but the choice of where they went back to feels a bit cherry-picked to demonstrate the value of the benchmark.


Which makes sense, you gotta wait until you get enough data before you can communicate on the said data…

If anything it's coherent with the fact that they very likely didn't have data earlier than January the 8th.


I'm in a similar but more ridiculous reason. My reasonably modern hardware should support windows 11, but I get "disk not supported" because apparently I once picked the "wrong" bootloader?

I can't be arsed, if I'm going to have to fiddle around getting that working I might as well move to linux.


Much greater than now, given the open discoverability of the original post here, versus the walled-off content we have today, locked away in discord servers and the like.

Furthermore, the act of replying to that post will have bumped it right back to the top for everyone to see.


I agree with this. We are much missing these forums with civil replies and clouded behind "influencer" culture, which is optimized for incentives. Pure discussions as in this example are such a stalwarts of open web.

On the other hand, small websites and forums can disappear but that openness allows platform like archive.org to capture and "fossilize" them.


These forums still exist. Typically with much older and mature discussions, as the users have aged alongside the forums. Nothing is stopping you from joining them now.

My Something Awful forums account is over 25 years old at this point. The software and standards and moderation style is approximately unchanged, complete with 10 dollar sign-up fee to keep out the spam.


Like mosquitos trapped in amber, preserving hidden blocks of knowledge


That is a well recognised part of the LLM cycle.

A model or new model version X is released, everyone is really impressed.

3 months later, "Did they nerf X?"

It's been this way since the original chatGPT release.

The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.


This is not always true. LLMs do get nerfed, and quite regularly, usually because they discover that users are using them more than expected, because of user abuse or simply because it attract a larger user base. One of the recent nerfs is the Gemini context window, drastically reduced.

What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.


There's at least one site doing this: https://aistupidlevel.info/

Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.


Interesting! I was just thinking about pinging the creator of simple-bench.com and asking them if they intend to re-benchmark models after 3 months. I've noticed, in particular, Gemini models dramatically reducing in quality after the initial hype cycle. Gemini 3 Pro _was_ my top performer and has slowly reduced to 'is it worth asking', complete with gpt-4o style glazing. It's been frustrating. I had been working on a very custom benchmark and over the course of it Gemini 3 Pro and Flash both started underperforming by 20% or more. I wondered if I had subtle broken my benchmark but ultimately started seeing the same behavior in general online queries (Google AI Studio).


> What we need is an open and independent way of testing LLMs

I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.


How hard is benchmarking models actually?

We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc

To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.

What am I missing here?


Except the time that it was to the point Anthropic had to acknowledge it? Which also revealed they don't have monitoring?

https://www.anthropic.com/engineering/a-postmortem-of-three-...


I usually agree with this. But I am using the same workflows and skills that were a breeze for Claude, but are causing it to run in cycles and require intervention.

This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.


When I once had that happen in a really bad way, I discovered I had written something wildly incorrect into the readme.

It has a habit of trusting documentation over the actual code itself, causing no end of trouble.

Check your claude.md files (both local and ~user ) too, there could be something lurking there.

Or maybe it has horribly regressed, but that hasn't been my experience, certainly not back to Sonnet levels of needing constant babysitting.


I’m a x20 Max user who’s on it daily. Unusable the last 2 days. GLM in OpenCode and my local Qwen were more reliable. I wish I was exaggerating.


Also people who were lucky and had lots of success early on but then start to run into the actual problems of LLMs will experience that as "It was good and then it got worse" even when it didn't actually.

If LLMs have a 90% chance of working, there will be some who have only success and some who have only failure.

People are really failing to understand the probabilistic nature of all of this.

"You have a radically different experience with the same model" is perfectly possible with less than hundreds of thousands of interactions, even when you both interact in comparable ways.


Just because it's been true in the past doesn't mean it will always the case


Opus was a non-deterministic probability machine in the past, present and the foreseeable future. The variance eventually shows up when you push it hard.


Eh, I've definitely had issues where Claude can no longer easily do what it's previously done. That's with constant documenting things in appropriate markdown files well and resetting context here and there to keep confusion minimal.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: