I checked the link, it never says that the model's prediction get lower quality due to batching, just nondeterministic. I don't understand why people conflate these things. Also it's unlikely that they use smaller batch sizes when load is lower. They just likely spin up and down GPU serves based on demand, or more likely, reallocate servers and gpus between different roles and tasks.
It can be made deterministic. It's not trivial and can slow it down a bit (not much) but there are environment variables you can set to make your GPU computations bitwise reproducible. I have done this in training models with Pytorch.
How is this related to overloading? The nondeterminism should not be a function of overloading. It should just time out or reply slower. It will only be dumber if it gets rerouted to a dumber, faster model eg quantized.
This has nothing to do with overloading. The suspicion is that when there is too much demand (or they just want to save costs), Anthropic sometimes uses a less capable (quantized, distilled, etc) version of the model. People want to measure this so there is concrete evidence instead of hunches and feelings.
To say that this measurement is bad because the server might just be overloaded completely misses the point. The point is to see if the model sometimes silently performs worse. If I get a response from "Opus", I want a response from Opus. Or at least want to be told that I'm getting slightly-dumber-Opus this hour because the server load is too much.
We are on the comment section about an AI conference and up until the last few years material/hardware costs for computer science research was very cheap compare to other sciences like medicine, biology etc. where they use bespoke instruments and materials. In CS, up until very recently, all you needed was a good consumer PC for each grad student that lasted for many years. Nowadays GPU clusters are more needed but funding is generally not keeping up with that, so even good university labs are way underresourced on this front.
Universities are not really motivated to slow down the research careers of their employees, on the contrary. They are very much interested in their employees making novel, highly cited publications and bringing in grants that those publications can lead to.
Prerequisite required by who, and why is that entity motivated to design such a requirement? Universities also want more novel breakthrough papers to boast about and to outshine other universities in the rankings. And if one is honest, other researchers also get more excited about new ideas than a failed replication that may for a thousand different reasons and the original authors will argue you did something wrong, or evaluated in an unfair way, and generally publicly accusing other researchers of doing bad work won't help your career much. It's a small world, you'd be making enemies with people who will sit on your funding evaluation committees, hiring committees and it just generally leads to drama. Also papers are superseded so fast that people don't even care that a no longer state of the art paper may have been wrong. There are 5 newer ones that perform better and nobody uses the old one. I'm just stating how things actually are, I don't say that this is good, but when you say something "should" happen, think about who exactly is motivated to drive such a change.
> Prerequisite required by who, and why is that entity motivated to design such a requirement?
Grant awarding institutions like the NIH and NSF presumably? The NSF has as one of its functions, “to develop and encourage the pursuit of a national policy for the promotion of basic research and education in the sciences”. Encouraging the replication of research as part of graduate degree curricula seems to fall within bounds. And the government’s interest in science isn’t novelty per se, it’s the creation and dissemination of factually correct information that can be useful to its constituents.
The commenter I was replying to wanted it to be a prerequisite for a degree, not for a grant. Grant awarding institutions also have to justify their spending to other parts of government and/or parliament (specifically, politicians). Both politicians and the public want to see breakthrough results that have the potential to cure cancer and whatnot. They want to boast that their funding contributed to winning some big-name prize and so on. You have to think with the mind of specific people in specific positions and what makes them look good, what gets them praise, promotions and friends.
> And the government’s interest in science isn’t novelty per se, it’s the creation and dissemination of factually correct information that can be useful to its constituents.
Curl|bash isn't any less safe than installing from random a ppa, or a random npm or pip package. Or a random browser extension or anything. The problem is the random, not the shell script. If you don't trust it, don't install it. Also thinking that sudo is the big danger nowadays is also a red herring. Your personal files getting stolen or encrypted by ransomware is often worse than having to reinstall the OS.
Recently asked Claude Code how to do more thorough tests and described how I imagine it and it set up Hypothesis and mutmut testing. The latter is quite cool, it introduces bugs in the code like swapping values and relational operators and checks if any test catches the bug. If not, your tests are probably not thorough enough. Better than just line coverage checks.
reply