Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a very weird type of paper. They take a specific approach, then make arguments about a broad class of approaches that are under constant development. The finding that distilled LLMs must be more specialized than the giant LLMs that train them is unsurprising; nobody at this point expects a 13B parameter model to succeed with the same accuracy at the broad range of tasks supported by what may be a 1T parameter model.


> nobody at this point expects a 13B parameter model to succeed with the same accuracy at the broad range of tasks supported by what may be a 1T parameter model

I think a lot of people believe exactly that. To take one example from the "We Have No Moat" essay:

"It doesn’t take long before the cumulative effect of all of these fine-tunings overcomes starting off at a size disadvantage. Indeed, in terms of engineer-hours, the pace of improvement from these models vastly outstrips what we can do with our largest variants, and the best are already largely indistinguishable from ChatGPT." - https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...


That essay works in a context of specific datasets and tasks, which are referenced in the surrounding sentences and paragraphs. They are saying that for a particular "emergent" capability you might reach with a giant LLM, you might get there more efficiently with distillation / LoRa.

My comment is about generality, which is the remaining advantage of giant models.


That is exactly what people are expecting, and largely because of misleading metrics thrown around to claim ridiculous things like e.g. Vicuna-13b being nearly as good as GPT-3.5. It even shows up in the comments here, and if you go to any tangentially related subreddit, that's the kind of stuff that gets told as "everybody knows" to people setting up a local LLM for the first time.


The Vicuna headline was def an overreach, although the main text admits pretty readily that their performance test (asking ChatGPT to evaluate quality) is not rigorous [1]. I'm sure that has set a lot of people pontificating about AGI with tiny models, but I can't imagine anyone who has worked directly with fine tuning having that impression.

The comments I see here are not about that. They are about small models succeeding at specific tasks, which is affirmed by this paper. Most applications of LLMs are not general purpose chat bots, so this is not bad news for most of the distill/fine tune community.

[1] https://lmsys.org/blog/2023-03-30-vicuna/


Even if they don't start out expecting that, people might be fooled by how it behaves when they try it out. So it seems useful to point out that initial impressions based on crowdsourced evaluations are misleading:

> We were initially surprised by how much imitation models improve over their base models: they are far better at following instructions, and their outputs appear similar to ChatGPT’s. This was further supported by both human and GPT-4 evaluations, where the outputs of our best imitation model were rated as competitive with ChatGPT (e.g., Figure 1, left).

("Competitive" meaning that 70% outputs seemed about as good.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: