RWKV is def. better than TinyStories 125MB. Unfortunately, I have only seen 3 mo...

jasonjmcghee · on Jan 2, 2025

These tiny models in general have really weird failure modes. I tried the tiny stories prompt about asking mom for a dog who said no, and it output an incredibly dark story about how she asked her dad and they got a dog but it had pancreatic cancer (paraphrasing, it went into detail about the surgery etc.) and then starting writing an informational PSA about who is at risk of pancreatic cancer etc.

nerdponx · on Jan 2, 2025

Lest we forget that this stream-of-consciousness confusion was state of the art just a few years ago.

It makes sense if you think about it: a small model's "internal state" isn't rich enough to keep track of whatever it was supposed to be talking about.

It makes me think that the reason LLMs need to be so large is that the internal state needs to be bigger than a typical human "idea", whatever that might mean.

acchow · on Jan 3, 2025

The way we do LLMs now is that the program and the data are one and the same. The program mutates itself as it "executes". This is probably also how the brain works since there is no hard separation between "memory" neurons and "data processing" neurons. (biology has no hard separation in general).

kube-system · on Jan 2, 2025

What I find fascinating is how ML models hallucinate in a way that is sometimes reminiscent of a fever dream.

ethbr1 · on Jan 2, 2025

It makes sense that the failure modes of language prediction look a lot like ADD.

p0w3n3d · on Jan 2, 2025

It's because they are precisely lacking attention

jdiff · on Jan 3, 2025

Don't fall into the trap of applying human psychology to LLMs. Bag-of-chemistry quirks do not translate to matrix-multiplication quirks.

ethbr1 · on Jan 3, 2025

Why not? In both cases the result is losing the thread of thought.

hobs · on Jan 3, 2025

Because analogy can be useful in explaining things, or it can be worse than useless - it ties our thinking up into side quests that have nothing to do with the matter at hand.

jdiff · on Jan 3, 2025

...No, no that's not how ADHD works. It's difficult to sum up how wrong this is concisely, but I invite you do to some serious research into ADHD, how it functions, and the great variety of ways in which it can present in different people. It's quite a poor analogy.

ethbr1 · on Jan 3, 2025

I'm aware that anything to do with the brain has a variety of presentations.

Could you try to put a couple sentences down on how ADHD is an inapt metaphor for failure modes in this case?

It's lazy to claim something is wrong without offering a useful point as to how it's wrong. I trust in your ability to summarize.

jdiff · on Jan 3, 2025

For additional context/discussion, I feel this comment[0] elsewhere in the thread put it well.

The reply to that comment also has some information I feel is helpful to show the breakdown here. It mentions that lack of attention presents in only 15-20% of cases. This isn't ADHD, it is something new, the fundamental underpinnings do not relate, and so the analogy/metaphor does not facilitate a better understanding of the situation.

On the contrary, it makes LLM "attention" out to be something entirely different from what it actually is. Without attention, models don't become easily distracted. They are easily distracted regardless. Without attention, LLMs primarily fail to disambiguate between different meanings of identical words, they fail to take context of the sentence structure into account when assigning meaning.

I hopefully don't have to dive into psychological and chemical specifics of ADHD to have demonstrated that this is fundamentally just not at all what ADHD is. Again, there is no underlying harmony between this mechanism and how ADHD affects human attention in 15-20% of cases, and there is no analogy.

The only similarity is that they both use the word "attention". If they'd used a different label, we wouldn't even be having this conversation right now.

[0] https://news.ycombinator.com/item?id=42585600

marxisttemp · on Jan 3, 2025

It’s lazier to claim something is correct without offering a useful point as to how it’s correct. I trust in your ability to theorize.

soulofmischief · on Jan 3, 2025

ADHD is an actively-researched dopaminergic disorder with a host of possible symptoms completely unrelated to attention or hyperactivity.

It is ill-named and thus one often encounters comments such as yours in the real world, which while not meant to be negative, can be marginalizing to those with ADHD who see their disorder as misunderstood and the term misused much like people who say "I'm depressed" or "They're acting schizo again".

LLMs do not have dopamine pathways and therefore we should avoid comparing them to human-specific brain disorders, or marginalizing ADHD folk by trivializing the disorder or spreading misinformation about the presentation of ADHD. LLM hallucination does not "look a lot like ADD", that's such a vague and unsupported claim. Furthermore, "lacking attention" doesn't even make sense with respect to attention models. The "attention" in ADHD and "attention" in transformers share a semantic basis but are two very different phenomena.

robwwilliams · on Jan 3, 2025

For a good overview in ADHD see

https://www.ncbi.nlm.nih.giv/books/NBK441838/

It is not “a dopaminergic disorder” any more than many other neuropsychiatric disorders. Nothing much happens in CNS without some level of modulation by dopaminergic receptors, and to the best of my knowledge variants in these receptors are not known to contribute strongly to ADHD (I just confirmed by reviewed the GWAS Catalog: ebi.ac.uk/gwas/efotraits/EFI_oo3888 ).

Furthermoe lack of attention is considered an important facet of ADHD—-common to about 15-20% of cases.

Humans tend to think in terms of metaphors. Similes and metaphors are crucial in learning and thinking. And yes, sometimes problematic.

Explaining what is wrong with a particular metaphor can help.

taneq · on Jan 3, 2025

A fever dream looks nothing like ADD. If anything it's like a very mild mushroom trip. Did you base this on anything or did it just sound good in your head?

ethbr1 · on Jan 3, 2025

Your fever dreams and/or mushroom trips must be a lot more narratively stable and consistent than mine...

seattleeng · on Jan 3, 2025

As is usually the case, check the data! A lot of the dataset used has fairly morbid scenarios, so the model is working as expected. All the data was synthetically created with GPT4

jmward01 · on Jan 2, 2025

I plan on checking out RWKV and seeing if I can add my sacrifical training techniques to it this weekend. There is a reason quantization works, it is because models are very badly trained right now. I think we can get really good performance on .1b and 1b models which opens up the world to fine-tuning again. I was playing with fine-tuning llama 7b and 13b a while back but the HW/SW stack made it so unwieldy and the ROI was terrible compared to just adjusting prompts on gpt-4o-mini and the like. I have hope that we are about to see single GPU, very simple, fine-tuning again as models shrink and GPUs grow.

jmward01 · on Jan 6, 2025

I doubt anyone is still looking at this thread but I did actually start playing with RWKV by adding sacrificial training techniques to it and the results look promising, at least for early training.

daxfohl · on Jan 2, 2025

Would there be any way to distribute RAG across multiple smaller models? Rather than one giant model handling your entire document base, have it be more of a tree where the top level classifies the docs into top-level categories and sends it to submodels to subclassify, etc? (Doesn't have to be 1:1 classification). And same for q/a search?

These could all presumably be the same physical instance, just each query would use a different system prompt and perhaps different embeddings. (I'm guessing; I don't actually know how RAG works). So, a little slower and clunkier, but presumably way more efficient. And match could be anywhere between horrible to better-than-one-large-model. This would be more like how businesses organize docs.

Or maybe there's no real benefit to this, and each subclassifier would require just as big of a model as if you were to throw all docs into a single model anyway. I assume it's probably been tried before.

groby_b · on Jan 2, 2025

There's just been a twitter post by Omar Khattab (@lateinteraction) on encoding documents into a scoring function instead of a simple vector for the work on ColBERT - and maybe at some point using a DNN as scoring function.

So, yes, maybe there's a way to "distribute" RAG. (I still wonder if that isn't just MoE taken to its logical conclusion)

So, dig for ColBERT papers, might be helpful. (I wish I had the time to do that)

ankit219 · on Jan 2, 2025

Short answer: Yes, there are ways it can be done. Multiple. Needs to be custom built though, given no one has explored it deeply yet.

One simple way is what Omar Khattab (ColBert) mentioned about scoring function instead of a simple vector.

Another is to use a classifier at the start directing queries to the right model. You will have to train the classifier though. (I mean a language model kind of does this implicitly, you are just taking more control by making it explicit.)

Another is how you index your docs. Today, most RAG approaches do not encode enough information. If you have defined domains/models already, you can encode the same in metadata for your docs at the time of indexing, and you pick the model based on the metadata.

These approaches would work pretty well, given a model as small as 100M size can regurgitate what is in your docs. And is faster compared to your larger models.

Benefit wise, I don't see a lot of benefit except preserving privacy and gaining more control.

daxfohl · on Jan 2, 2025

I was originally thinking about it as like a bazel plugin for large codebases. Each module would have its own LLM context, and it might make it easier to put whole modules into the context, plus summaries of the dependencies. That could work better than a single huge context attempting to summarize the whole monorepo.

The general idea is probably be better for the code use case too, since having the module's whole codebase in context likely allows for more precise edits. Whereas RAG is just search, not edit.

That said, probably code assistants must somewhat do this already, though it must be more ad-hoc. Obviously they wouldn't be able to do any completions if they don't have detailed context of the adjacent code.

Pamar · on Jan 3, 2025

Another is how you index your docs. Today, most RAG approaches do not encode enough information....

Could you please provide some more info (or maybe links) about this, please?

ankit219 · on Jan 3, 2025

I don't have links unfortunately.

What I meant was that at the time of indexing, you can add more information to any chunk. This[1] is a simple example by Anthropic where they add more relevant context. In our case, say you have two models, D1 and D2. At the time of creating a vector store, you can add which model is more suitable to a chunk, so that when you retrieve it, you use the same model for inference. This is custom built, very dependent on datasets, but would get you to the functionality described. I suggest this approach when there are linkages between various docs (eg: financial statements/earning calls etc.).

[1]: https://www.anthropic.com/news/contextual-retrieval

Pamar · on Jan 4, 2025

Thanks... I also have another lingering doubt about the ability of RAG to make sense of "history", i.e. how to make sure that a more recent document on a given topic has more "weight" than older documents on the same issue.

I try to explain this a bit better here: https://pa-mar.net/Study/AiKiDo/VirtualBudoPass.html

ankit219 · on Jan 5, 2025

This is done at a reranking step. It's again custom. You have two variables - 1/ relevance (which most algos focus on) 2/ Date. Create a new score based on some combination of weights for relevance and date. Eg; Could be 50% of date. If the document has 70% relevance, but was published yesterday, it's overall score would be 85%. (A conceptual idea). This is similar to how you do weighted sorting anywhere.

Pamar · on Jan 7, 2025

Thank you!

Btw, I notice only now that the link that was supposed to explain my question better is completely wrong.

https://news.ycombinator.com/item?id=42589014

(But you still provided much needed clarification).

ankit219 · on Jan 7, 2025

No worries. If you need any specific help with the use case, please feel free to reach out on my email. ankit at clioapp.ai

antman · on Jan 3, 2025

I think he might be saying, have metadata in your vector retrieval that describe the domain of the retrieved chunk and use that as a decision on which model to use downstream. Sounds like very interesting improvement of RAG

refulgentis · on Jan 2, 2025

TL;DR: It's a very interesting line of thought that as late as Q2 2024, there were a couple thought leaders who pushed the idea we'd have, like 16 specialized local models.

I could see that in the very long term, but as it stands, it works the way you intuited: 2 turkeys don't make an eagle, i.e. there's some critical size where its speaking coherently, and its at least an OOM bigger than it needs to be in order to be interesting for products

fwiw RAG for me in this case is: - user asks q.

- llm generates search queries.

- search api returns urls.

- web view downloads urls.

- app turns html to text.

- local embedding model turns text into chunks.

- app decides, based on "character" limit configured by user, how many chunks to send.

- LLM gets all the chunks, instructions + original question, and answers.

It's incredibly interesting how many models fail this simple test, there's been multiple Google releases in the last year that just couldn't handle it.

- Some of it is basic too small to be coherent, bigcos don't make that mistake though.

- There's another critical threshold where the model doesn't wander off doing the traditional LLM task of completing rather than answering. What I mean is, throwing in 6 pages worth of retrieved webpages will cause some models to just start rambling like its writing more web pages, i.e. they're not able to "identify the context" of the web page snippets, and they ignore the instructions.

wolfgangK · on Jan 2, 2025

«Unfortunately, I have only seen 3 models, 3B or over, handle RAG.»

I would love to know which are these 3 models, especially if they can perform grounded RAG. If you have models (and their grounded RAG prompt formats) to share, I'm very interested !

Thx.

raegis · on Jan 2, 2025

> Unfortunately, I have only seen 3 models, 3B or over, handle RAG.

What's the unit "B" in "3B"? I can search for acronyms like "RAG" just fine, but you experts aren't making it easy for us beginners :)

Edit: Apologies, this is obvious. My brain needed a reboot for the new year.

cauliflower2718 · on Jan 2, 2025

You can ask an LLM exactly this question and it will tell you.

(The answer is billions of parameters)

SketchySeaBeast · on Jan 2, 2025

But what if they want to know they are finding the correct answer?

cauliflower2718 · on Jan 3, 2025

I think basic definitions for LLMs are solidly within the bounds of what we would expect e.g. chatgpt to be competent at. The task (defining terms) is simple and the specific content (basic LLM stuff) is easy to check by anyone who works on the LLM.

I agree with the general sentiment that we should not just blindly trust LLMs though.

elliotto · on Jan 2, 2025

Asking anonymous people on a forum would be much better.

SketchySeaBeast · on Jan 2, 2025

At least a forum with domain-specific knowledge.

gpm · on Jan 2, 2025

And people to go "no, that's wrong" if someone posts something that's wrong.

greesil · on Jan 3, 2025

No, any answer will do.

jasonjmcghee · on Jan 2, 2025

tbf, the gp comment said 125MB and then 3B, which would be pretty confusing, as it's a typo and should be 125M.

jmward01 · on Jan 2, 2025

(B)illion. It indicates the rough number of parameters in the model. Higher is generally more capable. 1B models are currently at the top end of 'easy' to deal with for playing around fine tuning and the like for most home lab setups.

jwineinger · on Jan 2, 2025

The number of parameters the model is trained on, in billions

a1o · on Jan 2, 2025

What is tiny and what is big?

Can I have a model that is like 100MB in weights and run with llama.cpp in my MacBook M2?

refulgentis · on Jan 2, 2025

Yeah, absolutely -- you'll probably pull 100+ token/s.

Here's a good range of model sizes that run just fine with llama.cpp on mac: https://huggingface.co/telosnex/fllama/tree/main.

I recommend trying the Telosnex* app, it uses llama.cpp and abstracts over LLMs so you can i.e. switch between local/servers at will.

The important part for you is its free, accelerated on macOS, and very easy to use local LLMs with (Settings > AI > LLM > On Device, tap Get)

Prepare to be underwhelmed, slightly: its only when you start hitting 3B that its coherent, anything under that will feel more like a markov chain than an LLM.

Depending on how geeked out you'll be to have it running locally, you might have fun with that Telosnex can run local models on every platform, i.e. you can run local models on iOS/Android/web too.

* because it's mine :3 It is quietly released currently. I want to get one more major update before widely announcing it in Jan 2025

a1o · on Jan 2, 2025

I have no interest in that, I would like small models that I can integrate and run offline in software that I make it myself be IDEs or games. CLion has a nice predictive model for single line C++ completion that has 400 MBs.

refulgentis · on Jan 2, 2025

Ah, totally possible, but wrapping llama.cpp will likely take a week to spike out and a month to stabilize across models.

The biggest problem for relying on it for local software is there's just too much latency for ex. game use cases currently. (among other UX bugaboos) (https://news.ycombinator.com/item?id=42561095)

qskousen · on Jan 2, 2025

Sorry to side track, but question about Telosnex - would you consider a Linux release with something other than Snap? Maybe Flatpak or appimage?

refulgentis · on Jan 2, 2025

If its a (mostly) CI-able process, I'm totally open to it ---

I looked into "What should I do besides Snap?" about 4 months ago; got quickly overwhelmed, because I don't have enough knowledge to understand what's fringe vs. common.

I'll definitely take a look at Flatpak again in the next month, 30 second Google says its possible (h/t /u/ damiano-ferrari at https://www.reddit.com/r/FlutterDev/comments/z35gdo/can_you_...)

(thanks for your interest btw, been working on this for ~year and this is my first outside feature request :) may there be many more)

jki275 · on Jan 2, 2025

LM Studio on Mac is your friend. You can choose any model you want, run a server for other tools, or chat direct with the model. It can use either MLX or just plain llama.cpp.