Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: RAGatouille, a simple lib to use&train top retrieval models in RAG apps (github.com/bclavie)
15 points by bclavie on Jan 4, 2024 | hide | past | favorite | 5 comments
Hey HN!

If you’re at all interested in LLMs/LLM-apps, you’ve probably heard of RAG: Retrieval-Assisted Generation, i.e. retrieving relevant documents to give to your LLM as context to answer user queries.

Today, I’m releasing RAGatouille v0.0.1, whose aim is to make it as easy as can be to improve your RAG pipelines by leveraging state-of-the-art Information Retrieval research.

As of right now, there’s quite a big gap between common everyday practice and the IR literature, and a lot of the gap is because there just aren’t good ways to quickly try out and leverage SotA IR techniques. RAGatouille aims to contribute to that problem! We do have a bit of a roadmap to support more IR papers, like UDAPDR [1], but for now, we focus on integrating ColBERT[2]/ColBERTv2[3], super strong retrieval methods, who are particularly good at generalising to new data (i.e.e your dataset!)

RAGatouille can train&fine-tune ColBERt models, index documents and search those indexes, in just a few lines of code. We also include an example in the repo on how to use GPT-4 to create fine-tuning data when you don’t have any annotated user queries, which works really well in practice.

Feel free to also check out the thread and discussion on Twitter/X[4] if you're interested!

I hope some of you find this useful, and please feel free to reach out and report any bugs, this is essentially a beta release and any feedback would be much appreciated.

[1] https://arxiv.org/abs/2303.00807 [2] https://arxiv.org/abs/2004.12832 [3] https://arxiv.org/abs/2112.01488 [4] https://twitter.com/bclavie/status/1742950315278672040



Longer Background/Explanation:

I’ve been working on RAG problems for quite a while now, and it’s very apparent that solving real-life problems with it is very, very different from the basic tutorials around.

There are a million moving parts, but a huge one is obviously the model you use to retrieve the data. The most common approach rely on just using dense embeddings (like OpenAI’s embedding models), and getting the documents that have the embedding vectors closest to the query’s own embedding.

The problem is that in practice, it’s a bit of a Sisyphean task: you’re asking a model to compress a document into a tiny vector. And then, it must also be able to encode a very differently worded query into another tiny vector, that must look similar to the previous vector. And it must do so in a way that can represent any specific aspect of the document that could be requested.

The result is that dense embeddings require tons of data to be trained (billions+ pertaining examples), are relatively hard to fine-tune (must find a hard-to-strike balance), and have been shown many times in the Information Retrieval (IR) literature to generalise worse outside of known benchmarks. This doesn’t mean they’re not a very useful tool, but there might be more suitable tools for retrieving your data.

In the IR literature again, late-interaction models, or “sparse embedding” approaches like ColBERT or SparseEmbed are clear winners. They train quickly, need less data, fine-tune relatively easily, and generalise very-well (their zero-shot performance is never far behind fine-tuned performance!)

This is because these models don’t encode full documents: they create bags-of-embeddings! It’s a twist on the old-timey keyword-based retrieval, except instead of hardcoded keywords, we now use contextualised semantic keywords. The models capture the meaning of all the “small units of content” within their context.

From there, a document’s represented as the sum of its parts. At retrieval time, “all you need to” is to match your query’s “semantic keywords” to the ones in your documents. It’s much easier for the model to learn representation for these tiny units, and much easier to match them. So what’s the catch? Why is this not everywhere? Because IR is not quite NLP — it hasn’t gone fully mainstream, and a lot of the IR frameworks are, quite frankly, a bit of a pain to work with in-production. Some solid efforts to bridge the gap like Vespa [1] are gathering steam, but it’s not quite there.

[1] https://vespa.ai


I looked at this on Twitter and will try it out using the integration with Llamaindex !! The idea of Late Interaction sounds like an improvement either on top of or in place of a vector db approach. I was looking at the BERT family in general so while ColBERT is a great implementation it would be interesting to have this same "everything in the one gumbo pot" type library for working with robert(a)/tinybert/qbert/distilbert/.....

Great project and I hope the post gets on the HN second chance loop !


ColBER: Contextualized Late Interaction over BERT. That is just the name. It can be fine-tuned for retrieval using any encoder-only model like the ones you mention.


Yes thanks and anyone looking to swapout ColBERTv2 should look at example.2 "for any BERT/RoBERTa-like model" and then test against use cases to your heart's desire or at least until you run out of time to optimize lol!


I'll admit even I sometimes wish ColBERT was more user-friendly. I'll probably start using ColBERT throught RAGatouille now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: