Hey HN!
If you’re at all interested in LLMs/LLM-apps, you’ve probably heard of RAG: Retrieval-Assisted Generation, i.e. retrieving relevant documents to give to your LLM as context to answer user queries.
Today, I’m releasing RAGatouille v0.0.1, whose aim is to make it as easy as can be to improve your RAG pipelines by leveraging state-of-the-art Information Retrieval research.
As of right now, there’s quite a big gap between common everyday practice and the IR literature, and a lot of the gap is because there just aren’t good ways to quickly try out and leverage SotA IR techniques. RAGatouille aims to contribute to that problem! We do have a bit of a roadmap to support more IR papers, like UDAPDR [1], but for now, we focus on integrating ColBERT[2]/ColBERTv2[3], super strong retrieval methods, who are particularly good at generalising to new data (i.e.e your dataset!)
RAGatouille can train&fine-tune ColBERt models, index documents and search those indexes, in just a few lines of code. We also include an example in the repo on how to use GPT-4 to create fine-tuning data when you don’t have any annotated user queries, which works really well in practice.
Feel free to also check out the thread and discussion on Twitter/X[4] if you're interested!
I hope some of you find this useful, and please feel free to reach out and report any bugs, this is essentially a beta release and any feedback would be much appreciated.
[1] https://arxiv.org/abs/2303.00807
[2] https://arxiv.org/abs/2004.12832
[3] https://arxiv.org/abs/2112.01488
[4] https://twitter.com/bclavie/status/1742950315278672040
I’ve been working on RAG problems for quite a while now, and it’s very apparent that solving real-life problems with it is very, very different from the basic tutorials around.
There are a million moving parts, but a huge one is obviously the model you use to retrieve the data. The most common approach rely on just using dense embeddings (like OpenAI’s embedding models), and getting the documents that have the embedding vectors closest to the query’s own embedding.
The problem is that in practice, it’s a bit of a Sisyphean task: you’re asking a model to compress a document into a tiny vector. And then, it must also be able to encode a very differently worded query into another tiny vector, that must look similar to the previous vector. And it must do so in a way that can represent any specific aspect of the document that could be requested.
The result is that dense embeddings require tons of data to be trained (billions+ pertaining examples), are relatively hard to fine-tune (must find a hard-to-strike balance), and have been shown many times in the Information Retrieval (IR) literature to generalise worse outside of known benchmarks. This doesn’t mean they’re not a very useful tool, but there might be more suitable tools for retrieving your data.
In the IR literature again, late-interaction models, or “sparse embedding” approaches like ColBERT or SparseEmbed are clear winners. They train quickly, need less data, fine-tune relatively easily, and generalise very-well (their zero-shot performance is never far behind fine-tuned performance!)
This is because these models don’t encode full documents: they create bags-of-embeddings! It’s a twist on the old-timey keyword-based retrieval, except instead of hardcoded keywords, we now use contextualised semantic keywords. The models capture the meaning of all the “small units of content” within their context.
From there, a document’s represented as the sum of its parts. At retrieval time, “all you need to” is to match your query’s “semantic keywords” to the ones in your documents. It’s much easier for the model to learn representation for these tiny units, and much easier to match them. So what’s the catch? Why is this not everywhere? Because IR is not quite NLP — it hasn’t gone fully mainstream, and a lot of the IR frameworks are, quite frankly, a bit of a pain to work with in-production. Some solid efforts to bridge the gap like Vespa [1] are gathering steam, but it’s not quite there.
[1] https://vespa.ai