Does this relate to http://elasticlunr.com/? mdBook uses the latter, and I was wondering if there's a similar library for Python with index compatibility so I can provide my own search from them.
If you want to build search in Python you can get a very long way using the full text search engine built into SQLite, which is available in the Python standard library.
Thanks, but I didn't explain it well: I don't want to generally build search, I have a static site using mdBook and I want to use the mdBook indexes from Python to generate search results for that site.
You might want to consider mapping an index to an ets table-based data structure instead of an immutable object managed by a GenServer, it will give you a way to share it between processes without having to awkwardly copy a potentially huge data structure all over the place.
This makes sense, and I think you've taken the correct route. I look forward to trying this in one of my projects and comparing to my current postgres-only backed search strategy. For my use case losing the index between restarts isn't a deal breaker, so hopefully I'll have some useful feedback.
I don't understand how this works. Is data read from ETS somehow shared more efficiently than data shared via a regular message? (which iirc is always copied)
It's still copied but if you are using an ets table you're likely only copying a small subset of the data per query instead of schlepping the whole index every time.
Have you experimented with stemming for the full text search?
I've built (much simpler query DSLs) in nodejs & golang. Adding stemming and also boosting rare words (TF-IDF stands for “Term Frequency — Inverse Document Frequency”) helped my use-case for recalling favorite tweets
I see that the original JS version has TF-IDF - so perhaps your port does as well
Yes. This library has stemming, TF-IDF included already. The everything provided by the JS version is included in this library. And improvements are made where applicable.
I just published an S3 storage provider for Elasticlunr. You can now store your indexes to an S3 bucket aside the Disk storage provider included in the base project.
The storage API is flexible, so writing to any storage provider (Google Cloud Storage, DB and so on) shouldn't be a problem. it's just a matter of grabbing the right provider or implementing one yourself.
Nice work! For certain storage media, e.g. S3 it might be useful to have some sort of delta-based updates where you can enqueue deltas that accrue over time. It might also be interesting to solicit volunteers to help implement distributed in-memory or disk persistence.