Elasticlunr, a full-text search library for Elixir

stavros · on Jan 9, 2022

Does this relate to http://elasticlunr.com/? mdBook uses the latter, and I was wondering if there's a similar library for Python with index compatibility so I can provide my own search from them.

simonw · on Jan 9, 2022

If you want to build search in Python you can get a very long way using the full text search engine built into SQLite, which is available in the Python standard library.

I have some tools for using that here: https://sqlite-utils.datasette.io/en/stable/python-api.html#...

stavros · on Jan 9, 2022

Thanks, but I didn't explain it well: I don't want to generally build search, I have a static site using mdBook and I want to use the mdBook indexes from Python to generate search results for that site.

heywhy · on Jan 9, 2022

Yes, it is a port of that library with some improvements.

dnautics · on Jan 9, 2022

You might want to consider mapping an index to an ets table-based data structure instead of an immutable object managed by a GenServer, it will give you a way to share it between processes without having to awkwardly copy a potentially huge data structure all over the place.

heywhy · on Jan 9, 2022

I do have thoughts about performance too but I was following the "get it working then make improvements" route :). Thank you for the suggestions.

kuzee · on Jan 9, 2022

This makes sense, and I think you've taken the correct route. I look forward to trying this in one of my projects and comparing to my current postgres-only backed search strategy. For my use case losing the index between restarts isn't a deal breaker, so hopefully I'll have some useful feedback.

heywhy · on Jan 9, 2022

That's great. I will be looking forward to this.

dnautics · on Jan 9, 2022

Love it. You're doing exactly the right thing.

skrebbel · on Jan 9, 2022

I don't understand how this works. Is data read from ETS somehow shared more efficiently than data shared via a regular message? (which iirc is always copied)

dnautics · on Jan 9, 2022

It's still copied but if you are using an ets table you're likely only copying a small subset of the data per query instead of schlepping the whole index every time.

eproxus · on Jan 9, 2022

It’s still copied, but a process can quickly become a bottleneck in parallel code (every request to a process is sequential).

An ETS table can be concurrently read (and tweaked even further for that use case if desired).

heywhy · on Jan 9, 2022

Like eproxus mentioned, it's still been shared through normal process messaging but improvements will be made regarding this.

linkdd · on Jan 9, 2022

I'd say even using mnesia as an option for disc copies.

dnautics · on Jan 9, 2022

mnesia had very difficult to debug consistency issues that can crop up. Have these been fixed?

anotherjesse · on Jan 9, 2022

I'm excited to look into this deeper.

Have you experimented with stemming for the full text search?

I've built (much simpler query DSLs) in nodejs & golang. Adding stemming and also boosting rare words (TF-IDF stands for “Term Frequency — Inverse Document Frequency”) helped my use-case for recalling favorite tweets

I see that the original JS version has TF-IDF - so perhaps your port does as well

heywhy · on Jan 9, 2022

Yes. This library has stemming, TF-IDF included already. The everything provided by the JS version is included in this library. And improvements are made where applicable.

heywhy · on Jan 11, 2022

I just published an S3 storage provider for Elasticlunr. You can now store your indexes to an S3 bucket aside the Disk storage provider included in the base project.

The storage API is flexible, so writing to any storage provider (Google Cloud Storage, DB and so on) shouldn't be a problem. it's just a matter of grabbing the right provider or implementing one yourself.

https://github.com/heywhy/ex_elasticlunr_s3

dnautics · on Jan 9, 2022

Does anyone know how this stores the indices?

heywhy · on Jan 9, 2022

Hello. I'm the author of the library, you should use the IndexManager (https://github.com/heywhy/ex_elasticlunr/blob/master/lib/ela...) to store your index after making changes to it but note that the indexes will be lost on application shutdown.

But I'm currently working on a configurable storage mechanism so that you can use whatever storage provider of your choice. See https://github.com/heywhy/ex_elasticlunr/pull/9

dnautics · on Jan 9, 2022

Nice work! For certain storage media, e.g. S3 it might be useful to have some sort of delta-based updates where you can enqueue deltas that accrue over time. It might also be interesting to solicit volunteers to help implement distributed in-memory or disk persistence.

heywhy · on Jan 9, 2022

Thank you for the suggestions. I also have same direction for the library. I don't mind if you recommend volunteers.

And don't forget to share the project with friends and colleagues who might be interested in contributing.

dnautics · on Jan 9, 2022

Post it on elixirforum (elixirforum.com).... Did a quick search and couldn't find it there.

rockwotj · on Jan 9, 2022

It looks like from a glance at the source it's in memory only. See: https://github.com/heywhy/ex_elasticlunr/blob/master/lib/ela...

Note: I'm not familiar with elixir so there may be some magic I'm missing

afegbua · on Jan 11, 2022

Awesome