Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nice idea, but this approach does not handle out of vocabulary words well which is one major motivation for using a vector-based search. It might not perform significantly better compared to lexical matching like tf-idf or BM25, and being slower because of linear complexity. But cool regardless.


It is supposed to be a simple search engine. Keyword: simple.

As long as it does what it is meant to, as a simple search engine, it seems fine


Using tfidf or bm25 would actually be simpler than a vector search.

I understand this is just for fun, just wanted to point that out.


TF/IDF does not support out-of-vocabulary keywords as far as I know.


Or since OP has both the cosine similarity matching and naive matching, a heuristic combination of the two since they address each other's weaknesses.


Vector based approaches either don’t handle OOV terms at all or will perform poorly, depending on implementation. If you limit to alphanumeric trigrams for example you can technically cover all terms but badly depending on training data.


How would you handle those in wordvec?

And isn’t a big advantage that synonyms are handled correctly. This implementation still has that advantage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: