Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wrote this blog :). Good to see it still getting use.

FYI for folks just skimming this, shards can affect scoring, but they don't have to. 2 mitigations: 1. The default in Elasticsearch has been 1 shard per index for a while, and many people (if not most) probably don't need more than 1 shard

2. You can do a dfs_query_then_fetch query, which adds a small amount of latency, but solves this problem

The fundamental tenant is accurate here that any time you want to break up term statistics (e.g. if you want each user to experience different relevance results for their own term stats) then yes, you need a separate index for that. I'd say that's largely not all that common though in practice.

A more common problem that warrants splitting indices is when you have mixed language content: the term "LA" in Spanish content adds very little information while it adds a reasonable amount of information in most English language documents (where is can mean Los Angeles). If you mix both content together, it can pollute term statistics for both. Considering how your segments of users and data will affect scoring is absolutely important as a general rule though, and part of why I'm super excited to be working on neutral retrieval now



Thanks for the clarifications! I've been spending the last 3 weeks deep in the weeds of TF/IDF scoring and was about to give up on Elastic Search when this got posted. The article has been eye opening!!!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: