IDF calculatiing based on filtered documents?
I am going to design a multi-tenant search platform using Elasticsearch. One option is to share index cross tenants. But the problem is, the documents for different tenants in same index might influence the scoring of results, as the IDF part of the scoring is calculated across all documents in one index. Is there a way to let ES calculate IDF based on filtered documents? For example, filter documents by tenant, so that the documents of one tenant won't influence score of documents of another tenant in same index.
There is no built-in way to calculate IDF based on something, rather than documents in the index (or shard), depending on
What options you have:
Implement custom Similarity, that will calculate IDF in a way you need (not sure if it will be efficient enough, also required custom code + customizing deployment)
Route documents for tenant X to shard X, while routing tenant Y docs to shard Y, by using
_routingfield. More information about this approach - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html. Later, do not use global IDF, but rather local IDF (default params), and it should do the trick. Problem is - that you have really limited control for this one - only shard per tenant, which makes impossible to scale properly later.