Draft: use derived fields to implement per-ingredient recipe scoring #121
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe the reason for these changes and the problem that they solve
Duing recipe search, we currently calculate matching-scores for each ingredient against each recipe. The implementation we've followed summates these scores -- each represented by a
constant_score
within an independent power-of-ten numeric range -- and then subsequently infers the number of exact matches and inexact matches that occurred by, essentially, inspecting the digits of that single floating-point number.This functions as expected, but it encounters a problem when more than 38 ingredients are entered by a user; that precondition causes an overflow of the floating-point value.
This changeset implements a different approach: when a user query is performed, the query will dynamically construct a derived field -- a field that doesn't exist in the indexed recipe documents -- containing a list of boolean values with the same length as the list of query ingredients. For each recipe, each of the boolean values may be
null
(no match for that ingredient),false
(matched, but not exactly), ortrue
(matched exactly).(inexact-matches are for query terms such as
tofu
matching against a recipe that mentionssilken tofu
as an ingredient)The derived field should provide a much more intuitive way to represent the match-status of each ingredient, and also it is a convenient datastructure to use when calculating total exact-match and inexact-match counts, features needed to
sort
(rank) the recipe results.Unfortunately scoring and sorting using derived fields isn't supported in OpenSearch yet, but it may be soon.
Briefly summarize the changes
_found
field at query-time, to replace the existing implementation that multiplexes power-of-ten scores into the floating-point_score
value.How have the changes been tested?
List any issues that this change relates to
Resolves #114
Relates to opensearch-project/OpenSearch#12281