You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The MLLM lexical backend (as well as STWFSA) try to match subject labels to document text, but they are quite strict in the matching. I think it could help in some cases to be able to perform fuzzy matching as well, for example matching subject labels even if there are small differences in spelling (e.g. color vs colour, or Chehov vs Chekhov).
This could either be its own backend (maybe called "flm", for fuzzy lexical matching?), or perhaps just an option in the MLLM backend that would allow selecting the matching method so that the user could select between traditional crisp matching and fuzzy matching. When finding fuzzy matches, the match similarity could be included as one of the features used for candidate selection.
One question is how to efficiently implement the matching. There are libraries like TheFuzz (formerly known as FuzzyWuzzy) and fuzzysearch which could perhaps be used. The most promising one I found is RapidFuzz, which seems to be in active development (in fact extremely active), promises to be very fast, and is MIT licensed. This could be an ideal library for the purpose. However, it relies on C++ code so we would have to consider making this into an optional feature instead of a core dependency.
Naturally, some benchmarking would be needed to find out whether this is actually a good idea at all. It's also possible that fuzzy matching doesn't give any benefit over the current matching.
The text was updated successfully, but these errors were encountered:
The MLLM lexical backend (as well as STWFSA) try to match subject labels to document text, but they are quite strict in the matching. I think it could help in some cases to be able to perform fuzzy matching as well, for example matching subject labels even if there are small differences in spelling (e.g.
color
vscolour
, orChehov
vsChekhov
).This could either be its own backend (maybe called "flm", for fuzzy lexical matching?), or perhaps just an option in the MLLM backend that would allow selecting the matching method so that the user could select between traditional crisp matching and fuzzy matching. When finding fuzzy matches, the match similarity could be included as one of the features used for candidate selection.
One question is how to efficiently implement the matching. There are libraries like TheFuzz (formerly known as FuzzyWuzzy) and fuzzysearch which could perhaps be used. The most promising one I found is RapidFuzz, which seems to be in active development (in fact extremely active), promises to be very fast, and is MIT licensed. This could be an ideal library for the purpose. However, it relies on C++ code so we would have to consider making this into an optional feature instead of a core dependency.
Naturally, some benchmarking would be needed to find out whether this is actually a good idea at all. It's also possible that fuzzy matching doesn't give any benefit over the current matching.
The text was updated successfully, but these errors were encountered: