Fuzzy lexical matching backend #629

osma · 2022-10-04T10:21:58Z

The MLLM lexical backend (as well as STWFSA) try to match subject labels to document text, but they are quite strict in the matching. I think it could help in some cases to be able to perform fuzzy matching as well, for example matching subject labels even if there are small differences in spelling (e.g. color vs colour, or Chehov vs Chekhov).

This could either be its own backend (maybe called "flm", for fuzzy lexical matching?), or perhaps just an option in the MLLM backend that would allow selecting the matching method so that the user could select between traditional crisp matching and fuzzy matching. When finding fuzzy matches, the match similarity could be included as one of the features used for candidate selection.

One question is how to efficiently implement the matching. There are libraries like TheFuzz (formerly known as FuzzyWuzzy) and fuzzysearch which could perhaps be used. The most promising one I found is RapidFuzz, which seems to be in active development (in fact extremely active), promises to be very fast, and is MIT licensed. This could be an ideal library for the purpose. However, it relies on C++ code so we would have to consider making this into an optional feature instead of a core dependency.

Naturally, some benchmarking would be needed to find out whether this is actually a good idea at all. It's also possible that fuzzy matching doesn't give any benefit over the current matching.

The text was updated successfully, but these errors were encountered:

osma added the enhancement label Oct 4, 2022

osma added this to the Long term milestone Oct 4, 2022

nwagner84 mentioned this issue Feb 22, 2024

Performance optimization with Rust or C extensions #746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzzy lexical matching backend #629

Fuzzy lexical matching backend #629

osma commented Oct 4, 2022

Fuzzy lexical matching backend #629

Fuzzy lexical matching backend #629

Comments

osma commented Oct 4, 2022