Releases: MaartenGr/PolyFuzz
v0.4.2
v0.4.1
v0.4.0
- Added new models (SentenceTransformers, Gensim, USE, Spacy)
- Added
.fit
,.transform
, and.fit_transform
methods - Added
.save
andPolyFuzz.load()
SentenceTransformers, Gensim, USE, and Spacy
SentenceTransformers
from polyfuzz.models import SentenceEmbeddings
distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model)
Gensim
from polyfuzz.models import GensimEmbeddings
distance_model = GensimEmbeddings("glove-twitter-25")
model = PolyFuzz(distance_model)
USE
from polyfuzz.models import USEEmbeddings
distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4")
model = PolyFuzz(distance_model)
Spacy
from polyfuzz.models import SpacyEmbeddings
distance_model = SpacyEmbeddings("en_core_web_md")
model = PolyFuzz(distance_model)
fit, transform, fit_transform
Add fit
, transform
, and fit_transform
in order to use PolyFuzz in production (#34)
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from polyfuzz import PolyFuzz
train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
unseen_words = ["apple", "apples", "mouse"]
# Fit
model = PolyFuzz("TF-IDF")
model.fit(train_words)
# Transform
results = model.transform(unseen_words)
In the code above, we fit our TF-IDF model on train_words
and use .transform()
to match the words in unseen_words
to the words that we trained on in train_words
.
After fitting our model, we can save it as follows:
model.save("my_model")
Then, we can load our model to be used elsewhere:
from polyfuzz import PolyFuzz
model = PolyFuzz.load("my_model")
v0.3.4
- Make sure that when you use two lists that are exactly the same, it will return 1 for identical terms:
from polyfuzz import PolyFuzz
from_list = ["apple", "house"]
model = PolyFuzz("TF-IDF")
model.match(from_list, from_list)
This will match each word in from_list
to itself and give it a score of 1. Thus, apple
will be matched to apple
and
house
will be mapped to house
. However, if you input just a single list, it will try to map them within the list without
mapping to itself:
from polyfuzz import PolyFuzz
from_list = ["apple", "apples"]
model = PolyFuzz("TF-IDF")
model.match(from_list)
In the example above, apple
will be mapped to apples
and not to apple
. Here, we assume that the user wants to
find the most similar words within a list without mapping to itself.
v0.3.3
v0.3.2
v0.3
You can now specify the top_n
matches for each string. This option allows you to get a selection of matches that best suit the input. It is implemented in polyfuzz.models.TFIDF
and polyfuzz.models.Embeddings
since this is computationally quite heavy and these models are best suited for making those calculations.
Usage:
from polyfuzz import PolyFuzz
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]
model = PolyFuzz("TF-IDF")
model.match(from_list, to_list, top_n=3)
Or usage in custom models:
from polyfuzz.models import TFIDF, Embeddings
from flair.embeddings import TransformerWordEmbeddings
embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased')
bert = Embeddings(embeddings, min_similarity=0, model_id="BERT", top_n=3)
tfidf = TFIDF(min_similarity=0, top_n=3)
string_models = [bert, tfidf]
model = PolyFuzz(string_models)
model.match(from_list, to_list)
BibTeX
First Release
First public release. Includes:
Features:
- Edit Distance
- TF-IDF
- Embeddings
- Custom models
- Grouping of results with custom models
- Evaluation through precision-recall curves
Fixes:
- Update naming convention matcher --> model
- Add basic models to grouper
- Fix issues with vector order in cosine similarity
- Update naming of cosine similarity function