Releases · MaartenGr/PolyFuzz

03 Sep 07:52

MaartenGr

v0.4.2

b36ffa6

v0.4.2 Latest

Latest

Removed restrictive pytorch dependencies for Flair

Assets 2

03 Sep 07:32

MaartenGr

v0.4.1

ea98abb

v0.4.1

Fixed deprecated np.float
Fixed #53
Fixed typo in README (#45) by dobraczka
Fixed API documentation (#38) by maxbachmann

Assets 2

07 May 07:12

MaartenGr

v0.4.0

9d9754d

v0.4.0

Added new models (SentenceTransformers, Gensim, USE, Spacy)
Added .fit, .transform, and .fit_transform methods
Added .save and PolyFuzz.load()

SentenceTransformers, Gensim, USE, and Spacy

SentenceTransformers

from polyfuzz.models import SentenceEmbeddings
distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model)

Gensim

from polyfuzz.models import GensimEmbeddings
distance_model = GensimEmbeddings("glove-twitter-25")
model = PolyFuzz(distance_model)

USE

from polyfuzz.models import USEEmbeddings
distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4")
model = PolyFuzz(distance_model)

Spacy

from polyfuzz.models import SpacyEmbeddings
distance_model = SpacyEmbeddings("en_core_web_md")
model = PolyFuzz(distance_model)

fit, transform, fit_transform

Add fit, transform, and fit_transform in order to use PolyFuzz in production (#34)

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from polyfuzz import PolyFuzz

train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
unseen_words = ["apple", "apples", "mouse"]

# Fit
model = PolyFuzz("TF-IDF")
model.fit(train_words)

# Transform
results = model.transform(unseen_words)

In the code above, we fit our TF-IDF model on train_words and use .transform() to match the words in unseen_words to the words that we trained on in train_words.

After fitting our model, we can save it as follows:

model.save("my_model")

Then, we can load our model to be used elsewhere:

from polyfuzz import PolyFuzz

model = PolyFuzz.load("my_model")

Assets 2

05 Nov 10:41

MaartenGr

v0.3.4

241d7d3

v0.3.4

Make sure that when you use two lists that are exactly the same, it will return 1 for identical terms:

from polyfuzz import PolyFuzz
from_list = ["apple", "house"]
model = PolyFuzz("TF-IDF")
model.match(from_list, from_list)

This will match each word in from_list to itself and give it a score of 1. Thus, apple will be matched to apple and
house will be mapped to house. However, if you input just a single list, it will try to map them within the list without
mapping to itself:

from polyfuzz import PolyFuzz
from_list = ["apple", "apples"]
model = PolyFuzz("TF-IDF")
model.match(from_list)

In the example above, apple will be mapped to apples and not to apple. Here, we assume that the user wants to
find the most similar words within a list without mapping to itself.

Assets 2

16 Jun 05:32

MaartenGr

v0.3.3

672c90e

v0.3.3

Quickfix for issues #21 and #23

Assets 2

08 Jun 06:48

MaartenGr

v0.3.2

aec80ef

v0.3.2

Fixed an issue with sparse_dot_n exploding memory usage when trying to access the top_n of a sparse matrix.

Assets 2

30 Apr 06:19

MaartenGr

v0.3

a60dfc6

v0.3

You can now specify the top_n matches for each string. This option allows you to get a selection of matches that best suit the input. It is implemented in polyfuzz.models.TFIDF and polyfuzz.models.Embeddings since this is computationally quite heavy and these models are best suited for making those calculations.

Usage:

from polyfuzz import PolyFuzz

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

model = PolyFuzz("TF-IDF")
model.match(from_list, to_list, top_n=3)

Or usage in custom models:

from polyfuzz.models import TFIDF, Embeddings
from flair.embeddings import TransformerWordEmbeddings

embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased')
bert = Embeddings(embeddings, min_similarity=0, model_id="BERT", top_n=3)
tfidf = TFIDF(min_similarity=0, top_n=3)

string_models = [bert, tfidf]
model = PolyFuzz(string_models)
model.match(from_list, to_list)

Assets 2

25 Jan 07:41

MaartenGr

v0.2.2

74945f8

BibTeX

This release is meant as a way to create a DOI through Zenodo.

Assets 2

29 Nov 06:47

MaartenGr

v0.2.1

b01e426

First Release

First public release. Includes:

Features:

Edit Distance
TF-IDF
Embeddings
Custom models
Grouping of results with custom models
Evaluation through precision-recall curves

Fixes:

Update naming convention matcher --> model
Add basic models to grouper
Fix issues with vector order in cosine similarity
Update naming of cosine similarity function

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentenceTransformers, Gensim, USE, and Spacy

SentenceTransformers

Gensim

USE

Spacy

fit, transform, fit_transform

Releases: MaartenGr/PolyFuzz

v0.4.2

v0.4.1

v0.4.0

SentenceTransformers, Gensim, USE, and Spacy

SentenceTransformers

Gensim

USE

Spacy

fit, transform, fit_transform

v0.3.4

v0.3.3

v0.3.2

v0.3

BibTeX

First Release