Skip to content

Releases: MaartenGr/PolyFuzz

v0.4.2

03 Sep 07:52
b36ffa6
Compare
Choose a tag to compare

Removed restrictive pytorch dependencies for Flair

v0.4.1

03 Sep 07:32
Compare
Choose a tag to compare

v0.4.0

07 May 07:12
9d9754d
Compare
Choose a tag to compare
  • Added new models (SentenceTransformers, Gensim, USE, Spacy)
  • Added .fit, .transform, and .fit_transform methods
  • Added .save and PolyFuzz.load()

SentenceTransformers, Gensim, USE, and Spacy

SentenceTransformers

from polyfuzz.models import SentenceEmbeddings
distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
model = PolyFuzz(distance_model)

Gensim

from polyfuzz.models import GensimEmbeddings
distance_model = GensimEmbeddings("glove-twitter-25")
model = PolyFuzz(distance_model)

USE

from polyfuzz.models import USEEmbeddings
distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4")
model = PolyFuzz(distance_model)

Spacy

from polyfuzz.models import SpacyEmbeddings
distance_model = SpacyEmbeddings("en_core_web_md")
model = PolyFuzz(distance_model)

fit, transform, fit_transform

Add fit, transform, and fit_transform in order to use PolyFuzz in production (#34)

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from polyfuzz import PolyFuzz

train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
unseen_words = ["apple", "apples", "mouse"]

# Fit
model = PolyFuzz("TF-IDF")
model.fit(train_words)

# Transform
results = model.transform(unseen_words)

In the code above, we fit our TF-IDF model on train_words and use .transform() to match the words in unseen_words to the words that we trained on in train_words.

After fitting our model, we can save it as follows:

model.save("my_model")

Then, we can load our model to be used elsewhere:

from polyfuzz import PolyFuzz

model = PolyFuzz.load("my_model")

v0.3.4

05 Nov 10:41
241d7d3
Compare
Choose a tag to compare
  • Make sure that when you use two lists that are exactly the same, it will return 1 for identical terms:
from polyfuzz import PolyFuzz
from_list = ["apple", "house"]
model = PolyFuzz("TF-IDF")
model.match(from_list, from_list)

This will match each word in from_list to itself and give it a score of 1. Thus, apple will be matched to apple and
house will be mapped to house. However, if you input just a single list, it will try to map them within the list without
mapping to itself:

from polyfuzz import PolyFuzz
from_list = ["apple", "apples"]
model = PolyFuzz("TF-IDF")
model.match(from_list)

In the example above, apple will be mapped to apples and not to apple. Here, we assume that the user wants to
find the most similar words within a list without mapping to itself.

v0.3.3

16 Jun 05:32
672c90e
Compare
Choose a tag to compare

Quickfix for issues #21 and #23

v0.3.2

08 Jun 06:48
Compare
Choose a tag to compare

Fixed an issue with sparse_dot_n exploding memory usage when trying to access the top_n of a sparse matrix.

v0.3

30 Apr 06:19
a60dfc6
Compare
Choose a tag to compare

You can now specify the top_n matches for each string. This option allows you to get a selection of matches that best suit the input. It is implemented in polyfuzz.models.TFIDF and polyfuzz.models.Embeddings since this is computationally quite heavy and these models are best suited for making those calculations.

Usage:

from polyfuzz import PolyFuzz

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

model = PolyFuzz("TF-IDF")
model.match(from_list, to_list, top_n=3)

Or usage in custom models:

from polyfuzz.models import TFIDF, Embeddings
from flair.embeddings import TransformerWordEmbeddings

embeddings = TransformerWordEmbeddings('bert-base-multilingual-cased')
bert = Embeddings(embeddings, min_similarity=0, model_id="BERT", top_n=3)
tfidf = TFIDF(min_similarity=0, top_n=3)

string_models = [bert, tfidf]
model = PolyFuzz(string_models)
model.match(from_list, to_list)

BibTeX

25 Jan 07:41
74945f8
Compare
Choose a tag to compare

This release is meant as a way to create a DOI through Zenodo.

First Release

29 Nov 06:47
b01e426
Compare
Choose a tag to compare

First public release. Includes:

Features:

  • Edit Distance
  • TF-IDF
  • Embeddings
  • Custom models
  • Grouping of results with custom models
  • Evaluation through precision-recall curves

Fixes:

  • Update naming convention matcher --> model
  • Add basic models to grouper
  • Fix issues with vector order in cosine similarity
  • Update naming of cosine similarity function