Skip to content

Latest commit

 

History

History
178 lines (133 loc) · 5.21 KB

README.md

File metadata and controls

178 lines (133 loc) · 5.21 KB

🍺IPA: import, preprocess, accelerate

PyTorch Stanza SpaCy Code style: black

Upload to PyPi PyPi Version DeepSource

🍺IPA: import, preprocess, accelerate

How to use

Install

Install the library from PyPI:

pip install ipa-core

Usage

IPA is a Python library that provides a set of preprocessing wrappers for Stanza and spaCy, providing a unified API for both libraries, making them interchangeable.

Let's start with a simple example. Here we are using the SpacyTokenizer wrapper to preprocess a text:

from ipa import SpacyTokenizer

spacy_tokenizer = SpacyTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = spacy_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""

You can load any model from spaCy, with its canonical name, en_core_web_sm, or with a simple alias, as we did here, like en. By default, the simpler alias loads the smaller version of each model. For a complete list of available models, see spaCy documentation.

In the very same way, you can load any model from Stanza using the StanzaTokenizer wrapper:

from ipa import StanzaTokenizer

stanza_tokenizer = StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = stanza_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""

For more simple scenarios, you can use the WhiteSpaceTokenizer wrapper, which will just split the text by whitespace:

from ipa import WhitespaceTokenizer

whitespace_tokenizer = WhitespaceTokenizer()
tokenized = whitespace_tokenizer("Mary sold the car to John .")
for word in tokenized:
    print("{:<5} {:<10}".format(word.index, word.text))

"""
0    Mary
1    sold
2    the
3    car
4    to
5    John
6    .
"""

Features

Complete preprocessing pipeline

SpacyTokenizer and StanzaTokenizer provide a unified API for both libraries, exposing most of their features, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate and deactivate any of these using return_pos_tags, return_lemmas and return_deps. So, for example,

StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)

will return a list of Token objects, with the pos and lemma fields filled.

while

StanzaTokenizer(language="en")

will return a list of Token objects, with only the text field filled.

GPU support

With use_gpu=True, the library will use the GPU if it is available. To set up the environment for the GPU, refer to the Stanza documentation and the spaCy documentation.

API

Tokenizers

SpacyTokenizer

class SpacyTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):

StanzaTokenizer

class StanzaTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):

WhitespaceTokenizer

class WhitespaceTokenizer(BaseTokenizer):
    def __init__(self):

Sentence Splitter

SpacySentenceSplitter

class SpacySentenceSplitter(BaseSentenceSplitter):
    def __init__(self, language: str = "en", model_type: str = "statistical"):