Skip to content

Commit

Permalink
Exclude parser when running Spacy model
Browse files Browse the repository at this point in the history
Doesn't load unnecessary components when loading the Spacy sentence segmentation model. This should improve performance.

> The SentenceRecognizer is a simple statistical component that only provides sentence boundaries. Along with being faster and smaller than the parser, its primary advantage is that it’s easier to train because it only requires annotated sentence boundaries rather than full dependency parses. spaCy’s trained pipelines include both a parser and a trained sentence segmenter, which is disabled by default. If you only need sentence boundaries and no parser, you can use the exclude or disable argument on spacy.load

https://spacy.io/usage/linguistic-features/#sbd-senter
  • Loading branch information
PJ-Finlay authored Sep 26, 2024
1 parent 275e7f7 commit 95c6f33
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions argostranslate/sbd.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,18 @@ def split_sentences(self, text: str, lang_code: Optional[str] = None) -> List[st

# Spacy sentence boundary detection Sentencizer
# https://community.libretranslate.com/t/sentence-boundary-detection-for-machine-translation/606/3
# https://spacy.io/usage/linguistic-features/#sbd

# Download model:
# python -m spacy download xx_sent_ud_sm
class SpacySentencizerSmall(ISentenceBoundaryDetectionModel):
def __init__(self):
try:
self.nlp = spacy.load("xx_sent_ud_sm")
self.nlp = spacy.load("xx_sent_ud_sm", exclude=["parser"])
except OSError:
# Automatically download the model if it doesn't exist
spacy.cli.download("xx_sent_ud_sm")
self.nlp = spacy.load("xx_sent_ud_sm")
self.nlp = spacy.load("xx_sent_ud_sm", exclude=["parser"])
self.nlp.add_pipe("sentencizer")

def split_sentences(self, text: str, lang_code: Optional[str] = None) -> List[str]:
Expand Down

0 comments on commit 95c6f33

Please sign in to comment.