-
Notifications
You must be signed in to change notification settings - Fork 41
Backend: YAKE
The yake
backend is a wrapper around YAKE library, which performs unsupervised automatic keyword extraction.
In the backend the keywords found by YAKE are searched from an index, which is formed from the SKOS vocabulary labels. The index can include prefLabels
, altLabels
and/or hiddenLabels
. Keywords and labels in the index are lemmatized and sorted alphabetically for matching.
The YAKE backend is based on lexical principle, but currently it does not perform as well as the other lexical backends (MLLM or STWFSA, which are from the beginning designed to utilize the SKOS vocabulary features). However, the (free) keyword extraction operation offers a possibility to add new features to Annif, especially the feature for suggesting new terms for a vocabulary (the keywords not found in the vocabulary).
Currently the keywords not found from the vocabulary are shown in the debug log. However, at the moment setting the log-level for server mode (command annif run
) is not possible.
Also the unsupervised approach can be useful in some cases: there is no need for training data.
Please note that the YAKE library is licended under GPLv3, while Annif is licensed under the Apache License 2.0. The licenses are compatible, but depending on legal interpretation, the terms of the GPLv3 (for example the requirement to publish corresponding source code when publishing an executable application) may be considered to apply to the whole of Annif+Yake if you decide to install the optional YAKE dependency.
For installation see Optional features and dependencies.
A minimal configuration that relies on default values:
[yso-yake-en]
language=en
backend=yake
analyzer=snowball(english)
vocab=yso
For long texts it can be advantageous to use the limit
transformation project setting to truncate the documents before passing them to YAKE. For Finnish thesis and dissertations good results can be achieved with transform=limit(20000)
.
The label_types
and remove_parentheses
parameters are used for constructing the label index.
Note that if these parameters are changed after the label index has been created, which occurs on the first suggest
call for a project, the update does not change the index, but the project then needs to be reset by annif clear <project>
.
Resetting is needed also after vocabulary update.
The other parameters are passed to YAKE when extracting keywords; for the detailed description of them and the YAKE algorithm see the article by R. Campos et al..
Parameter | Description |
---|---|
label_types | SKOS label types to use in matching. Values are given in a comma separated list of prefLabel , altLabel , and/or hiddenLabel . Defaults to prefLabel, altLabel . |
remove_parentheses | Whether to remove parts of SKOS labels inside parentheses (a specifier for a label, e.g. (photography) in films (photography) ). Value needs to be interpretable as a boolean, e.g. True/False . Defaults to False . |
window_size | Distance (in number of tokens) considered when computing co-occurances of tokens. Defaults to 1. |
max_ngram_size | Maximum number of consequtive words to use in forming candidate keywords. Defaults to 4. |
deduplication_algo | Algorithm to measure the similarity of candidate keywords for deduplication: levs , jaro or seqm . Defaults to levs . |
deduplication_threshold | Threshold for the value of the similarity measure for deduplication. Defaults to 0.9. |
num_keywords | Limit for the number of keywords that YAKE extracts. Defaults to 100. |
Load a vocabulary:
annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl
Training is not necessary or possible. Test the model with a single document:
cat document.txt | annif suggest yso-yake-en
Evaluate a directory full of files in fulltext document corpus format:
annif eval yso-yake-en /path/to/documents/
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend