This library provides the functionality to find SKOS thesaurus concepts in a text. It is a reimplementation in Python of stwfsa combined with the concept scoring from [1]. A deterministic finite automaton is constructed from the labels of the thesaurus concepts to perform the matching. In addition, a classifier is trained to score the matched concept occurrences.
The construction of the automaton requires a SKOS thesaurus represented as a rdflib
Graph
.
Concepts should be related to labels by skos:prefLabel
, skos:altLabel
, zbwext:altLabelNarrower
, zbwext:altLabelRelated
or skos:hiddenLabel
.
Concepts have to be identifiable by rdf:type
.
The training of the predictor requires annotated text.
Each training sample should be annotated with one or more concepts from the thesaurus.
Python >= 3.9
is required.
stwfsapy is available on PyPI . You can install stwfsapy using pip:
pip install stwfsapy
This will install a python package called stwfsapy
.
Note that it is generally recommended to use a virtual environment to avoid conflicting behaviour with the system package manager.
You also have the option to checkout the repository and install the packages from source. You need poetry to perform the task:
# call inside the project directory
poetry install --without ci
First load your thesaurus.
from rdflib import Graph
g = Graph()
g.parse('/path/to/your/thesaurus')
First, define the type URI for descriptors.
If your thesaurus is structured into sub-thesauri by providing categories for the concepts of the thesaurus using,
e.g., skos:Collection
, you can optionally specify the type of these categories via a URI.
In this case you should also specify the relation that relates concepts to categories.
Furthermore you can indicate whether this relation is a specialisation relation (as opposed to a generalisation relation, which is the default).
For the STW this would be
descriptor_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Descriptor'
thsys_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Thsys'
thesaurus_relation_type_uri = 'http://www.w3.org/2004/02/skos/core#broader'
is_specialisation = False
Create the predictor
from stwfsapy.predictor import StwfsapyPredictor
p = StwfsapyPredictor(
g,
descriptor_type_uri,
thsys_type_uri,
thesaurus_relation_type_uri,
is_specialisation,
langs={'en'},
simple_english_plural_rules=True)
The next step assumes you have loaded your texts into a list X
and your labels into a list of lists y
,
such that for all indices 0 <= i < len(X)
. The list at y[i]
contains the URIs to the correct concepts for X[i]
.
The concepts should be given by their URI.
Then you can train the classifier:
p.fit(X, y)
Afterwards you can get the predicted concepts and scores:
p.suggest_proba(['one input text', 'A completely different input text.'])
Alternatively you can get a sparse matrix of scores by calling
p.predict_proba(['one input text', 'Another input text.'])
The indices of the concepts are stored in p.concept_map_
.
All options for the predictor are documented at https://stwfsapy.readthedocs.io/ .
A trained predictor p
can be stored by calling p.store('/path/to/storage/location')
.
Afterwards it can be loaded as follows:
from stwfsapy.predictor import StwfsapyPredictor
StwfsapyPredictor.load('/path/to/storage/location')
Contributions via pull requests are welcome. Please create an issue beforehand to explain and discuss the reasons for the respective contribution.
This code was created as part of the subject indexing automation effort at ZBW – Leibniz Information Centre for Economics. See our homepage for more information, publications, and contact details.