stwfsapy

About

This library provides the functionality to find SKOS thesaurus concepts in a text. It is a reimplementation in Python of stwfsa combined with the concept scoring from [1]. A deterministic finite automaton is constructed from the labels of the thesaurus concepts to perform the matching. In addition, a classifier is trained to score the matched concept occurrences.

Data Requirements

The construction of the automaton requires a SKOS thesaurus represented as a rdflib Graph. Concepts should be related to labels by skos:prefLabel, skos:altLabel, zbwext:altLabelNarrower, zbwext:altLabelRelated or skos:hiddenLabel. Concepts have to be identifiable by rdf:type. The training of the predictor requires annotated text. Each training sample should be annotated with one or more concepts from the thesaurus.

Installation

Requirements

Python >= 3.9 is required.

With pip

stwfsapy is available on PyPI . You can install stwfsapy using pip:

pip install stwfsapy

This will install a python package called stwfsapy.

Note that it is generally recommended to use a virtual environment to avoid conflicting behaviour with the system package manager.

From source

You also have the option to checkout the repository and install the packages from source. You need poetry to perform the task:

# call inside the project directory
poetry install --without ci

Usage

Create predictor

First load your thesaurus.

from rdflib import Graph

g = Graph()
g.parse('/path/to/your/thesaurus')

First, define the type URI for descriptors. If your thesaurus is structured into sub-thesauri by providing categories for the concepts of the thesaurus using, e.g., skos:Collection, you can optionally specify the type of these categories via a URI. In this case you should also specify the relation that relates concepts to categories. Furthermore you can indicate whether this relation is a specialisation relation (as opposed to a generalisation relation, which is the default). For the STW this would be

descriptor_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Descriptor'
thsys_type_uri = 'http://zbw.eu/namespaces/zbw-extensions/Thsys'
thesaurus_relation_type_uri = 'http://www.w3.org/2004/02/skos/core#broader'
is_specialisation = False

Create the predictor

from stwfsapy.predictor import StwfsapyPredictor
p = StwfsapyPredictor(
    g,
    descriptor_type_uri,
    thsys_type_uri,
    thesaurus_relation_type_uri,
    is_specialisation,
    langs={'en'},
    simple_english_plural_rules=True)

The next step assumes you have loaded your texts into a list X and your labels into a list of lists y, such that for all indices 0 <= i < len(X). The list at y[i] contains the URIs to the correct concepts for X[i]. The concepts should be given by their URI. Then you can train the classifier:

p.fit(X, y)

Afterwards you can get the predicted concepts and scores:

p.suggest_proba(['one input text', 'A completely different input text.'])

Alternatively you can get a sparse matrix of scores by calling

p.predict_proba(['one input text', 'Another input text.'])

The indices of the concepts are stored in p.concept_map_.

Options

All options for the predictor are documented at https://stwfsapy.readthedocs.io/ .

Save Model

A trained predictor p can be stored by calling p.store('/path/to/storage/location'). Afterwards it can be loaded as follows:

from stwfsapy.predictor import StwfsapyPredictor

StwfsapyPredictor.load('/path/to/storage/location')

Contribute

Contributions via pull requests are welcome. Please create an issue beforehand to explain and discuss the reasons for the respective contribution.

References

[1] Toepfer, Martin, and Christin Seifert. "Fusion architectures for automatic subject indexing under concept drift" International Journal on Digital Libraries (IJDL), 2018.

Context information

This code was created as part of the subject indexing automation effort at ZBW – Leibniz Information Centre for Economics. See our homepage for more information, publications, and contact details.

Name		Name	Last commit message	Last commit date
Latest commit History 300 Commits
.github/workflows		.github/workflows
docs		docs
stwfsapy		stwfsapy
.coveragerc		.coveragerc
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stwfsapy

About

Data Requirements

Installation

Requirements

With pip

From source

Usage

Create predictor

Options

Save Model

Contribute

References

Context information

About

Releases 3

Packages

Contributors 6

Languages

License

zbw/stwfsapy

Folders and files

Latest commit

History

Repository files navigation

stwfsapy

About

Data Requirements

Installation

Requirements

With pip

From source

Usage

Create predictor

Options

Save Model

Contribute

References

Context information

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 6

Languages

Packages