-
Notifications
You must be signed in to change notification settings - Fork 41
Hugging Face Hub integration
Annif integration with Hugging Face Hub allows to easily upload and download projects and their associated vocabularies to and from Hugging Face Hub model repositories. This integration provides a convenient way to share Annif projects and collaborate on text classification tasks within the Hugging Face ecosystem.
Downloads are possible from public repositories without logging in to Hugging Face Hub, but for all uploads and for downloads from private repositories you need to have logged in using the huggingface-cli login
command or to give a User Access token with the --token
option of the annif upload
or annif download
commands.
The integration utilizes the Hugging Face Hub cache-system; to explore the cache, use huggingface-cli scan-cache
command, and to delete items from the cache, use huggingface-cli delete-cache
.
Note that using glob wildcards (e.g. *
or ?
) in the projects id patterns for the below commands can make the shell to expand the wildcards to matching filenames, instead of the intended projects, because the shell expansion happens before Annif processes the command arguments. At least when targeting all projects with *
pattern the shell expands it to all filenames in the directory. To avoid shell expansion, use quotion marks around the pattern, e.g. "*"
.
The download command enables to download selected projects and their associated vocabularies from a specified Hugging Face Hub repository. This command retrieves project and vocabulary archives along with their configuration files from the designated repository and extracts them to the local data/
directory and projects.d/
directory, respectively. After download the projects are directly usable by Annif (if projects.{cfg,toml}
does not exists or by using the --projects
option to override them).
Caution
The annif download
command has a --trust-repo
option, which needs to be used if the repository to download from has not been used previously (that is if the repository does not appear in the local Hugging Face Hub cache). This option is to remind of the risks of using models from the internet; the project downloads should only be done from trusted sources. For more information of the risks see the Hugging Face Hub documentation.
For example, download the yso-mllm-en project from NatLibFi/FintoAI-data-YSO repository and show the fetched files:
$ annif download yso-mllm-en NatLibFi/FintoAI-data-YSO # --trust-repo is needed if this is the first download from the repo
Downloading project(s): yso-mllm-en
projects/yso-mllm-en.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 8.72MB/s]
yso-mllm-en.cfg: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 165/165 [00:00<00:00, 256kB/s]
vocabs/yso.zip: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61.6M/61.6M [00:05<00:00, 11.2MB/s]
$ tree data projects.d
data
├── projects
│ └── yso-mllm-en
│ └── mllm-model.gz
└── vocabs
└── yso
├── subjects.csv
├── subjects.dump.gz
└── subjects.ttl
projects.d
└── yso-mllm-en.cfg
To download all YSO projects with English language, assuming their project ids end with "en", use projects id pattern "yso-*en"
in place of yso-mllm-en
.
If any file contained in the archive already exists locally, the extraction does not overwrite the local file, but skips its extraction. To force overwrite of the existing local file, you can use the --force/-f
option.
A specific revision (commit hash, branch name or tag) to download can be selected using the --revision
option; see below for versioning projects.
The upload command allows to upload selected projects and their vocabularies to a specified Hugging Face Hub repository. The command zips the project directories and vocabularies of the projects whose project IDs match the given pattern to archive files, and uploads these archives along with the project configurations to the designated repository.
The yso-mllm-en project can be uploaded to NatLibFi/FintoAI-data-YSO repository like this:
$ annif upload yso-mllm-en NatLibFi/FintoAI-data-YSO
Uploading project(s): yso-mllm-en
Also the upload command targets the projects with a pattern of project IDs, so "yso-*en"
could be used to upload all local YSO English projects.
The commit message can be specified with --commit-message
option; the default commit message is "Upload project(s) <project-ID-pattern> with Annif".
The branch to upload to can set by the --revision
option, the default branch is main
. Note that the branch needs to exist before upload, see below how to create branches.
The Hugging Face Hub has good support for metadata of the models by using Model Cards.
Annif automatically creates or updates some information of the Model Cards on project uploads. By default when running annif upload
(without --no-modelcard
option):
- if Model Card does not exist in the destination repository, it is created with default contents,
- if Model Card exists, its project list and metadata are updated as necessary.
The metadata of the Model Card that are automatically set include model task "text-classification", a custom tag "annif" and the languages of the projects. The automatically inserted text content consist of a heading, Usage section and Projects section with a list of the projects that are stored in the repository (detected by the .cfg
files in the repository); the project list is updated on every upload.
Git branches and tags can be used for versioning Annif projects in Hugging Face Hub. Currently (by release 0.22) the Hugging Face Hub commandline tool does not support accessing git branches or tags, but this will change in the future. However, the tags can be accessed and manipulated using the Hugging Face Hub Python client. For example, to list branches and tags in the NatLibFi/FintoAI-data-YSO repository, use the command
$ python -c "from huggingface_hub import HfApi; client=HfApi(); refs=client.list_repo_refs(repo_id='NatLibFi/FintoAI-data-YSO'); print([t.ref for t in refs.branches]); print([t.ref for t in refs.tags])"
A new branch, e.g. release-2024-04
, can be created with command
$ python -c "from huggingface_hub import HfApi; client=HfApi(); client.create_branch(repo_id='NatLibFi/FintoAI-data-YSO', branch='release-2024-04')"
A new tag, e.g. 2024-04
, can be created with command
$ python -c "from huggingface_hub import HfApi; client=HfApi(); client.create_tag(repo_id='NatLibFi/FintoAI-data-YSO', tag='2024-04')"
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend