MetadataExtractor

MetadataExtractor is a web service built on Flask for extracting metadata as RDF triples from various file types. This service utilizes a REST API for receiving files and returning metadata in multiple formats.

Requirements

Python 3.10+
Additional dependencies are listed in the installDependencies.sh file.
- Apache Tika
- Tesseract
- OpenCV
- YOLO-NAS
Additional requirements are listed in the requirements.txt file.

Installation

To install MetadataExtractor, follow these steps:

Clone the repository:

git clone https://github.com/BenediktHeinrichs/MetadataExtractor.git
cd MetadataExtractor

Run the installDependencies.sh script to install required dependencies & Python packages (Linux):
```
./installDependencies.sh
```
If you have Docker installed, you can build and run the service using the provided Dockerfile.

Running the Service

To start the service:

Using Python directly:
```
python server.py
```

Using Docker:

Build the Docker image:
```
docker build -t metadataextractor .
```

Run the Docker container:

docker run -p 36541:36541 metadataextractor

Configuration

The service can be configured using the defaultConfigs.py module.
- This configuration can be overwritten at every metadata extraction request
Logging is set up via the setDefaultLogging() function.
Environmental variables such as MAX_CONTENT_LENGTH, METADATA_EXTRACTOR_HOST, and METADATA_EXTRACTOR_PORT can be adjusted as needed.

Version

Current API version is defined by the __version__ attribute within the MetadataExtractor module.

Usage

The service exposes several endpoints:

POST /

This endpoint accepts form-data with a download url or a file along with optional parameters:
- identifier: A unique identifier for the file.
- config: Configuration object for extraction settings. (Example value: { "Extractors": { "Text": [ "SummaryExtract" ] } })
- creation_date: File's creation date.
- modification_date: File's modification date.
- url: Download URL of the file.
- file: The file to be processed.
- accept: The Accept header has to be set (default is JSON, recommended is Turtle)
Returns extracted metadata in the requested format. (JSON, Turtle, RDF/XML, JSON-LD, TriG)

GET /defaultConfig

Returns the default configuration JSON object for the Metadata Extractor.

GET /version

Returns the current version of the Metadata Extractor.

API Response Models

The server uses defined response models to structure the JSON response. This includes the MetadataOutput model for the main endpoint and the Version model for the version endpoint.

Contributing

Contributions are welcome, check out the Contribution guidelines! Please feel free to submit a pull request.

Linting & Fixing

pip install ruff
ruff --fix .
ruff format .

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
Data		Data
Examples		Examples
MetadataExtractor		MetadataExtractor
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitpod.Dockerfile		.gitpod.Dockerfile
.gitpod.yml		.gitpod.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
defaultConfigs.py		defaultConfigs.py
getCheckpoints.sh		getCheckpoints.sh
installDependencies.sh		installDependencies.sh
installDependenciesCleanup.sh		installDependenciesCleanup.sh
pipeline_runner.py		pipeline_runner.py
requirements.txt		requirements.txt
run.sh		run.sh
server.py		server.py
setup.py		setup.py
start.bat		start.bat
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MetadataExtractor

Requirements

Installation

Running the Service

Configuration

Version

Usage

POST /

GET /defaultConfig

GET /version

API Response Models

Contributing

Linting & Fixing

Development

About

Releases 14

Packages

Contributors 2

Languages

License

BenediktHeinrichs/metadataextractor

Folders and files

Latest commit

History

Repository files navigation

MetadataExtractor

Requirements

Installation

Running the Service

Configuration

Version

Usage

POST /

GET /defaultConfig

GET /version

API Response Models

Contributing

Linting & Fixing

Development

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 14

Packages 0

Contributors 2

Languages

Packages