-
Notifications
You must be signed in to change notification settings - Fork 31
API
This document describes an application programming interface to the input and output format used for processes within the OCR-D project. The format itself is based on METS as a container and for descriptive metadata and PAGE XML for the content.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Input can be either a single METS XML file or a ZIP container with a single mets.xml plus referenced files
A METS file an have 1..n <fileGrp>
. Their USE
attribute MUST be one of
@USE |
Type of use for OCR-D |
---|---|
OCR-D-IMG |
The unmanipulated source images |
OCR-D-IMG-BIN |
Black-and-White images |
OCR-D-IMG-GRAY |
Gray images |
OCR-D-IMG-CROP |
Cropped images |
OCR-D-IMG-DESKEW |
Deskewed images |
OCR-D-IMG-DESPECK |
Despeckled images |
OCR-D-IMG-DEWARP |
Dewarped images |
OCR-D-SEG-PAGE |
Page segmentation |
OCR-D-SEG-BLOCK |
Block segmentation |
OCR-D-SEG-LINE |
Line segmentation |
OCR-D-OCR-TESS3 |
Tesseract 3.04 OCR |
OCR-D-OCR-TESS4 |
Tesseract 4.00 OCR |
OCR-D-OCR-ANY |
AnyOCR |
OCR-D-COR-CIS |
CIS post-correction |
OCR-D-COR-ASV |
ASV post-correction |
The ID
of the files produced SHOULD be <USE>_<INDEX>
, where <USE>
is the
USE
of surrounding <mets:fileGrp>
and <INDEX>
is the zero-padded four-digit index of the file
within the group. This way, file ID
are unique within the document.
Example:
<mets:fileGrp USE="OCR-D-SEG-LINE">
<mets:file ID="OCR-D-SEG-LINE_0001>[...]</mets:file>
</mets:fileGrp>
A single PAGE XML file represents one page in the original document.
Every <pc:Page>
element MUST have an attribute image
which MUST always be the source image.
The PAGE XML root element <pc:PcGts>
MUST have exactly one <pc:Page>
.
Coordinates are always absolute, i.e. relative to extent defined in the
imageWidth
/imageHeight
attribute of the nearest <pc:Page>
.
When a processor wants to access the image of a layout element like a TextRegion or TextLine, the algorithm should be:
- If the element in question has an attribute
imageFilename
, resolve this value - If the element has a
<pc:Coords>
subelement, resolve by passing the attributeimageFilename
of the nearest<pc:Page>
and thepoints
attribute of the<pc:Coords>
element
📦TODO📦 https://github.com/PRImA-Research-Lab/prima-core-libs and its apidocs.
📦TODO📦 Describe
- Data Repository
- backend for the transparency in handling input and output
- cutting out images
- etc.
Creates a resolver and sets e.g. the ZIP it should resolve file-URL in etc.
Resolve a URL to an OcrdPage
.
Resolve a URL to an OcrdMets
.
Resolve a URL to an OcrdImage
.
Resolve a URL to an image, then crop it to the coordinates provided.
Represents the METS file as used for input and output of the processors.
If fileGrp USE="INPUT"
contains file mimetype="text/xml"
, parse them (OcrdPage) and list them.
Otherwise, if fileGrp USE="INPUT"
contains file mimetype="image/*"
, generate empty PAGE XML from these by
- Creating an
pc:PcGts
and therein - an empty
pc:Page
element withimage="<URL>"
📦TODO📦 Wrong here
Lists all variants, i.e. nested METS files used as INPUT
. In the common case
that there is no nesting, this will return just one variant with all the files
listed in INPUT.
Should be generated by the resolver.
A processor is a tool that accepts METSPAGE input and produces METSPAGE output according to this spec.