-
Notifications
You must be signed in to change notification settings - Fork 31
Architecture
- Data lifecycle (this should delegate to, potentially local, repository)
- Create a local workspace directory for tools to run in
- Transparent translation of file access to URL access/in-archive-access
- Handling input
- Handling output
- Capture STDERR of tools
- Capture STDOUT of tools
- Keep track of mapping input-image <-> output image
- Produce either
- METS-XML with files POSTed to data repository or
- OCRD-ZIP with files and METS combined
- Implement the CLI
- Represent the
command > implementation
CLI structure in code
- Represent the
Makes sense to extract the format handling altogether to a library (e.g. ocrd-mets
) to reuse as much from there and abstract as much XML API as possible.
Have functionality to convert between OCRD-ZIP and METS-XML with a data repository.
Support FContent? Extremely inefficient encoding-wise (base64) but compact and standards-compliant.
Mapping between files must be by order within fileGrp (c.f. https://github.com/OCR-D/spec/issues/9 and https://github.com/OCR-D/spec/issues/7). What to do if processing breaks for a single file?
A ZIP archive with a METS file serving as the manifest:
$> zipinfo test.ocrd.zip
Archive: test.ocrd.zip
Zip file size: 16384306 bytes, number of entries: 11
-rw-rw-r-- 3.0 unx 388 tx defN 18-Mar-20 22:56 00000002.xml
-rw-rw-r-- 3.0 unx 389 tx defN 18-Mar-20 22:56 00000004.xml
-rw-rw-r-- 3.0 unx 5263814 bx defN 18-Mar-20 22:56 00000001.tif
-rw-rw-r-- 3.0 unx 3113906 bx defN 18-Mar-20 22:56 00000002.tif
-rw-rw-r-- 3.0 unx 3461046 bx defN 18-Mar-20 22:56 00000005.tif
-rw-rw-r-- 3.0 unx 3080122 bx defN 18-Mar-20 22:56 00000003.tif
-rw-rw-r-- 3.0 unx 388 tx defN 18-Mar-20 22:56 00000001.xml
-rw-rw-r-- 3.0 unx 388 tx defN 18-Mar-20 22:56 00000003.xml
-rw-rw-r-- 3.0 unx 1717616 bx defN 18-Mar-20 22:56 00000004.tif
-rw-rw-r-- 3.0 unx 8648 tx defN 18-Mar-20 22:56 mets.xml
-rw-rw-r-- 3.0 unx 389 tx defN 18-Mar-20 22:56 00000005.xml
11 files, 16647094 bytes uncompressed, 16382620 bytes compressed: 1.6%
$> unzip -p test.ocrd.zip mets.xml
[...]
<mets:fileSec xmlns:xlink="http://www.w3.org/1999/xlink">
<mets:fileGrp USE="IMAGE">
<mets:file ID="FILE_0001_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000001.tif" /></mets:file>
<mets:file ID="FILE_0002_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000002.tif" /></mets:file>
<mets:file ID="FILE_0003_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000003.tif" /></mets:file>
<mets:file ID="FILE_0004_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000004.tif" /></mets:file>
<mets:file ID="FILE_0005_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000005.tif" /></mets:file>
</mets:fileGrp>
<mets:fileGrp USE="FULLTEXT">
<mets:file ID="FILE_0001_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000001.xml" /></mets:file>
<mets:file ID="FILE_0002_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000002.xml" /></mets:file>
<mets:file ID="FILE_0003_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000003.xml" /></mets:file>
<mets:file ID="FILE_0004_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000004.xml" /></mets:file>
<mets:file ID="FILE_0005_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000005.xml" /></mets:file>
</mets:fileGrp>
</mets:fileSec>
</mets:mets>
We recommend to structure the files by their use. Resulting files may be labeled like this:
FILE_NAME := 'Basename of the original image' + WORKFLOW_STEP + ("-" + PROCESSOR)? + "." + 'Extension depending of the mimetype'
WORKFLOW_STEP := ("IMG" | "SEG" | "OCR" | "COR")
PROCESSOR := [A-Z0-9-]{3,}
$> zipinfo groupByUSE.zip
Archive: groupByUSE.zip
Zip file size: 4397 bytes, number of entries: 14
-rw-r----- 3.0 unx 7664 tx defN 18-Apr-16 12:51 mets.xml
drwxr-xr-x 3.0 unx 0 bx stor 18-Apr-16 12:51 OCR-D-IMG/
-rw-r--r-- 3.0 unx 5263814 bx stor 18-Apr-16 12:50 OCR-D-IMG/00000001.tif
-rw-r--r-- 3.0 unx 3113906 bx stor 18-Apr-16 12:50 OCR-D-IMG/00000002.tif
drwxr-xr-x 3.0 unx 0 bx stor 18-Apr-16 13:04 OCR-D-IMG-BIN/
-rw-r--r-- 3.0 unx 3461046 bx stor 18-Apr-16 12:51 OCR-D-IMG-BIN/00000001_BIN.tif
-rw-r--r-- 3.0 unx 3080122 bx stor 18-Apr-16 12:51 OCR-D-IMG-BIN/00000002_BIN.tif
-rw-r--r-- 3.0 unx 236 bx stor 18-Apr-16 13:04 OCR-D-IMG-BIN/00000001_BIN_PROV.json
-rw-r--r-- 3.0 unx 236 bx stor 18-Apr-16 13:04 OCR-D-IMG-BIN/00000002_BIN_PROV.json
drwxr-xr-x 3.0 unx 0 bx stor 18-Apr-16 13:05 OCR-D-SEG-PAGE/
-rw-r--r-- 3.0 unx 1324 bx stor 18-Apr-16 12:51 OCR-D-SEG-PAGE/00000001_SEG_PAGE.xml
-rw-r--r-- 3.0 unx 1578 bx stor 18-Apr-16 12:51 OCR-D-SEG-PAGE/00000002_SEG_PAGE.xml
-rw-r--r-- 3.0 unx 236 bx stor 18-Apr-16 13:05 OCR-D-SEG-PAGE/00000001_SEG_PAGE_PROV.json
-rw-r--r-- 3.0 unx 236 bx stor 18-Apr-16 13:05 OCR-D-SEG-PAGE/00000002_SEG_PAGE_PROV.json
14 files, 14910398 bytes uncompressed, 14772734 bytes compressed: 1.0%
$> unzip -p groupByUSE.zip mets.xml
[...]
<mets:fileSec xmlns:xlink="http://www.w3.org/1999/xlink">
<mets:fileGrp USE="IMAGE">
<mets:file ID="FILE_0001_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG/00000001.tif" /></mets:file>
<mets:file ID="FILE_0002_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG/00000002.tif" /></mets:file>
</mets:fileGrp>
<mets:fileGrp USE="OCR-D-IMG-BIN">
<mets:file ID="FILE_0001_IMG_BIN" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG-BIN/00000001_BIN.tif" /></mets:file>
<mets:file ID="FILE_0001_IMG_BIN_PROV" MIMETYPE="application/json"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG-BIN/00000001_BIN_PROV.json" /></mets:file>
<mets:file ID="FILE_0002_IMG_BIN" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG-BIN/00000002_BIN.tif" /></mets:file>
<mets:file ID="FILE_0002_IMG_BIN_PROV" MIMETYPE="application/json"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG-BIN/00000002_BIN_PROV.json" /></mets:file>
</mets:fileGrp>
<mets:fileGrp USE="OCR-D-SEG-PAGE">
<mets:file ID="FILE_0001_SEG_PAGE" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-SEG-PAGE/00000001_SEG_PAGE.xml" /></mets:file>
<mets:file ID="FILE_0001_SEG_PAGE_PROV" MIMETYPE="application/json"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-SEG-PAGE/00000001_SEG_PAGE_PROV.json" /></mets:file>
<mets:file ID="FILE_0002_SEG_PAGE" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-SEG-PAGE/00000002_SEG_PAGE.xml" /></mets:file>
<mets:file ID="FILE_0002_SEG_PAGE_PROV" MIMETYPE="application/json"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-SEG-PAGE/00000002_SEG_PAGE_PROV.json" /></mets:file>
</mets:fileGrp>
</mets:fileSec>
</mets:mets>