Skip to content

Architecture

VolkerHartmann edited this page Apr 16, 2018 · 4 revisions

pyocrd Architecture

  • Data lifecycle (this should delegate to, potentially local, repository)
    • Create a local workspace directory for tools to run in
    • Transparent translation of file access to URL access/in-archive-access
  • Handling input
    • METS-XML with URLs to PAGE XML and images
    • OCRD-ZIP: ZIP with METS-XML as manifest and files included
  • Handling output
    • Capture STDERR of tools
    • Capture STDOUT of tools
    • Keep track of mapping input-image <-> output image
    • Produce either
      • METS-XML with files POSTed to data repository or
      • OCRD-ZIP with files and METS combined
  • Implement the CLI
    • Represent the command > implementation CLI structure in code

Format handling lib

Makes sense to extract the format handling altogether to a library (e.g. ocrd-mets) to reuse as much from there and abstract as much XML API as possible.

Have functionality to convert between OCRD-ZIP and METS-XML with a data repository.

Formats

METS-XML

Support FContent? Extremely inefficient encoding-wise (base64) but compact and standards-compliant.

Mapping between files must be by order within fileGrp (c.f. https://github.com/OCR-D/spec/issues/9 and https://github.com/OCR-D/spec/issues/7). What to do if processing breaks for a single file?

OCRD-ZIP

A ZIP archive with a METS file serving as the manifest:

$> zipinfo test.ocrd.zip
Archive:  test.ocrd.zip
Zip file size: 16384306 bytes, number of entries: 11
-rw-rw-r--  3.0 unx      388 tx defN 18-Mar-20 22:56 00000002.xml
-rw-rw-r--  3.0 unx      389 tx defN 18-Mar-20 22:56 00000004.xml
-rw-rw-r--  3.0 unx  5263814 bx defN 18-Mar-20 22:56 00000001.tif
-rw-rw-r--  3.0 unx  3113906 bx defN 18-Mar-20 22:56 00000002.tif
-rw-rw-r--  3.0 unx  3461046 bx defN 18-Mar-20 22:56 00000005.tif
-rw-rw-r--  3.0 unx  3080122 bx defN 18-Mar-20 22:56 00000003.tif
-rw-rw-r--  3.0 unx      388 tx defN 18-Mar-20 22:56 00000001.xml
-rw-rw-r--  3.0 unx      388 tx defN 18-Mar-20 22:56 00000003.xml
-rw-rw-r--  3.0 unx  1717616 bx defN 18-Mar-20 22:56 00000004.tif
-rw-rw-r--  3.0 unx     8648 tx defN 18-Mar-20 22:56 mets.xml
-rw-rw-r--  3.0 unx      389 tx defN 18-Mar-20 22:56 00000005.xml
11 files, 16647094 bytes uncompressed, 16382620 bytes compressed:  1.6%
$> unzip -p test.ocrd.zip mets.xml

[...]
    <mets:fileSec xmlns:xlink="http://www.w3.org/1999/xlink">
        <mets:fileGrp USE="IMAGE">
            <mets:file ID="FILE_0001_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000001.tif" /></mets:file>
            <mets:file ID="FILE_0002_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000002.tif" /></mets:file>
            <mets:file ID="FILE_0003_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000003.tif" /></mets:file>
            <mets:file ID="FILE_0004_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000004.tif" /></mets:file>
            <mets:file ID="FILE_0005_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000005.tif" /></mets:file>
        </mets:fileGrp>
        <mets:fileGrp USE="FULLTEXT">
            <mets:file ID="FILE_0001_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000001.xml" /></mets:file>
            <mets:file ID="FILE_0002_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000002.xml" /></mets:file>
            <mets:file ID="FILE_0003_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000003.xml" /></mets:file>
            <mets:file ID="FILE_0004_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000004.xml" /></mets:file>
            <mets:file ID="FILE_0005_FULLTEXT" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://00000005.xml" /></mets:file>
        </mets:fileGrp>
    </mets:fileSec>
</mets:mets>
Example: Zip file grouped by USE

We recommend to structure the files by their use. Resulting files may be labeled like this:

FILE_NAME := 'Basename of the original image' + WORKFLOW_STEP + ("-" + PROCESSOR)? + "." + 'Extension depending of the mimetype'

WORKFLOW_STEP := ("IMG" | "SEG" | "OCR" | "COR")

PROCESSOR := [A-Z0-9-]{3,}

$> zipinfo groupByUSE.zip 
Archive:  groupByUSE.zip
Zip file size: 4397 bytes, number of entries: 14
-rw-r-----  3.0 unx     7664 tx defN 18-Apr-16 12:51 mets.xml
drwxr-xr-x  3.0 unx        0 bx stor 18-Apr-16 12:51 OCR-D-IMG/
-rw-r--r--  3.0 unx  5263814 bx stor 18-Apr-16 12:50 OCR-D-IMG/00000001.tif
-rw-r--r--  3.0 unx  3113906 bx stor 18-Apr-16 12:50 OCR-D-IMG/00000002.tif
drwxr-xr-x  3.0 unx        0 bx stor 18-Apr-16 13:04 OCR-D-IMG-BIN/
-rw-r--r--  3.0 unx  3461046 bx stor 18-Apr-16 12:51 OCR-D-IMG-BIN/00000001_BIN.tif
-rw-r--r--  3.0 unx  3080122 bx stor 18-Apr-16 12:51 OCR-D-IMG-BIN/00000002_BIN.tif
-rw-r--r--  3.0 unx      236 bx stor 18-Apr-16 13:04 OCR-D-IMG-BIN/00000001_BIN_PROV.json
-rw-r--r--  3.0 unx      236 bx stor 18-Apr-16 13:04 OCR-D-IMG-BIN/00000002_BIN_PROV.json
drwxr-xr-x  3.0 unx        0 bx stor 18-Apr-16 13:05 OCR-D-SEG-PAGE/
-rw-r--r--  3.0 unx     1324 bx stor 18-Apr-16 12:51 OCR-D-SEG-PAGE/00000001_SEG_PAGE.xml
-rw-r--r--  3.0 unx     1578 bx stor 18-Apr-16 12:51 OCR-D-SEG-PAGE/00000002_SEG_PAGE.xml
-rw-r--r--  3.0 unx      236 bx stor 18-Apr-16 13:05 OCR-D-SEG-PAGE/00000001_SEG_PAGE_PROV.json
-rw-r--r--  3.0 unx      236 bx stor 18-Apr-16 13:05 OCR-D-SEG-PAGE/00000002_SEG_PAGE_PROV.json
14 files, 14910398 bytes uncompressed, 14772734 bytes compressed:  1.0%
$> unzip -p groupByUSE.zip mets.xml

[...]
    <mets:fileSec xmlns:xlink="http://www.w3.org/1999/xlink">
        <mets:fileGrp USE="IMAGE">
            <mets:file ID="FILE_0001_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG/00000001.tif" /></mets:file>
            <mets:file ID="FILE_0002_IMAGE" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG/00000002.tif" /></mets:file>
        </mets:fileGrp>
        <mets:fileGrp USE="OCR-D-IMG-BIN">
            <mets:file ID="FILE_0001_IMG_BIN" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG-BIN/00000001_BIN.tif" /></mets:file>
            <mets:file ID="FILE_0001_IMG_BIN_PROV" MIMETYPE="application/json"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG-BIN/00000001_BIN_PROV.json" /></mets:file>
            <mets:file ID="FILE_0002_IMG_BIN" MIMETYPE="image/tif"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG-BIN/00000002_BIN.tif" /></mets:file>
            <mets:file ID="FILE_0002_IMG_BIN_PROV" MIMETYPE="application/json"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-IMG-BIN/00000002_BIN_PROV.json" /></mets:file>
        </mets:fileGrp>
        <mets:fileGrp USE="OCR-D-SEG-PAGE">
            <mets:file ID="FILE_0001_SEG_PAGE" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-SEG-PAGE/00000001_SEG_PAGE.xml" /></mets:file>
            <mets:file ID="FILE_0001_SEG_PAGE_PROV" MIMETYPE="application/json"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-SEG-PAGE/00000001_SEG_PAGE_PROV.json" /></mets:file>
            <mets:file ID="FILE_0002_SEG_PAGE" MIMETYPE="text/xml"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-SEG-PAGE/00000002_SEG_PAGE.xml" /></mets:file>
            <mets:file ID="FILE_0002_SEG_PAGE_PROV" MIMETYPE="application/json"><mets:FLocat LOCTYPE="OTHER" xlink:href="file://OCR-D-SEG-PAGE/00000002_SEG_PAGE_PROV.json" /></mets:file>
        </mets:fileGrp>
    </mets:fileSec>
</mets:mets>
Clone this wiki locally