This script takes one or more multi-page PDFS or Tiffs and generates the directory structure necessary to ingest it into an Islandora instance as a book object.
It assumes that your source objects contain the entirety of a single book.
This script requires Python 3.
- Clone the this repository
- Install dependencies
pip (or pip3) install -r requirements.txt
- Run
To run this script requires the existence of:
- ghostscript
- Imagemagick convert
- Imagemagick identify
in the working PATH.
It also needs:
- tesseract unless you specify the
--skip-hocr-ocr
option - kdu_compress unless you specify the
--skip-jp2
option
If you specify the --skip-derivatives
option, neither is required.
This is the main script which does the bulk of the work in generating your book object.
The script takes the file or a directory of files for each file it creates a clean directory name of the file, with
spaces replaced by underscores and the word _dir
at the end.
ie. The Heart of the Continent.tiff --> The_Heart_of_the_Continent_dir
If you provide the --mods-dir
option, it should point to a directory containing MODS files with the same name as the
source file but with a .mods
extension. (ie. The_Heart_of_the_Continent.mods).
Note: You can alter the MODS file extension with the --mods-extension
argument.
If you don't provide a --mods-dir
option but your files
argument is a directory, then that same directory will be
checked for MODS files.
Running the multipage2book.py
with a -h
or --help
argument will get you a description of the possible options.
usage: multipage2book.py [-h] [--password PASSWORD] [--overwrite] [--language LANGUAGE] [--resolution RESOLUTION] [--use-hocr] [--mods-dir MODS_DIR] [--mods-extension MODS_EXTENSION]
[--output-dir OUTPUT_DIR] [--merge] [--skip-derivatives] [--skip-hocr-ocr] [--skip-jp2] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
files
Turn a PDF/Tiff or set of PDFs/Tiffs into properly formatted directories for Islandora Book Batch.
positional arguments:
files A file or directory of files to process.
optional arguments:
-h, --help show this help message and exit
--password PASSWORD Password to use when parsing PDFs.
--overwrite Overwrite any existing Tiff/PDF/OCR/Hocr files with new copies.
--language LANGUAGE Language of the source material, used for OCRing. Defaults to eng.
--resolution RESOLUTION
Resolution of the source material, used when generating Tiff. Defaults to 300.
--use-hocr Generate OCR by stripping HTML characters from HOCR, otherwise run tesseract a second time. Defaults to use tesseract.
--mods-dir MODS_DIR Directory of files with a matching name but with the extension "mods" to be added to the books.
--mods-extension MODS_EXTENSION
The extension of the MODS files existing in the above directory. Files are matched based on filename but with this extension. Defaults to 'mods'
--output-dir OUTPUT_DIR
Directory to build books in, defaults to current directory.
--merge Files that have the same name but with a numeric suffix are considered the same book and directories are merged. (ie. MyBook1.pdf and MyBook2.pdf)
--skip-derivatives Only split the source file into the separate pages and directories, don't generate derivatives.
--skip-hocr-ocr Do not generate OCR/HOCR datastreams, this cannot be used with --skip-derivatives
--skip-jp2 Do not generate JP2 datastreams, this cannot be used with --skip-derivatives
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set logging level, defaults to ERROR.
-
Process a PDF file into the correct directory structure with just each PDF page split out.
./multipage2book.py --output-dir=OUTPUT --skip-derivatives MyBook.pdf
This creates the following structure
OUTPUT/ MyBook_dir/ PDF.pdf 1/ PDF.pdf 2/ PDF.pdf ...
-
Process a PDF file into the correct directory structure with simple derivatives from the source.
./multipage2book.py --output-dir=OUTPUT --skip-hocr-ocr --skip-jp2 MyBook.pdf
This creates the following structure
OUTPUT/ MyBook_dir/ PDF.pdf TN.jpg 1/ OBJ.tiff PDF.pdf JPG.jpg TN.jpg 2/ OBJ.tiff PDF.pdf JPG.jpg TN.jpg ...
-
Process a PDF file into the correct directory structure processing the MODS file down to the pages.
Assuming a directory called "INPUT"
INPUT/ MyPDF.pdf MyPDF.xml
Then calling:
./multipage2book.py INPUT --output-dir=/output/directory --skip-hocr-ocr --skip-jp2 --mods-extension=xml
This creates the following structure
OUTPUT/ MyPDF_dir/ PDF.pdf MODS.xml TN.jpg 1/ JPG.jpg MODS.xml OBJ.tiff PDF.pdf TN.jpg 2/ JPG.jpg MODS.xml OBJ.tiff PDF.pdf TN.jpg ...
-
The
--merge
option is useful, but problematic. Its use case is when a single Tiff could not hold all the pages of a book. In which case so long as the various files share a common basename but with an integer appended. (ie. SomeBook1.tiff, SomeBook2.tiff, SomeBook3.tiff). These books will all be combined into a single set of pages.Normally you can process a book overtop of a previous run, the script will just fill in the missing parts. However the
--merge
option requires that there NOT be a book directory in the output directory. Because we are adding pages we can't guarantee correct order and numbering unless it starts fresh each time.Also any MODS file must match the filename WITHOUT the numeric extension.
ie.
MyTitle1.tiff -> MyTitle.mods
Assuming an "INPUT" directory containing 3 files each with 10 pages
INPUT/ MyBook1.tiff MyBook2.tiff MyBook3.tiff MyBook.mods
we process them with
./multipage2book.py INPUT --output-dir=OUTPUT --merge --skip-derivatives Warning: merge attempts to combine multiple files that start with the same name and end with a digit before the extension. Files are sorted by the number and require an empty starting directory. If the expected directory contains files, it will halt with a warning. Press any key to proceed
The output directory would look like
OUTPUT/ MyBook_dir/ MODS.xml 1/ OBJ.tiff MODS.xml 2/ OBJ.tiff MODS.xml ... 29/ OBJ.tiff MODS.xml 30/ OBJ.tiff MODS.xml
The hocrpdf.py
class is included in such a way that if you specify a --loglevel
level of DEBUG
, any searchable
PDFs generated will have the text visibly written over the page image. Only use this setting for debugging, never for
production.
Along with multipage2book.py
there are several support classes that can be run as standalone scripts. These are:
Derivatives.py
- generate derivatives for a directory or set of directories.MODSSpreader.py
- copy/alter a MODS files for each page of a paged content item.hocrpdf.py
- generate a searchable PDF using an image (JP2, JPG) and an hOCR file.
All of these scripts have usage arguments that can be revealed by running them with the -h
or --help
argument.
hocrpdf.py
is a modification/rewrite of hocr-pdf from tmbdev.
It has been modified to:
- make it a class for inclusion in other code
- modifications to the calculation of the word box base
- changed from using
setTextOrigin()
to usingsetTextTransform()
to assign the rotation of the box. - stopped using the included invisible font.
- set the font height to match the box height to get better word highlighting.
- switched from
lxml.etree
library toxml.etree
library