Skip to content

Document Processing Pipelines

Adam Hooper edited this page Mar 19, 2018 · 5 revisions

Where to start? How about ImportJob: our progress-reporting mechanism. No matter how you upload files, Overview can always tell you:

  • The DocumentSet the files will go into. (Overview always creates a document set first and adds files to it second.)
  • Progress-reporting information: a way to set the user's expectations.

Beyond that, our import pipelines have a bit in common:

  • Every pipeline creates Document objects.
  • Documents are always generated in Overview's "worker" process (as opposed to its "web server" process).

What We're Generating

Each import pipeline creates Documents. Document data is stored in a few places:

  • Most document data is in the Postgres database, in the document table. In particular, document text and title (which Overview generates within these pipelines) and document notes and metadata (which the user provides) are stored here.
  • Tags are in the tag table, and document-tag lists are in the document_tag table.
  • Processed uploaded files are Files, with metadata in the file table and file contents in BlobStorage (Amazon S3 or the filesystem). Alongside each uploaded file is a generated PDF file Overview lets the user view.
  • When the user chooses to split by page, Overview generates a PDF per page for the user to view: that's in the page table and in BlobStorage.
  • Thumbnails are in BlobStorage.
  • Each document set also has a Lucene index containing document titles, text and metadata. The worker maintains those indexes on the filesystem.

Pipelines

File Upload Pipeline

See Ingest Pipeline. We plan to make this the only Pipeline in Overview.

Right now, a bit of a hack remains:

  1. User uploads files into GroupedFileUploads (and Postgres Large Objects).

    1. On demand, the server creates a FileGroup to hold all the files the user will upload. (There is one FileGroup per User+DocumentSet, and DocumentSet may be null here.)
    2. The user streams each file into a GroupedFileUpload, assigning a client-generated GUID to handle resuming. See js-mass-upload for design details.
    3. The user clicks "Finish". Overview creates the DocumentSet if it's a new document set, then Overview sets FileGroup.addToDocumentSetId and kicks off the worker process.
  2. For each file:

    1. Worker converts GroupedFileUpload to WrittenFile. It deletes GroupedFileUpload's associated Postgres Large Object.
    2. Overview runs the Ingest Pipeline on the WrittenFile.
    3. Worker sorts the documents and writes the result to document_set.sorted_document_ids.
    4. Worker deletes the FileGroup.

When the user asks for a progress report, the web server builds an ImportJob from the file_group table.

DocumentCloud Pipeline

TODO

CSV-Import Pipeline

TODO

Clone this wiki locally