Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First iteration of transfer step is super-brittle and must be rewritten #6

Open
kspurgin opened this issue Jan 5, 2021 · 0 comments
Assignees

Comments

@kspurgin
Copy link
Contributor

kspurgin commented Jan 5, 2021

Based on previous discussion w/Mark, I'd made collectionspace-mapper return a record_status attribute (:new or :existing) in its result retrieved in the Process step.

The *_processing_report.csv file produced by the Process step contains a column INFO: record status with this information, which is also cached along with the XML payload (and CSID/path info for existing records) for use in the subsequent Transfer step.

An initial pass at transfer_job invokes RecordTransferService, which uses the combination of result.record_status and Step::Transfer's action_delete, action_create, and action_update attributes to determine

a) whether to attempt to transfer a given record
b) what type of transfer to do for a given record

For a pre-alpha proof of concept stage, this basically works, if you assume:

  • you are transferring the batch right after processing the batch
  • no one is adding/deleting records manually in CollectionSpace while you are processing this batch
  • no one is doing other batches to add or delete the same record ids
  • transfer step will not fail/be cancelled, then be reset and restarted

Of course in reality, we can assume none of these things going forward, and we need a new approach because of the following:

  • Relying on the record_status in the mapper result, or on any cached status value, can cause problems. Sending a POST to create a new record when a record already exists creates a duplicate record in CollectionSpace. A cancelled or failed transfer job that is reset thus has the potential to create duplicate records. This can be got around by resetting the cached record status value to :existing as part of completing a successful POST/create operation. HOWEVER, this does not protect against records added manually, or records added in other batches. The problematic possibilities are all compounded if you imagine some delay between the processing and transfer steps.
  • It appears that sending a PUT to update a previously existing, now non-existent record does not return an error OR recreate the record. This has less of a batch-import liability (unless we imagine starting a delete transfer, canceling, resetting, and redoing as a transfer including updates, OR another batch delete including the record happening in between process and transfer steps.

Some, but not all, of this can be handled with a combination of caching record info across client/connection instead of per-batch, and updating the cached record status value after successful create or delete transfers, but this gets into other complications that could introduce real bizarre stuff like the record payload from one batch being transferred by another batch.

The safest and most straightforward way through seems to be doing a ping of the affected record in the live CollectionSpace instance immediately before any transfer operation. This will likely slow down the transfer job, but it means less read/writes to Redis/cache (i.e. less expensive to run via AWS?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant