First iteration of transfer step is super-brittle and must be rewritten #6

kspurgin · 2021-01-05T21:27:31Z

Based on previous discussion w/Mark, I'd made collectionspace-mapper return a record_status attribute (:new or :existing) in its result retrieved in the Process step.

The *_processing_report.csv file produced by the Process step contains a column INFO: record status with this information, which is also cached along with the XML payload (and CSID/path info for existing records) for use in the subsequent Transfer step.

An initial pass at transfer_job invokes RecordTransferService, which uses the combination of result.record_status and Step::Transfer's action_delete, action_create, and action_update attributes to determine

a) whether to attempt to transfer a given record
b) what type of transfer to do for a given record

For a pre-alpha proof of concept stage, this basically works, if you assume:

you are transferring the batch right after processing the batch
no one is adding/deleting records manually in CollectionSpace while you are processing this batch
no one is doing other batches to add or delete the same record ids
transfer step will not fail/be cancelled, then be reset and restarted

Of course in reality, we can assume none of these things going forward, and we need a new approach because of the following:

Relying on the record_status in the mapper result, or on any cached status value, can cause problems. Sending a POST to create a new record when a record already exists creates a duplicate record in CollectionSpace. A cancelled or failed transfer job that is reset thus has the potential to create duplicate records. This can be got around by resetting the cached record status value to :existing as part of completing a successful POST/create operation. HOWEVER, this does not protect against records added manually, or records added in other batches. The problematic possibilities are all compounded if you imagine some delay between the processing and transfer steps.
It appears that sending a PUT to update a previously existing, now non-existent record does not return an error OR recreate the record. This has less of a batch-import liability (unless we imagine starting a delete transfer, canceling, resetting, and redoing as a transfer including updates, OR another batch delete including the record happening in between process and transfer steps.

Some, but not all, of this can be handled with a combination of caching record info across client/connection instead of per-batch, and updating the cached record status value after successful create or delete transfers, but this gets into other complications that could introduce real bizarre stuff like the record payload from one batch being transferred by another batch.

The safest and most straightforward way through seems to be doing a ping of the affected record in the live CollectionSpace instance immediately before any transfer operation. This will likely slow down the transfer job, but it means less read/writes to Redis/cache (i.e. less expensive to run via AWS?).

The text was updated successfully, but these errors were encountered:

kspurgin mentioned this issue Jan 6, 2021

check_records and check_terms checkboxes interaction with batch.batch_config #4

Closed

kspurgin self-assigned this Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First iteration of transfer step is super-brittle and must be rewritten #6

First iteration of transfer step is super-brittle and must be rewritten #6

kspurgin commented Jan 5, 2021

First iteration of transfer step is super-brittle and must be rewritten #6

First iteration of transfer step is super-brittle and must be rewritten #6

Comments

kspurgin commented Jan 5, 2021