You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on previous discussion w/Mark, I'd made collectionspace-mapper return a record_status attribute (:new or :existing) in its result retrieved in the Process step.
The *_processing_report.csv file produced by the Process step contains a column INFO: record status with this information, which is also cached along with the XML payload (and CSID/path info for existing records) for use in the subsequent Transfer step.
An initial pass at transfer_job invokes RecordTransferService, which uses the combination of result.record_status and Step::Transfer's action_delete, action_create, and action_update attributes to determine
a) whether to attempt to transfer a given record
b) what type of transfer to do for a given record
For a pre-alpha proof of concept stage, this basically works, if you assume:
you are transferring the batch right after processing the batch
no one is adding/deleting records manually in CollectionSpace while you are processing this batch
no one is doing other batches to add or delete the same record ids
transfer step will not fail/be cancelled, then be reset and restarted
Of course in reality, we can assume none of these things going forward, and we need a new approach because of the following:
Relying on the record_status in the mapper result, or on any cached status value, can cause problems. Sending a POST to create a new record when a record already exists creates a duplicate record in CollectionSpace. A cancelled or failed transfer job that is reset thus has the potential to create duplicate records. This can be got around by resetting the cached record status value to :existing as part of completing a successful POST/create operation. HOWEVER, this does not protect against records added manually, or records added in other batches. The problematic possibilities are all compounded if you imagine some delay between the processing and transfer steps.
It appears that sending a PUT to update a previously existing, now non-existent record does not return an error OR recreate the record. This has less of a batch-import liability (unless we imagine starting a delete transfer, canceling, resetting, and redoing as a transfer including updates, OR another batch delete including the record happening in between process and transfer steps.
Some, but not all, of this can be handled with a combination of caching record info across client/connection instead of per-batch, and updating the cached record status value after successful create or delete transfers, but this gets into other complications that could introduce real bizarre stuff like the record payload from one batch being transferred by another batch.
The safest and most straightforward way through seems to be doing a ping of the affected record in the live CollectionSpace instance immediately before any transfer operation. This will likely slow down the transfer job, but it means less read/writes to Redis/cache (i.e. less expensive to run via AWS?).
The text was updated successfully, but these errors were encountered:
Based on previous discussion w/Mark, I'd made
collectionspace-mapper
return arecord_status
attribute (:new
or:existing
) in its result retrieved in theProcess
step.The
*_processing_report.csv
file produced by theProcess
step contains a columnINFO: record status
with this information, which is also cached along with the XML payload (and CSID/path info for existing records) for use in the subsequentTransfer
step.An initial pass at
transfer_job
invokesRecordTransferService
, which uses the combination ofresult.record_status
andStep::Transfer
'saction_delete
,action_create
, andaction_update
attributes to determinea) whether to attempt to transfer a given record
b) what type of transfer to do for a given record
For a pre-alpha proof of concept stage, this basically works, if you assume:
Of course in reality, we can assume none of these things going forward, and we need a new approach because of the following:
record_status
in the mapper result, or on any cached status value, can cause problems. Sending a POST to create a new record when a record already exists creates a duplicate record in CollectionSpace. A cancelled or failed transfer job that is reset thus has the potential to create duplicate records. This can be got around by resetting the cached record status value to:existing
as part of completing a successful POST/create operation. HOWEVER, this does not protect against records added manually, or records added in other batches. The problematic possibilities are all compounded if you imagine some delay between the processing and transfer steps.Some, but not all, of this can be handled with a combination of caching record info across client/connection instead of per-batch, and updating the cached record
status
value after successful create or delete transfers, but this gets into other complications that could introduce real bizarre stuff like the record payload from one batch being transferred by another batch.The safest and most straightforward way through seems to be doing a ping of the affected record in the live CollectionSpace instance immediately before any transfer operation. This will likely slow down the transfer job, but it means less read/writes to Redis/cache (i.e. less expensive to run via AWS?).
The text was updated successfully, but these errors were encountered: