-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corb2 needs be restartable #65
Comments
It's a good idea. Other workarounds, such as adding docs to a CORB job specific collection add extra overhead. It would be nice to have a client-only mechanism and an easy option to enable for jobs. Things to consider:
|
Good points. I am edging towards the idea of expanding on the disk-queue option and make it a requirement if the job needs to be restartable - well, unless we are using file loader, streaming loader etc. We probably need a simple marker (file) to keep track of indexes of uris already processed or indexes of uris currently in-process when the job was killed, which ever is efficient. We may be able to use this information filter out what is already processed and what remains to be processed if the job is restarted. If the job started with a tracking file along with temp-file (if not uris-file), then we can assume it is a restart. So, instead of temp file with delete on exit, we may be able to change to delete on 'clean' exit. We need a way to report back (as errors) if these files are left undeleted when job was killed, but not restarted. Well, this is not an easy problem to solve and any batch op, that does this has to track this processing information in a file or db. |
I am thinking of using using a parameter similar to URIS-COMPLETED-FILE (or something a bit more obvious), to which completed URIS will be written by the Monitor class, which is where we track completed URIS. The challenge I have is to make this transparent to the users i.e., if the parameter is specified, the job should always be restartable without user intervention (as it can/often be done by the scheduler) during the restart i.e., not forcing user to rename files (the user can do it if he/she wants to), update the parameters or use a combination of parameters. I am not sure how to do this yet (looking for ideas?), but I am thinking if the uris completed file exists, use it before the start of 'restart' job and move the previous file aside so that the file name can be used to track newly completed uris. May be we can append timestamp to the previous uris completed file name etc., to avoid being overwritten. |
@hansenmc @vjsaradhi - Please comment if you have an ideas or find mistakes in my approach. Also, need to think how we should do this for the new loaders we added for v2.4.0. Is there a way to diff of very large files in java. In our case, we need to find which of the uris from original uris are missing in completed uris file. I couldn't find any reliable open source implementation. I am wondering we could write it ourselves by - this only works since both file are sorted and completed uris file is always a subset of original uris file. I will need to experiment this on very large files.
Question: How can we implement restartability on the new loaders i.e., streaming, zip, directory loaders? I am hoping these are not going to be difficult, but I will keep implementation towards the end. Note: Once the diff is done, we can restart corb with missing uris as as URIS-FILE. My big concern is how to get the restartability working seamlessly with out much user intervention. |
bump I have a customer who would be interested in using this feature. For now they're just restarting the entire job. This would be a very useful feature for users who have large and complicated collection processes |
@mikeburl - sorry for delay. This is has been a little tricky to implement, though the basic idea is simple i.e, keep track of uris already processed and skip them if restarted. I will try to get to it in near future. For the time being, we have pursuing alternate ways in our current project i.e., 1. For update jobs, we are either tagging processed docs with a collection or updating a field, so next time we run the job, it won't pick up already updated docs. 2. For read only jobs, split corb into two parts i.e, use module executor to run the selector and dump uris to a file and then use URIS-FILE option to run the transform, which writes processed uris to a file. If the job needs to be restarted, we can do a delta between to figure out uris that haven't yet been processed. This could be automated via shell script. For a longer term solution, we could probably make this second option built into corb itself. |
We need to figure out a way to make corb restartable. We achieve this with some workarounds using the control documents, but very few batch operations rely on control docs. We can potentially write the processed uris to a local file and filter them out when the job is restarted.
The text was updated successfully, but these errors were encountered: