Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.0] Change TS addFile logic for Deleted files #7758

Closed
wants to merge 9 commits into from

Conversation

arrabito
Copy link
Contributor

@arrabito arrabito commented Aug 21, 2024

In the current 8.0 release when a file is removed from DFC, if the file was attached to a transformation it gets in 'Deleted' Status in the TransformationFile table. Now, if a new file with the same LFN of the one which has been removed is added again to the DFC and also to the TSCatalog, the current implementation of the addFile method of the TS does not update the Status of this file. As a consequence the new produced file will remain in 'Deleted' Status and will be never be processed by the transformation. The same is true for the setMetadata method of the TS. With this PR we add a check in the addFile and setMetadata methods on the 'Deleted' status of files already attached to the transformation and if the file must be attached to the transformations its Status is changed from 'Deleted' to 'Unused'. For coherence we also update the Status in the DataFile table from 'Deleted' to 'New'.

BEGINRELEASENOTES

*TransformationSystem
CHANGE: Check if file previously DELETED is back in DFC

ENDRELEASENOTES

@DIRACGridBot DIRACGridBot added the alsoTargeting:integration Cherry pick this PR to integration after merge label Aug 21, 2024
for fileDict in res["Value"]:
fileIDs.append(fileDict["FileID"])
if fileDict["Status"] == "Deleted":
res = self.__setTransformationFileStatus(list(fileIDs), "Unused", connection=connection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileIDs is appended to inside the loop, probably one should only use the current fileID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe split the fileIDs in Deleted and New and call once the status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileIDs is appended to inside the loop, probably one should only use the current fileID?

Yes I agree.
What about removing :

fileIDs.append(fileDict["FileID"])

and modifying in this way :

list([fileDict["FileID"]]), TransformationFilesStatus.UNUSED, connection=connection

?

@fstagni
Copy link
Contributor

fstagni commented Aug 21, 2024

I have never thought a file could be "re-added" to the catalog.

This check is somewhat "expensive" and I am wondering if there's a better way.

@arrabito
Copy link
Contributor Author

I have never thought a file could be "re-added" to the catalog.

In some cases we have jobs that failed but which succeded to upload part of the output files. When this happens we simply set to Unused all the input files of the job and a new job is created to process again those files. So when the new job tries to upload its output it will find some existing LFN. We have thus just instructed the job to remove the existing file and upload the new one.
We should probably improve this logic which is not optimal, but the scenario of re-add a file could happen anyway since it's not forbidden.

This check is somewhat "expensive" and I am wondering if there's a better way.

I haven't thought at a better implementation, but what about just removing the file also from the transformations when it's removed from the catalog instead of changing its status to 'Deleted'?

@fstagni
Copy link
Contributor

fstagni commented Aug 22, 2024

In some cases we have jobs that failed but which succeded to upload part of the output files. When this happens we simply set to Unused all the input files of the job and a new job is created to process again those files.

This, I would argue, is not the best approach. The better way would be to set requests, this is done for you by

def transferAndRegisterFileFailover(
which created exactly for this purpose.

So when the new job tries to upload its output it will find some existing LFN. We have thus just instructed the job to remove the existing file and upload the new one. We should probably improve this logic which is not optimal, but the scenario of re-add a file could happen anyway since it's not forbidden.

The scenario is not forbidden but still feels "odd".

This check is somewhat "expensive" and I am wondering if there's a better way.

I haven't thought at a better implementation, but what about just removing the file also from the transformations when it's removed from the catalog instead of changing its status to 'Deleted'?

Might be an option, but I think using the FailoverTransfer method is less error-prone.

@arrabito
Copy link
Contributor Author

In some cases we have jobs that failed but which succeded to upload part of the output files. When this happens we simply set to Unused all the input files of the job and a new job is created to process again those files.

This, I would argue, is not the best approach. The better way would be to set requests, this is done for you by

def transferAndRegisterFileFailover(

which created exactly for this purpose.

I'm not sure that we are talking exactly of the same scenario.
Let me give an example. We have job_1 which process files: infile_a, infile_b and which should produce: outfile_a, outfile_b.
For some reasons (e.g. bug in the application) the job is able to process infile_a but not infile_b, so that it uploads only outfile_a and it gets 'Failed'.
Since for our jobs we use the FailoverRequest module, e.g.:

        job.setExecutable(
            "/bin/ls -l", modulesList=["Script", "FailoverRequest"]

this takes care of setting to Unused all input files of failed jobs, in this case: infile_a, infile_b.

The new job will then process again both infile_a, infile_b which results in the scenario I've described.

What do you suggest in this case?

@fstagni
Copy link
Contributor

fstagni commented Aug 22, 2024

If the job failed (because it did not manage to process all input files), why is that same job uploading the (partial) outputs? It shouldn't even try the first upload.

@arrabito
Copy link
Contributor Author

If the job failed (because it did not manage to process all input files), why is that same job uploading the (partial) outputs? It shouldn't even try the first upload.

It's because we have implemented the upload of the outputs as an additional step inside the job, e.g.:

job.setExecutable('app_1')
job.setExecutable('upload_outs')

However you are right, if 'app_1' fails we could instruct for instance app_1 to remove the partially produced files so that 'upload_outs' has no file to upload.
I also take the opportunity to ask you some fundamental questions about the DIRAC job execution logic.

Why if the 'app_1' step fails, the 'upload_outs' step is executed anyway?

@fstagni
Copy link
Contributor

fstagni commented Aug 23, 2024

Why if the 'app_1' step fails, the 'upload_outs' step is executed anyway?

Within the current workflow system there is no way to prevent one step from being run. This means that every worfkflow module code should start with

def _checkWFAndStepStatus(self, noPrint=False):

@arrabito
Copy link
Contributor Author

arrabito commented Sep 5, 2024

Why if the 'app_1' step fails, the 'upload_outs' step is executed anyway?

Within the current workflow system there is no way to prevent one step from being run. This means that every worfkflow module code should start with

def _checkWFAndStepStatus(self, noPrint=False):

I see. We don't have custom modules, we just use: Script and FailoverRequest in this way:

job.setExecutable("app1")
job.setExecutable("upload_outs")
job.setExecutable("ls", modulesList=["Script", "FailoverRequest"])

So from your answer I understand that we should better create custom modules for each step and use checkWFAndStepStatus in each of them.

I'm going to try that.

@fstagni
Copy link
Contributor

fstagni commented Sep 11, 2024

You can fix the existing modules if there are errors in them, creating custom ones it is possible but a slight pain.

@arrabito
Copy link
Contributor Author

Hi @fstagni
I've tried to play with custom modules and indeed I think it's the way to go for us.
We will change our job logic and hopefully this scenario shouldn't happen anymore or very rarely.
So, I let you decide if you want to merge this PR or not.
Anyway, just a last question, why the use of:

def _checkWFAndStepStatus(self, noPrint=False):

isn't the default? I mean also in the Script module?

Thank you.

@fstagni
Copy link
Contributor

fstagni commented Sep 11, 2024

If you do not need this PR anymore, I would prefer to not merge it.

The use of _checkWFAndStepStatus is not the default because of some historical reason, but also because sometimes you just need to run all the modules.

@arrabito
Copy link
Contributor Author

ok fine for me.
Thank you.

@fstagni fstagni closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alsoTargeting:integration Cherry pick this PR to integration after merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants