[8.0] Change TS addFile logic for Deleted files #7758

arrabito · 2024-08-21T14:32:16Z

In the current 8.0 release when a file is removed from DFC, if the file was attached to a transformation it gets in 'Deleted' Status in the TransformationFile table. Now, if a new file with the same LFN of the one which has been removed is added again to the DFC and also to the TSCatalog, the current implementation of the addFile method of the TS does not update the Status of this file. As a consequence the new produced file will remain in 'Deleted' Status and will be never be processed by the transformation. The same is true for the setMetadata method of the TS. With this PR we add a check in the addFile and setMetadata methods on the 'Deleted' status of files already attached to the transformation and if the file must be attached to the transformations its Status is changed from 'Deleted' to 'Unused'. For coherence we also update the Status in the DataFile table from 'Deleted' to 'New'.

BEGINRELEASENOTES

*TransformationSystem
CHANGE: Check if file previously DELETED is back in DFC

ENDRELEASENOTES

andresailer · 2024-08-21T15:20:42Z

src/DIRAC/TransformationSystem/DB/TransformationDB.py

+                    for fileDict in res["Value"]:
+                        fileIDs.append(fileDict["FileID"])
+                        if fileDict["Status"] == "Deleted":
+                            res = self.__setTransformationFileStatus(list(fileIDs), "Unused", connection=connection)


fileIDs is appended to inside the loop, probably one should only use the current fileID?

Or maybe split the fileIDs in Deleted and New and call once the status?

fileIDs is appended to inside the loop, probably one should only use the current fileID?

Yes I agree.
What about removing :

fileIDs.append(fileDict["FileID"])

and modifying in this way :

list([fileDict["FileID"]]), TransformationFilesStatus.UNUSED, connection=connection

?

src/DIRAC/TransformationSystem/DB/TransformationDB.py

fstagni · 2024-08-21T16:09:00Z

I have never thought a file could be "re-added" to the catalog.

This check is somewhat "expensive" and I am wondering if there's a better way.

arrabito · 2024-08-22T13:36:24Z

I have never thought a file could be "re-added" to the catalog.

In some cases we have jobs that failed but which succeded to upload part of the output files. When this happens we simply set to Unused all the input files of the job and a new job is created to process again those files. So when the new job tries to upload its output it will find some existing LFN. We have thus just instructed the job to remove the existing file and upload the new one.
We should probably improve this logic which is not optimal, but the scenario of re-add a file could happen anyway since it's not forbidden.

This check is somewhat "expensive" and I am wondering if there's a better way.

I haven't thought at a better implementation, but what about just removing the file also from the transformations when it's removed from the catalog instead of changing its status to 'Deleted'?

fstagni · 2024-08-22T14:05:57Z

In some cases we have jobs that failed but which succeded to upload part of the output files. When this happens we simply set to Unused all the input files of the job and a new job is created to process again those files.

This, I would argue, is not the best approach. The better way would be to set requests, this is done for you by

DIRAC/src/DIRAC/DataManagementSystem/Client/FailoverTransfer.py

Line 166 in f18be6f

def transferAndRegisterFileFailover(

which created exactly for this purpose.

So when the new job tries to upload its output it will find some existing LFN. We have thus just instructed the job to remove the existing file and upload the new one. We should probably improve this logic which is not optimal, but the scenario of re-add a file could happen anyway since it's not forbidden.

The scenario is not forbidden but still feels "odd".

This check is somewhat "expensive" and I am wondering if there's a better way.

I haven't thought at a better implementation, but what about just removing the file also from the transformations when it's removed from the catalog instead of changing its status to 'Deleted'?

Might be an option, but I think using the FailoverTransfer method is less error-prone.

arrabito · 2024-08-22T14:30:18Z

In some cases we have jobs that failed but which succeded to upload part of the output files. When this happens we simply set to Unused all the input files of the job and a new job is created to process again those files.

This, I would argue, is not the best approach. The better way would be to set requests, this is done for you by

DIRAC/src/DIRAC/DataManagementSystem/Client/FailoverTransfer.py

Line 166 in f18be6f

def transferAndRegisterFileFailover(

which created exactly for this purpose.

I'm not sure that we are talking exactly of the same scenario.
Let me give an example. We have job_1 which process files: infile_a, infile_b and which should produce: outfile_a, outfile_b.
For some reasons (e.g. bug in the application) the job is able to process infile_a but not infile_b, so that it uploads only outfile_a and it gets 'Failed'.
Since for our jobs we use the FailoverRequest module, e.g.:

        job.setExecutable(
            "/bin/ls -l", modulesList=["Script", "FailoverRequest"]

this takes care of setting to Unused all input files of failed jobs, in this case: infile_a, infile_b.

The new job will then process again both infile_a, infile_b which results in the scenario I've described.

What do you suggest in this case?

fstagni · 2024-08-22T14:38:24Z

If the job failed (because it did not manage to process all input files), why is that same job uploading the (partial) outputs? It shouldn't even try the first upload.

arrabito · 2024-08-22T14:57:22Z

If the job failed (because it did not manage to process all input files), why is that same job uploading the (partial) outputs? It shouldn't even try the first upload.

It's because we have implemented the upload of the outputs as an additional step inside the job, e.g.:

job.setExecutable('app_1')
job.setExecutable('upload_outs')

However you are right, if 'app_1' fails we could instruct for instance app_1 to remove the partially produced files so that 'upload_outs' has no file to upload.
I also take the opportunity to ask you some fundamental questions about the DIRAC job execution logic.

Why if the 'app_1' step fails, the 'upload_outs' step is executed anyway?

fstagni · 2024-08-23T07:56:02Z

Why if the 'app_1' step fails, the 'upload_outs' step is executed anyway?

Within the current workflow system there is no way to prevent one step from being run. This means that every worfkflow module code should start with

DIRAC/src/DIRAC/Workflow/Modules/ModuleBase.py

Line 381 in f18be6f

def _checkWFAndStepStatus(self, noPrint=False):

arrabito · 2024-09-05T14:45:23Z

Why if the 'app_1' step fails, the 'upload_outs' step is executed anyway?

Within the current workflow system there is no way to prevent one step from being run. This means that every worfkflow module code should start with

DIRAC/src/DIRAC/Workflow/Modules/ModuleBase.py

Line 381 in f18be6f

def _checkWFAndStepStatus(self, noPrint=False):

I see. We don't have custom modules, we just use: Script and FailoverRequest in this way:

job.setExecutable("app1")
job.setExecutable("upload_outs")
job.setExecutable("ls", modulesList=["Script", "FailoverRequest"])

So from your answer I understand that we should better create custom modules for each step and use checkWFAndStepStatus in each of them.

I'm going to try that.

fstagni · 2024-09-11T09:09:37Z

You can fix the existing modules if there are errors in them, creating custom ones it is possible but a slight pain.

arrabito · 2024-09-11T09:24:04Z

Hi @fstagni
I've tried to play with custom modules and indeed I think it's the way to go for us.
We will change our job logic and hopefully this scenario shouldn't happen anymore or very rarely.
So, I let you decide if you want to merge this PR or not.
Anyway, just a last question, why the use of:

def _checkWFAndStepStatus(self, noPrint=False):

isn't the default? I mean also in the Script module?

Thank you.

fstagni · 2024-09-11T10:04:06Z

If you do not need this PR anymore, I would prefer to not merge it.

The use of _checkWFAndStepStatus is not the default because of some historical reason, but also because sometimes you just need to run all the modules.

arrabito · 2024-09-11T10:17:05Z

ok fine for me.
Thank you.

DIRACGridBot added the alsoTargeting:integration Cherry pick this PR to integration after merge label Aug 21, 2024

andresailer reviewed Aug 21, 2024

View reviewed changes

fstagni reviewed Aug 21, 2024

View reviewed changes

feat (TS): add check on Deleted status

b58c753

arrabito force-pushed the fixTSaddFile branch from 1f65b29 to afac90d Compare August 22, 2024 10:10

arrabito added 4 commits August 22, 2024 12:19

fix (TS): add missing argument

aca8030

feat (TS): update DataFile status

97ecdcf

fix (TS): use TransformationStatus

190d5ca

fix (TS): pre-commit reformat

7a13ae9

arrabito force-pushed the fixTSaddFile branch from afac90d to 7a13ae9 Compare August 22, 2024 10:20

arrabito added 2 commits August 22, 2024 13:52

fix (TS): use transformationfiles status

99a427b

fix (TS): pre-commit reformat

8ea362a

arrabito added 2 commits August 22, 2024 15:50

fix (TS): remove list append

d152f99

fix (TS): allow for parameter values equal to zero

5ba6450

fstagni closed this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.0] Change TS addFile logic for Deleted files #7758

[8.0] Change TS addFile logic for Deleted files #7758

arrabito commented Aug 21, 2024 •

edited by fstagni

Loading

andresailer Aug 21, 2024

andresailer Aug 21, 2024

arrabito Aug 22, 2024

fstagni commented Aug 21, 2024

arrabito commented Aug 22, 2024

fstagni commented Aug 22, 2024

arrabito commented Aug 22, 2024

fstagni commented Aug 22, 2024

arrabito commented Aug 22, 2024

fstagni commented Aug 23, 2024

arrabito commented Sep 5, 2024

fstagni commented Sep 11, 2024

arrabito commented Sep 11, 2024

fstagni commented Sep 11, 2024

arrabito commented Sep 11, 2024

[8.0] Change TS addFile logic for Deleted files #7758

[8.0] Change TS addFile logic for Deleted files #7758

Conversation

arrabito commented Aug 21, 2024 • edited by fstagni Loading

andresailer Aug 21, 2024

Choose a reason for hiding this comment

andresailer Aug 21, 2024

Choose a reason for hiding this comment

arrabito Aug 22, 2024

Choose a reason for hiding this comment

fstagni commented Aug 21, 2024

arrabito commented Aug 22, 2024

fstagni commented Aug 22, 2024

arrabito commented Aug 22, 2024

fstagni commented Aug 22, 2024

arrabito commented Aug 22, 2024

fstagni commented Aug 23, 2024

arrabito commented Sep 5, 2024

fstagni commented Sep 11, 2024

arrabito commented Sep 11, 2024

fstagni commented Sep 11, 2024

arrabito commented Sep 11, 2024

arrabito commented Aug 21, 2024 •

edited by fstagni

Loading