-
Dear all, I'm having troubles to make multi-core jobs working so that I have some questions about how to configure DIRAC to run multi-core jobs. Here below my use case. We have an application which is "almost" single core, in the sense that multiple threads are instantiated at the start of the job payload, but then they don't use CPU in parallel, so that in general only one core is used, with some exceptions. Our sites are all configured with NumeberOfProcessors = 1 and usually these jobs just run fine, but for a particular site we had a large fraction of failures and from the batch system logs we learned that this was due to jobs exceeding the nb of requested cores, i.e. 1. So, we have configured the problematic site with: but then we get all jobs there failed with Minor Status = Received Kill signal. In the logs of the pilots I found:
but the pilot status is Done. I don't know yet which component sent the kill signal, but I was wondering if the configuration for multi-core jobs is fine. Is it enough to just configure the site with:
? Or some additional specification should be done also in the job requirements? For instance in the logs of the pilot I found:
but I don't know if this is correct or if we should have payloadProcessors = 4. Thank you for your help. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 3 replies
-
Hi,
How do you submit your jobs? e.g. are you using an ARC or HTCondor CE? Are you submitting them via SSH specifying the underlying batch system? In this case, which batch system is it?
IIUC, the
From what I understand, it should be okay. |
Beta Was this translation helpful? Give feedback.
-
Before I investigate deeper, I just want to make sure that:
|
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for these explanations. which I think it answers to my question. From my understanding I should indeed use in my case:
I will let you know if it works as expected. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Dear all, First of all, could you please tell me if for our use case, we should use the PoolComputingElement, setting:
in the CE configuration? For the moment we haven't specified anything. Then, I've done a few tests with:
but jobs keep failing. In the pilot logs I found:
Could you please tell me if this is correct? Eventually I thought that payloadProcessors should have been 4, since it can use 4 cores at max. Anyway, I don't know if jobs failed because they tried to use more than 4 cores or because the requirement is not working as expected. So, I've also tried a small test with 3 jobs with:
and jobs succeeded. Unfortunately I don't have pilot logs. Finally I've tried another small test of 3 jobs with:
and 2 were Done and 1 Failed. I will try to investigate further, but I wanted to know if it's normal to have payloadProcessors = 1 for jobs that can use 1 core and at maximum 6 cores. Also I would like to know about the use of PoolComputingElement. @aldbr These test refer to HTCondor CE, but we will also have later to run on ARC6. Thank you. Luisa |
Beta Was this translation helpful? Give feedback.
-
Thank you for your answer. When you say that we could set
? Thanks. |
Beta Was this translation helpful? Give feedback.
-
To check: seems like the |
Beta Was this translation helpful? Give feedback.
Before I investigate deeper, I just want to make sure that:
NumberOfProcessors
and notNumeberOfProcessors
as you repeatedly made this typo above.NumberOfProcessors=1
is the default, you don't have to specify it further.