You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The DIRAC certification instance seems to contain a number of resources that haven't worked in months (years?), making it difficult to distinguish real DIRAC errors from site failures.
So far we have: Sites/CE:
LCG
htcondor-ce-[1,2,3,4]-kit.gridka.de_condor: submission fails with
/opt/dirac/runit/WorkloadManagement/SiteDirectorDteam/log/current:Command ['condor_submit', '-terse', '-pool', 'htcondor-ce-3-kit.gridka.de:9619', '-remote', 'htcondor-ce-3-kit.gridka.de', '/opt/dirac/data/HTCondor/work/HTCondorCE_axlxafve.sub'] failed with: 1 - ERROR: Failed to connect to queue manager htcondor-ce-3-kit.gridka.de
According to their bdii the condor CE still support dteam. Should this be followed up or the resource deleted ? If kept, they probably don't run EL7 any longer either.
There is a test that explicitly targets GRIF which hasn't worked in over a year:
/opt/dirac/runit/WorkloadManagement/SiteDirectorDteam/log/current:2024-06-07T15:52:39,588024Z WorkloadManagement/SiteDirectorDteam/node16.datagrid.cea.fr ERROR: Failed getting the status of the CE. Response: 403 - User can't be assigned configuration
Ask site or retire ?
CERN:
/opt/dirac/runit/WorkloadManagement/SiteDirectorDteam/log/current:2024-06-07T15:41:24,161662Z WorkloadManagement/SiteDirectorDteam/WorkloadManagement/SiteDirectorDteam ERROR: The following errors occurred during the pilot submission operation Command ['condor_submit', '-terse', '-pool', 'ce504.cern.ch:9619', '-remote', 'ce504.cern.ch', '/opt/dirac/data/HTCondor/work/HTCondorCE_no8_ygwr.sub'] failed with: 1 - ERROR: Failed to connect to queue manager ce504.cern.ch
I think giving up on CERN is a bad idea.
LCG.NCBJ.pl
One HelloWorld job recently succeeded, but the rest fails. Possibly related to storage elements ?
LCG.RAL.uk
Currently doesn't work due to: #7657, but should work in principle
Imperial, Glasgow, RALPPD: Should all work and can be fixed if it doesn't. Imperial cloud is currently (07/06/24) broken, but Simon and me are onto it. Everything else should work.
Storage Elements:
dcache.du.cesnet.cz (CESNET-SE) does not exist, at least I couldn't find it in the gocdb. Remove from config ?
IN2P3-SE ( ccsrm.in2p3.fr) does exist, needs verifying to see if it still takes dteam data.
RAL-SE does exist, but does it still take dteam ?
UKI-LT2-IC-HEP-disk and UKI-SOUTHGRID-RALPP-disk should work and if not the sites should be told.
We could probably rope Glasgow in, if you need a further storage element.
Once we have a set of sites/storage elements that should work, we should then remove any obsolete ones from the tests.
I think this ticket should be a group effort :-D.
The text was updated successfully, but these errors were encountered:
Also, RAL: Currently the OS is defined for the CE, but for RAL this does not make sense, it needs to hand of the queue. We use a slightly different approach ("Platform", no-one but Simon understands this), so each queue is either EL7, EL8, EL9 and the OS is None. This seems to work fine, but how do I define on the certification machine
Good points. But I am not sure that bdii provides reliable information. What does goc say?
Given that we need to move to the new DIRAC certification instance, and that this is something that we should finalize, as a group effort, in the workshop's hackathon, we can redefine the list of resources the new setup.
GOCDB does not give information as to what VOs are supported. We've been pointing this out for quite a while. In the end we will have to email the sites.
I'm currently trying to convince Glasgow to give us another SE for testing.
The DIRAC certification instance seems to contain a number of resources that haven't worked in months (years?), making it difficult to distinguish real DIRAC errors from site failures.
So far we have:
Sites/CE:
LCG
htcondor-ce-[1,2,3,4]-kit.gridka.de_condor: submission fails with
According to their bdii the condor CE still support dteam. Should this be followed up or the resource deleted ? If kept, they probably don't run EL7 any longer either.
There is a test that explicitly targets GRIF which hasn't worked in over a year:
Ask site or retire ?
CERN:
I think giving up on CERN is a bad idea.
LCG.NCBJ.pl
One HelloWorld job recently succeeded, but the rest fails. Possibly related to storage elements ?
LCG.RAL.uk
Currently doesn't work due to: #7657, but should work in principle
Imperial, Glasgow, RALPPD: Should all work and can be fixed if it doesn't. Imperial cloud is currently (07/06/24) broken, but Simon and me are onto it. Everything else should work.
Storage Elements:
dcache.du.cesnet.cz (CESNET-SE) does not exist, at least I couldn't find it in the gocdb. Remove from config ?
IN2P3-SE ( ccsrm.in2p3.fr) does exist, needs verifying to see if it still takes dteam data.
RAL-SE does exist, but does it still take dteam ?
UKI-LT2-IC-HEP-disk and UKI-SOUTHGRID-RALPP-disk should work and if not the sites should be told.
We could probably rope Glasgow in, if you need a further storage element.
Once we have a set of sites/storage elements that should work, we should then remove any obsolete ones from the tests.
I think this ticket should be a group effort :-D.
The text was updated successfully, but these errors were encountered: