Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hackathon] Only include resources that are actually able to accept jobs on DIRAC certification instance #7658

Open
marianne013 opened this issue Jun 7, 2024 · 4 comments
Assignees

Comments

@marianne013
Copy link
Contributor

The DIRAC certification instance seems to contain a number of resources that haven't worked in months (years?), making it difficult to distinguish real DIRAC errors from site failures.

So far we have:
Sites/CE:

LCG
htcondor-ce-[1,2,3,4]-kit.gridka.de_condor: submission fails with

/opt/dirac/runit/WorkloadManagement/SiteDirectorDteam/log/current:Command ['condor_submit', '-terse', '-pool', 'htcondor-ce-3-kit.gridka.de:9619', '-remote', 'htcondor-ce-3-kit.gridka.de', '/opt/dirac/data/HTCondor/work/HTCondorCE_axlxafve.sub'] failed with: 1 - ERROR: Failed to connect to queue manager htcondor-ce-3-kit.gridka.de

According to their bdii the condor CE still support dteam. Should this be followed up or the resource deleted ? If kept, they probably don't run EL7 any longer either.

There is a test that explicitly targets GRIF which hasn't worked in over a year:

/opt/dirac/runit/WorkloadManagement/SiteDirectorDteam/log/current:2024-06-07T15:52:39,588024Z WorkloadManagement/SiteDirectorDteam/node16.datagrid.cea.fr ERROR: Failed getting the status of the CE. Response: 403 - User can't be assigned configuration

Ask site or retire ?

CERN:

/opt/dirac/runit/WorkloadManagement/SiteDirectorDteam/log/current:2024-06-07T15:41:24,161662Z WorkloadManagement/SiteDirectorDteam/WorkloadManagement/SiteDirectorDteam ERROR: The following errors occurred during the pilot submission operation Command ['condor_submit', '-terse', '-pool', 'ce504.cern.ch:9619', '-remote', 'ce504.cern.ch', '/opt/dirac/data/HTCondor/work/HTCondorCE_no8_ygwr.sub'] failed with: 1 - ERROR: Failed to connect to queue manager ce504.cern.ch

I think giving up on CERN is a bad idea.

LCG.NCBJ.pl
One HelloWorld job recently succeeded, but the rest fails. Possibly related to storage elements ?

LCG.RAL.uk
Currently doesn't work due to: #7657, but should work in principle

Imperial, Glasgow, RALPPD: Should all work and can be fixed if it doesn't. Imperial cloud is currently (07/06/24) broken, but Simon and me are onto it. Everything else should work.

Storage Elements:
dcache.du.cesnet.cz (CESNET-SE) does not exist, at least I couldn't find it in the gocdb. Remove from config ?
IN2P3-SE ( ccsrm.in2p3.fr) does exist, needs verifying to see if it still takes dteam data.
RAL-SE does exist, but does it still take dteam ?
UKI-LT2-IC-HEP-disk and UKI-SOUTHGRID-RALPP-disk should work and if not the sites should be told.
We could probably rope Glasgow in, if you need a further storage element.

Once we have a set of sites/storage elements that should work, we should then remove any obsolete ones from the tests.
I think this ticket should be a group effort :-D.

@marianne013
Copy link
Contributor Author

Also, RAL: Currently the OS is defined for the CE, but for RAL this does not make sense, it needs to hand of the queue. We use a slightly different approach ("Platform", no-one but Simon understands this), so each queue is either EL7, EL8, EL9 and the OS is None. This seems to work fine, but how do I define on the certification machine

@fstagni
Copy link
Contributor

fstagni commented Jun 11, 2024

Good points. But I am not sure that bdii provides reliable information. What does goc say?

Given that we need to move to the new DIRAC certification instance, and that this is something that we should finalize, as a group effort, in the workshop's hackathon, we can redefine the list of resources the new setup.

@marianne013
Copy link
Contributor Author

GOCDB does not give information as to what VOs are supported. We've been pointing this out for quite a while. In the end we will have to email the sites.
I'm currently trying to convince Glasgow to give us another SE for testing.

@fstagni fstagni self-assigned this Sep 19, 2024
@fstagni
Copy link
Contributor

fstagni commented Oct 16, 2024

  • add a CE that uses certificates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants