-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pilot status for pilots in PollTime related sleep cycle #7636
Comments
Indeed there's no such status. If we are considering making it a status, that would be something like "FINISHED_EMPTY", which would mean "Done with no matched payloads", because of course not matching jobs is not an error! What you are suggesting makes sense, for example, for resources for which pilots are submitted but for which there are no matching payloads, maybe because of non-fully-supported CPU types (I am thinking about ARM). Introducing such status is not difficult, using it for decisions in SiteDirector requires a bit of accurate work. To be fair, I would have this in DiracX, because we are reluctant to implement new functionalities before that. Unless you want to give it a try yourself... |
I'm not sure "Finished_empty" is the right status since a pilot can run multiple payloads and does not need to finish once it couldn't not find a matching payload. It will just go into a sleep mode and try again later. The main issue I see is the following we see on our site in production:
Before all pilots submitted to the site are running, the first that started already finished all available payload. Once there is no more payload available, all pilots are idle and try to poll for new payload periodically depending on "PollingTime" and the hardcoded increase of sleep time once a poll did not succeed in a payload (which would be good to have that also enabled/disabled via a configuration option). Also, when there are pilots in "sleep" mode between polling for payload and new payload arrives, the site director does not seem to take into account such pilots. It only seem to take into account pilots that are still idle from a batch system point of view, but not pilots that are running but have no payload. That results in more submitted pilots and again in a larger number of pilots that get no payload and go into sleep since the payload will already be processed when current sleeping pilots poll the next time. What I suggest is that the site director submits new pilots based on To do so, the status of such running pilots without payload needs to be known. |
OK, so I slightly misunderstood your first message: you are not talking about pilots that did not match any job, but pilots for which the last n cycles of the JobAgent did not match jobs.
This can be easily done.
The SiteDirector consumes info from the Computing Element. What you are suggesting to have is taking into consideration also:
This is possible (but won't be much precise anyway)
Instead of having a status (that at pilot would be something like "RUNNING_IDLE" , or "SLEEPING", we can also increment or decrement a (central) counter. |
That's correct.
Having 2. and assuming any job for that site can be matched to any of the sleeping pilots would help in a first step. Adding a config option for a site, e.g. "account for sleeping pilots = yes/no" could disable this feature in case a pilot can only match a specific payload and it is not known which payloads a pilot could potentially match.
Why would that no be more precise than how it is right now?
What do you suggest would be counted then? |
DIRAC/src/DIRAC/WorkloadManagementSystem/Client/PilotStatus.py
Line 23 in 60e2d82
For the pilot status options defined above, it seems there is no status to indicate a pilot that is running in the batch queue but did not get any payload in the last cycle.
Should this status be made available to the system for monitoring as well as for the sitedirector's decision about submitting new jobs? (If total number of pilots in such sleep mode is larger than the number of available payload, then no new pilot needs to get submitted)
The text was updated successfully, but these errors were encountered: