Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish never-spawned from never-submitted. #6067

Merged
merged 13 commits into from
Apr 29, 2024

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Apr 13, 2024

Close #6066

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@hjoliver hjoliver added the bug Something is wrong :( label Apr 13, 2024
@hjoliver hjoliver self-assigned this Apr 13, 2024
@hjoliver hjoliver changed the title Distinguish never spawned from suicided before submit. Distinguish never-spawned from suicided-before-submit. Apr 14, 2024
@hjoliver hjoliver changed the title Distinguish never-spawned from suicided-before-submit. Distinguish never-spawned from never-submitted. Apr 15, 2024
@hjoliver hjoliver added this to the cylc-8.3.0 milestone Apr 15, 2024
@hjoliver
Copy link
Member Author

hjoliver commented Apr 15, 2024

Fixing bullet point 2 has generated a DB issue that I don't understand yet.

@hjoliver
Copy link
Member Author

suicide trigger

[scheduling]
    [[graph]]
        R1 = """
            a? & b => c
            a:failed? => !a & !c
        """
[runtime]
    [[a]]
        script = "sleep 3; false"
    [[b,c]]

now works as expected:

...
INFO - [1/c:waiting(runahead)] => waiting
INFO - [1/a/01:failed] removed from active task pool: suicide trigger
INFO - [1/c:waiting] removed from active task pool: suicide trigger
INFO - [1/b/01:running] => succeeded
INFO - Not spawning 1/c: already used in this flow
INFO - [1/b/01:succeeded] removed from active task pool: completed
INFO - Workflow shutting down - AUTOMATIC
INFO - DONE

cylc remove

Since #5643 is not done yet, we can't remove future tasks, so need an example that spawns a waiting task to be removed:

[scheduling]
    [[graph]]
        # b spawns c before a removes c
        R1 = "a & b => c"
[runtime]
    [[a]]
        script = "sleep 5; cylc remove $CYLC_WORKFLOW_ID//1/c"
    [[b,c]]

Result: 1/c does not run (already spawned in this flow; and prior removal was logged).

...
INFO - [1/c:waiting(runahead)] => waiting
INFO - [1/b/01:succeeded] removed from active task pool: completed
INFO - Command "remove_tasks" received. ID=f2834293-11dc-49cd-ae5c-e1d9f5cbdb42
    remove_tasks(tasks=['1/c'])
INFO - [1/c:waiting] removed from active task pool: request
INFO - Command "remove_tasks" actioned. ID=f2834293-11dc-49cd-ae5c-e1d9f5cbdb42
INFO - [1/a/01:running] => succeeded
INFO - Not spawning 1/c: already used in this flow
INFO - [1/a/01:succeeded] removed from active task pool: completed
INFO - Workflow shutting down - AUTOMATIC
INFO - DONE

I think this is what we want, subject to possible revision when #5643 is implemented.

@hjoliver hjoliver marked this pull request as ready for review April 18, 2024 06:53
@hjoliver
Copy link
Member Author

Can rebase this onto 8.2.x if necessary (it seems we may get another 8.2 release out before 8.3.0?)

@oliver-sanders
Copy link
Member

Yes, we've got a cylc-rose issue that we would like to fix in the 8.2 series so are planning another release.

@hjoliver
Copy link
Member Author

This actually conflicts quite badly with 8.2.x. Mind if we leave it on master 8.3.0?

@oliver-sanders
Copy link
Member

Shockingly, we've only had one report of this so far (failure cases are presumably less well tested in workflows?) so we can wait a little.

* Ensure removed tasks are not automatically re-spawned by task pool
  logic.
* Ensure removed tasks can be manually re-spawned by request.
@oliver-sanders
Copy link
Member

oliver-sanders commented Apr 22, 2024

Looks good to me, I wrote a couple of extra tests whilst reviewing - hjoliver#49

Can confirm that this fixes the bug where removed tasks could be automatically respawned at a later time by task pool logic. There's no associated issue for this, but might be worth a mention in the changelog entry.

@hjoliver
Copy link
Member Author

I wrote a couple of extra tests whilst reviewing - hjoliver#49

Thanks, I'll take a look tomorrow. I presume you saw my new integration tests on this branch (which seem to achieve 100% patch coverage) - do yours go beyond mine?

@oliver-sanders
Copy link
Member

oliver-sanders commented Apr 22, 2024

Yes, I extended your test to ensure that it's possible to get the task back by triggering and added a new one to ensure that removed tasks remain removed. These behaviours follow naturally from the implementation, but it's worth locking them down in the tests to protect against future changes.

changes.d/6067.fix.md Outdated Show resolved Hide resolved
cylc/flow/task_pool.py Outdated Show resolved Hide resolved
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR but while I've noticed it: the "added to active task pool" info messages introduced in #5658 seem a bit excessive. Consider demoting to debug?

submit_num == 0
):
# Previous instance removed before completing any outputs.
LOG.info(f"Not spawning {point}/{name}: already used in this flow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this message is easy to understand

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but how to describe what's happening concisely?

Basically, the flow (on this branch of the graph) came to a halt at this point because the task was removed earlier by suicide trigger or cylc remove. That definitely warrants some kind of log message.

I'll change it to this (subject to further litigation, of course):

Suggested change
LOG.info(f"Not spawning {point}/{name}: already used in this flow")
LOG.info(f"Flow stopping at {point}/{name} - task previously removed")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, "flow blocked at" might be better.

Copy link
Member

@MetRonnie MetRonnie Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "flow blocked" I'm afraid! How about:

Skipping {point}/{name} {stringify_flow_nums(flow_nums)} - task previously removed

Or just

Not spawning {point}/{name} {stringify_flow_nums(flow_nums)} - task previously removed

as it was the "already used in this flow" that I think doesn't make much sense

Copy link
Member Author

@hjoliver hjoliver Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "flow blocked" is better than your two suggestions - it correctly indicates that the flow will not continue here because the task was removed.

"Skipping" suggests the flow will skip over the removed task, which is not the case.

"Spawning" (as you'll have seen in the chat) is I think being expunged from user terminology.

Copy link
Member Author

@hjoliver hjoliver Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think options are as follows (or some close variant thereof):

  • flow blocked here, the task was removed earlier
    • (note addition of "here" to avoid implicating the entire flow)
  • not spawning task, it was removed earlier

Personally I think in this instance "flow blocked" is more user-friendly, in that it is more or less self-explanatory, compared to "not spawning".

But given that @oliver-sanders agrees "not spawning" is correct, and (as I've just shown) spawning probably has to remain a user-facing term, perhaps we could just stick with that.

It needs to be logged at INFO level though, to explain why the flow does not continue here, at this time.

Copy link
Member

@oliver-sanders oliver-sanders Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, the term "spawn" ... was never a user facing term in the first place.

Ah, really?!

That's Cylc 7!

I'm very confused by these comments. It was your own argument that users should not need to understand the management (insertion or removal) of task proxies or task pool logic in general which I got on board with. It was a primary goal of SoD to remove the need for us to communicate this internal implementation detail, and a necessary side effect as SoD would make it harder for users to monitor the task pool itself. That's why spawning was retired right, s/spawn/set-outputs, users shouldn't need to understand spawning, because SoD will ensure the task proxy is always there when needed right? E.G. #2143. I agreed with you on this at the time, it was a good idea!

To quote your own proposal:

Users should not have to know about the "task pool" and the concepts of task "insert" and "remove" anymore.

I appreciate this is going back a few years now, maybe reality crept in and the thinking has changed. Perhaps it was a bit ambitions to presume we could hide the internal logic. Maybe we should write up the SoD model and communicate it to users as essential reading? We could do with writing this up in any case.

It needs to be logged at INFO level though, to explain why the flow does not continue here, at this time.

I'm not convinced it does. We are trying to communicate to users the graph model, not the internal SoD implementation (or at least I thought we were).

Take this graph for example:

a => b => c => d
a => x
b => x
c => x
d => x

If the user runs cylc remove //x, they will get a message along the lines of "x removed".

They don't need to see four "not re-spawning x" messages in a row as the graph proceeds, there is no new information in these messages. Moreover, it isn't communicating anything to them about the graph. Of course x isn't be re-spawned, I removed it! On purpose!

Copy link
Member

@oliver-sanders oliver-sanders Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My approval still stands of course, I'll leave you two to find something you can agree on.

I was trying to overt a terminology argument not to stoke one, I accepted the original phrasing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think "flow blocked" sounds too negative and some users will see it and think something has gone wrong

Copy link
Member Author

@hjoliver hjoliver Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK @MetRonnie, let's try to agree. I don't see "flow blocked" as negative, but perhaps we could add more information to explain the reason for it.

We've established that "skipping" isn't right. Are you happy with either of these, or do you have another suggestion?:

  • "flow blocked by previous removal of [task]"
  • "not spawning [task] (previously removed)"

I like "flow blocked" because it automatically conveys the fact that the flow will not continue from this point. But having argued that "spawn" is still needed to explain how flows evolve, I don't mind if we use that here too.

tests/integration/test_task_pool.py Outdated Show resolved Hide resolved
tests/integration/test_task_pool.py Outdated Show resolved Hide resolved
@hjoliver
Copy link
Member Author

All comments addressed. Logging is nice and succinct now:

$ cylc play --no-detach --no-timestamp bug/run17

 ▪ ■  Cylc Workflow Engine 8.3.0.dev
 ██   Copyright (C) 2008-2024 NIWA
▝▘    & British Crown (Met Office) & Contributors

INFO - Extracting job.sh to /home/oliverh/cylc-run/bug/run17/.service/etc/job.sh
INFO - Workflow: bug/run17
INFO - Scheduler: url=tcp://NIWA-1022450.niwa.local:43023 pid=18085
INFO - Workflow publisher: url=tcp://NIWA-1022450.niwa.local:43032
INFO - Run: (re)start number=1, log rollover=1
INFO - Cylc version: 8.3.0.dev
INFO - Run mode: live
INFO - Initial point: 1
INFO - Final point: 1
INFO - Cold start from 1
INFO - New flow: 1 (original flow from 1) 2024-04-23T13:40:13
INFO - [1/a:waiting(runahead)] => waiting
INFO - [1/a:waiting] => waiting(queued)
INFO - [1/a:waiting(queued)] => waiting
INFO - [1/a:waiting] => preparing
INFO - [1/a/01:preparing] submitted to localhost:background[18110]
INFO - [1/a/01:preparing] => submitted
INFO - [1/a/01:submitted] => running
INFO - [1/a/01:running] => succeeded
INFO - Workflow shutting down - AUTOMATIC
INFO - DONE

@hjoliver hjoliver force-pushed the tweak-suicide branch 2 times, most recently from 0f264f5 to 995734b Compare April 23, 2024 02:15
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see "flow blocked" changed, but happy enough

@hjoliver
Copy link
Member Author

hjoliver commented Apr 28, 2024

That's Cylc 7!

Sorry, I guess I took "never" a bit too literally!

To quote your own proposal:

Users should not have to know about the "task pool" and the concepts of task "insert" and "remove" anymore.

Ah, OK @oliver-sanders, I do understand your point now.

That statement was idealistic in retrospect, and maybe that's my fault. Early SoD discussions were very scheduler-focused and we didn't know exactly what would be in the task pool (e.g. whether partially satisfied tasks should be "hidden" or not).

Now the GUI matters too because the n=0 window IS the task pool, so we have to explain what tasks are in it and how they get there.

After going through all the options that we've thought of (to my knowledge) it still seems to me that the best terminology for this is "task pool" and "tasks get spawned (into the pool) by completion of upstream outputs". [I have actually commented on this several times over the past couple years, but not surprising if it got lost in the noise.]

I appreciate this is going back a few years now, maybe reality crept in and the thinking has changed. Perhaps it was a bit ambitions to presume we could hide the internal logic. Maybe we should write up the SoD model and communicate it to users as essential reading? We could do with writing this up in any case.

Yeah, more or less what I meant by "idealistic". The basic "SoD for Users" story for users, which can serve as the basis for documenting flows, optional outputs, and interventions, is really quite simple though:

  • the "task pool" is the set of tasks under active management by the scheduler, as the flow moves along the graph
    • (in an infinite graph the scheduler can only manage a subset of all the tasks, in any implementation)
  • tasks get "spawned" into the pool by upstream (graph) outputs, or automatically if they have no parents
    • (tasks have to get into the current pool somehow, in any implementation)
  • tasks exit the pool if/when they achieve a final status with completed outputs
    • (ditto)
  • in GUI you see a graph-based window around the task pool
  • flows (and what you see in the GUI) evolve according to the graph

@hjoliver
Copy link
Member Author

hjoliver commented Apr 28, 2024

If the user runs cylc remove //x, they will get a message along the lines of "x removed".

They don't need to see four "not re-spawning x" messages in a row as the graph proceeds, there is no new information in these messages.

Yes, but you've given a simple example where the removal and its effects are contiguous and obvious. In more complex workflows, and especially once we can remove future (non-pool) tasks as planned, that may not be the case.

This is not exposing SoD implementation, it just explains why the flow does not continue at this point the graph: because at some point in the past someone or something "removed" the task there.

Whether this is INFO or DEBUG is a pretty small thing, not worth dying in a ditch over, but I can't see how hurts to explain to users why the flow halts now on this branch when the "remove" operation may have happened some time ago.

@hjoliver
Copy link
Member Author

hjoliver commented Apr 29, 2024

Just noticed you approved this @MetRonnie - I could have saved myself some writing above 🤦‍♂️

OK, I'll merge this after tweaking the log message to give a bit more information as described above, so we can push on to 8.3.0.

Of course x isn't be re-spawned, I removed it! On purpose!

BTW I (unsuccessfully!) made the exact same argument for manual expiry, which has the exact same effect on the workflow: x doesn't spawn, which blocks the flow at that point in the graph, which may result in premature shutdown.

I'll concede and demote the message to DEBUG though, because removal is pretty niche, and if the user did it they don't have too much of a right to complain when the inevitable downstream effects occur.

It's definitely a candidate for our proposed intermediate log level though #3647 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

suicide triggers broken
3 participants