Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bluechi is online tool #964

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

engelmi
Copy link
Member

@engelmi engelmi commented Oct 18, 2024

Relates to: #962

So far, this PR adds the bluechi-is-online CLI tool, a man page and the RPM package to the spec file.

@engelmi
Copy link
Member Author

engelmi commented Oct 18, 2024

@dofmind
This PR is still in-progress, but you can still build and test the bluechi-is-online binary (using the usual meson install). Here are some examples on how to use it (will document them later):

######################
# Example 1: Stop service(s) when bluechi-agent loses connection 
$ cat /etc/systemd/system/monitor-bluechi-agent.service
[Unit]
Description=Monitor bluechi-agents connection to controller

[Service]
Type=simple
ExecStart=/usr/local/bin/bluechi-is-online agent --initial-wait=5000 --monitor

$ cat /etc/systemd/system/workload.service
[Unit]
Description=Some workload that should stop running when bluechi-agent disconnects
BindsTo=monitor-bluechi-agent.service
After=monitor-bluechi-agent.service

[Service]
...

######################
# Example 2: Start a service when bluechi-agent loses connection 
$ cat /etc/systemd/system/handle-bluechi-agent-offline.service
[Unit]
Description=Handle BlueChi Agent going offline and start do-stuff.service
OnFailure=do-stuff.service

[Service]
Type=simple
ExecStart=/usr/local/bin/bluechi-is-online agent --initial-wait=5000 --monitor

Not sure yet if BlueChi will provide some general purpose systemd units for it - I don't have an idea how those could look like at the moment. If you have, please let me know. And if have time to test the bluechi-is-online, please let me know what you think so we can implement your feedback right away.

@coveralls
Copy link

coveralls commented Oct 18, 2024

Coverage Status

coverage: 83.301% (-0.1%) from 83.413%
when pulling c3adec6 on engelmi:add-bluechi-is-online-tool
into 4fbfa95 on eclipse-bluechi:main.

@dofmind
Copy link
Contributor

dofmind commented Oct 21, 2024

Thanks for this PR. I tested bluechi-is-online on my system with multiple nodes. The basic behavior of bluechi-is-online worked as we expected. However, there are three issues.

  1. When bluechi-agent loses connection, monitor-bluechi-agent.service stops but does not restart. I added Restart=on-failure. I also need a condition that bluechi-agent must be online when restarting, so I added ExecStartPre=/usr/bin/wait-for-agent-online.sh using the following script.
$ cat scripts/wait-for-agent-online.sh 
#!/bin/sh

main() {
    while [ true ]; do
        /usr/bin/bluechi-is-online agent && break
        sleep 1
    done
}

main "$@"
  1. My system uses the SwitchController DBus method of Agent when the leader node changes. When the leader node changes and bluechi-agent executes the SwitchController DBus method, the bluechi-agent status changes to offline and then reconnects to the bluechi-controller of the new leader node. The monitor-bluechi-agent.service may stop even though bluechi-agent does not lose the connection physically.

  2. Before applying bluechi-is-online, I made bluechi-agent exit with 1 when it doesn't receive a heartbeat from the controller. But now, if bluechi-agent disconnects, bluechi-agent will try to reconnect to bluechi-controller.

Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Did not receive heartbeat from controller since '2500'ms. Disconnecting it...
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Disconnected from controller
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Trying to connect to controller (try 1)
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 21 20:09:11 42dot-ak7 bluechi-agent[522173]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected

If the leader node is changed before the error Registering as 'ak7_master_main' failed: Transport endpoint is not connected is reported, the SwitchController DBus method will fail as the follows and not work on bluechi-agent.

root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.102,port=842'
Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

@engelmi
Copy link
Member Author

engelmi commented Oct 22, 2024

Thanks for your feedback! @dofmind

1. When bluechi-agent loses connection, monitor-bluechi-agent.service stops but does not restart. I added `Restart=on-failure`. I also need a condition that bluechi-agent must be online when restarting, so I added `ExecStartPre=/usr/bin/wait-for-agent-online.sh` using the following script.

I'd suggest using UpheldBy= (inverse from Upholds=) on the `monitor-bluechi-agent.service:

$ cat /etc/systemd/system/monitor-bluechi-agent.service
[Unit]
Description=Monitor bluechi-agents connection to controller

[Service]
Type=simple
ExecStart=/usr/local/bin/bluechi-is-online agent --initial-wait=5000 --monitor

[Install]
UpheldBy=bluechi-agent.service

This way the monitoring service gets restarted as long as the bluechi-agent.service is active.

2. My system uses the SwitchController DBus method of Agent when the leader node changes. When the leader node changes and bluechi-agent executes the SwitchController DBus method, the bluechi-agent status changes to offline and then reconnects to the bluechi-controller of the new leader node. The monitor-bluechi-agent.service may stop even though bluechi-agent does not lose the connection physically.

Although I think the behavior of bluechi-is-online is correct here (since the agent really disconnected), I understand that this connection "wiggling" isn't desired. The ControllerAddress property emits a changed signal, which is also triggered for SwitchController right before disconnecting. In bluechi-is-online, we can use that signal and don't exit as a disconnect is expected to happen... I'll add a new CLI option to set this (maybe with a timeout?).

3. Before applying bluechi-is-online, I made bluechi-agent exit with 1 when it doesn't receive a heartbeat from the controller. But now, if bluechi-agent disconnects, bluechi-agent will try to reconnect to bluechi-controller.
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Did not receive heartbeat from controller since '2500'ms. Disconnecting it...
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Disconnected from controller
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Trying to connect to controller (try 1)
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 21 20:09:11 42dot-ak7 bluechi-agent[522173]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected

If the leader node is changed before the error Registering as 'ak7_master_main' failed: Transport endpoint is not connected is reported, the SwitchController DBus method will fail as the follows and not work on bluechi-agent.

root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.102,port=842'
Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

This sounds like a bug in BlueChi?! This needs more investigation, I think.
If I understood it correctly, this issue can be reproduced by:

  1. setting up a 3 nodes - two running a controller, and the third an agent
  2. stopping controller (yanking cable or so)
  3. triggering the SwitchController before the Registering as... appears in bluechi-agent

@dofmind
Copy link
Contributor

dofmind commented Oct 24, 2024

I'd suggest using UpheldBy= (inverse from Upholds=) on the `monitor-bluechi-agent.service:

I couldn't test using UpheldBy= because my systemd versions (246.9 and 250.5) don't support it. I'll try it after I backport the patch to systemd to support UpheldBy=.

The ControllerAddress property emits a changed signal, which is also triggered for SwitchController right before disconnecting. In bluechi-is-online, we can use that signal and don't exit as a disconnect is expected to happen... I'll add a new CLI option to set this (maybe with a timeout?).

Looks good for a new CLI option with a timeout.

This sounds like a bug in BlueChi?! This needs more investigation, I think.
If I understood it correctly, this issue can be reproduced by:

That's right, i will create an issue for this.

@engelmi
Copy link
Member Author

engelmi commented Oct 24, 2024

I couldn't test using UpheldBy= because my systemd versions (246.9 and 250.5) don't support it. I'll try it after I backport the patch to systemd to support UpheldBy=.

Ah ok, then the UpheldBy= can't be used, of course. And I think I misunderstood the condition you wanted to apply - that bluechi-agent must be online when restarting. You could achieve that, I think, by adding a ExecStartPre= to your unit with an --initial-wait of the new tool. This should keep the unit in an activating state so depending services are not started. For example:

[Service]
Type=simple
ExecStartPre=/usr/local/bin/bluechi-is-online agent --initial-wait=5000
ExecStart=/usr/local/bin/bluechi-is-online agent --monitor

Looks good for a new CLI option with a timeout.

Just added the new option --switch-timeout=<ms>. If bluechi-is-online is called with this option, it will wait the specified amount of time till it exits with code 1. If the agent connects during that time frame again, bluechi-is-online will continue to monitor the state.
@dofmind Please give it a try if you have time.

I noticed, however, a problem we have with the order of the changed signals for the connection state and the address. Currently, we first emit the change in the connection state, then the change of the address - which should be reversed, in my point of view. I prepared a small PR to fix this: #968

This sounds like a bug in BlueChi?! This needs more investigation, I think.
If I understood it correctly, this issue can be reproduced by:

That's right, i will create an issue for this.

Thank you!

@dofmind
Copy link
Contributor

dofmind commented Oct 25, 2024

After applying the updated is-online application and the small PR #968, I tested using the --wait (instead of --initial-wait) and --switch-timeout options, and both worked perfectly.

[Service]
Type=simple
ExecStartPre=/usr/bin/bluechi-is-online agent --wait=5000
ExecStart=/usr/bin/bluechi-is-online agent --monitor --switch-timeout=1000
Restart=on-failure

I created an issue about triggering SwitchController DBus method: #966, if this gets resolved I'll finally be able to apply the is-online solution on my system. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants