Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iperf3 tests need control over bind address to support tests with NAT'd hosts #1476

Open
jprorama opened this issue Oct 4, 2024 · 12 comments

Comments

@jprorama
Copy link

jprorama commented Oct 4, 2024

Using pscheduler to test throughput against servers running in NAT'd environments, like cloud hosting, requires the test to use the public IP of the instance from the external test participant but use the instance internal (often private) IP from the NAT'd server.

The pscheduler task subcommand and throughput test have some support for specify bind addresses, but these do not get translated to the correct bind behavior when running iperf3.

An easy option is to avoid specifying an iperf3 bind parameter "-B" for command invocations. This causes iperf3 to bind on any interface and allows the test to proceed. Unfortunately, the only why to cause iperf3 to be called in this way is to submit pscheduler thoughput tasks with only a destination host parameter, e.g. pscheduler task throughput -d <destination>. These test specifications work to test throughput to and from a NAT'd host but require that the test is submitted from a shell on source host (so that the source address is implied).

This works because the code is written to not use the -B bind parameter to iperf3 if there is no source host provided in the test specification, in run_client() and run_server().

This doesn't work with throughput tests specify both a source and destination host. In those cases, the -B parameter is passed to the server and the client iperf3 invocations and will default to the public IP of the NAT'd host. This causes the iperf3 bind to fail and kills the command. The pscheduler task result then reports a timeout and the test fails. Specifying the various bind parameters of the task subcommand or throughput test doesn't work either because these do not make it through to the iperf3 command construction in a way that makes sense for the NAT'd use-case.

Since scheduled tests and third-party tests specify both source and destination hosts in the test specification the existing code prevents these tests from working with hosts that are using NAT. This prevents regular test for hosts in cloud environments.

jprorama added a commit to jprorama/pscheduler that referenced this issue Oct 4, 2024
Add support for the bind_address parameter to be read
from the iperf3.conf file and use it to override
bind address interpretation from the test spec.

This enables pscheduler throughput tests against
hosts that are deployed in NAT environments where
their public IP used by clients is different from
their local network address.  The config provides
the local knowledge to use the correct bind address.

Proposed fix for perfsonar#1476
@jprorama jprorama changed the title iperf3 tests need control over bind address to support tests from NAT'd hosts iperf3 tests need control over bind address to support tests with NAT'd hosts Oct 5, 2024
@pllopis
Copy link

pllopis commented Oct 10, 2024

Thanks for creating this issue and for the PR. I'm experiencing the same issue, so would also be interested in a resolution for this issue.
In our case it's a Kubernetes cluster with perfSONAR sitting behind a LoadBalancer service, and while I haven't tested this patch, indeed it looks like it would resolve the issue.

@pllopis
Copy link

pllopis commented Oct 10, 2024

Worth noting that this seems a duplicate of #1323, for which there used to be a solution that was already merged (which allowed setting --dest-bind and --source-bind separately), but this functionality was removed in 045bc00 as part of #256

@mfeit-internet2
Copy link
Member

The better way to deal with this is:

  • Put an entry in the NATted host's /etc/hosts that points the outside host's name at the inside address (e.g., 10.9.8.7 outside.cloud.example.org).
  • Make sure the resolver is configured to query hosts before dns.
  • Always refer to the host by its FQDN and never by its outside IP when configuring tests.

@pllopis
Copy link

pllopis commented Oct 10, 2024

Thanks @mfeit-internet2 , that's indeed a valid workaround that works. What also works is to add the options -B 0.0.0.0 --reverse, which even skips having to touch /etc/hosts (since it's obviously just reversing the roles..)

However I'd still be interested in a more permanent solution going forward that doesn't require further manual adjustments.
Do you know why the --dest-bind and --source-bind options were removed? It seems like an elegant way to make this work, especially if -B becomes an alias of --source-bind, as I believe it was already implemented :)
(.. or the config file override as implemented in #1477)

@jprorama
Copy link
Author

jprorama commented Oct 10, 2024 via email

@pllopis
Copy link

pllopis commented Oct 17, 2024

Hi again, perhaps this should go on a separate issue (let me know if you'd like me to open a new one) but since it's very related I'll post here first:

There's the same issue affecting owampd, where it will try to bind to the address specified as the destination.

Unfortunately in this case the workaround of using /etc/hosts does not seem to work. With that /etc/hosts edit, iperf3 works just fine, and getent hosts gives back the local IP address for the DNS name that I use with --dest, but owampd seems to try to bind to the public IP anyway.

Is this expected?
Any idea about whether there's another potential workaround when sitting behind a NAT/LB?

If there isn't and changing the code is required, would you accept a patch that works similarly to the one in this PR, where it's possible to optionally override the listen address for UDP tests? I could look into writing it if you'd like to see a patch. (Having a config option in this sense I think would make sense, but let me know if you have a specific opinion on how this should be done)

Thanks!

@mfeit-internet2
Copy link
Member

@pllopis That should be filed against the owamp repository. The OWAMP protocol embeds addresses in its payloads and I'd have to think about whether or not making alterations to support NAT would conform to RFC 4656.

@jprorama
Copy link
Author

@mfeit-internet2 I've set up an install of 5.1.4 on Jetstream2 with the name-based configuration you recommended for NAT environments.

https://js2.ps.rundmz.projects.rcops.dev

This works for reaching the host and performing ad hoc tests. But I'm still not seeing scheduled tests results show up in our campus toolkit install.

https://ps-sd.rc.uab.edu/

I have our site configured to test every 6 hours but don't see any throughput results showing up. The trace tests appear to report correctly.

I have my NAT'd ps node set up with it's correct FQDN.

exouser@js2:~$ cat /etc/hosts | tail -2
# NAT config for perfsonar
10.1.221.182 js2.ps.rundmz.projects.rcops.dev js2
exouser@js2:~$ hostname
js2
exouser@js2:~$ hostname -f
js2.ps.rundmz.projects.rcops.dev
exouser@js2:~$ cat /etc/resolv.conf | tail -3
nameserver 127.0.0.53
options edns0 trust-ad
search ps.rundmz.projects.rcops.dev js2local

Here are the tests configured by our campus ps node (ps-sd.rc.uab.edu) showing up on my js2 instance:

exouser@js2:~$ pscheduler schedule +PT1H
2024-10-25T14:56:46+00:00 - 2024-10-25T14:57:15+00:00 (Pending)
throughput --source 138.26.220.66 --source-node 138.26.220.66 --dest js2.ps.rundmz.projects.rcops.dev --dest-node
  js2.ps.rundmz.projects.rcops.dev --duration PT20S --ip-version 4 (Run with tool 'iperf3')
https://js2/pscheduler/tasks/2a01f146-2b82-43e3-939d-b0b2ad5e1383/runs/f2d16301-f8c1-48e9-b46d-d71fd8612af8


2024-10-25T15:11:10+00:00 - 2024-10-25T15:11:39+00:00 (Pending)
throughput --source js2.ps.rundmz.projects.rcops.dev --source-node js2.ps.rundmz.projects.rcops.dev --dest 138.26.220.66 --dest-
  node 138.26.220.66 --duration PT20S --ip-version 4 (Run with tool 'iperf3')
https://js2/pscheduler/tasks/0e60eb36-ebf2-461f-8d3b-ed20920b331d/runs/716436ac-670b-42dd-a0d3-0c1748cec0d4

Here are those tests from the ps-sd side:

jpr_@ps-sd:~$ pscheduler schedule --host js2.ps.rundmz.projects.rcops.dev +PT1H
2024-10-25T14:56:46+00:00 - 2024-10-25T14:57:15+00:00 (Pending)
throughput --source 138.26.220.66 --source-node 138.26.220.66 --dest js2.ps.rundmz.projects.rcops.dev --dest-node
  js2.ps.rundmz.projects.rcops.dev --duration PT20S --ip-version 4 (Run with tool 'iperf3')
https://js2.ps.rundmz.projects.rcops.dev/pscheduler/tasks/2a01f146-2b82-43e3-939d-b0b2ad5e1383/runs/f2d16301-f8c1-48e9-b46d-d71fd8612af8


2024-10-25T15:11:10+00:00 - 2024-10-25T15:11:39+00:00 (Pending)
throughput --source js2.ps.rundmz.projects.rcops.dev --source-node js2.ps.rundmz.projects.rcops.dev --dest 138.26.220.66 --dest-
  node 138.26.220.66 --duration PT20S --ip-version 4 (Run with tool 'iperf3')
https://js2.ps.rundmz.projects.rcops.dev/pscheduler/tasks/0e60eb36-ebf2-461f-8d3b-ed20920b331d/runs/716436ac-670b-42dd-a0d3-0c1748cec0d4

I'm noticing that the js2 node reports the tests under it's hostname without the domain name. Not sure if that's the source of the problem.

Do you have any suggestions where to look to fix this config?

@jprorama
Copy link
Author

It appears, throughput results from ps-sd->js2 have started to show up.

https://ps-sd.rc.uab.edu/grafana/d/b6df6a8d-1b66-4712-a8da-957b36dc37dc/perfsonar-endpoint-pair-explorer?orgId=1&var-ds=f990c0c6-a920-41c2-8b3e-a43e6ff9234a&from=now-2d&to=now&var-source=ps-sd.rc.uab.edu&var-dest=js2.ps.rundmz.projects.rcops.dev&var-node_name=ps-sd.rc.uab.edu&var-source_ref=All&var-dest_ref=All

I'm wondering if these one-directional results relate to the URL hostname vs fqdn differences noted above.

Does the throughput task takes its name from the equivalent of the hostname or hostname -f command?

@jprorama
Copy link
Author

jprorama commented Nov 1, 2024

Bidirectional tests with the NAT'd node still appear to have problems. I change the NAT'd device hostname to be the FQDN, rather than rely on system config to provide the hostname+domain.

sudo hostnamectl set-hostname js2.ps.rundmz.projects.rcops.dev

This does ensure that the FQDN for the NAT host shows up in it's own view of the scheduled test URLs reported by pscheduler. This doesn't appear to resolve the bi-directional test not working.

Inspecting the scheduled tests, it appears they are run and succeeding in both directions. This is the record of bidirectional tests registered with on an external test point.

jpr_@ps-sd:~$ pscheduler schedule  --filter-test=throughput -PT6H | awk 'BEGIN {RS=""; FS="\n"} /js2.ps.rundmz/'
2024-11-01T05:59:33-05:00 - 2024-11-01T05:59:56-05:00 (Finished)
throughput --source 138.26.220.66 --source-node 138.26.220.66 --dest js2.ps.rundmz.projects.rcops.dev --dest-node js2.ps.rundmz.projects.rcops.dev --duration PT20S --ip-version 4 (Run with tool 'iperf3')
https://ps-sd.rc.uab.edu/pscheduler/tasks/d51f023e-4e87-44c4-80c8-ac75eaaf21a7/runs/6fba6633-1ab3-4ec7-8e19-d3eb607a40e4
2024-11-01T06:06:07-05:00 - 2024-11-01T06:06:30-05:00 (Finished)
throughput --source js2.ps.rundmz.projects.rcops.dev --source-node js2.ps.rundmz.projects.rcops.dev --dest 138.26.220.66 --dest-node 138.26.220.66 --duration PT20S --ip-version 4 (Run with tool 'iperf3')
https://ps-sd.rc.uab.edu/pscheduler/tasks/e89d0053-3a11-4c57-8965-49005ca3b482/runs/421a9b76-5765-4a8c-bbb5-064ee667d113

The test URLs show that there are results in both directions. For some reason, however, the web UI of the testing node does not show the tests results for when the NAT'd node is the source (second test above).

Looking at the results of a test, shows that iperf test did execute and generated results:

https://ps-sd.rc.uab.edu/pscheduler/tasks/e89d0053-3a11-4c57-8965-49005ca3b482/runs/421a9b76-5765-4a8c-bbb5-064ee667d113

However, I cannot access those results through the web UI dashboard:

https://ps-sd.rc.uab.edu/grafana/d/b6df6a8d-1b66-4712-a8da-957b36dc37dc/perfsonar-endpoint-pair-explorer?orgId=1&var-ds=f990c0c6-a920-41c2-8b3e-a43e6ff9234a&from=now-7d&to=now&var-source=js2.ps.rundmz.projects.rcops.dev&var-dest=&var-node_name=ps-sd.rc.uab.edu&var-source_ref=All&var-dest_ref=All

image

@jprorama
Copy link
Author

jprorama commented Nov 1, 2024

Looking at the tests from the perspective of the NAT'd node suggests it sees the tests as failing.

exouser@js2:~$ pscheduler schedule  --filter-test=throughput -PT1H | awk 'BEGIN {RS=""; FS="\n"} /138.26/'
2024-11-01T17:13:05+00:00 - 2024-11-01T17:13:08+00:00 (Failed)
throughput --source js2.ps.rundmz.projects.rcops.dev --source-node js2.ps.rundmz.projects.rcops.dev --dest 138.26.220.66 --dest-node 138.26.220.66 --duration PT20S --ip-version 4 (Run with tool 'iperf3')
https://js2.ps.rundmz.projects.rcops.dev/pscheduler/tasks/3ebf604f-54d6-4eb6-9ccc-3dbf94f439f3/runs/f4e99390-6cf0-4bf1-9a3b-906172843fab

In the above the result URL the reported failure is reportedly due to not being able to reach the started iperf server:

{"added":"2024-11-01T04:34:06+00:00","state":"failed","errors":null,"result":{"diags":"/usr/bin/iperf3 -p 5201 -4 -B js2.ps.rundmz.projects.rcops.dev -c 138.26.220.66 -t 20 --json --rsa-public-key-path /var/pscheduler-server/runner/tmp/tmpf0q07ueb/tmpkp6skygy/public-key --username 9VGmQPkKmNOYAm0jirbR","error":"iperf3 returned an error: unable to connect to server - server may have stopped running or use a different port, firewall issue, etc.: Connection refused","succeeded":false},"duration":"PT3S","end-time":"2024-11-01T17:13:08+00:00","priority":null,"start-time":"2024-11-01T17:13:05+00:00","limit-diags":"Hints:\n  requester: 138.26.220.66\n  server: 10.1.221.182\nIdentified as everybody\nClassified as default\nApplication: Defaults applied to non-friendly hosts\n  Group 1: Limit 'allowed-tests' passed\n  Group 1: Limit 'throughput-default-parallel' passed\n  Group 1: Limit 'throughput-default-time' passed\n  Group 1: Limit 'throughput-default-udp' passed\n  Group 1: Want all, 4/4 passed, 0/4 failed: PASS\n  Application PASSES\nPassed one application.  Stopping.\nProposal meets limits\nPriority set at 0:\n  Initial priority  (Set to 0)","participant":0,"result-full":[{"diags":"/usr/bin/iperf3 -p 5201 -4 -B js2.ps.rundmz.projects.rcops.dev -c 138.26.220.66 -t 20 --json --rsa-public-key-path /var/pscheduler-server/runner/tmp/tmpf0q07ueb/tmpkp6skygy/public-key --username 9VGmQPkKmNOYAm0jirbR","error":"iperf3 returned an error: unable to connect to server - server may have stopped running or use a different port, firewall issue, etc.: Connection refused","succeeded":false},null],"clock-survey":[{"time":"2024-11-01T13:13:34.472013-04:00","offset":5.435943603515625e-05,"source":"ntp","reference":"secondary reference (2) from 130.207.244.240","synchronized":true},{"time":"2024-11-01T12:13:34.788386-05:00","offset":2.258e-06,"source":"chrony","reference":"23.155.40.38","synchronized":true}],"participants":["js2.ps.rundmz.projects.rcops.dev","138.26.220.66"],"result-merged":{"diags":"Participant 0:\n/usr/bin/iperf3 -p 5201 -4 -B js2.ps.rundmz.projects.rcops.dev -c 138.26.220.66 -t 20 --json --rsa-public-key-path /var/pscheduler-server/runner/tmp/tmpf0q07ueb/tmpkp6skygy/public-key --username 9VGmQPkKmNOYAm0jirbR\n","error":"iperf3 returned an error: unable to connect to server - server may have stopped running or use a different port, firewall issue, etc.: Connection refused","succeeded":false},"state-display":"Failed","participant-data":{"schema":3,"iperf3-version":"3.17.1"},"participant-data-full":[{"schema":3,"iperf3-version":"3.17.1"},{"_auth":null,"schema":3,"server_port":5201,"iperf3-version":"3.17.1"}],"href":"https://js2.ps.rundmz.projects.rcops.dev/pscheduler/tasks/3ebf604f-54d6-4eb6-9ccc-3dbf94f439f3/runs/f4e99390-6cf0-4bf1-9a3b-906172843fab","task-href":"https://js2.ps.rundmz.projects.rcops.dev/pscheduler/tasks/3ebf604f-54d6-4eb6-9ccc-3dbf94f439f3","result-href":"https://js2.ps.rundmz.projects.rcops.dev/pscheduler/tasks/3ebf604f-54d6-4eb6-9ccc-3dbf94f439f3/runs/f4e99390-6cf0-4bf1-9a3b-906172843fab/result"}

@jprorama
Copy link
Author

jprorama commented Nov 1, 2024

An additional note on the tests reported by the NAT node that is confusing. It seems that for a test js2->ps-sd as seen from the NAT'd node the URL referenced for the test appears to be for a test in the reverse direction, ie. data flowing to the js2 node (the listner).

Here is a failed test record:



2024-11-01T17:19:16+00:00 - 2024-11-01T17:19:19+00:00 (Failed)
throughput --source js2.ps.rundmz.projects.rcops.dev --source-node js2.ps.rundmz.projects.rcops.dev --dest 138.26.220.66 --dest-
  node 138.26.220.66 --duration PT20S --ip-version 4 (Run with tool 'iperf3')
https://js2.ps.rundmz.projects.rcops.dev/pscheduler/tasks/7ded5d7a-2c47-4b90-878e-7348fb0ba276/runs/8391d355-1b45-49ba-8604-63207c3ae2a0

Inspecting the results URL indicates failure but for a test in the opposite direction.

"/usr/bin/iperf3 -p 5201 -4 -B js2.ps.rundmz.projects.rcops.dev -c 138.26.220.66 -t 20 --json --rsa-public-key-path /var/pscheduler-server/runner/tmp/tmp4cvihl3_/tmpygkjre0m/public-key --username nuJl6Nm2tlMxeuLu3qfO"```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready
Development

No branches or pull requests

3 participants