Optimizations: Consider lightweight icmp? Consider closer endpoint? Consider spot? #27

adamravid · 2022-11-09T20:09:55Z

Maybe these are all obvious and each have unique blockers but has the project maintainers considered a few optimizations: test with icmp instead of tcp, test to an Amazon public address, and use spot instances.

bwhaley · 2022-11-09T21:40:27Z

Thanks for the suggestions!

test with icmp instead of tcp

I initially used icmp, but ultimately decided on tcp/https because I felt it would be a more realistic health check, e.g. in the event of some unusual behavior where ICMP worked but TCP didn't. Seems reasonable to have ICMP as an option though.

test to an Amazon public address

Also could be a good option!

use spot instances

Might require some testing, but could help reduce the expense of the NAT instance for sure.

PRs are welcome for all of the above. :)

nitrocode · 2022-11-22T12:41:03Z

For spot, it could be as simple as updating the launch template to expose the market_options argument and/or providing our own aws_launch_template arn as an input as an escape hatch.

Launch template defined here

https://github.com/1debit/alternat/blob/1e594ce2a86a42130fd9d45f3abecca640fc470a/modules/terraform-aws-alternat/main.tf#L190

To safely kill the instance upon an interruption, I don't think we could rely solely on the instance terminating lifecycle hook and there aren't additional hooks we could key off of.

According to this blog post it looks like it can be implemented in 2 ways.

event rule that matches when spot is interrupted and triggers the lambda
shell script on the instance that checks for when spot is interrupted and triggers the lambda

Finally, it could be put behind a flag i.e. spot_enabled or logic based on the market options input so it can be tested by a subset of users instead of all the users.

What do you folks think?

bwhaley · 2022-11-22T23:07:27Z

Thanks for doing this research. Seems like the event rule approach might be the best option. I wonder if the event rule could call TerminateInstanceInAutoScalingGroup which would in turn invoke the instance terminating lifecycle hook.. then we wouldn't really need a separate workflow to handle this.

The only catch I can think of off the top of my head is that the spot instance only has 2 minutes to terminate, but the termination lifecycle hook heartbeat can be much longer than that. At a minimum, the heartbeat_timeout value would need to be configurable (should be anyway probably), and the docs should explain that for spot use cases it must be set to less than 120.

bwhaley · 2022-12-15T00:30:34Z

Quick updates:

The connectivity check endpoints are now configurable via var.connectivity_test_check_urls
The heartbeat timeout is now exposed via var.lifecycle_heartbeat_timeout

therealvio · 2023-06-29T00:49:53Z

Hi there,

We've deployed alterNAT to the eu-west-1 (Ireland) datacenter and we noticed occasionally that example.com is timing out. Refer to the log snippets below.

[ERROR]	2023-06-28T17:12:22.372Z	307bb718-1412-48da-81e4-cfb3db4e055a	error connecting to https://www.example.com: <urlopen error _ssl.c:1112: The handshake operation timed out>
[ERROR]	2023-06-28T17:33:54.441Z	4316bf29-cec9-4ee2-8e03-0ffb1314415c	timeout error connecting to https://www.example.com: The read operation timed out
[ERROR]	2023-06-28T17:36:28.864Z	83de7836-4cfd-4114-bc58-adefa3f906d4	timeout error connecting to https://www.example.com: The read operation timed out
[ERROR]	2023-06-28T17:42:17.948Z	2b58395a-8aa8-4e74-aa28-26d6ff50e00b	error connecting to https://www.example.com: <urlopen error _ssl.c:1112: The handshake operation timed out>

We noticed this because we have monitoring setup to advise us of our error burn rate. I am not convinced that it's a problem with example.com itself, but where the endpoint is located for the site. Meaning, that US-based deployments are o.k.

Google is working fine, but I am confident that the hardcoded domain example.com may be a detriment for customers located in places that aren't the US. This is something to probably indicate somewhere. Happy to even raise a PR to update the Readme docs if you'd like @bwhaley.

Thanks for the override options, we're going to apply that anywho.

therealvio · 2023-06-30T03:21:53Z

An update to my last post:

We changed our address assignments to the following:

https://www.google.com
https://www.aws.amazon.com; and
http://captive.apple.com

and we saw a dramatic improvement in the reliability of the connectivity testing results across our us-west-1 and eu-west-1 region deployments.

bwhaley added good first issue Good for newcomers help wanted Extra attention is needed labels Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations: Consider lightweight icmp? Consider closer endpoint? Consider spot? #27

Optimizations: Consider lightweight icmp? Consider closer endpoint? Consider spot? #27

adamravid commented Nov 9, 2022

bwhaley commented Nov 9, 2022

nitrocode commented Nov 22, 2022 •

edited

Loading

bwhaley commented Nov 22, 2022

bwhaley commented Dec 15, 2022

therealvio commented Jun 29, 2023

therealvio commented Jun 30, 2023

Optimizations: Consider lightweight icmp? Consider closer endpoint? Consider spot? #27

Optimizations: Consider lightweight icmp? Consider closer endpoint? Consider spot? #27

Comments

adamravid commented Nov 9, 2022

bwhaley commented Nov 9, 2022

nitrocode commented Nov 22, 2022 • edited Loading

bwhaley commented Nov 22, 2022

bwhaley commented Dec 15, 2022

therealvio commented Jun 29, 2023

therealvio commented Jun 30, 2023

nitrocode commented Nov 22, 2022 •

edited

Loading