Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations: Consider lightweight icmp? Consider closer endpoint? Consider spot? #27

Open
adamravid opened this issue Nov 9, 2022 · 6 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@adamravid
Copy link

Maybe these are all obvious and each have unique blockers but has the project maintainers considered a few optimizations: test with icmp instead of tcp, test to an Amazon public address, and use spot instances.

@bwhaley
Copy link
Member

bwhaley commented Nov 9, 2022

Thanks for the suggestions!

test with icmp instead of tcp

I initially used icmp, but ultimately decided on tcp/https because I felt it would be a more realistic health check, e.g. in the event of some unusual behavior where ICMP worked but TCP didn't. Seems reasonable to have ICMP as an option though.

test to an Amazon public address

Also could be a good option!

use spot instances

Might require some testing, but could help reduce the expense of the NAT instance for sure.

PRs are welcome for all of the above. :)

@bwhaley bwhaley added good first issue Good for newcomers help wanted Extra attention is needed labels Nov 10, 2022
@nitrocode
Copy link
Contributor

nitrocode commented Nov 22, 2022

For spot, it could be as simple as updating the launch template to expose the market_options argument and/or providing our own aws_launch_template arn as an input as an escape hatch.

Launch template defined here

https://github.com/1debit/alternat/blob/1e594ce2a86a42130fd9d45f3abecca640fc470a/modules/terraform-aws-alternat/main.tf#L190

To safely kill the instance upon an interruption, I don't think we could rely solely on the instance terminating lifecycle hook and there aren't additional hooks we could key off of.

According to this blog post it looks like it can be implemented in 2 ways.

  • event rule that matches when spot is interrupted and triggers the lambda
  • shell script on the instance that checks for when spot is interrupted and triggers the lambda

Finally, it could be put behind a flag i.e. spot_enabled or logic based on the market options input so it can be tested by a subset of users instead of all the users.

What do you folks think?

@bwhaley
Copy link
Member

bwhaley commented Nov 22, 2022

Thanks for doing this research. Seems like the event rule approach might be the best option. I wonder if the event rule could call TerminateInstanceInAutoScalingGroup which would in turn invoke the instance terminating lifecycle hook.. then we wouldn't really need a separate workflow to handle this.

The only catch I can think of off the top of my head is that the spot instance only has 2 minutes to terminate, but the termination lifecycle hook heartbeat can be much longer than that. At a minimum, the heartbeat_timeout value would need to be configurable (should be anyway probably), and the docs should explain that for spot use cases it must be set to less than 120.

@bwhaley
Copy link
Member

bwhaley commented Dec 15, 2022

Quick updates:

  • The connectivity check endpoints are now configurable via var.connectivity_test_check_urls
  • The heartbeat timeout is now exposed via var.lifecycle_heartbeat_timeout

@therealvio
Copy link

Hi there,

We've deployed alterNAT to the eu-west-1 (Ireland) datacenter and we noticed occasionally that example.com is timing out. Refer to the log snippets below.

[ERROR]	2023-06-28T17:12:22.372Z	307bb718-1412-48da-81e4-cfb3db4e055a	error connecting to https://www.example.com: <urlopen error _ssl.c:1112: The handshake operation timed out>
[ERROR]	2023-06-28T17:33:54.441Z	4316bf29-cec9-4ee2-8e03-0ffb1314415c	timeout error connecting to https://www.example.com: The read operation timed out
[ERROR]	2023-06-28T17:36:28.864Z	83de7836-4cfd-4114-bc58-adefa3f906d4	timeout error connecting to https://www.example.com: The read operation timed out
[ERROR]	2023-06-28T17:42:17.948Z	2b58395a-8aa8-4e74-aa28-26d6ff50e00b	error connecting to https://www.example.com: <urlopen error _ssl.c:1112: The handshake operation timed out>

We noticed this because we have monitoring setup to advise us of our error burn rate. I am not convinced that it's a problem with example.com itself, but where the endpoint is located for the site. Meaning, that US-based deployments are o.k.

Google is working fine, but I am confident that the hardcoded domain example.com may be a detriment for customers located in places that aren't the US. This is something to probably indicate somewhere. Happy to even raise a PR to update the Readme docs if you'd like @bwhaley.

Thanks for the override options, we're going to apply that anywho.

@therealvio
Copy link

An update to my last post:

We changed our address assignments to the following:

  • https://www.google.com
  • https://www.aws.amazon.com; and
  • http://captive.apple.com

and we saw a dramatic improvement in the reliability of the connectivity testing results across our us-west-1 and eu-west-1 region deployments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants