Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP - Retry through proxy is not successful #427

Closed
digitalprecision opened this issue Nov 3, 2015 · 20 comments
Closed

PHP - Retry through proxy is not successful #427

digitalprecision opened this issue Nov 3, 2015 · 20 comments

Comments

@digitalprecision
Copy link

http://stackoverflow.com/questions/33487641/twitter-twemproxy-retry-not-working-as-expected

Wondering if anyone had insight into this? I even tried to sleep and create a new instance and I can't get it to hit a known good cache node. I set server retry to 1, and in code have retries set to 2 per the docs ( retries have to be > server retry).

@manjuraj
Copy link
Collaborator

manjuraj commented Nov 3, 2015

default value for server_retry_timeout is 30 seconds. So, if you are retrying your request after 30 seconds, it might end up going to the same server. Try setting it to a bigger value. Also use a sane value for the timeout key.

@digitalprecision
Copy link
Author

I am storing css/js output in cache for fast retrieval so I cannot wait 30 seconds to retry, the retry is almost instantaneous for obvious reasons. But it still tries to connect to the know bad server.

"timeout key" = the key expiry? We are also storing sessions in memcached, so the expiry usually runs 15 minutes.

@manjuraj
Copy link
Collaborator

manjuraj commented Nov 3, 2015

have you read this -- https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md; most of your answers can be found here

@digitalprecision
Copy link
Author

Of course, specifically the "liveness" section, did you see my question on SO?

@manjuraj
Copy link
Collaborator

manjuraj commented Nov 3, 2015

yes; use a server_retry_timeout and set it to 300000. server_retry_timeout option controls how long an auto_ejected server is kept ejected

@digitalprecision
Copy link
Author

I understand that.

What I am saying is that at 300001, the bad cache node is re-considered for re-entry back into the pool. However the request at 300001 will "break", because the server is still not online. To recover from this breakage, the retry mechanism at the app layer tries to re-execute the same command, as the expectation is the bad cache node will now have been re-ejected from the pool, and the retry will go to a known good cache node.

@manjuraj
Copy link
Collaborator

manjuraj commented Nov 3, 2015

yup, that is correct - if you set it up properly as mentioned in recommendation.md, you won't encounter this issue.

alternatively this patch: #29 will also solve your issue

@digitalprecision
Copy link
Author

" if you set it up properly as mentioned in recommendation.md", could you be more specific? The nutcracker.yaml file is posted in the SO link in the OP. What is wrong with that config that would cause the aforementioned behavior?

@manjuraj
Copy link
Collaborator

manjuraj commented Nov 3, 2015

basically you want to do is tradeoff "application level retries" for "server_retry_timeout". This section talks about it: https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md#liveness

@digitalprecision
Copy link
Author

That's what I am trying to tell you. The application is trying 3 times to set an item in the cache store via twemproxy. First try it fails due to failed node, 2nd and 3rd time it tries, it fails again b/c twemproxy is sending the request to the same failed server, when the failed server should have been ejected.

@manjuraj
Copy link
Collaborator

manjuraj commented Nov 3, 2015

The issue is that by the time second retry arrives at twemproxy, the ejected server has already been added to the server pool. So you need to set server_retry_timeout to a value greater than 30 seconds (or some value that lets your retries to succeed). Let say you set the server_retry_timeout to 30 seconds and server_failure_limit to 2,

  • at t = 0, request to failed node fails
  • at t = 10 (seconds), retried request to failed node fails and failed node is ejected
  • at t = 20 (seconds) retried request is routed to another node
  • at t = 40 (seconds), failed node is added back to the server pool
  • at t = 50 (seconds), requests are routed back to the failed node

hth

@digitalprecision
Copy link
Author

I appreciate the help.

I guess that's where my confusion lies. Because we are dealing with a caching layer, which is expected to be extremely fast, I cannot wait 10 seconds to re-issue the 2nd retry. The 2nd retry has to be in milliseconds.

Based on docs, and what you said, if I set server_failure_limit to 1, and keep server_retry_timeout at 30 this is what is supposed to happen:

  • Request 1 | t = 0 | Fails (vcache-2 down and is ejected)
  • Request 1 | t = 1s + 500ms | App retries request 1 | Succeeds (vcache-1 responds)
  • (30 seconds pass)
  • Request 200 | t = 31s | Fails | (vcache-2 re-introduced into pool, but still down, and is re-ejected)
  • Request 200 | t = 31s + 500ms | App retries request 200 | succeeds (vcache-1 responds)

@charsyam
Copy link
Contributor

charsyam commented Nov 3, 2015

@digitalprecision
Could you try this branch of twemproxy?
https://github.com/charsyam/twemproxy/tree/feature/heartbeat

this patch try restore failed node after checking.
I rebased old patch :)
(cc @manjuraj)

@digitalprecision
Copy link
Author

Sweet, I'll give her a go and let you know. Thanks.

Just curious, when do you think the heartbeat patch will make it into master?

@charsyam
Copy link
Contributor

charsyam commented Nov 3, 2015

@digitalprecision sorry, Actually, I don't know. maybe It depends on other persons' review and usage :)

@digitalprecision
Copy link
Author

Hmmmm, with that being said, would you say the heartbeat branch is stable? If this does work, it wouldn't be ideal to run a fork off master in production environment for too long.

@charsyam
Copy link
Contributor

charsyam commented Nov 3, 2015

@digitalprecision not experimental :) but we always need test :)

@digitalprecision
Copy link
Author

Actually I am going to pass on the heartbeat branch. System stability is of utmost importance and manually compiling the heartbeat fork is forcing me to update a slew of packages which aren't available in upstream repos (CentOS 6.7, rpmforge, epel).

At this point I am going to go with the following config settings:

  • server_failure_limit: 1
  • server_retry_timeout: 600000ms (10 minute)

I'd rather have a server be out of pool for 10 minute then dealing with breakage, and considering that it's low % chance of a cache node actually being down, shouldn't be too painful.

But I would recommend merging the heartbeat patch into master ASAP. Otherwise, other organizations are under the false pretense that an app level retry can recover from a non-responsive cache node.

@digitalprecision
Copy link
Author

Curious on thoughts @manjuraj

@TysonAndre
Copy link
Collaborator

The heartbeat patch hasn't been merged into twitter/twemproxy yet, but is planned for 0.6.0 - #608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants