PHP - Retry through proxy is not successful #427

digitalprecision · 2015-11-03T03:29:43Z

http://stackoverflow.com/questions/33487641/twitter-twemproxy-retry-not-working-as-expected

Wondering if anyone had insight into this? I even tried to sleep and create a new instance and I can't get it to hit a known good cache node. I set server retry to 1, and in code have retries set to 2 per the docs ( retries have to be > server retry).

manjuraj · 2015-11-03T03:35:08Z

default value for server_retry_timeout is 30 seconds. So, if you are retrying your request after 30 seconds, it might end up going to the same server. Try setting it to a bigger value. Also use a sane value for the timeout key.

digitalprecision · 2015-11-03T03:41:06Z

I am storing css/js output in cache for fast retrieval so I cannot wait 30 seconds to retry, the retry is almost instantaneous for obvious reasons. But it still tries to connect to the know bad server.

"timeout key" = the key expiry? We are also storing sessions in memcached, so the expiry usually runs 15 minutes.

manjuraj · 2015-11-03T03:43:40Z

have you read this -- https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md; most of your answers can be found here

digitalprecision · 2015-11-03T03:44:59Z

Of course, specifically the "liveness" section, did you see my question on SO?

manjuraj · 2015-11-03T03:50:40Z

yes; use a server_retry_timeout and set it to 300000. server_retry_timeout option controls how long an auto_ejected server is kept ejected

digitalprecision · 2015-11-03T03:56:44Z

I understand that.

What I am saying is that at 300001, the bad cache node is re-considered for re-entry back into the pool. However the request at 300001 will "break", because the server is still not online. To recover from this breakage, the retry mechanism at the app layer tries to re-execute the same command, as the expectation is the bad cache node will now have been re-ejected from the pool, and the retry will go to a known good cache node.

manjuraj · 2015-11-03T04:13:48Z

yup, that is correct - if you set it up properly as mentioned in recommendation.md, you won't encounter this issue.

alternatively this patch: #29 will also solve your issue

digitalprecision · 2015-11-03T04:20:48Z

" if you set it up properly as mentioned in recommendation.md", could you be more specific? The nutcracker.yaml file is posted in the SO link in the OP. What is wrong with that config that would cause the aforementioned behavior?

manjuraj · 2015-11-03T04:24:10Z

basically you want to do is tradeoff "application level retries" for "server_retry_timeout". This section talks about it: https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md#liveness

digitalprecision · 2015-11-03T14:11:34Z

That's what I am trying to tell you. The application is trying 3 times to set an item in the cache store via twemproxy. First try it fails due to failed node, 2nd and 3rd time it tries, it fails again b/c twemproxy is sending the request to the same failed server, when the failed server should have been ejected.

manjuraj · 2015-11-03T16:06:24Z

The issue is that by the time second retry arrives at twemproxy, the ejected server has already been added to the server pool. So you need to set server_retry_timeout to a value greater than 30 seconds (or some value that lets your retries to succeed). Let say you set the server_retry_timeout to 30 seconds and server_failure_limit to 2,

at t = 0, request to failed node fails
at t = 10 (seconds), retried request to failed node fails and failed node is ejected
at t = 20 (seconds) retried request is routed to another node
at t = 40 (seconds), failed node is added back to the server pool
at t = 50 (seconds), requests are routed back to the failed node

hth

digitalprecision · 2015-11-03T16:17:41Z

I appreciate the help.

I guess that's where my confusion lies. Because we are dealing with a caching layer, which is expected to be extremely fast, I cannot wait 10 seconds to re-issue the 2nd retry. The 2nd retry has to be in milliseconds.

Based on docs, and what you said, if I set server_failure_limit to 1, and keep server_retry_timeout at 30 this is what is supposed to happen:

Request 1 | t = 0 | Fails (vcache-2 down and is ejected)
Request 1 | t = 1s + 500ms | App retries request 1 | Succeeds (vcache-1 responds)
(30 seconds pass)
Request 200 | t = 31s | Fails | (vcache-2 re-introduced into pool, but still down, and is re-ejected)
Request 200 | t = 31s + 500ms | App retries request 200 | succeeds (vcache-1 responds)

charsyam · 2015-11-03T16:47:13Z

@digitalprecision
Could you try this branch of twemproxy?
https://github.com/charsyam/twemproxy/tree/feature/heartbeat

this patch try restore failed node after checking.
I rebased old patch :)
(cc @manjuraj)

digitalprecision · 2015-11-03T16:52:04Z

Sweet, I'll give her a go and let you know. Thanks.

Just curious, when do you think the heartbeat patch will make it into master?

charsyam · 2015-11-03T16:54:20Z

@digitalprecision sorry, Actually, I don't know. maybe It depends on other persons' review and usage :)

digitalprecision · 2015-11-03T16:56:58Z

Hmmmm, with that being said, would you say the heartbeat branch is stable? If this does work, it wouldn't be ideal to run a fork off master in production environment for too long.

charsyam · 2015-11-03T17:18:57Z

@digitalprecision not experimental :) but we always need test :)

digitalprecision · 2015-11-03T17:24:29Z

Actually I am going to pass on the heartbeat branch. System stability is of utmost importance and manually compiling the heartbeat fork is forcing me to update a slew of packages which aren't available in upstream repos (CentOS 6.7, rpmforge, epel).

At this point I am going to go with the following config settings:

server_failure_limit: 1
server_retry_timeout: 600000ms (10 minute)

I'd rather have a server be out of pool for 10 minute then dealing with breakage, and considering that it's low % chance of a cache node actually being down, shouldn't be too painful.

But I would recommend merging the heartbeat patch into master ASAP. Otherwise, other organizations are under the false pretense that an app level retry can recover from a non-responsive cache node.

digitalprecision · 2015-11-04T21:01:59Z

Curious on thoughts @manjuraj

TysonAndre · 2021-07-02T00:36:32Z

The heartbeat patch hasn't been merged into twitter/twemproxy yet, but is planned for 0.6.0 - #608

TysonAndre closed this as completed Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHP - Retry through proxy is not successful #427

PHP - Retry through proxy is not successful #427

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

charsyam commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

charsyam commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

charsyam commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

digitalprecision commented Nov 4, 2015

TysonAndre commented Jul 2, 2021

PHP - Retry through proxy is not successful #427

PHP - Retry through proxy is not successful #427

Comments

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

manjuraj commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

charsyam commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

charsyam commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

charsyam commented Nov 3, 2015

digitalprecision commented Nov 3, 2015

digitalprecision commented Nov 4, 2015

TysonAndre commented Jul 2, 2021