You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that HdfsCli doesn't cope well when a WebHDFS namenode that it is connected to shuts down even though an alternative node is available.
I don't have a minimal producer and my logging isn't 100% clear, but if I'm not mistaken the following happened during a run of an application I'm testing:
The cluster administrator has initiated a restart of the namenodes for a settings update. There are two namenodes in the cluster.
At the time that HdfsCli attempts to make its first connection, the main node 001 is in the process of restarting and unavailable. It successfully connects to the alternative node 002.
Node 001 has finished restarting. Node 002 goes offline to restart.
HdfsCli loses its HTTP connection to node 002. It retries its connection to node 002. It does not attempt a new connection to node 001.
Stacktrace from Python (names changed to protect the innocent):
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/util/connection.py", line 80, in create_connection
raise err
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/util/connection.py", line 70, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib64/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/usr/lib64/python3.4/http/client.py", line 1182, in _send_request
self.endheaders(body)
File "/usr/lib64/python3.4/http/client.py", line 1133, in endheaders
self._send_output(message_body)
File "/usr/lib64/python3.4/http/client.py", line 963, in _send_output
self.send(msg)
File "/usr/lib64/python3.4/http/client.py", line 898, in send
self.connect()
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connection.py", line 181, in connect
conn = self._new_conn()
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fe008d72fd0>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/urllib3/util/retry.py", line 399, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=<node 002>, port=50070): Max retries exceeded with url: <webhdfs path>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main
process()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/grid/3/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000001/<our package>", line 267, in extract_map
File "/usr/lib64/python3.4/contextlib.py", line 59, in __enter__
return next(self.gen)
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/hdfs/client.py", line 686, in read
buffersize=buffer_size,
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/hdfs/client.py", line 125, in api_handler
raise err
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/hdfs/client.py", line 107, in api_handler
**self.kwargs
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/hdfs/client.py", line 214, in _request
**kwargs
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/grid/0/hadoop/yarn/local/usercache/<user>/appcache/application_1573035413979_0276/container_e55_1573035413979_0276_01_000003/.venv/lib/python3.4/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host=<node 002>, port=50070): Max retries exceeded with url: <webhdfs path>
Looking at issue #18 you rightly state that handling failovers is difficult. Is this something you're willing to look into? I can attempt to create a minimal producer or test case, but I'd rather not if you regard this issue as out of scope.
Many thanks!
The text was updated successfully, but these errors were encountered:
Hi @dutchgecko, apologies for the slow reply. #80 introduced HA support, I'd like to understand why it doesn't work here. Are you emitting a single large request that spans both restarts?
If that's the case, it makes sense that the overall request fails when the second server restarts since the first server had already been tried for the same request. Supporting such long requests will require care, to avoid infinite loops. I'm inclined to think that such sequences of events are rare enough that retrying them transparently is not worth the added complexity.
Yes, in this case the request spanned both restarts (or almost: the request started while NN01 was down, NN02 went down during the request).
In this case the request is quite long because it is a streaming operation: files are streamed from HDFS, operations carried out on the stream, and the result streamed to a new file on HDFS.
I respect your decision if you regard this as WONTFIX due to complexity, however in my opinion the current behaviour is not as robust as the HA mechanisms used in other Hadoop tools (a Spark job running at the same time for example handled the double failover gracefully). I agree that rebooting both namenodes is a rare occurrence, however part of the appeal of a system like Hadoop is that it should always be "safe" to do so (or said another way, I had to deal with a grumpy sysadmin who wondered why he had to manually restart tasks that shouldn't have failed during the reboot he initiated 😉).
Hi,
It seems that HdfsCli doesn't cope well when a WebHDFS namenode that it is connected to shuts down even though an alternative node is available.
I don't have a minimal producer and my logging isn't 100% clear, but if I'm not mistaken the following happened during a run of an application I'm testing:
001
is in the process of restarting and unavailable. It successfully connects to the alternative node002
.001
has finished restarting. Node002
goes offline to restart.002
. It retries its connection to node002
. It does not attempt a new connection to node001
.Stacktrace from Python (names changed to protect the innocent):
Looking at issue #18 you rightly state that handling failovers is difficult. Is this something you're willing to look into? I can attempt to create a minimal producer or test case, but I'd rather not if you regard this issue as out of scope.
Many thanks!
The text was updated successfully, but these errors were encountered: