Replies: 7 comments 7 replies
-
It seems to be running OK. It's error every time you try to create a cluster?
…________________________________
From: Jesse Anderson ***@***.***>
Sent: Wednesday, December 20, 2023 4:02 PM
To: microsoft/PlanetaryComputer ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [microsoft/PlanetaryComputer] Is the dask gateway down? (Discussion #309)
The last couple of days, I've been getting consistent errors when I try to spawn a gateway cluster, like
import dask_gateway
gateway = dask_gateway.Gateway()
cluster = gateway.new_cluster()
client = cluster.get_client()
will produce
# lots of stuff, then
OSError: Timed out trying to connect to gateway://pccompute-dask.westeurope.cloudapp.azure.com:80/prod.752e08130c7945e1a3cfce738ec880d3 after 30 s
Is the gateway down?
—
Reply to this email directly, view it on GitHub<#309>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAKAOIVUGRWPKXYF4Z2BW5LYKNOBDAVCNFSM6AAAAABA5PL3QCVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZVHE4DGNZVGU>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Yes, every time. I'd try a new token but I read this as not being related to authentication? Here's a full error log. I'm running the above code in a NB instance started using this (suggested invocation from the docs) docker run -it --rm \
-p 8888:8888 \
-e JUPYTERHUB_API_TOKEN=$JUPYTERHUB_API_TOKEN \
-e DASK_GATEWAY__AUTH__TYPE="jupyterhub" \
-e DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE="mcr.microsoft.com/planetary-computer/python:latest" \
-e DASK_GATEWAY__ADDRESS="https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway" \
-e DASK_GATEWAY__PROXY_ADDRESS="gateway://pccompute-dask.westeurope.cloudapp.azure.com:80" \
mcr.microsoft.com/planetary-computer/python:latest \
jupyter lab --no-browser --ip="0.0.0.0" CancelledError Traceback (most recent call last)
File /srv/conda/envs/notebook/lib/python3.10/site-packages/dask_gateway/comm.py:45, in GatewayConnector.connect(self, address, deserialize, **connection_args)
44 try:
---> 45 plain_stream = await self.client.connect(
46 ip, port, max_buffer_size=MAX_BUFFER_SIZE
47 )
48 stream = await plain_stream.start_tls(
49 False, ssl_options=ctx, server_hostname=sni
50 )
File /srv/conda/envs/notebook/lib/python3.10/site-packages/tornado/tcpclient.py:269, in TCPClient.connect(self, host, port, af, ssl_options, max_buffer_size, source_ip, source_port, timeout)
268 else:
--> 269 addrinfo = await self.resolver.resolve(host, port, af)
270 connector = _Connector(
271 addrinfo,
272 functools.partial(
(...)
277 ),
278 )
CancelledError:
During handling of the above exception, another exception occurred:
CancelledError Traceback (most recent call last)
File /srv/conda/envs/notebook/lib/python3.10/asyncio/tasks.py:456, in wait_for(fut, timeout)
455 try:
--> 456 return fut.result()
457 except exceptions.CancelledError as exc:
CancelledError:
The above exception was the direct cause of the following exception:
TimeoutError Traceback (most recent call last)
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/comm/core.py:292, in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
291 try:
--> 292 comm = await wait_for(
293 connector.connect(loc, deserialize=deserialize, **connection_args),
294 timeout=min(intermediate_cap, time_left()),
295 )
296 break
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:1878, in wait_for(fut, timeout)
1877 async def wait_for(fut: Awaitable[T], timeout: float) -> T:
-> 1878 return await asyncio.wait_for(fut, timeout)
File /srv/conda/envs/notebook/lib/python3.10/asyncio/tasks.py:458, in wait_for(fut, timeout)
457 except exceptions.CancelledError as exc:
--> 458 raise exceptions.TimeoutError() from exc
459 finally:
TimeoutError:
The above exception was the direct cause of the following exception:
OSError Traceback (most recent call last)
Cell In[1], line 4
2 gateway = dask_gateway.Gateway()
3 cluster = gateway.new_cluster()
----> 4 client = cluster.get_client()
File /srv/conda/envs/notebook/lib/python3.10/site-packages/dask_gateway/client.py:1080, in GatewayCluster.get_client(self, set_as_default)
1073 def get_client(self, set_as_default=True):
1074 """Get a ``Client`` for this cluster.
1075
1076 Returns
1077 -------
1078 client : dask.distributed.Client
1079 """
-> 1080 client = Client(
1081 self,
1082 security=self.security,
1083 set_as_default=set_as_default,
1084 asynchronous=self.asynchronous,
1085 loop=self.loop,
1086 )
1087 if not self.asynchronous:
1088 self._clients.add(client)
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:1012, in Client.__init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, connection_limit, **kwargs)
1009 preload_argv = dask.config.get("distributed.client.preload-argv")
1010 self.preloads = preloading.process_preloads(self, preload, preload_argv)
-> 1012 self.start(timeout=timeout)
1013 Client._instances.add(self)
1015 from distributed.recreate_tasks import ReplayTaskClient
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:1210, in Client.start(self, **kwargs)
1208 self._started = asyncio.ensure_future(self._start(**kwargs))
1209 else:
-> 1210 sync(self.loop, self._start, **kwargs)
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:418, in sync(loop, func, callback_timeout, *args, **kwargs)
416 if error:
417 typ, exc, tb = error
--> 418 raise exc.with_traceback(tb)
419 else:
420 return result
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:391, in sync.<locals>.f()
389 future = wait_for(future, callback_timeout)
390 future = asyncio.ensure_future(future)
--> 391 result = yield future
392 except Exception:
393 error = sys.exc_info()
File /srv/conda/envs/notebook/lib/python3.10/site-packages/tornado/gen.py:767, in Runner.run(self)
765 try:
766 try:
--> 767 value = future.result()
768 except Exception as e:
769 # Save the exception for later. It's important that
770 # gen.throw() not be called inside this try/except block
771 # because that makes sys.exc_info behave unexpectedly.
772 exc: Optional[Exception] = e
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:1290, in Client._start(self, timeout, **kwargs)
1287 self.scheduler_comm = None
1289 try:
-> 1290 await self._ensure_connected(timeout=timeout)
1291 except (OSError, ImportError):
1292 await self._close()
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:1353, in Client._ensure_connected(self, timeout)
1350 self._connecting_to_scheduler = True
1352 try:
-> 1353 comm = await connect(
1354 self.scheduler.address, timeout=timeout, **self.connection_args
1355 )
1356 comm.name = "Client->Scheduler"
1357 if timeout is not None:
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/comm/core.py:318, in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
316 await asyncio.sleep(backoff)
317 else:
--> 318 raise OSError(
319 f"Timed out trying to connect to {addr} after {timeout} s"
320 ) from active_exception
322 local_info = {
323 **comm.handshake_info(),
324 **(handshake_overrides or {}),
325 }
326 try:
327 # This would be better, but connections leak if worker is closed quickly
328 # write, handshake = await asyncio.gather(comm.write(local_info), comm.read())
OSError: Timed out trying to connect to gateway://pccompute-dask.westeurope.cloudapp.azure.com:80/prod.2866347575a64b3a873aec3ecf5dfae4 after 30 s |
Beta Was this translation helpful? Give feedback.
-
A little more info: The cluster is created (it is listed as running with |
Beta Was this translation helpful? Give feedback.
-
Hmm, I'm not sure. I think that they might be in separate namespaces, so I don't know how kubernetes' networking would handle that.
…________________________________
From: Jesse Anderson ***@***.***>
Sent: Thursday, December 21, 2023 10:46 AM
To: microsoft/PlanetaryComputer ***@***.***>
Cc: Tom Augspurger ***@***.***>; Comment ***@***.***>
Subject: Re: [microsoft/PlanetaryComputer] Is the dask gateway down? (Discussion #309)
Hmm, ok. So should it work through kbatch? (I'm seeing the error there too).
—
Reply to this email directly, view it on GitHub<#309 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAKAOIUG5GGPPQ36R3CVLFLYKRRWJAVCNFSM6AAAAABA5PL3QCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TSMRQGM2TS>.
You are receiving this because you commented.
|
Beta Was this translation helpful? Give feedback.
-
You should be able to address a service in a different namespace. I just had a look at our deployment of the PC environment, and I think this is the internal service address: I have a DB proxy service, which I can access from any namespace using a similar hostname. |
Beta Was this translation helpful? Give feedback.
-
Bumping this up because I'm getting the same error: I'd like to compute using the Dask gateway, using a local notebook or python file, but I keep getting proxy timeouts on |
Beta Was this translation helpful? Give feedback.
-
Inside the cluster, you should be able to use the routes used at https://github.com/microsoft/planetary-computer-hub/blob/408269db9e84497bc9a8a90005804cb098d7b020/helm/chart/config.yaml#L96-L97
DASK_GATEWAY__ADDRESS = "http://proxy-http:8000/compute/services/dask-gateway/"
…________________________________
From: Alex Leith ***@***.***>
Sent: Monday, January 22, 2024 4:01 PM
To: microsoft/PlanetaryComputer ***@***.***>
Cc: Tom Augspurger ***@***.***>; Comment ***@***.***>
Subject: Re: [microsoft/PlanetaryComputer] Is the dask gateway down? (Discussion #309)
We're running inside the network right?
Are there new values available for DASK_GATEWAY__ADDRESS and DASK_GATEWAY__PROXY_ADDRESS that work internally, i.e., kubernetes service endpoints that aren't publicly exposed?
—
Reply to this email directly, view it on GitHub<#309 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAKAOISG4ZNX6QFC43RWKWLYP3OUBAVCNFSM6AAAAABA5PL3QCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DEMJTHA2TG>.
You are receiving this because you commented.
|
Beta Was this translation helpful? Give feedback.
-
The last couple of days, I've been getting consistent errors when I try to spawn a gateway cluster, like
will produce
Is the gateway down?
Beta Was this translation helpful? Give feedback.
All reactions