Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgpu-scheduler-extender terminated with exit code 2 #584

Open
jeonghyunkeem opened this issue Oct 31, 2024 · 3 comments
Open

vgpu-scheduler-extender terminated with exit code 2 #584

jeonghyunkeem opened this issue Oct 31, 2024 · 3 comments
Labels
kind/bug Something isn't working

Comments

@jeonghyunkeem
Copy link

What happened: vgpu-scheduler-extender container (part of hami-scheduler pod) keeps terminated with exit code 2.

What you expected to happen: vgpu-scheduler-extender stays alive without termination

How to reproduce it (as minimally and precisely as possible): I'm not sure as it happens randomly

Anything else we need to know?:

I'm using multiple gpu nodes in my cluster and each node has hami.io/node-nvidia-register annotation as follows:

hami.io/node-nvidia-register=GPU-80c9c145-7ed8-5261-305e-72044d835856,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-7cbd3046-f3e2-dbc2-95dd-a77b1de5639f,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-eac3c055-a9e3-f967-5255-cb1234c78133,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-8f8d7649-0174-5b7a-4499-c93f4f4c1301,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-1a9db261-1fc5-0b0f-da59-636a3e97850b,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-53e5a370-8700-37f5-f10a-00c8ff829794,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-4dcb8085-4e5e-0462-9d96-794c903503ce,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-ea6f48d4-12de-4481-4fb5-883341efecf4,10,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:
  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs

here are the final logs of terminated vgpu-scheduler-extender container:

I1031 00:52:46.362711       1 util.go:146] Encoded container Devices: GPU-937336ae-3b44-8441-66f0-1df55e635668,NVIDIA,65536,0:
I1031 00:52:46.362717       1 util.go:146] Encoded container Devices: GPU-937336ae-3b44-8441-66f0-1df55e635668,NVIDIA,65536,0:
I1031 00:52:46.362722       1 util.go:169] Encoded pod single devices GPU-937336ae-3b44-8441-66f0-1df55e635668,NVIDIA,65536,0:;GPU-937336ae-3b44-8441-66f0-1df55e635668,NVIDIA,65536,0:;
fatal error: concurrent map iteration and map write

goroutine 150 [running]:
github.com/Project-HAMi/HAMi/pkg/scheduler.(*Scheduler).getNodesUsage(0xc000658000, 0xc00a3c9b40, 0x0)
	/k8s-vgpu/pkg/scheduler/scheduler.go:301 +0x356
github.com/Project-HAMi/HAMi/pkg/scheduler.(*Scheduler).RegisterFromNodeAnnotations(0xc000658000)
	/k8s-vgpu/pkg/scheduler/scheduler.go:244 +0x2c5
created by main.start in goroutine 1
	/k8s-vgpu/cmd/scheduler/main.go:75 +0xe5

goroutine 1 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0x7f0d74613eb0, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0x3?, 0x1?, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc008737b80)
	/usr/local/go/src/internal/poll/fd_unix.go:611 +0x2ac
net.(*netFD).accept(0xc008737b80)
	/usr/local/go/src/net/fd_unix.go:172 +0x29
net.(*TCPListener).accept(0xc008ae0da0)
	/usr/local/go/src/net/tcpsock_posix.go:159 +0x1e
net.(*TCPListener).Accept(0xc008ae0da0)
	/usr/local/go/src/net/tcpsock.go:327 +0x30
crypto/tls.(*listener).Accept(0xc00884f998)
	/usr/local/go/src/crypto/tls/tls.go:66 +0x27
net/http.(*Server).Serve(0xc008aec000, {0x1d342a8, 0xc00884f998})
	/usr/local/go/src/net/http/server.go:3260 +0x33e
net/http.(*Server).ServeTLS(0xc008aec000, {0x1d34518, 0xc008ae0da0}, {0x7ffe446421da, 0xc}, {0x7ffe446421f2, 0xc})
	/usr/local/go/src/net/http/server.go:3330 +0x486
net/http.(*Server).ListenAndServeTLS(0xc008aec000, {0x7ffe446421da, 0xc}, {0x7ffe446421f2, 0xc})
	/usr/local/go/src/net/http/server.go:3487 +0x125
net/http.ListenAndServeTLS(...)
	/usr/local/go/src/net/http/server.go:3453
main.start()
	/k8s-vgpu/cmd/scheduler/main.go:90 +0x52d
main.init.func1(0xc000436100?, {0x1a9d158?, 0x4?, 0x1a9d15c?})
	/k8s-vgpu/cmd/scheduler/main.go:45 +0xf
github.com/spf13/cobra.(*Command).execute(0x2a757c0, {0xc000202a90, 0x1b, 0x1b})
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:989 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0x2a757c0)
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041
main.main()
	/k8s-vgpu/cmd/scheduler/main.go:97 +0x1e

goroutine 143 [chan receive]:
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:969 +0x4b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000029f70, {0x1d28440, 0xc0003c4210}, 0x1, 0xc000620060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00049e770, 0x3b9aca00, 0x0, 0x1, 0xc000620060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002e1cb0)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:968 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 195
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73

goroutine 127 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc000444188, 0xb733)
	/usr/local/go/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0xc004ef56c0?)
	/usr/local/go/src/sync/cond.go:70 +0x85
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc000444160, 0xc00061a070)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/delta_fifo.go:575 +0x236
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc000440460)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:188 +0x30
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0001b6e48, {0x1d28440, 0xc0005b01e0}, 0x1, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0001b6e48, 0x3b9aca00, 0x0, 0x1, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*controller).Run(0xc000440460, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:159 +0x35e
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run(0xc0002dcdc0, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:504 +0x2c8
k8s.io/client-go/informers.(*sharedInformerFactory).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/informers/factory.go:150 +0x5c
created by k8s.io/client-go/informers.(*sharedInformerFactory).Start in goroutine 1
	/go/pkg/mod/k8s.io/[email protected]/informers/factory.go:148 +0x205

goroutine 128 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0002dd158, 0xbb157)
	/usr/local/go/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0xc005f1b8a0?)
	/usr/local/go/src/sync/cond.go:70 +0x85
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop(0xc0002dd130, 0xc00044e5a0)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/delta_fifo.go:575 +0x236
k8s.io/client-go/tools/cache.(*controller).processLoop(0xc0003d2f00)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:188 +0x30
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00002de48, {0x1d28440, 0xc000615440}, 0x1, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00002de48, 0x3b9aca00, 0x0, 0x1, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*controller).Run(0xc0003d2f00, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:159 +0x35e
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run(0xc0002dd080, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:504 +0x2c8
k8s.io/client-go/informers.(*sharedInformerFactory).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/informers/factory.go:150 +0x5c
created by k8s.io/client-go/informers.(*sharedInformerFactory).Start in goroutine 1
	/go/pkg/mod/k8s.io/[email protected]/informers/factory.go:148 +0x205

goroutine 148 [chan receive]:
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:969 +0x4b
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc008d0ff70, {0x1d28440, 0xc0002e39b0}, 0x1, 0xc000342000)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00049ff70, 0x3b9aca00, 0x0, 0x1, 0xc000342000)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002e1b90)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:968 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 179
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73

goroutine 112 [IO wait]:
internal/poll.runtime_pollWait(0x7f0d74613db8, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc0003de080?, 0xc008e76000?, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0003de080, {0xc008e76000, 0xa000, 0xa000})
	/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc0003de080, {0xc008e76000?, 0x7f0d6450f588?, 0xc0025ebcb0?})
	/usr/local/go/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc000610038, {0xc008e76000?, 0xc00002a938?, 0x4136bb?})
	/usr/local/go/src/net/net.go:185 +0x45
crypto/tls.(*atLeastReader).Read(0xc0025ebcb0, {0xc008e76000?, 0x0?, 0xc0025ebcb0?})
	/usr/local/go/src/crypto/tls/conn.go:806 +0x3b
bytes.(*Buffer).ReadFrom(0xc00024c630, {0x1d26d40, 0xc0025ebcb0})
	/usr/local/go/src/bytes/buffer.go:211 +0x98
crypto/tls.(*Conn).readFromUntil(0xc00024c388, {0x1d270c0, 0xc000610038}, 0xc00002a980?)
	/usr/local/go/src/crypto/tls/conn.go:828 +0xde
crypto/tls.(*Conn).readRecordOrCCS(0xc00024c388, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:626 +0x3cf
crypto/tls.(*Conn).readRecord(...)
	/usr/local/go/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc00024c388, {0xc0005cf000, 0x1000, 0x94c471?})
	/usr/local/go/src/crypto/tls/conn.go:1370 +0x156
bufio.(*Reader).Read(0xc0005c89c0, {0xc0005c42e0, 0x9, 0x0?})
	/usr/local/go/src/bufio/bufio.go:241 +0x197
io.ReadAtLeast({0x1d262a0, 0xc0005c89c0}, {0xc0005c42e0, 0x9, 0x9}, 0x9)
	/usr/local/go/src/io/io.go:335 +0x90
io.ReadFull(...)
	/usr/local/go/src/io/io.go:354
golang.org/x/net/http2.readFrameHeader({0xc0005c42e0, 0x9, 0x2adc0?}, {0x1d262a0?, 0xc0005c89c0?})
	/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:237 +0x65
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0005c42a0)
	/go/pkg/mod/golang.org/x/[email protected]/http2/frame.go:501 +0x85
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc00002afa8)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:2358 +0xda
golang.org/x/net/http2.(*ClientConn).readLoop(0xc0002d4180)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:2254 +0x8b
created by golang.org/x/net/http2.(*Transport).newClientConn in goroutine 111
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:869 +0xd1b

goroutine 179 [chan receive, 2172 minutes]:
k8s.io/client-go/tools/cache.(*sharedProcessor).run(0xc00062e460, 0xc00051e120)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:803 +0x4d
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run.(*Group).StartWithChannel.func4()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:55 +0x1b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 127
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73

goroutine 180 [chan receive, 2172 minutes]:
k8s.io/client-go/tools/cache.(*controller).Run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:132 +0x25
created by k8s.io/client-go/tools/cache.(*controller).Run in goroutine 127
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:131 +0xa9

goroutine 195 [chan receive, 2172 minutes]:
k8s.io/client-go/tools/cache.(*sharedProcessor).run(0xc00062e4b0, 0xc000216660)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:803 +0x4d
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run.(*Group).StartWithChannel.func4()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:55 +0x1b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 128
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73

goroutine 196 [chan receive, 2172 minutes]:
k8s.io/client-go/tools/cache.(*controller).Run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:132 +0x25
created by k8s.io/client-go/tools/cache.(*controller).Run in goroutine 128
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:131 +0xa9

goroutine 197 [select]:
k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0x2a8c200?}, {0x1d2dde8, 0xc006ca99c0}, {0x7f0d743423c0, 0xc0002dd130}, {0x1d559a8, 0x1a6d060}, 0x0, ...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:714 +0x187
k8s.io/client-go/tools/cache.(*Reflector).watch(0xc0002b0380, {0x0?, 0x0?}, 0xc00051e060, 0xc000380120)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:433 +0x545
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc0002b0380, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:358 +0x377
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:291 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x2a8c900?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005bbf50, {0x1d28460, 0xc00062e5a0}, 0x1, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc0002b0380, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:290 +0x1c5
k8s.io/client-go/tools/cache.(*controller).Run.(*Group).StartWithChannel.func2()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:55 +0x1b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 128
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73

goroutine 181 [select]:
k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0x2a8c200?}, {0x1d2dde8, 0xc008eb4400}, {0x7f0d743423c0, 0xc000444160}, {0x1d559a8, 0x1a6db40}, 0x0, ...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:714 +0x187
k8s.io/client-go/tools/cache.(*Reflector).watch(0xc00052a000, {0x0?, 0x0?}, 0xc00051e060, 0xc007d9daa0)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:433 +0x545
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc00052a000, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:358 +0x377
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:291 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x2a8c900?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc008637f50, {0x1d28460, 0xc0003b6370}, 0x1, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc00052a000, 0xc00051e060)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:290 +0x1c5
k8s.io/client-go/tools/cache.(*controller).Run.(*Group).StartWithChannel.func2()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:55 +0x1b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 127
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73

goroutine 144 [select]:
k8s.io/client-go/tools/cache.(*processorListener).pop(0xc0002e1cb0)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:939 +0x107
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 195
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73

goroutine 149 [select]:
k8s.io/client-go/tools/cache.(*processorListener).pop(0xc0002e1b90)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:939 +0x107
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 179
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73

goroutine 201 [select, 12 minutes]:
k8s.io/client-go/tools/cache.(*Reflector).startResync(0xc0002b0380, 0xc00051e060, 0xc0006211a0, 0xc000380120)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:370 +0x10f
created by k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch in goroutine 197
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:357 +0x34d

goroutine 226 [select, 12 minutes]:
k8s.io/client-go/tools/cache.(*Reflector).startResync(0xc00052a000, 0xc00051e060, 0xc0080f9b60, 0xc007d9daa0)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:370 +0x10f
created by k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch in goroutine 181
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:357 +0x34d

goroutine 151 [IO wait, 2171 minutes]:
internal/poll.runtime_pollWait(0x7f0d74613cc0, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0x8?, 0x10?, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc00879e380)
	/usr/local/go/src/internal/poll/fd_unix.go:611 +0x2ac
net.(*netFD).accept(0xc00879e380)
	/usr/local/go/src/net/fd_unix.go:172 +0x29
net.(*TCPListener).accept(0xc0089fed60)
	/usr/local/go/src/net/tcpsock_posix.go:159 +0x1e
net.(*TCPListener).Accept(0xc0089fed60)
	/usr/local/go/src/net/tcpsock.go:327 +0x30
net/http.(*Server).Serve(0xc008a2a000, {0x1d34518, 0xc0089fed60})
	/usr/local/go/src/net/http/server.go:3260 +0x33e
net/http.(*Server).ListenAndServe(0xc008a2a000)
	/usr/local/go/src/net/http/server.go:3189 +0x71
net/http.ListenAndServe(...)
	/usr/local/go/src/net/http/server.go:3443
main.initMetrics({0x7ffe44642236, 0x5})
	/k8s-vgpu/cmd/scheduler/metrics.go:239 +0x225
created by main.start in goroutine 1
	/k8s-vgpu/cmd/scheduler/main.go:76 +0x14b

goroutine 348153 [select]:
net/http.(*http2serverConn).serve(0xc0096db040)
	/usr/local/go/src/net/http/h2_bundle.go:4757 +0x897
net/http.(*http2Server).ServeConn(0xc008ae6fa0, {0x1d482b8, 0xc009e82a88}, 0xc009da9b30)
	/usr/local/go/src/net/http/h2_bundle.go:4345 +0xbad
net/http.http2ConfigureServer.func1(0xc008aec000, 0xc009e82a88, {0x1d266c0, 0xc009e7db80})
	/usr/local/go/src/net/http/h2_bundle.go:4135 +0x125
net/http.(*conn).serve(0xc009e850e0, {0x1d41a90, 0xc008aded80})
	/usr/local/go/src/net/http/server.go:1952 +0x12f3
created by net/http.(*Server).Serve in goroutine 1
	/usr/local/go/src/net/http/server.go:3290 +0x4b4

goroutine 347327 [select, 6 minutes]:
golang.org/x/net/http2.(*clientStream).writeRequest(0xc001ff6180, 0xc00733c6c0, 0x0)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1536 +0xa85
golang.org/x/net/http2.(*clientStream).doRequest(0xc001ff6180, 0x0?, 0xc0056e6480?)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1414 +0x56
created by golang.org/x/net/http2.(*ClientConn).roundTrip in goroutine 181
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1319 +0x3e5

goroutine 347328 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc001ff61c8, 0x6f)
	/usr/local/go/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0x1a?)
	/usr/local/go/src/sync/cond.go:70 +0x85
golang.org/x/net/http2.(*pipe).Read(0xc001ff61b0, {0xc0006e0001, 0x7dff, 0x7dff})
	/go/pkg/mod/golang.org/x/[email protected]/http2/pipe.go:76 +0xdf
golang.org/x/net/http2.transportResponseBody.Read({0x374f?}, {0xc0006e0001?, 0xc00a3cdce0?, 0x4136bb?})
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:2641 +0x65
encoding/json.(*Decoder).refill(0xc0065788c0)
	/usr/local/go/src/encoding/json/stream.go:165 +0x188
encoding/json.(*Decoder).readValue(0xc0065788c0)
	/usr/local/go/src/encoding/json/stream.go:140 +0x85
encoding/json.(*Decoder).Decode(0xc0065788c0, {0x185f3c0, 0xc002d479b0})
	/usr/local/go/src/encoding/json/stream.go:63 +0x75
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc009184570, {0xc0026d8000, 0x8000, 0xa000})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/framer/framer.go:152 +0x19c
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc00393c6e0, 0x0, {0x1d2d050, 0xc007adb380})
	/go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/streaming/streaming.go:77 +0xa3
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc0063126c0)
	/go/pkg/mod/k8s.io/[email protected]/rest/watch/decoder.go:49 +0x4b
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc008eb4400)
	/go/pkg/mod/k8s.io/[email protected]/pkg/watch/streamwatcher.go:105 +0xdb
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher in goroutine 181
	/go/pkg/mod/k8s.io/[email protected]/pkg/watch/streamwatcher.go:76 +0x105

goroutine 348319 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc0007a6948, 0xc7)
	/usr/local/go/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0x1a?)
	/usr/local/go/src/sync/cond.go:70 +0x85
golang.org/x/net/http2.(*pipe).Read(0xc0007a6930, {0xc00342e001, 0x7dff, 0x7dff})
	/go/pkg/mod/golang.org/x/[email protected]/http2/pipe.go:76 +0xdf
golang.org/x/net/http2.transportResponseBody.Read({0x5893?}, {0xc00342e001?, 0xc000067ce0?, 0x4136bb?})
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:2641 +0x65
encoding/json.(*Decoder).refill(0xc0002f3180)
	/usr/local/go/src/encoding/json/stream.go:165 +0x188
encoding/json.(*Decoder).readValue(0xc0002f3180)
	/usr/local/go/src/encoding/json/stream.go:140 +0x85
encoding/json.(*Decoder).Decode(0xc0002f3180, {0x185f3c0, 0xc002d95410})
	/usr/local/go/src/encoding/json/stream.go:63 +0x75
k8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc005514510, {0xc00344c000, 0x8000, 0xa000})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/framer/framer.go:152 +0x19c
k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc00341afa0, 0x0, {0x1d2d050, 0xc0096cf480})
	/go/pkg/mod/k8s.io/[email protected]/pkg/runtime/serializer/streaming/streaming.go:77 +0xa3
k8s.io/client-go/rest/watch.(*Decoder).Decode(0xc00494bee0)
	/go/pkg/mod/k8s.io/[email protected]/rest/watch/decoder.go:49 +0x4b
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc006ca99c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/watch/streamwatcher.go:105 +0xdb
created by k8s.io/apimachinery/pkg/watch.NewStreamWatcher in goroutine 197
	/go/pkg/mod/k8s.io/[email protected]/pkg/watch/streamwatcher.go:76 +0x105

goroutine 347503 [IO wait]:
internal/poll.runtime_pollWait(0x7f0d74613120, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc007e3cf80?, 0xc004a84a00?, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc007e3cf80, {0xc004a84a00, 0x2500, 0x2500})
	/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc007e3cf80, {0xc004a84a00?, 0x7f0d6450f588?, 0xc0025ebc80?})
	/usr/local/go/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc009c634d0, {0xc004a84a00?, 0xc003ec9788?, 0x4136bb?})
	/usr/local/go/src/net/net.go:185 +0x45
crypto/tls.(*atLeastReader).Read(0xc0025ebc80, {0xc004a84a00?, 0x0?, 0xc0025ebc80?})
	/usr/local/go/src/crypto/tls/conn.go:806 +0x3b
bytes.(*Buffer).ReadFrom(0xc0044bd7b0, {0x1d26d40, 0xc0025ebc80})
	/usr/local/go/src/bytes/buffer.go:211 +0x98
crypto/tls.(*Conn).readFromUntil(0xc0044bd508, {0x1d270c0, 0xc009c634d0}, 0xc003ec97d0?)
	/usr/local/go/src/crypto/tls/conn.go:828 +0xde
crypto/tls.(*Conn).readRecordOrCCS(0xc0044bd508, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:626 +0x3cf
crypto/tls.(*Conn).readRecord(...)
	/usr/local/go/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc0044bd508, {0xc0044f4000, 0x1000, 0x0?})
	/usr/local/go/src/crypto/tls/conn.go:1370 +0x156
net/http.(*connReader).Read(0xc003f87680, {0xc0044f4000, 0x1000, 0x1000})
	/usr/local/go/src/net/http/server.go:789 +0x14b
bufio.(*Reader).fill(0xc0043a42a0)
	/usr/local/go/src/bufio/bufio.go:110 +0x103
bufio.(*Reader).Peek(0xc0043a42a0, 0x4)
	/usr/local/go/src/bufio/bufio.go:148 +0x53
net/http.(*conn).serve(0xc0044d50e0, {0x1d41a90, 0xc008aded80})
	/usr/local/go/src/net/http/server.go:2079 +0x749
created by net/http.(*Server).Serve in goroutine 1
	/usr/local/go/src/net/http/server.go:3290 +0x4b4

goroutine 348156 [IO wait]:
internal/poll.runtime_pollWait(0x7f0d74613bc8, 0x72)
	/usr/local/go/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc000c21e80?, 0xc003df0000?, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000c21e80, {0xc003df0000, 0x6000, 0x6000})
	/usr/local/go/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc000c21e80, {0xc003df0000?, 0x7f0d745a63d8?, 0xc0007c5ab8?})
	/usr/local/go/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc009e13420, {0xc003df0000?, 0xc000026a58?, 0x4136bb?})
	/usr/local/go/src/net/net.go:185 +0x45
crypto/tls.(*atLeastReader).Read(0xc0007c5ab8, {0xc003df0000?, 0x0?, 0xc0007c5ab8?})
	/usr/local/go/src/crypto/tls/conn.go:806 +0x3b
bytes.(*Buffer).ReadFrom(0xc009e82d30, {0x1d26d40, 0xc0007c5ab8})
	/usr/local/go/src/bytes/buffer.go:211 +0x98
crypto/tls.(*Conn).readFromUntil(0xc009e82a88, {0x1d270c0, 0xc009e13420}, 0xc000026aa0?)
	/usr/local/go/src/crypto/tls/conn.go:828 +0xde
crypto/tls.(*Conn).readRecordOrCCS(0xc009e82a88, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:626 +0x3cf
crypto/tls.(*Conn).readRecord(...)
	/usr/local/go/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc009e82a88, {0xc001de3ee0, 0x9, 0x453186?})
	/usr/local/go/src/crypto/tls/conn.go:1370 +0x156
io.ReadAtLeast({0x7f0d74555118, 0xc009e82a88}, {0xc001de3ee0, 0x9, 0x9}, 0x9)
	/usr/local/go/src/io/io.go:335 +0x90
io.ReadFull(...)
	/usr/local/go/src/io/io.go:354
net/http.http2readFrameHeader({0xc001de3ee0, 0x9, 0x0?}, {0x7f0d74555118?, 0xc009e82a88?})
	/usr/local/go/src/net/http/h2_bundle.go:1638 +0x65
net/http.(*http2Framer).ReadFrame(0xc001de3ea0)
	/usr/local/go/src/net/http/h2_bundle.go:1905 +0x85
net/http.(*http2serverConn).readFrames(0xc0096db040)
	/usr/local/go/src/net/http/h2_bundle.go:4637 +0x87
created by net/http.(*http2serverConn).serve in goroutine 348153
	/usr/local/go/src/net/http/h2_bundle.go:4749 +0x56a

goroutine 348448 [runnable]:
fmt.(*pp).printArg(0xc005e90000?, {0x1742900?, 0xc009d18320?}, 0x73?)
	/usr/local/go/src/fmt/print.go:681 +0x5bd
fmt.(*pp).doPrintf(0xc005e90000, {0x1af4138, 0x37}, {0xc001d6e6c0, 0x4, 0x4})
	/usr/local/go/src/fmt/print.go:1075 +0x37e
fmt.Fprintf({0x1d26ac0, 0xc0096bc8c0}, {0x1af4138, 0x37}, {0xc001d6e6c0, 0x4, 0x4})
	/usr/local/go/src/fmt/print.go:224 +0x71
k8s.io/klog/v2.(*loggingT).printfDepth(0x2a8c900, 0x0, 0x0, {0x0, 0x0}, 0x1, {0x1af4138, 0x37}, {0xc001d6e6c0, 0x4, ...})
	/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:763 +0x165
k8s.io/klog/v2.(*loggingT).printf(...)
	/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:744
k8s.io/klog/v2.Infof(...)
	/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1525
github.com/Project-HAMi/HAMi/pkg/scheduler.(*podManager).addPod(0xc000658020, 0xc001100908, {0xc00869e080, 0xe}, 0xc005013980)
	/k8s-vgpu/pkg/scheduler/pods.go:63 +0x338
github.com/Project-HAMi/HAMi/pkg/scheduler.(*Scheduler).Filter(0xc000658000, {0xc001100908?, 0x0?, 0xc000ad40c0?})
	/k8s-vgpu/pkg/scheduler/scheduler.go:486 +0xb38
github.com/Project-HAMi/HAMi/pkg/scheduler/routes.PredicateRoute.func1({0x1d341b8, 0xc0093ff138}, 0xc006228000, {0x0?, 0x0?, 0x0?})
	/k8s-vgpu/pkg/scheduler/routes/route.go:59 +0x33b
github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc00884dbc0, {0x1d341b8, 0xc0093ff138}, 0xc006228000)
	/go/pkg/mod/github.com/julienschmidt/[email protected]/router.go:387 +0x7eb
net/http.serverHandler.ServeHTTP({0xc00509e480?}, {0x1d341b8?, 0xc0093ff138?}, 0xc0002d42a0?)
	/usr/local/go/src/net/http/server.go:3142 +0x8e
net/http.initALPNRequest.ServeHTTP({{0x1d41a90?, 0xc009eb8300?}, 0xc009e82a88?, {0xc008aec000?}}, {0x1d341b8, 0xc0093ff138}, 0xc006228000)
	/usr/local/go/src/net/http/server.go:3750 +0x231
net/http.(*http2serverConn).runHandler(0x952ba8?, 0xc007482900?, 0x0?, 0xc00a9037d0?)
	/usr/local/go/src/net/http/h2_bundle.go:6192 +0xbb
created by net/http.(*http2serverConn).scheduleHandler in goroutine 348153
	/usr/local/go/src/net/http/h2_bundle.go:6127 +0x21d

goroutine 348318 [select]:
golang.org/x/net/http2.(*clientStream).writeRequest(0xc0007a6900, 0xc0063ccd80, 0x0)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1536 +0xa85
golang.org/x/net/http2.(*clientStream).doRequest(0xc0007a6900, 0x6ea845?, 0xc009971830?)
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1414 +0x56
created by golang.org/x/net/http2.(*ClientConn).roundTrip in goroutine 197
	/go/pkg/mod/golang.org/x/[email protected]/http2/transport.go:1319 +0x3e5
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version: 2.3
  • nvidia driver or other AI device driver version: 535.154.05
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:
@jeonghyunkeem jeonghyunkeem added the kind/bug Something isn't working label Oct 31, 2024
@Nimbus318
Copy link
Contributor

Could you please provide the exact hami image version to help trace the specific code line? It currently appears that certain map-type fields in the scheduler might be accessed concurrently without locks, causing a fatal error: concurrent map iteration and map write

@jeonghyunkeem
Copy link
Author

@Nimbus318 vgpu-scheduler-extender uses a following image: projecthami/hami:v2.3.13

@Nimbus318
Copy link
Contributor

@jeonghyunkeem Got it, I checked, and I know where the problem is. This issue has already been fixed in #418, so it should no longer occur if you use the latest version, 2.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants