GPUd release notes (2024-10-27T09:38:10Z)
Welcome to this new release!
What's Changed
- nits(server): debug level log for redundant register attempts by @gyuho in #126
- fix(nvidia-smi/parse): do not parse remapped rows N/A by @gyuho in #128
- feat(component/network): latency checks to global edge/DERP servers (using tailscale) by @gyuho in #125
- fix(containerd): readable query failure error message (When CRI is not set up) by @gyuho in #129
- fix(components): do not panic when there's no data collected yet by @gyuho in #130
- feat(nvidia): exposing SM core and tensor core metrics in GPUd by @photoszzt in #132
- fix(nvidia/query/metrics): remove duplicate metric register call by @gyuho in #133
- feat(charts): add gpud run helm chart by @gyuho in #123
- fix(infiniband): simplify ibstat existence when evaluating healthy by @gyuho in #124
- feat(network/latency): track latency in metrics per region by @gyuho in #134
- Update mothership endpoint by @cardyok in #82
- fix(nvidia): use NVML + lspci to detect NVIDIA GPUs (without running nvidia-smi) by @gyuho in #127
- fix(server): handle "components" URL query, return 404 not found on unknown component queries by @gyuho in #131
- nits(nvidia/query): make detect logs debug level by @gyuho in #135
- fix(status): fix divide by zero by @cardyok in #136
- fix(nvidia/xid): do not error log when no xid happened yet by @gyuho in #138
- fix(nvidia): persistence mode check based on NVML, do not rely on "nvidia-persistenced" binary by @gyuho in #137
New Contributors
- @photoszzt made their first contribution in #132
Full Changelog: v0.0.5...v0.1.0