Skip to content

Releases: pytorch/test-infra

Runner lambdas v20221102-115928

02 Nov 12:00
2299ad1
Compare
Choose a tag to compare
rewrite metrics CW to leverage dimensions and be compatible with meta…

Runner lambdas v20221101-172204

01 Nov 17:23
c8e87d5
Compare
Choose a tag to compare
reference version over lambda alias (#995)

Runner lambdas v20221031-105322

31 Oct 10:54
95fac91
Compare
Choose a tag to compare
limit cloudwatch metrics for linux disk to /, other mount points are …

Runner lambdas v20221025-105328

25 Oct 10:54
f524390
Compare
Choose a tag to compare
GHA runners - Separate AMI owner filters for linux and windows instan…

Runner lambdas v20221021-231347

21 Oct 23:14
4412592
Compare
Choose a tag to compare
Fix runaway runner deletion on scale-down when API quota is hit (#938)

Rethrow the octokit 'API rate limit exceeded' errors when fetching
runner info on `scale-down` instead of consuming it.

Please see SEV  pytorch/pytorch#87500 for details.

Note: I've decided on the least invasive implementation (re-throwing
only very specific class of exceptions) to avoid unintended side
effects. Need to discuss with @jeanschmidt if all exceptions could be
safely rethrown.

Testing:
* Unit tests

Runner lambdas v20221019-100329

19 Oct 10:04
7a7c4ea
Compare
Choose a tag to compare
FIX: Don't remove EC2 instance when fails to remove githubRunner (#904)

`removeGithubRunner[Org || Repo]` used to remove the EC2 instance, so no
need to call `terminateRunner` again. This potentially could cause
runners that failed to be unregistered from GHA to be terminated on EC2.

As a fix, `removeGithubRunner` won't terminate the instance, nor
generate logs. This will enable `scaleDown` to control when to call
`terminateRunner` and generate the proper logs and metrics. Avoiding
having this issue in the future.

This bug also explains why we had in the past more EC2 instances being
kept at its minimum time: instances with less than minimum time got
unregistered and terminated without being tracked on main application
metric. This is obvious when we compare the API calls to terminate and
the count of app level termination.

![Screenshot 2022-10-18 at 09 21
47](https://user-images.githubusercontent.com/4520845/196364535-5aaab331-2080-44be-b6af-0702f99d50d9.png)
![Screenshot 2022-10-18 at 09 26
19](https://user-images.githubusercontent.com/4520845/196364542-376ff99f-617e-4e82-b459-dfc8364219ad.png)

Bug initially flagged on
[87134](https://github.com/pytorch/pytorch/issues/87134)

Runner lambdas v20221017-084425

17 Oct 08:45
82d970e
Compare
Choose a tag to compare
FIX: add back metrics runnerLessMinimumTime and runnerFound to scaleD…

Runner lambdas v20221012-113302

12 Oct 11:34
63f4a29
Compare
Choose a tag to compare
Rewrite scaleDown to fix a series of bugs (#864)

On scaleDown:

- [FIX] Guaranteed stop the runners from the oldest to the newest,
avoiding having runners for too long;
- [FIX] Fixed bug where a runner could be removed from GHA but kept
running on AWS;
- [IMPROVED] Try to maintain always a minimum of `minAvailableRunners`
runners free;

Runner lambdas v20221005-093519

05 Oct 09:36
2060ac2
Compare
Choose a tag to compare
Jeanschmidt/runners send metrics (#843)


This change changes CW agent config, enabling runners to send host metrics, so dashboards/alerts can be built for disk usage, cpu, memory, etc.

Runner lambdas v20221003-131827

03 Oct 13:19
c191dd8
Compare
Choose a tag to compare
scaleDown lamnda tries to send metrics 10s before timing out (#830)