Releases · pytorch/test-infra

02 Nov 12:00

v20221102-115928

2299ad1

Runner lambdas v20221102-115928

rewrite metrics CW to leverage dimensions and be compatible with meta…

Assets 3

01 Nov 17:23

github-actions

v20221101-172204

c8e87d5

Runner lambdas v20221101-172204

reference version over lambda alias (#995)

Assets 3

31 Oct 10:54

github-actions

v20221031-105322

95fac91

Runner lambdas v20221031-105322

limit cloudwatch metrics for linux disk to /, other mount points are …

Assets 3

25 Oct 10:54

github-actions

v20221025-105328

f524390

Runner lambdas v20221025-105328

GHA runners - Separate AMI owner filters for linux and windows instan…

Assets 3

21 Oct 23:14

github-actions

v20221021-231347

4412592

Runner lambdas v20221021-231347

Fix runaway runner deletion on scale-down when API quota is hit (#938)

Rethrow the octokit 'API rate limit exceeded' errors when fetching
runner info on `scale-down` instead of consuming it.

Please see SEV  pytorch/pytorch#87500 for details.

Note: I've decided on the least invasive implementation (re-throwing
only very specific class of exceptions) to avoid unintended side
effects. Need to discuss with @jeanschmidt if all exceptions could be
safely rethrown.

Testing:
* Unit tests

Assets 3

19 Oct 10:04

github-actions

v20221019-100329

7a7c4ea

Runner lambdas v20221019-100329

FIX: Don't remove EC2 instance when fails to remove githubRunner (#904)

`removeGithubRunner[Org || Repo]` used to remove the EC2 instance, so no
need to call `terminateRunner` again. This potentially could cause
runners that failed to be unregistered from GHA to be terminated on EC2.

As a fix, `removeGithubRunner` won't terminate the instance, nor
generate logs. This will enable `scaleDown` to control when to call
`terminateRunner` and generate the proper logs and metrics. Avoiding
having this issue in the future.

This bug also explains why we had in the past more EC2 instances being
kept at its minimum time: instances with less than minimum time got
unregistered and terminated without being tracked on main application
metric. This is obvious when we compare the API calls to terminate and
the count of app level termination.

![Screenshot 2022-10-18 at 09 21
47](https://user-images.githubusercontent.com/4520845/196364535-5aaab331-2080-44be-b6af-0702f99d50d9.png)
![Screenshot 2022-10-18 at 09 26
19](https://user-images.githubusercontent.com/4520845/196364542-376ff99f-617e-4e82-b459-dfc8364219ad.png)

Bug initially flagged on
[87134](https://github.com/pytorch/pytorch/issues/87134)

Assets 3

17 Oct 08:45

github-actions

v20221017-084425

82d970e

Runner lambdas v20221017-084425

FIX: add back metrics runnerLessMinimumTime and runnerFound to scaleD…

Assets 3

12 Oct 11:34

github-actions

v20221012-113302

63f4a29

Runner lambdas v20221012-113302

Rewrite scaleDown to fix a series of bugs (#864)

On scaleDown:

- [FIX] Guaranteed stop the runners from the oldest to the newest,
avoiding having runners for too long;
- [FIX] Fixed bug where a runner could be removed from GHA but kept
running on AWS;
- [IMPROVED] Try to maintain always a minimum of `minAvailableRunners`
runners free;

Assets 3

05 Oct 09:36

github-actions

v20221005-093519

2060ac2

Runner lambdas v20221005-093519

Jeanschmidt/runners send metrics (#843)


This change changes CW agent config, enabling runners to send host metrics, so dashboards/alerts can be built for disk usage, cpu, memory, etc.

Assets 3

03 Oct 13:19

github-actions

v20221003-131827

c191dd8

Runner lambdas v20221003-131827

scaleDown lamnda tries to send metrics 10s before timing out (#830)

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: pytorch/test-infra

Runner lambdas v20221102-115928

Runner lambdas v20221101-172204

Runner lambdas v20221031-105322

Runner lambdas v20221025-105328

Runner lambdas v20221021-231347

Runner lambdas v20221019-100329

Runner lambdas v20221017-084425

Runner lambdas v20221012-113302

Runner lambdas v20221005-093519

Runner lambdas v20221003-131827