Skip to content

Release 0.3.8

Latest
Compare
Choose a tag to compare
@BalaBalaYi BalaBalaYi released this 29 Sep 02:02
bdc5ed2

Features:

  • Added the basic implementation of the first version of positive diagnostics.
  • Supported 'fast-fail' strategy for training job in some boundary scenarios. e.g. pending case
  • Accelerate(sync -> async) pod creation.
  • Added the basic implementation of structured event logging.

BugFix:

  • Fixed unexpected rendezvous failure in occasional fault-tolerant scenarios.
  • Fixed unexpected socket client creation before socket socket creation.
  • Optimized 'network-check' implementation for 'Ascend NPU'.
  • Optimized some implementations for master-fault-tolerance(internal) scenario.
  • Other numerous known issues fixed and optimized.