Releases: aws-neuron/aws-neuron-sdk
Neuron SDK Release - October 25, 2024
Neuron 2.20.1 release addresses an issue with the Neuron Persistent Cache that was brought forth in 2.20 release. In the 2.20 release, the Neuron persistent cache issue resulted in a cache-miss scenario when attempting to load a previously compiled Neuron Executable File Format (NEFF) from a different path or Python environment than the one used for the initial Neuron SDK installation and NEFF compilation. This release resolves the cache-miss problem, ensuring that NEFFs can be loaded correctly regardless of the path or Python environment used to install the Neuron SDK, as long as they were compiled using the same Neuron SDK version.
This release also addresses the excessive lock wait time issue during neuron_parallel_compile graph extraction for large cluster training. See PyTorch Neuron (torch-neuronx) release notes and Neuron XLA pluggable device (libneuronxla) release notes.
Additionally, Neuron 2.20.1 introduces new Multi Framework DLAMI for Amazon Linux 2023 (AL2023) that customers can use to easily get started with latest Neuron SDK on multiple frameworks that Neuron supports. See Neuron DLAMI Release Notes.
Neuron 2.20.1 Training DLC is also updated to pre-install the necessary dependencies and support NxD Training library out of the box. See Neuron DLC Release Notes
Neuron SDK Release - September 16th, 2024
Neuron 2.20 release introduces usability improvements and new capabilities across training and inference workloads. A key highlight is the introduction of Neuron Kernel Interface (beta). NKI, pronounced ‘Nicky’, is enabling developers to build optimized custom compute kernels for Trainium and Inferentia. Additionally, this release introduces NxD Training (beta), a PyTorch-based library enabling efficient distributed training, with a user-friendly interface compatible with NeMo. This release also introduces the support for the JAX framework (beta).
Neuron 2.20 also adds inference support for Pixart-alpha and Pixart-sigma Diffusion-Transformers (DiT) models, and adds support for Llama 3.1 8B,70B and 405B models inference supporting up to 128K context length.
Neuron SDK Release - July 19, 2024
This release (Neuron 2.19.1) addresses an issue with the Neuron Persistent Cache that was introduced in the previous release, Neuron 2.19. The issue resulted in a cache-miss scenario when attempting to load a previously compiled Neuron Executable File Format (NEFF) from a different path or Python environment than the one used for the initial Neuron SDK installation and NEFF compilation. This release resolves the cache-miss problem, ensuring that NEFFs can be loaded correctly regardless of the path or Python environment used to install the Neuron SDK, as long as they were compiled using the same Neuron SDK version.
Neuron SDK Release - July 3, 2024
Neuron 2.19 release adds Llama 3 training support and introduces Flash Attention kernel support to enable LLM training and inference for large sequence lengths. Neuron 2.19 also introduces new features and performance improvements to LLM training, improves LLM inference performance for Llama 3 model by upto 20%, and adds tools for monitoring, problem detection and recovery in Kubernetes (EKS) environments, improving efficiency and reliability.
Training highlights: LLM model training user experience using NeuronX Distributed (NxD) is improved by support for Flash Attention to enable training with longer sequence lengths >= 8K. Neuron 2.19 adds support for Llama 3 model training. This release also adds support for Interleaved pipeline parallelism to reduce idle time (bubble size) and enhance training efficiency and resource utilization for large cluster sizes.
Inference highlights: Flash Attention kernel support in the Transformers NeuronX library enables LLM inference for context lengths of up to 32k. This release also adds [Beta] support for continuous batching with mistralai/Mistral-7B-v0.2
in Transformers NeuronX.
Tools and Neuron DLAMI/DLC highlights: This release introduces the new Neuron Node Problem Detector and Recovery plugin in EKS supported Kubernetes environments:a tool to monitor the health of Neuron instances and triggers automatic node replacement upon detecting an unrecoverable error. Neuron 2.19 introduces the new Neuron Monitor container to enable easy monitoring of Neuron metrics in Kubernetes, and adds monitoring support with Prometheus and Grafana. This release also introduces new PyTorch 2.1 and PyTorch 1.13 single framework DLAMIs for Ubuntu 22. Neuron DLAMIs and Neuron DLCs are also updated to support this release (Neuron 2.19).
Neuron SDK Release - April 25, 2024
Patch release with minor Neuron Compiler bug fixes and enhancements. See more in Neuron Compiler (neuronx-cc) release notes
Neuron SDK Release - April 10, 2024
Neuron 2.18.1 release introduces Continuous batching(beta) and Neuron vLLM integration(beta) support in Transformers NeuronX library that improves LLM inference throughput. This release also fixes hang issues related to Triton Inference Server as well as updating Neuron DLAMIs and DLCs with this release(2.18.1). See more in Transformers Neuron (transformers-neuronx) release notes and Neuron Compiler (neuronx-cc) release notes
Neuron SDK Release - April 1, 2024
What's New
Neuron 2.18 release introduces stable support (out of beta) for PyTorch 2.1, introduces new features and performance improvements to LLM training and inference, and updates Neuron DLAMIs and Neuron DLCs to support this release (Neuron 2.18).
Training highlights: LLM model training user experience using NeuronX Distributed (NxD) is improved by introducing asynchronous checkpointing. This release also adds support for auto partitioning pipeline parallelism in NxD and introduces Pipeline Parallelism in PyTorch Lightning Trainer (beta).
Inference highlights: Speculative Decoding support (beta) in TNx library improves LLM inference throughput and output token latency(TPOT) by up to 25% (for LLMs such as Llama-2-70B). TNx also improves weight loading performance by adding support for SafeTensor checkpoint format. Inference using Bucketing in PyTorch NeuronX and NeuronX Distributed is improved by introducing auto-bucketing feature. This release also adds a new sample for Mixtral-8x7B-v0.1
and mistralai/Mistral-7B-Instruct-v0.2
in TNx.
Neuron DLAMI and Neuron DLC support highlights: This release introduces new Multi Framework DLAMI for Ubuntu 22 that customers can use to easily get started with latest Neuron SDK on multiple frameworks that Neuron supports as well as SSM parameter support for DLAMIs to automate the retrieval of latest DLAMI ID in cloud automation flows. Support for new Neuron Training and Inference Deep Learning containers (DLCs) for PyTorch 2.1, as well as a new dedicated GitHub repository to host Neuron container dockerfiles and a public Neuron container registry to host Neuron container images.
Neuron SDK Release - February 13, 2024
What's New
Neuron 2.17 release improves small collective communication operators (smaller than 16MB) by up to 30%, which improves large language model (LLM) Inference performance by up to 10%. This release also includes improvements in :ref:`Neuron Profiler <neuron-profile-ug>` and other minor enhancements and bug fixes.
For more detailed release notes of the new features and resolved issues, see :ref:`components-rn`.
To learn about the model architectures currently supported on Inf1, Inf2, Trn1 and Trn1n instances, please see :ref:`model_architecture_fit`.
Neuron Components Release Notes
Inf1, Trn1/Trn1n and Inf2 common packages
Component | Instance/s | Package/s | Details |
---|---|---|---|
Neuron Runtime | Trn1/Trn1n, Inf1, Inf2 | Trn1/Trn1n: aws-neuronx-runtime-lib (.deb, .rpm) Inf1: Runtime is linked into the ML frameworks packages | :ref:neuron-runtime-rn |
Neuron Runtime Driver | Trn1/Trn1n, Inf1, Inf2 | aws-neuronx-dkms (.deb, .rpm) | :ref:neuron-driver-release-notes |
Neuron System Tools | Trn1/Trn1n, Inf1, Inf2 | aws-neuronx-tools (.deb, .rpm) | :ref:neuron-tools-rn |
Containers | Trn1/Trn1n, Inf1, Inf2 | aws-neuronx-k8-plugin (.deb, .rpm) aws-neuronx-k8-scheduler (.deb, .rpm) aws-neuronx-oci-hooks (.deb, .rpm) | :ref:neuron-k8-rn :ref:neuron-containers-release-notes |
NeuronPerf (Inference only) | Trn1/Trn1n, Inf1, Inf2 | neuronperf (.whl) | :ref:neuronperf_rn |
TensorFlow Model Server Neuron | Trn1/Trn1n, Inf1, Inf2 | tensorflow-model-server-neuronx (.deb, .rpm) | :ref:tensorflow-modeslserver-neuronx-rn |
Neuron Documentation | Trn1/Trn1n, Inf1, Inf2 | :ref:neuron-documentation-rn |
Neuron SDK Release - January 18, 2024
Patch release with compiler bug fixes, updates to Neuron Device Plugin and Neuron Kubernetes Scheduler .
Neuron SDK Release - Decemeber 21, 2023
What’s New
Neuron 2.16 adds support for Llama-2-70B training and inference, upgrades to PyTorch 2.1 (beta) and adds new support for PyTorch Lightning Trainer (beta) as well as performance improvements and adding Amazon Linux 2023 support.
Training highlights: NeuronX Distributed library LLM models training performance is improved by up to 15%. LLM model training user experience is improved by introducing support of PyTorch Lightning Trainer (beta), and a new model optimizer wrapper which will minimize the amount of changes needed to partition models using NeuronX Distributed primitives.
Inference highlights: PyTorch inference now allows to dynamically swap different fine-tuned weights for an already loaded model, as well as overall improvements of LLM inference throughput and latency with Transformers NeuronX. Two new reference model samples for LLama-2-70b and Mistral-7b model inference.
User experience: This release introduces two new capabilities: A new tool, Neuron Distributed Event Tracing (NDET) which improves debuggability, and the support of profiling collective communication operators in the Neuron Profiler tool.
More release content can be found in the table below and each component release notes.
What’s New | Details | Instances |
---|---|---|
Transformers NeuronX (transformers-neuronx) for Inference | [Beta] Support for Grouped Query Attention(GQA). See developer guide [Beta] Support for Llama-2-70b model inference using Grouped Query Attention. See tutorial [Beta] Support for Mistral-7B-Instruct-v0.1 model inference. See sample code See more at Transformers Neuron (transformers-neuronx) release notes | Inf2, Trn1/Trn1n |
NeuronX Distributed (neuronx-distributed) for Training | [Beta] Support for PyTorch Lightning to train models using tensor parallelism and data parallelism . See api guide , developer guide and tutorial Support for Model and Optimizer Wrapper training API that handles the parallelization. See api guide and Developer guide for model and optimizer wrapper (neuronx-distributed ) New save_checkpoint and load_checkpoint APIs to save/load checkpoints during distributed training. See Developer guide for save/load checkpoint (neuronx-distributed ) Support for a new Query-Key-Value(QKV) module that provides the ability to replicate the Key Value heads and adds flexibility to use higher Tensor parallel degree during Training. See api guide and tutorial See more at Neuron Distributed Release Notes (neuronx-distributed) | Trn1/Trn1n |
NeuronX Distributed (neuronx-distributed) for Inference | Support weight-deduplication amongst TP shards by giving ability to save weights separately than in NEFF files. See developer guide Llama-2-7B model inference script ([html] [notebook]) See more at Neuron Distributed Release Notes (neuronx-distributed) and API Reference Guide (neuronx-distributed ) | Inf2,Trn1/Trn1n |
PyTorch NeuronX (torch-neuronx) | [Beta]Support for] PyTorch 2.1. See Introducing PyTorch 2.1 Support (Beta) . See llama-2-13b inference sample. Support to separate out model weights from NEFF files and new replace_weights API to replace the separated weights. See PyTorch Neuron (torch-neuronx) Weight Replacement API for Inference and PyTorch NeuronX Tracing API for Inference [Beta] Script for training stabilityai/stable-diffusion-2-1-base and runwayml/stable-diffusion-v1-5 models . See script [Beta] Script for training facebook/bart-large model. See script [Beta] Script for stabilityai/stable-diffusion-2-inpainting model inference. See script | Trn1/Trn1n,Inf2 |
Neuron Tools | New Neuron Distributed Event Tracing (NDET) tool to help visualize execution trace logs and diagnose errors in multi-node workloads. See Neuron Distributed Event Tracing (NDET) User Guide Support for multi-worker jobs in neuron-profile . See Neuron Profile User Guide See more at Neuron System Tools | Inf1/Inf2/Trn1/Trn1n |
Documentation Updates | Added setup guide instructions for AL2023 OS. See Setup Guide Added announcement for name change of Neuron Components. See Announcing Name Change for Neuron Components Added announcement for End of Support for PyTorch 1.10 . See Announcing End of Support for PyTorch Neuron version 1.10 Added announcement for End of Support for PyTorch 2.0 Beta. See Announcing End of Support for PyTorch NeuronX version 2.0 (beta) See more at Neuron Documentation Release Notes | Inf1, Inf2, Trn1/Trn1n |
Minor enhancements and bug fixes. | See Neuron Components Release Notes | Trn1/Trn1n , Inf2, Inf1 |
Known Issues and Limitations | See 2.16.0 Known Issues and Limitations | Trn1/Trn1n , Inf2, Inf1 |