Neuron SDK Release - July 3, 2024
Neuron 2.19 release adds Llama 3 training support and introduces Flash Attention kernel support to enable LLM training and inference for large sequence lengths. Neuron 2.19 also introduces new features and performance improvements to LLM training, improves LLM inference performance for Llama 3 model by upto 20%, and adds tools for monitoring, problem detection and recovery in Kubernetes (EKS) environments, improving efficiency and reliability.
Training highlights: LLM model training user experience using NeuronX Distributed (NxD) is improved by support for Flash Attention to enable training with longer sequence lengths >= 8K. Neuron 2.19 adds support for Llama 3 model training. This release also adds support for Interleaved pipeline parallelism to reduce idle time (bubble size) and enhance training efficiency and resource utilization for large cluster sizes.
Inference highlights: Flash Attention kernel support in the Transformers NeuronX library enables LLM inference for context lengths of up to 32k. This release also adds [Beta] support for continuous batching with mistralai/Mistral-7B-v0.2
in Transformers NeuronX.
Tools and Neuron DLAMI/DLC highlights: This release introduces the new Neuron Node Problem Detector and Recovery plugin in EKS supported Kubernetes environments:a tool to monitor the health of Neuron instances and triggers automatic node replacement upon detecting an unrecoverable error. Neuron 2.19 introduces the new Neuron Monitor container to enable easy monitoring of Neuron metrics in Kubernetes, and adds monitoring support with Prometheus and Grafana. This release also introduces new PyTorch 2.1 and PyTorch 1.13 single framework DLAMIs for Ubuntu 22. Neuron DLAMIs and Neuron DLCs are also updated to support this release (Neuron 2.19).