Skip to content

Commit

Permalink
Neuron SDK Release 2.20.0
Browse files Browse the repository at this point in the history
---------

Co-authored-by: Finn Thompson <[email protected]>
Co-authored-by: mounchin <[email protected]>
Co-authored-by: Gandiinaa Gumenjav <[email protected]>
Co-authored-by: Wojciech Romaszkan <[email protected]>
Co-authored-by: Gandii Gumenjav <[email protected]>
Co-authored-by: Esha Lakhotia <[email protected]>
Co-authored-by: Nicholas Waldron <[email protected]>
Co-authored-by: mario-aws <[email protected]>
Co-authored-by: aws-ivanrco <[email protected]>
Co-authored-by: Joshua Hannan <[email protected]>
Co-authored-by: Alvin Yin <[email protected]>
Co-authored-by: musunita <[email protected]>
Co-authored-by: Arjun Raman <[email protected]>
Co-authored-by: Vikas Paliwal <[email protected]>
Co-authored-by: Akhil Raj Azhikodan <[email protected]>
Co-authored-by: Rahul Solanki <[email protected]>
Co-authored-by: jeffhataws <[email protected]>
Co-authored-by: Karthick Gopalswamy <[email protected]>
Co-authored-by: Shubham Chandak <[email protected]>
Co-authored-by: aws-rishyraj <[email protected]>
Co-authored-by: geetasg <[email protected]>
Co-authored-by: Shruthi (AWS) <[email protected]>
Co-authored-by: awshaichen <[email protected]>
Co-authored-by: aws-patlange <[email protected]>
Co-authored-by: Bowen Chen <[email protected]>
Co-authored-by: Maen Suleiman <[email protected]>
Co-authored-by: Lily Liu <[email protected]>
Co-authored-by: aws-caijune <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: aws-yishanm <[email protected]>
Co-authored-by: gsnaws <[email protected]>
Co-authored-by: anistala (AWS) <[email protected]>
Co-authored-by: Karan Dhiman <[email protected]>
Co-authored-by: Rahul Solanki <[email protected]>
Co-authored-by: Huang, Guangtai <[email protected]>
Co-authored-by: Finn Thompson <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Maen Suleiman <[email protected]>
Co-authored-by: micwade (AWS) <[email protected]>
Co-authored-by: Nikhil Yogendra Murali <[email protected]>
Co-authored-by: aws-auderian <[email protected]>
Co-authored-by: aws-bhegedus <[email protected]>
Co-authored-by: Mustafa Quraish <[email protected]>
Co-authored-by: Zhitao Lin <[email protected]>
Co-authored-by: sitgupta-aws <[email protected]>
Co-authored-by: Roopnath <[email protected]>
Co-authored-by: Zhuang Wang <[email protected]>
Co-authored-by: Nathan Mailhot <[email protected]>
Co-authored-by: Nathan Mailhot <[email protected]>
Co-authored-by: Karan Dhiman <[email protected]>
  • Loading branch information
Show file tree
Hide file tree
Showing 451 changed files with 24,177 additions and 2,500 deletions.
22 changes: 22 additions & 0 deletions .github/stale_issue_mark_close_workflow.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: Close inactive issues
on:
schedule:
- cron: "30 1 * * *"

jobs:
close-issues:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v5
with:
days-before-issue-stale: 30
days-before-issue-close: 14
stale-issue-label: "stale"
stale-issue-message: "This issue is stale because it has been open for 30 days with no activity."
close-issue-message: "This issue was closed because it has been inactive for 14 days since being marked as stale."
days-before-pr-stale: -1
days-before-pr-close: -1
repo-token: ${{ secrets.GITHUB_TOKEN }}
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ src/examples/pytorch/libtorch_demo.tar.gz
*-checkpoint.ipynb
.idea/
.vscode/
*/nki/*/generated/
12 changes: 6 additions & 6 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ version: 2

# Set the version of Python and other tools you might need
build:
os: "ubuntu-20.04"
os: "ubuntu-22.04"
tools:
python: "3.7"
jobs:
pre_build:
- python -m sphinx -b linkcheck . _build/linkcheck
python: "3.10"
# jobs:
# pre_build:
# - python -m sphinx -b linkcheck . _build/linkcheck

# Build documentation in the docs/ directory with Sphinx
sphinx:
Expand All @@ -33,4 +33,4 @@ formats:
# Optionally set the version of Python and requirements required to build your docs
python:
install:
- requirements: requirements.txt
- requirements: requirements.txt
8 changes: 5 additions & 3 deletions _ext/neuron_tag.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
'frameworks/tensorflow/tensorflow-neuron/'
]
add_trn1_tag = ['frameworks/neuron-customops/','frameworks/torch/inference-torch-neuronx']
add_neuronx_tag = ['frameworks/torch/torch-neuronx/','frameworks/tensorflow/tensorflow-neuronx/','frameworks/torch/inference-torch-neuronx/','libraries/transformers-neuronx/','libraries/neuronx-distributed/','general/setup/tensorflow-neuronx']
add_neuronx_tag = ['frameworks/torch/torch-neuronx/','frameworks/tensorflow/tensorflow-neuronx/','frameworks/torch/inference-torch-neuronx/','libraries/transformers-neuronx/','libraries/neuronx-distributed/','neuronx-distributed/nxd-training', 'general/setup/tensorflow-neuronx']
clear_inf1_tag = ['general/arch/neuron-features/neuron-caching',
'general/arch/neuron-features/eager-debug-mode',
'general/arch/neuron-features/collective-communication-operations',
Expand Down Expand Up @@ -74,7 +74,8 @@
'general/setup/neuron-setup/tensorflow/neuronx/',
'general/setup/neuron-setup/pytorch/neuronx/',
'general/models/inference-inf2-trn1-samples',
'general/models/training-trn1-samples'
'general/models/training-trn1-samples',
'general/nki/',
]

clear_inf2_tag = ['frameworks/torch/torch-neuronx/training',
Expand All @@ -84,7 +85,8 @@
'general/arch/neuron-hardware/trn1-arch',
'general/arch/neuron-hardware/trainium',
'general/benchmarks/trn1/trn1-inference-performance',
'general/benchmarks/trn1/trn1-training-performance'
'general/benchmarks/trn1/trn1-training-performance',
'neuronx-distributed/nxd-training'
]

clear_trn1_tag = [ 'general/arch/neuron-hardware/inf2-arch',
Expand Down
24 changes: 20 additions & 4 deletions conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
import datetime

sys.path.append(os.path.abspath("./_ext"))
sys.path.append(os.path.abspath("./general/nki/api"))
sys.path.append(os.path.abspath("./frameworks/torch/torch-neuron/"))
#sys.path.append(os.path.abspath("./_static"))

# get environment variables
Expand Down Expand Up @@ -79,6 +81,7 @@
'sphinx.ext.viewcode',
'sphinx.ext.napoleon',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'local_documenter',
'archive',
"sphinx_copybutton",
Expand All @@ -90,7 +93,7 @@
}

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
templates_path = ['_templates', 'general/nki/_templates/']

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
Expand All @@ -109,6 +112,14 @@

napoleon_google_docstring = True

# Turn on figure/table numbering
numfig = True

# -- autodoc/autosummary options -------------------------------------------------

autosummary_generate = True # Turn on sphinx.ext.autosummary


# -- more options -------------------------------------------------


Expand All @@ -135,18 +146,18 @@

intersphinx_mapping = {
'python': ('https://docs.python.org/3', None),
'numpy': ('https://numpy.org/doc/stable/', None),
'torch': ('https://pytorch.org/docs/master/', None),
'transformers': ('https://huggingface.co/docs/transformers/master/en/', None),
}


# -- Options for Theme -------------------------------------------------

#top_banner_message="<a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/neuron2.x/dlami-pytorch-introduce.html'> Deep Learning AMI Neuron PyTorch is now available! </a> <br> <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/neuron2.x/sm-training-trn1-introduce.html'> Amazon Sagemaker now supports training jobs on Trn1! </a>"

#top_banner_message="<span>&#9888;</span><a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/setup-troubleshooting.html#gpg-key-update'> Neuron repository GPG key for Ubuntu installation has expired, see instructions how to update! </a>"

top_banner_message="Neuron 2.19.1 is released! check <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release'> What's New </a> and <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/index.html'> Announcements </a>"
top_banner_message="Neuron 2.20.0 is released! check <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release'> What's New </a> and <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/index.html'> Announcements </a>"

html_theme = "sphinx_book_theme"
html_theme_options = {
Expand All @@ -159,8 +170,13 @@
"home_page_in_toc": False,
"repository_branch" : branch_name,
"announcement": top_banner_message,
# "max_navbar_depth": 2, # needs sphinx_book_theme >= v1.1.0
}

html_context = {
# ...
"default_mode": "light"
}

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
Expand Down Expand Up @@ -239,4 +255,4 @@
,r'https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/torch-neuronx/t5-inference-tutorial.ipynb',r'https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md',r'https://github.com/pytorch/PiPPy/blob/main/pippy/IR.py#L697', r'https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L241', r'https://github.com/pytorch/xla/blob/master/torch_xla/utils/checkpoint.py#L129', r'https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/parallel_layers/layer_norm.py#L32', r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain.py#L273C1-L289C55'
,r'https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html#pytorch-neuronx-install',r'https://github.com/google-research/bert#user-content-pre-trained-models',r'https://github.com/google-research/bert#user-content-sentence-and-sentence-pair-classification-tasks', r'https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-retirement.html', r'https://repost.aws/knowledge-center/eventbridge-notification-scheduled-events', r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/modeling_gpt_neox_nxd.py',r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain.py',r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3-8b-32k-sampling.ipynb']
linkcheck_exclude_documents = [r'src/examples/.*', 'general/announcements/neuron1.x/announcements', r'release-notes/.*',r'containers/.*',r'general/.*']
nitpicky = True
nitpicky = False
1 change: 1 addition & 0 deletions containers/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ In the kubernetes environment the EFA device plugin is used to detect and advert
EFA interfaces.

::

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/manifest/efa-k8s-device-plugin.yml

Application can use the resource type vpc.amazonaws.com/efa in a pod request spec
Expand Down
10 changes: 6 additions & 4 deletions containers/getting-started.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
Containers - Getting Started
============================

.. include:: /containers/getting-started.txt
.. _containers-getting-started:

Getting started with Neuron DLC using Docker
============================================

.. include:: /containers/getting-started.txt
80 changes: 23 additions & 57 deletions containers/index.rst
Original file line number Diff line number Diff line change
@@ -1,70 +1,36 @@
.. _neuron_containers:

Deploy Containers with Neuron
=============================
Neuron Containers
=================

.. toctree::
:maxdepth: 1
:hidden:

Locate Neuron DLC Image </containers/locate-neuron-dlc-image>
Getting Started </containers/getting-started>
Kubernetes Getting Started </containers/kubernetes-getting-started>
Tutorials </containers/tutorials>
Developer Flows </containers/developerflows>
FAQ, Troubleshooting and Release Note </containers/faq-troubleshooting-releasenote>
/containers/getting-started
/containers/locate-neuron-dlc-image
/containers/dlc-then-customize-devflow
/containers/neuron-plugins
/containers/faq


In this section, you'll find resources to help you use containers for accelerating your deep learning models on Inferentia and Trainium instances.

In this section you will find resources to help you use containers for your accelerated deep learning model acceleration on top of Inferentia and Trainium enabled instances.
Getting started with Neuron DLC using Docker
--------------------------------------------
AWS Neuron Deep Learning Containers (DLCs) are a set of Docker images for training and serving models on AWS Trainium and Inferentia instances using AWS Neuron SDK. To build a Neuron container using Docker, please refer to :ref:`containers-getting-started`.

The section is organized based on the target deployment environment
and use case. In most cases, it is recommended to use a preconfigured
`Deep Learning Container (DLC) <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html>`_ from AWS.
Each DLC is pre-configured to have all of the Neuron components installed and is specific to the chosen ML Framework.
Neuron Deep Learning Containers
-------------------------------
In most cases, it is recommended to use a preconfigured `Deep Learning Container (DLC) <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html>`_ from AWS. Each DLC is pre-configured to have all of the Neuron components installed and is specific to the chosen ML Framework. For more details on Neuron Deep Learning Containers, please refer to :ref:`locate-neuron-dlc-image`.

.. dropdown:: Locate Neuron DLC image
:class-title: sphinx-design-class-title-med
:class-body: sphinx-design-class-body-small
:animate: fade-in
Customize Neuron DLC
---------------------
Neuron DLC can be customized as needed. To learn more about how to customize the Neuron Deep Learning Container (DLC) to fit your specific project needs, please refer to :ref:`containers-dlc-then-customize-devflow`.

.. include:: /containers/locate-neuron-dlc-image.txt
Neuron Plugins for Containerized Environments
---------------------------------------------
Neuron provides plugins for better observability and fault tolerance. For more information on the plugins, please refer to :ref:`neuron-container-plugins`.

.. dropdown:: Getting Started
:class-title: sphinx-design-class-title-med
:class-body: sphinx-design-class-body-small
:animate: fade-in

.. include:: /containers/getting-started.txt

.. dropdown:: Kubernetes Getting Started
:class-title: sphinx-design-class-title-med
:class-body: sphinx-design-class-body-small
:animate: fade-in

.. include:: /containers/kubernetes-getting-started.txt


.. dropdown:: Tutorials
:class-title: sphinx-design-class-title-med
:class-body: sphinx-design-class-body-small
:animate: fade-in

.. include:: /containers/tutorials.txt


.. dropdown:: Developer Flows
:class-title: sphinx-design-class-title-med
:class-body: sphinx-design-class-body-small
:animate: fade-in

.. include:: /containers/developerflows.txt


.. dropdown:: FAQ, Troubleshooting and Release Note
:class-title: sphinx-design-class-title-med
:class-body: sphinx-design-class-body-small
:animate: fade-in
:open:

.. include:: /containers/faq-troubleshooting-releasenote.txt
Neuron Containers FAQ
----------------------
For frequently asked questions and troubleshooting, please refer to :ref:`container-faq`
16 changes: 8 additions & 8 deletions containers/kubernetes-getting-started.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
Containers - Kubernetes - Getting Started
=========================================
.. _kubernetes-getting-started:

The Neuron device plugin is a DaemonSet run on all Inferentia and Trainium nodes that enables the containers in your Kubernetes cluster to request and use Neuron cores or devices.
The Neuron scheduler extension is required for containers in your Kubernetes cluster that request multiple Neuron resources.
It helps find optimal sets of Neuron resources to minimize inter-resource communication costs.
Below are directions for installing and using the Neuron device plugin and scheduler extension.
Using Neuron with Amazon EKS
============================


.. include:: /containers/kubernetes-getting-started.txt
.. contents:: Table of Contents
:local:
:depth: 2

.. include:: /containers/kubernetes-getting-started.txt
64 changes: 34 additions & 30 deletions containers/kubernetes-getting-started.txt
Original file line number Diff line number Diff line change
@@ -1,41 +1,45 @@
.. dropdown:: Prerequisite
:class-title: sphinx-design-class-title-small
:class-body: sphinx-design-class-body-small
:animate: fade-in
.. _tutorial-k8s-env-setup-for-neuron:

.. include:: /containers/tutorials/k8s-prerequisite.rst
EKS Setup For Neuron
--------------------

.. dropdown:: Prerequisite for Neuron Problem Detector Plugin
:class-title: sphinx-design-class-title-small
:class-body: sphinx-design-class-body-small
:animate: fade-in
Customers that use Kubernetes can conveniently integrate Inf1/Trn1 instances into their workflows. This section will go through steps for setting up EKS cluster for Neuron.

.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa.rst
Prerequisites
-------------

.. dropdown:: Deploy Neuron Device Plugin
:class-title: sphinx-design-class-title-small
:class-body: sphinx-design-class-body-small
:animate: fade-in
.. include:: /containers/tutorials/k8s-prerequisite.rst

.. include:: /containers/tutorials/k8s-neuron-device-plugin.rst
Neuron Device Plugin
--------------------

.. dropdown:: Deploy Neuron Scheduler Extension
:class-title: sphinx-design-class-title-small
:class-body: sphinx-design-class-body-small
:animate: fade-in
.. include:: /containers/tutorials/k8s-neuron-device-plugin.rst

.. include:: /containers/tutorials/k8s-neuron-scheduler.rst
Neuron Scheduler Extension
--------------------------

.. dropdown:: Deploy Neuron Problem Detector And Recovery
:class-title: sphinx-design-class-title-small
:class-body: sphinx-design-class-body-small
:animate: fade-in
.. include:: /containers/tutorials/k8s-neuron-scheduler.rst

.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery.rst
Neuron Node Problem Detector Plugin
-----------------------------------
The Neuron Problem Detector Plugin facilitates error detection and recovery by continuously monitoring the health of Neuron devices across all Kubernetes nodes. It publishes CloudWatch metrics for node errors and can optionally trigger automatic recovery of affected nodes. Please follow the instructions below to enable the necessary permissions for the plugin.

.. dropdown:: Deploy Neuron Monitor Daemonset
:class-title: sphinx-design-class-title-small
:class-body: sphinx-design-class-body-small
:animate: fade-in
Permissions for Neuron Node Problem Detector Plugin
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. include:: /containers/tutorials/k8s-neuron-monitor.rst
.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa.rst

Deploy Neuron Node Problem Detector And Recovery
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery.rst

Neuron Monitor Daemonset
------------------------

.. include:: /containers/tutorials/k8s-neuron-monitor.rst

Neuron Helm Chart
-----------------

.. include:: /containers/tutorials/k8s-neuron-helm-chart.rst
Loading

0 comments on commit ce3668b

Please sign in to comment.