-
Notifications
You must be signed in to change notification settings - Fork 153
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
--------- Co-authored-by: Finn Thompson <[email protected]> Co-authored-by: mounchin <[email protected]> Co-authored-by: Gandiinaa Gumenjav <[email protected]> Co-authored-by: Wojciech Romaszkan <[email protected]> Co-authored-by: Gandii Gumenjav <[email protected]> Co-authored-by: Esha Lakhotia <[email protected]> Co-authored-by: Nicholas Waldron <[email protected]> Co-authored-by: mario-aws <[email protected]> Co-authored-by: aws-ivanrco <[email protected]> Co-authored-by: Joshua Hannan <[email protected]> Co-authored-by: Alvin Yin <[email protected]> Co-authored-by: musunita <[email protected]> Co-authored-by: Arjun Raman <[email protected]> Co-authored-by: Vikas Paliwal <[email protected]> Co-authored-by: Akhil Raj Azhikodan <[email protected]> Co-authored-by: Rahul Solanki <[email protected]> Co-authored-by: jeffhataws <[email protected]> Co-authored-by: Karthick Gopalswamy <[email protected]> Co-authored-by: Shubham Chandak <[email protected]> Co-authored-by: aws-rishyraj <[email protected]> Co-authored-by: geetasg <[email protected]> Co-authored-by: Shruthi (AWS) <[email protected]> Co-authored-by: awshaichen <[email protected]> Co-authored-by: aws-patlange <[email protected]> Co-authored-by: Bowen Chen <[email protected]> Co-authored-by: Maen Suleiman <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: aws-caijune <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: aws-yishanm <[email protected]> Co-authored-by: gsnaws <[email protected]> Co-authored-by: anistala (AWS) <[email protected]> Co-authored-by: Karan Dhiman <[email protected]> Co-authored-by: Rahul Solanki <[email protected]> Co-authored-by: Huang, Guangtai <[email protected]> Co-authored-by: Finn Thompson <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Maen Suleiman <[email protected]> Co-authored-by: micwade (AWS) <[email protected]> Co-authored-by: Nikhil Yogendra Murali <[email protected]> Co-authored-by: aws-auderian <[email protected]> Co-authored-by: aws-bhegedus <[email protected]> Co-authored-by: Mustafa Quraish <[email protected]> Co-authored-by: Zhitao Lin <[email protected]> Co-authored-by: sitgupta-aws <[email protected]> Co-authored-by: Roopnath <[email protected]> Co-authored-by: Zhuang Wang <[email protected]> Co-authored-by: Nathan Mailhot <[email protected]> Co-authored-by: Nathan Mailhot <[email protected]> Co-authored-by: Karan Dhiman <[email protected]>
- Loading branch information
1 parent
bc3bb91
commit ce3668b
Showing
451 changed files
with
24,177 additions
and
2,500 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
name: Close inactive issues | ||
on: | ||
schedule: | ||
- cron: "30 1 * * *" | ||
|
||
jobs: | ||
close-issues: | ||
runs-on: ubuntu-latest | ||
permissions: | ||
issues: write | ||
pull-requests: write | ||
steps: | ||
- uses: actions/stale@v5 | ||
with: | ||
days-before-issue-stale: 30 | ||
days-before-issue-close: 14 | ||
stale-issue-label: "stale" | ||
stale-issue-message: "This issue is stale because it has been open for 30 days with no activity." | ||
close-issue-message: "This issue was closed because it has been inactive for 14 days since being marked as stale." | ||
days-before-pr-stale: -1 | ||
days-before-pr-close: -1 | ||
repo-token: ${{ secrets.GITHUB_TOKEN }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,3 +10,4 @@ src/examples/pytorch/libtorch_demo.tar.gz | |
*-checkpoint.ipynb | ||
.idea/ | ||
.vscode/ | ||
*/nki/*/generated/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,6 @@ | ||
Containers - Getting Started | ||
============================ | ||
|
||
.. include:: /containers/getting-started.txt | ||
.. _containers-getting-started: | ||
|
||
Getting started with Neuron DLC using Docker | ||
============================================ | ||
|
||
.. include:: /containers/getting-started.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,70 +1,36 @@ | ||
.. _neuron_containers: | ||
|
||
Deploy Containers with Neuron | ||
============================= | ||
Neuron Containers | ||
================= | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
:hidden: | ||
|
||
Locate Neuron DLC Image </containers/locate-neuron-dlc-image> | ||
Getting Started </containers/getting-started> | ||
Kubernetes Getting Started </containers/kubernetes-getting-started> | ||
Tutorials </containers/tutorials> | ||
Developer Flows </containers/developerflows> | ||
FAQ, Troubleshooting and Release Note </containers/faq-troubleshooting-releasenote> | ||
/containers/getting-started | ||
/containers/locate-neuron-dlc-image | ||
/containers/dlc-then-customize-devflow | ||
/containers/neuron-plugins | ||
/containers/faq | ||
|
||
|
||
In this section, you'll find resources to help you use containers for accelerating your deep learning models on Inferentia and Trainium instances. | ||
|
||
In this section you will find resources to help you use containers for your accelerated deep learning model acceleration on top of Inferentia and Trainium enabled instances. | ||
Getting started with Neuron DLC using Docker | ||
-------------------------------------------- | ||
AWS Neuron Deep Learning Containers (DLCs) are a set of Docker images for training and serving models on AWS Trainium and Inferentia instances using AWS Neuron SDK. To build a Neuron container using Docker, please refer to :ref:`containers-getting-started`. | ||
|
||
The section is organized based on the target deployment environment | ||
and use case. In most cases, it is recommended to use a preconfigured | ||
`Deep Learning Container (DLC) <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html>`_ from AWS. | ||
Each DLC is pre-configured to have all of the Neuron components installed and is specific to the chosen ML Framework. | ||
Neuron Deep Learning Containers | ||
------------------------------- | ||
In most cases, it is recommended to use a preconfigured `Deep Learning Container (DLC) <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html>`_ from AWS. Each DLC is pre-configured to have all of the Neuron components installed and is specific to the chosen ML Framework. For more details on Neuron Deep Learning Containers, please refer to :ref:`locate-neuron-dlc-image`. | ||
|
||
.. dropdown:: Locate Neuron DLC image | ||
:class-title: sphinx-design-class-title-med | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
Customize Neuron DLC | ||
--------------------- | ||
Neuron DLC can be customized as needed. To learn more about how to customize the Neuron Deep Learning Container (DLC) to fit your specific project needs, please refer to :ref:`containers-dlc-then-customize-devflow`. | ||
|
||
.. include:: /containers/locate-neuron-dlc-image.txt | ||
Neuron Plugins for Containerized Environments | ||
--------------------------------------------- | ||
Neuron provides plugins for better observability and fault tolerance. For more information on the plugins, please refer to :ref:`neuron-container-plugins`. | ||
|
||
.. dropdown:: Getting Started | ||
:class-title: sphinx-design-class-title-med | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
|
||
.. include:: /containers/getting-started.txt | ||
|
||
.. dropdown:: Kubernetes Getting Started | ||
:class-title: sphinx-design-class-title-med | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
|
||
.. include:: /containers/kubernetes-getting-started.txt | ||
|
||
|
||
.. dropdown:: Tutorials | ||
:class-title: sphinx-design-class-title-med | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
|
||
.. include:: /containers/tutorials.txt | ||
|
||
|
||
.. dropdown:: Developer Flows | ||
:class-title: sphinx-design-class-title-med | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
|
||
.. include:: /containers/developerflows.txt | ||
|
||
|
||
.. dropdown:: FAQ, Troubleshooting and Release Note | ||
:class-title: sphinx-design-class-title-med | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
:open: | ||
|
||
.. include:: /containers/faq-troubleshooting-releasenote.txt | ||
Neuron Containers FAQ | ||
---------------------- | ||
For frequently asked questions and troubleshooting, please refer to :ref:`container-faq` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,10 @@ | ||
Containers - Kubernetes - Getting Started | ||
========================================= | ||
.. _kubernetes-getting-started: | ||
|
||
The Neuron device plugin is a DaemonSet run on all Inferentia and Trainium nodes that enables the containers in your Kubernetes cluster to request and use Neuron cores or devices. | ||
The Neuron scheduler extension is required for containers in your Kubernetes cluster that request multiple Neuron resources. | ||
It helps find optimal sets of Neuron resources to minimize inter-resource communication costs. | ||
Below are directions for installing and using the Neuron device plugin and scheduler extension. | ||
Using Neuron with Amazon EKS | ||
============================ | ||
|
||
|
||
.. include:: /containers/kubernetes-getting-started.txt | ||
.. contents:: Table of Contents | ||
:local: | ||
:depth: 2 | ||
|
||
.. include:: /containers/kubernetes-getting-started.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,41 +1,45 @@ | ||
.. dropdown:: Prerequisite | ||
:class-title: sphinx-design-class-title-small | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
.. _tutorial-k8s-env-setup-for-neuron: | ||
|
||
.. include:: /containers/tutorials/k8s-prerequisite.rst | ||
EKS Setup For Neuron | ||
-------------------- | ||
|
||
.. dropdown:: Prerequisite for Neuron Problem Detector Plugin | ||
:class-title: sphinx-design-class-title-small | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
Customers that use Kubernetes can conveniently integrate Inf1/Trn1 instances into their workflows. This section will go through steps for setting up EKS cluster for Neuron. | ||
|
||
.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa.rst | ||
Prerequisites | ||
------------- | ||
|
||
.. dropdown:: Deploy Neuron Device Plugin | ||
:class-title: sphinx-design-class-title-small | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
.. include:: /containers/tutorials/k8s-prerequisite.rst | ||
|
||
.. include:: /containers/tutorials/k8s-neuron-device-plugin.rst | ||
Neuron Device Plugin | ||
-------------------- | ||
|
||
.. dropdown:: Deploy Neuron Scheduler Extension | ||
:class-title: sphinx-design-class-title-small | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
.. include:: /containers/tutorials/k8s-neuron-device-plugin.rst | ||
|
||
.. include:: /containers/tutorials/k8s-neuron-scheduler.rst | ||
Neuron Scheduler Extension | ||
-------------------------- | ||
|
||
.. dropdown:: Deploy Neuron Problem Detector And Recovery | ||
:class-title: sphinx-design-class-title-small | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
.. include:: /containers/tutorials/k8s-neuron-scheduler.rst | ||
|
||
.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery.rst | ||
Neuron Node Problem Detector Plugin | ||
----------------------------------- | ||
The Neuron Problem Detector Plugin facilitates error detection and recovery by continuously monitoring the health of Neuron devices across all Kubernetes nodes. It publishes CloudWatch metrics for node errors and can optionally trigger automatic recovery of affected nodes. Please follow the instructions below to enable the necessary permissions for the plugin. | ||
|
||
.. dropdown:: Deploy Neuron Monitor Daemonset | ||
:class-title: sphinx-design-class-title-small | ||
:class-body: sphinx-design-class-body-small | ||
:animate: fade-in | ||
Permissions for Neuron Node Problem Detector Plugin | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. include:: /containers/tutorials/k8s-neuron-monitor.rst | ||
.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa.rst | ||
|
||
Deploy Neuron Node Problem Detector And Recovery | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery.rst | ||
|
||
Neuron Monitor Daemonset | ||
------------------------ | ||
|
||
.. include:: /containers/tutorials/k8s-neuron-monitor.rst | ||
|
||
Neuron Helm Chart | ||
----------------- | ||
|
||
.. include:: /containers/tutorials/k8s-neuron-helm-chart.rst |
Oops, something went wrong.