Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tt-smi -g hangs when remote chips inaccessible #25

Open
hmohiuddinTT opened this issue May 15, 2024 · 3 comments
Open

tt-smi -g hangs when remote chips inaccessible #25

hmohiuddinTT opened this issue May 15, 2024 · 3 comments
Labels
enhancement New feature or request internal_report

Comments

@hmohiuddinTT
Copy link

Summary

This is a chicken and egg problem:

  1. User sets up TG and tries to reset the system to get it into a good state.
  2. In order to reset they need a reset_config.json.
  3. In order to generate a reset_config.json they run tt-smi -g.
  4. Since boards are not able to talk to remote chips yet, this will hang.

TLDR: Cannot generate reset_config needed for resetting, without first resetting.

image
@hmohiuddinTT
Copy link
Author

Ok nvm, it looks like I just need to wait for the training to timeout:

ansible@g14cs03:~$ tt-smi -g
 Detected Chips: 4
 Generated sample reset config file for this host: /home/ansible/.config/tenstorrent/reset_config.json 
 Update the generated file and use it as an input for the -r/--reset option. 
ansible@g14cs03:~$ cat ~/.config/tenstorrent/reset_config.json 
{
    "time": "2024-05-15T23:17:08.833710",
    "host_name": "g14cs03",
    "gs_tensix_reset": {
        "pci_index": []
    },
    "wh_link_reset": {
        "pci_index": [
            0,
            1,
            2,
            3
        ]
    },
    "re_init_devices": true,
    "wh_mobo_reset": [
        {
            "nb_host_pci_idx": [
                0,
                1,
                2,
                3
            ],
            "mobo": "<MOBO NAME>",
            "credo": [
                "<group id>:<credo id>",
                "<group id>:<credo id>"
            ],
            "disabled_ports": [
                "<group id>:<credo id>",
                "<group id>:<credo id>"
            ]
        },
        {
            "nb_host_pci_idx": [
                0,
                1,
                2,
                3
            ],
            "mobo": "<MOBO NAME>",
            "credo": [
                "<group id>:<credo id>",
                "<group id>:<credo id>"
            ],
            "disabled_ports": [
                "<group id>:<credo id>",
                "<group id>:<credo id>"
            ]
        }
    ]
}

@hmohiuddinTT
Copy link
Author

Might still be useful to have a countdown or timeout in tt-smi while the training is ongoing.

@hmohiuddinTT hmohiuddinTT reopened this May 15, 2024
@sbansalTT
Copy link
Contributor

Yeah thanks for raising this - I don't think the generation of this config file should not have any eth training detection, I'll take a look and see if I can separate the two.
I'll also add some kind of indicators to users to wait

@warthog9 warthog9 added internal_report enhancement New feature or request labels Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request internal_report
Projects
None yet
Development

No branches or pull requests

3 participants