T4显卡微调报错“RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false”怎么办 #857

luoyi-zhang · 2024-02-22T07:56:41Z

luoyi-zhang
Feb 22, 2024

完全按照“finetune_demo/lora_finetune.ipynb”的步骤进行
pytorch和cuda版本：conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
python版本3.10
稳定训练3 step后报错：

# python finetune_hf.py  data/AdvertiseGen_fix/  THUDM/chatglm3-6b  configs/lora.yaml yes
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.42it/s]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

train_dataset: Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 114599
})
val_dataset: Dataset({
    features: ['input_ids', 'output_ids'],
    num_rows: 1070
})
test_dataset: Dataset({
    features: ['input_ids', 'output_ids'],
    num_rows: 1070
})
Detected kernel version 5.4.119, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
  Num examples = 114,599
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 3,000
  Number of trainable parameters = 1,949,696
{'loss': 4.4406, 'grad_norm': 4.049661636352539, 'learning_rate': 4.9833333333333336e-05, 'epoch': 0.0}                                                                                    
{'loss': 4.9137, 'grad_norm': 3.5368947982788086, 'learning_rate': 4.966666666666667e-05, 'epoch': 0.0}                                                                                    
{'loss': 4.6822, 'grad_norm': 4.433278560638428, 'learning_rate': 4.9500000000000004e-05, 'epoch': 0.0}                                                                                    
  1%|█▊                                                                                                                                                  | 37/3000 [00:10<10:44,  4.60it/s╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/ChatGLM3/finetune_demo/finetune_hf.py:550 in main                                          │
│                                                                                                  │
│   547 │   │   │   │   trainer.train(resume_from_checkpoint=checkpointdir)                        │
│   548 │   │   │   else:                                                                          │
│   549 │   │   │   │   # If not, start from scratch                                               │
│ ❱ 550 │   │   │   │   trainer.train()                                                            │
│   551 │   │   else:                                                                              │
│   552 │   │   │   # If it is a numerical value, select the corresponding checkpoint              │
│   553 │   │   │   if auto_resume_from_checkpoint.isdigit():                                      │
│                                                                                                  │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/transformers/trainer.py:1624 in      │
│ train                                                                                            │
│                                                                                                  │
│   1621 │   │   │   finally:                                                                      │
│   1622 │   │   │   │   hf_hub_utils.enable_progress_bars()                                       │
│   1623 │   │   else:                                                                             │
│ ❱ 1624 │   │   │   return inner_training_loop(                                                   │
│   1625 │   │   │   │   args=args,                                                                │
│   1626 │   │   │   │   resume_from_checkpoint=resume_from_checkpoint,                            │
│   1627 │   │   │   │   trial=trial,                                                              │
│                                                                                                  │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/transformers/trainer.py:1961 in      │
│ _inner_training_loop                                                                             │
│                                                                                                  │
│   1958 │   │   │   │   │   self.control = self.callback_handler.on_step_begin(args, self.state,  │
│   1959 │   │   │   │                                                                             │
│   1960 │   │   │   │   with self.accelerator.accumulate(model):                                  │
│ ❱ 1961 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1962 │   │   │   │                                                                             │
│   1963 │   │   │   │   if (                                                                      │
│   1964 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/transformers/trainer.py:2911 in      │
│ training_step                                                                                    │
│                                                                                                  │
│   2908 │   │   │   with amp.scale_loss(loss, self.optimizer) as scaled_loss:                     │
│   2909 │   │   │   │   scaled_loss.backward()                                                    │
│   2910 │   │   else:                                                                             │
│ ❱ 2911 │   │   │   self.accelerator.backward(loss)                                               │
│   2912 │   │                                                                                     │
│   2913 │   │   return loss.detach() / self.args.gradient_accumulation_steps                      │
│   2914                                                                                           │
│                                                                                                  │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/accelerate/accelerator.py:1966 in    │
│ backward                                                                                         │
│                                                                                                  │
│   1963 │   │   elif self.scaler is not None:                                                     │
│   1964 │   │   │   self.scaler.scale(loss).backward(**kwargs)                                    │
│   1965 │   │   else:                                                                             │
│ ❱ 1966 │   │   │   loss.backward(**kwargs)                                                       │
│   1967 │                                                                                         │
│   1968 │   def set_trigger(self):                                                                │
│   1969 │   │   """                                                                               │
│                                                                                                  │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/torch/_tensor.py:492 in backward     │
│                                                                                                  │
│    489 │   │   │   │   create_graph=create_graph,                                                │
│    490 │   │   │   │   inputs=inputs,                                                            │
│    491 │   │   │   )                                                                             │
│ ❱  492 │   │   torch.autograd.backward(                                                          │
│    493 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    494 │   │   )                                                                                 │
│    495                                                                                           │
│                                                                                                  │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/torch/autograd/__init__.py:251 in    │
│ backward                                                                                         │
│                                                                                                  │
│   248 │   # The reason we repeat the same comment below is that                                  │
│   249 │   # some Python versions print out the first line of a multi-line function               │
│   250 │   # calls in the traceback and some print out the last line                              │
│ ❱ 251 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   252 │   │   tensors,                                                                           │
│   253 │   │   grad_tensors_,                                                                     │
│   254 │   │   retain_graph,                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
  1%|█▊                                                                                                                                                  | 37/3000 [00:11<15:00,  3.29it/s]

查了应该是GPU架构不匹配，T4的应该是sm75，RTX 3060是sm86；除了换显卡，还有什么别的办法可以解决这个问题吗，网上相关资料很少，尝试解决无果，崩溃中

Answered by zRzRzRzRzRzRzR

Feb 24, 2024

卡太老了，得换新卡了。。。

View full answer

zRzRzRzRzRzRzR · 2024-02-24T02:50:01Z

zRzRzRzRzRzRzR
Feb 24, 2024
Maintainer

卡太老了，得换新卡了。。。

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T4显卡微调报错“RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false”怎么办 #857

{{title}}

Replies: 1 comment

{{title}}

Select a reply

T4显卡微调报错“RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false”怎么办 #857

luoyi-zhang Feb 22, 2024

Replies: 1 comment

zRzRzRzRzRzRzR Feb 24, 2024 Maintainer

luoyi-zhang
Feb 22, 2024

zRzRzRzRzRzRzR
Feb 24, 2024
Maintainer