T4显卡微调报错“RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false”怎么办 #857
-
完全按照“finetune_demo/lora_finetune.ipynb”的步骤进行 # python finetune_hf.py data/AdvertiseGen_fix/ THUDM/chatglm3-6b configs/lora.yaml yes
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.42it/s]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
--> model has 1.949696M params
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
Detected kernel version 5.4.119, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
Num examples = 114,599
Num Epochs = 1
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 1
Gradient Accumulation steps = 1
Total optimization steps = 3,000
Number of trainable parameters = 1,949,696
{'loss': 4.4406, 'grad_norm': 4.049661636352539, 'learning_rate': 4.9833333333333336e-05, 'epoch': 0.0}
{'loss': 4.9137, 'grad_norm': 3.5368947982788086, 'learning_rate': 4.966666666666667e-05, 'epoch': 0.0}
{'loss': 4.6822, 'grad_norm': 4.433278560638428, 'learning_rate': 4.9500000000000004e-05, 'epoch': 0.0}
1%|█▊ | 37/3000 [00:10<10:44, 4.60it/s╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/ChatGLM3/finetune_demo/finetune_hf.py:550 in main │
│ │
│ 547 │ │ │ │ trainer.train(resume_from_checkpoint=checkpointdir) │
│ 548 │ │ │ else: │
│ 549 │ │ │ │ # If not, start from scratch │
│ ❱ 550 │ │ │ │ trainer.train() │
│ 551 │ │ else: │
│ 552 │ │ │ # If it is a numerical value, select the corresponding checkpoint │
│ 553 │ │ │ if auto_resume_from_checkpoint.isdigit(): │
│ │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/transformers/trainer.py:1624 in │
│ train │
│ │
│ 1621 │ │ │ finally: │
│ 1622 │ │ │ │ hf_hub_utils.enable_progress_bars() │
│ 1623 │ │ else: │
│ ❱ 1624 │ │ │ return inner_training_loop( │
│ 1625 │ │ │ │ args=args, │
│ 1626 │ │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1627 │ │ │ │ trial=trial, │
│ │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/transformers/trainer.py:1961 in │
│ _inner_training_loop │
│ │
│ 1958 │ │ │ │ │ self.control = self.callback_handler.on_step_begin(args, self.state, │
│ 1959 │ │ │ │ │
│ 1960 │ │ │ │ with self.accelerator.accumulate(model): │
│ ❱ 1961 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1962 │ │ │ │ │
│ 1963 │ │ │ │ if ( │
│ 1964 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/transformers/trainer.py:2911 in │
│ training_step │
│ │
│ 2908 │ │ │ with amp.scale_loss(loss, self.optimizer) as scaled_loss: │
│ 2909 │ │ │ │ scaled_loss.backward() │
│ 2910 │ │ else: │
│ ❱ 2911 │ │ │ self.accelerator.backward(loss) │
│ 2912 │ │ │
│ 2913 │ │ return loss.detach() / self.args.gradient_accumulation_steps │
│ 2914 │
│ │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/accelerate/accelerator.py:1966 in │
│ backward │
│ │
│ 1963 │ │ elif self.scaler is not None: │
│ 1964 │ │ │ self.scaler.scale(loss).backward(**kwargs) │
│ 1965 │ │ else: │
│ ❱ 1966 │ │ │ loss.backward(**kwargs) │
│ 1967 │ │
│ 1968 │ def set_trigger(self): │
│ 1969 │ │ """ │
│ │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/torch/_tensor.py:492 in backward │
│ │
│ 489 │ │ │ │ create_graph=create_graph, │
│ 490 │ │ │ │ inputs=inputs, │
│ 491 │ │ │ ) │
│ ❱ 492 │ │ torch.autograd.backward( │
│ 493 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 494 │ │ ) │
│ 495 │
│ │
│ /data/miniforge3/envs/chatglm3/lib/python3.10/site-packages/torch/autograd/__init__.py:251 in │
│ backward │
│ │
│ 248 │ # The reason we repeat the same comment below is that │
│ 249 │ # some Python versions print out the first line of a multi-line function │
│ 250 │ # calls in the traceback and some print out the last line │
│ ❱ 251 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 252 │ │ tensors, │
│ 253 │ │ grad_tensors_, │
│ 254 │ │ retain_graph, │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
1%|█▊ | 37/3000 [00:11<15:00, 3.29it/s] 查了应该是GPU架构不匹配,T4的应该是sm75,RTX 3060是sm86;除了换显卡,还有什么别的办法可以解决这个问题吗,网上相关资料很少,尝试解决无果,崩溃中 |
Beta Was this translation helpful? Give feedback.
Answered by
zRzRzRzRzRzRzR
Feb 24, 2024
Replies: 1 comment
-
卡太老了,得换新卡了。。。 |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
zRzRzRzRzRzRzR
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
卡太老了,得换新卡了。。。