微调一直卡在0/3000 [00:00<?, ?it/s] #1093

happye · 2024-04-06T17:43:39Z

happye
Apr 6, 2024

执行如下命令：
!CUDA_VISIBLE_DEVICES=1 /home/crux/miniconda3/envs/transformers/bin/python finetune_hf.py data/AdvertiseGen_fix /home/crux/AI/LLM/LLM-quickstart/ChatGLM3-6B/ChatGLM3/chatglm3-6b configs/lora.yaml
结果：
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:01<00:00, 5.89it/s]
/home/crux/miniconda3/envs/transformers/lib/python3.12/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/crux/miniconda3/envs/transformers/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
--> Sanity check
'[gMASK]': 64790 -> -100
'sop': 64792 -> -100
'<|user|>': 64795 -> -100
……
……
……
'萌': 56842 -> 56842
'。': 31155 -> 31155
'': 2 -> 2
/home/crux/miniconda3/envs/transformers/lib/python3.12/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
max_steps is given, it will override any value given in num_train_epochs
***** Running training *****
Num examples = 114,599
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 3,000
Number of trainable parameters = 1,949,696
0%| | 0/3000 [00:00<?, ?it/s]