Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChatGLM3四卡训练出错了 #134

Open
eanfs opened this issue Feb 4, 2024 · 1 comment
Open

ChatGLM3四卡训练出错了 #134

eanfs opened this issue Feb 4, 2024 · 1 comment

Comments

@eanfs
Copy link

eanfs commented Feb 4, 2024

[2024-02-04 17:56:47,007] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Traceback (most recent call last):
File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main
model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer
optimizer = FusedAdam(
^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 573, in module_from_spec
File "", line 1233, in create_module
File "", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
Loading extension module fused_adam...
Traceback (most recent call last):
File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main
model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer
optimizer = FusedAdam(
^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 573, in module_from_spec
File "", line 1233, in create_module
File "", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
Loading extension module fused_adam...
Loading extension module fused_adam...
Traceback (most recent call last):
File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main
model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
Traceback (most recent call last):
File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init
main()
self._configure_optimizer(optimizer, model_parameters)
File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config,
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init
optimizer = FusedAdam(
self._configure_optimizer(optimizer, model_parameters)
^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^ ^return self.jit_load(verbose)^
^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
optimizer = FusedAdam(
^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
^^^ ^return _jit_compile(^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^
^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
return _import_module_from_library(name, build_directory, is_python_module)
op_module = load(name=self.name,
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
return _jit_compile(
^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^return _import_module_from_library(name, build_directory, is_python_module)^
^
File "", line 573, in module_from_spec
File "", line 1233, in create_module
File "", line 241, in _call_with_frames_removed
ImportError : /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 573, in module_from_spec
File "", line 1233, in create_module
File "", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
[2024-02-04 17:56:50,782] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30665
[2024-02-04 17:56:50,797] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30666
[2024-02-04 17:56:50,807] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30667
[2024-02-04 17:56:50,817] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30668
[2024-02-04 17:56:50,818] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda3/envs/chatglm/bin/python', '-u', 'train.py', '--local_rank=3', '--train_path', 'data/d2q_0.json', '--model_name_or_path', 'chatglm3-6b/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm3', '--train_type', 'lora', '--freeze_module_name', 'layers.27.,layers.26.,layers.25.,layers.24.', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm3'] exits with return code = 1

@zhouchanglin-rr
Copy link

环境坏了, 二进制不兼容, 重新做系统吧
_ZNSt15__exception_ptr13exception_ptr9_M_addrefEv 是c++相关的错误

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants