[Bug] Phi-3-vision-128k-instruct 跑模型在8卡上出现 “Expected all tensors to be on the same device, but found at least two devices” #2633

dreamerlin · 2024-10-22T04:28:57Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Reproduction

backend_config = PytorchEngineConfig(tp=8, session_len=session_len)
pipe = lmdeploy.pipeline(args.checkpoint, backend_config=backend_config, chat_template_config=ChatTemplateConfig(model_name='phi-3'))

Environment

sys.platform: linux
Python: 3.9.19 (main, May  6 2024, 19:43:03) [GCC 11.2.0]                                                                                                                          CUDA available: False
MUSA available: False                                                                                                                                                          numpy_random_seed: 2147483648                                                                                                                                                      GCC: gcc (GCC) 9.4.0
PyTorch: 2.0.1
PyTorch compiling details: PyTorch built with:                                                                                                                                       - GCC 9.3
 - C++ Version: 201703                                                                                                                                                              Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications                                                             - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)                                                                                                     - OpenMP 201511 (a.k.a. OpenMP 4.5)                                                                                                                                               - LAPACK is enabled (usually provided by MKL)                                                                                                                                     - NNPACK is enabled
- CPU capability usage: AVX2                                                                                                                                                       - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,                                                                                                                                                                                                                                                         TorchVision: 0.15.2
LMDeploy: 0.6.1+2323e69                                                                                                                                                            transformers: 4.45.2
gradio: 4.44.1
fastapi: 0.103.2
pydantic: 2.9.2
triton: 3.0.0

Error traceback

Traceback (most recent call last):
File "/mnt/petrelfs/wangweiyun/miniconda3/envs/lmdploy/lib/python3.9/asyncio/events.py", line 80, in _run                                                                           self._context.run(self._callback, *self._args)
File "/mnt/petrelfs/wangweiyun/miniconda3/envs/lmdploy/lib/python3.9/site-packages/lmdeploy/vl/engine.py", line 27, in _raise_exception_on_finish
raise e
 File "/mnt/petrelfs/wangweiyun/miniconda3/envs/lmdploy/lib/python3.9/site-packages/lmdeploy/vl/engine.py", line 23, in _raise_exception_on_finish
task.result()
File "/mnt/petrelfs/wangweiyun/miniconda3/envs/lmdploy/lib/python3.9/concurrent/futures/thread.py", line 58, in run                                                                  result = self.fn(*self.args, **self.kwargs)
File "/mnt/petrelfs/wangweiyun/miniconda3/envs/lmdploy/lib/python3.9/site-packages/lmdeploy/vl/engine.py", line 169, in forward
outputs = self.model.forward(*func_inputs)                                                                                                                                      File "/mnt/petrelfs/wangweiyun/miniconda3/envs/lmdploy/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
 File "/mnt/petrelfs/wangweiyun/miniconda3/envs/lmdploy/lib/python3.9/site-packages/lmdeploy/vl/model/phi3_vision.py", line 193, in forward
image_features = _process_image_embedding(                                                                                                                                       File "/mnt/petrelfs/wangweiyun/miniconda3/envs/lmdploy/lib/python3.9/site-packages/lmdeploy/vl/model/phi3_vision.py", line 64, in _process_image_embedding
glb_img = torch.cat([glb_img, temp_glb_GN],                                                                                                                                   
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cuda:6! (when checking argument for argument tensors in method wrapper_CUDA_cat)

dreamerlin · 2024-10-22T04:29:22Z

8卡跑的

dreamerlin · 2024-10-22T05:37:39Z

顺带，这行代码是不是有问题 https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/vl/model/phi3_vision.py#L61
是不是应该是

temp_glb_GN = self.glb_GN.repeat(1, H // 2, 1, 1)

dreamerlin · 2024-10-22T05:55:37Z

我自己改了代码后（只改了和 device 有关的代码），跑8k with 2 images，做 text needle 任务，输出有问题

你们确保 phi 的代码逻辑没错误嘛

RunningLeon · 2024-10-23T02:45:26Z

@dreamerlin hi, it seems that the implementation in lmdeploy is based on the old version of the phi3 model, see this commit https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/commit/866d1691437a49af79d5f3ad4a34c1750e08d163 . we may update it later.
BTW. Could you provide the sample codes with image files to reproduce? THX

lvhan028 assigned RunningLeon Oct 22, 2024

RunningLeon added the mllm label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Phi-3-vision-128k-instruct 跑模型在8卡上出现 “Expected all tensors to be on the same device, but found at least two devices” #2633

[Bug] Phi-3-vision-128k-instruct 跑模型在8卡上出现 “Expected all tensors to be on the same device, but found at least two devices” #2633

dreamerlin commented Oct 22, 2024 •

edited

Loading

dreamerlin commented Oct 22, 2024

dreamerlin commented Oct 22, 2024 •

edited

Loading

dreamerlin commented Oct 22, 2024

RunningLeon commented Oct 23, 2024 •

edited

Loading

[Bug] Phi-3-vision-128k-instruct 跑模型在8卡上出现 “Expected all tensors to be on the same device, but found at least two devices” #2633

[Bug] Phi-3-vision-128k-instruct 跑模型在8卡上出现 “Expected all tensors to be on the same device, but found at least two devices” #2633

Comments

dreamerlin commented Oct 22, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

Error traceback

dreamerlin commented Oct 22, 2024

dreamerlin commented Oct 22, 2024 • edited Loading

dreamerlin commented Oct 22, 2024

RunningLeon commented Oct 23, 2024 • edited Loading

dreamerlin commented Oct 22, 2024 •

edited

Loading

dreamerlin commented Oct 22, 2024 •

edited

Loading

RunningLeon commented Oct 23, 2024 •

edited

Loading