模型运行的 CUDA error: device-side assert相关问题 #393

flowbywind · 2023-10-29T09:34:24Z

flowbywind
Oct 29, 2023

模型部署后调用了几百次没问题但再调用就报了这个错误
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in call
return await self.app(scope, receive, send)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/fastapi/applications.py", line 276, in call
await super().call(scope, receive, send)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in call
raise e
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in call
await self.app(scope, receive, send)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
return await dependant.call(**values)
File "get_api_cuda1.py", line 66, in create_item
response, history = model.chat(tokenizer,
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 1032, in chat
inputs = inputs.to(self.device)
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 758, in to
self.data = {k: v.to(device=device) for k, v in self.data.items()}
File "/usr/local/anaconda3/envs/chatglm/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 758, in
self.data = {k: v.to(device=device) for k, v in self.data.items()}
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

hxujal · 2023-10-31T09:29:47Z

hxujal
Oct 31, 2023

我也出现了同样的问题

0 replies

jeinlee1991 · 2023-11-01T10:13:10Z

jeinlee1991
Nov 1, 2023

File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 1038, in chat
    outputs = self.generate(**inputs, **gen_kwargs, eos_token_id=eos_token_id)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1522, in generate
    return self.greedy_search(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2339, in greedy_search
    outputs = self(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 937, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 830, in forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 640, in forward
    layer_ret = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 542, in forward
    layernorm_output = self.input_layernorm(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 190, in forward
    hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

我也出现了同样的问题

0 replies

Hcnaeg · 2023-11-03T15:19:14Z

Hcnaeg
Nov 3, 2023

同样问题

0 replies

hubo0417 · 2023-11-06T07:27:18Z

hubo0417
Nov 6, 2023

怎么解决呢？

0 replies

BeiJingChengXiTech · 2023-11-07T09:21:04Z

BeiJingChengXiTech
Nov 7, 2023

glm官方，能快速解决这个问题不？？？

0 replies

YiFraternity · 2023-11-08T01:00:30Z

YiFraternity
Nov 8, 2023

是这样的，也是遇到这个问题

0 replies

lanxinlo · 2023-11-08T02:16:05Z

lanxinlo
Nov 8, 2023

我也遇到这个问题怎么解决哇

0 replies

qzl164 · 2023-11-08T03:08:03Z

qzl164
Nov 8, 2023

同样遇到了，求解

0 replies

zRzRzRzRzRzRzR · 2023-11-08T03:45:38Z

zRzRzRzRzRzRzR
Nov 8, 2023
Maintainer

mark，我们抓紧解决这个问题

1 reply

htlyaaaa Mar 13, 2024

gkd

Btlmd · 2023-11-08T06:02:38Z

Btlmd
Nov 8, 2023
Maintainer

目前我暂时没能复现这个问题。这里 CUDA 的报错是异步的，是否有人可以在设置了 CUDA_LAUNCH_BLOCKING=1 环境变量的情况下，定位一下问题的产生位置？

0 replies

Btlmd · 2023-11-08T07:06:24Z

Btlmd
Nov 8, 2023
Maintainer

我在一种情况下复现了这种问题。一种可能的原因是输入序列的长度超过了模型的 position embedding 的最大长度，造成索引时超范围了。

在 composite_demo 中引入了更加友善的错误提示。

0 replies

tjgaozw · 2023-11-08T14:07:58Z

tjgaozw
Nov 8, 2023

我也出现了这个问题，出现了很多次，多轮对话后出现，显存也没有爆

0 replies

worm128 · 2023-11-10T00:34:26Z

worm128
Nov 10, 2023

一样的问题，用model.stream_chat流输出字符串没问题，用model.chat整段字符串输出就报错，断点到源码看，看起来像是数组越界
调试代码得知：
错误文件：.cache\huggingface\modules\transformers_modules\local\modeling_chatglm.py
代码错误行：723行代码报错 words_embeddings = self.word_embeddings(input_ids)
错误信息：pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:1239: block: [28,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

0 replies

worm128 · 2023-11-10T00:57:23Z

worm128
Nov 10, 2023

TORCH_USE_CUDA_DSA

看我错误报告 #243

0 replies

worm128 · 2023-11-10T00:57:54Z

worm128
Nov 10, 2023

mark，我们抓紧解决这个问题

看我错误报告 #243

0 replies

wuyiclover · 2023-11-28T09:16:27Z

wuyiclover
Nov 28, 2023

822行 seq_length为config中默认的8192
如果输入长度超过8192，会引发824行的索引错误

0 replies

zRzRzRzRzRzRzR · 2023-12-07T10:21:31Z

zRzRzRzRzRzRzR
Dec 7, 2023
Maintainer

今天更新了一下模型配置文件，查看一下问题是否还在

3 replies

YLiberals Dec 9, 2023

仍然存在

zRzRzRzRzRzRzR Dec 19, 2023
Maintainer

emm，怎么复现，我们在linux 和mac都试过，没有问题

YLiberals Dec 19, 2023

是这样的，我的环境是在ubuntu的docker中，具体是：1、使用了LLaMA-Factory项目进行了LoRA微调；2、使用LLaMA-Factory中提供的api服务进行访问微调的glm3-6b。在序列长度较长的情况下就会出现这个问题。之前已经在LLaMA-Factory项目的Issue中进行了查询，说是glm3的问题，可以使用LLaMA-Factory的api试试能不能够复现这个问题

bearbee123 · 2023-12-10T02:26:11Z

bearbee123
Dec 10, 2023

我也会出现这样的情况，特别是在每一次进行对话显存的占用都会上升，但结束这轮对话后，显存会降到原来刚加载模型时侯的显存占用，如果再进行对话则显存会比上一次对话时候的显存占用还要大


是我的代码逻辑有问题么，后面尝试加入如下图的索引要求后会有所改善[]

求助！

0 replies

worm128 · 2023-12-12T00:19:47Z

worm128
Dec 12, 2023

今天更新了一下模型配置文件，查看一下问题是否还在
你们用window11测试一下，应该就能复现这个问题

3 replies

zRzRzRzRzRzRzR Dec 19, 2023
Maintainer

emm，我没有windows电脑，在mac 和Linux没有出现

tinkball Dec 29, 2023

它在问答次数少的情况下是不会出现的，需要进行至少200次的问答，才会出现

treya-lin Feb 5, 2024

它在问答次数少的情况下是不会出现的，需要进行至少200次的问答，才会出现

不，我遇到的几次都是极少轮就会出现。

zhang-xh95 · 2023-12-12T11:08:57Z

zhang-xh95
Dec 12, 2023

有没有chatGLM2-6B也遇到同样问题的

1 reply

lrh000 Mar 15, 2024

一样的

10cent01 · 2023-12-13T08:46:08Z

10cent01
Dec 13, 2023

你好，我在循环调用stream_chat时也出现了同样的问题。在到了大约114条输入时，报出提示：Input length of input_ids is 32802, but `max_length` is set to 32768. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`. ，随后报出跟上面类似的cuda越界错误。

我尝试按照给的提示修改max_length 和 max_new_tokens，问题依然存在。然后我在modelling_chatglm.py的stream_generate()函数中打印： batch_size, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]，结果大致为：

input_ids_seq_length ，即input_ids.shape[-1] 的值从360增长到3万多。而我记起之前使用GLM2的时候没有遇到过这种问题，我也顺便添加了一下打印，发现这个值一直在400-600之间上下徘徊，而不像GLM3这里稳定增长。

1 reply

10cent01 Dec 15, 2023

input_ids_seq_length 的问题，是 stream_chat 里 past_key_values 参数导致的，似乎跟CUDA报错没有联系。

treya-lin · 2024-02-05T07:30:23Z

treya-lin
Feb 5, 2024

same error. :( 刚开始尝试使用就遇到了。

环境：我在ubuntu 22.04 的 docker 内（cuda118， torch210），显卡4090*1 ，没有修改过模型。

另外并不是超多轮问答才会出现，我三四轮就会遇到。贴一个聊天记录吧，就：(我之前还遇到过更短的对话，从第三轮就开始出现问题的，没记下来。）

== Input ==
 你好

==History==
 [{'role': 'system', 'content': "You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown."}]

== Input ==
 你知道什么是语音转换吗？

==History==
 [{'role': 'system', 'content': "You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown."}, {'role': 'user', 'content': '你好'}, {'role': 'assistant', 'content': '你好！有什么可以帮助您的吗？'}]

== Input ==
 你对现在比较流行的RAG技术有什么看法？你认为什么时候适合RAG,什么适合finetuning？

==History==
 [{'role': 'system', 'content': "You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown."}, {'role': 'user', 'content': '你好'}, {'role': 'assistant', 'content': '你好！有什么可以帮助您的吗？'}, {'role': 'user', 'content': '你知道什么是语音转换吗？'}, {'role': 'assistant', 'content': '语音转换是一种将语音信号转换为其他格式的技术，常见的有文本、文字或指令等。它可以通过算法实现，也可以通过人机
  互完成。例如，一些智能助手可以识别用户的声音并回答问题。'}]

== Input ==
 如果有一个需求，要做一个可以离线部署的AI的对话机器人，需要能够根据公司内部的文档，进行知识问答，你认为应该需要准备什么数据，然后用什么技术来实现呢？

==History==
 [{'role': 'system', 'content': "You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown."}, {'role': 'user', 'content': '你好'}, {'role': 'assistant', 'content': '你好！有什么可以帮助您的吗？'}, {'role': 'user', 'content': '你知道什么是语音转换吗？'}, {'role': 'assistant', 'content': '语音转换是一种将语音信号转换为其他格式的技术，常见的有文本、文字或指令等。它可以通过算法实现，也可以通过人机
  互完成。例如，一些智能助手可以识别用户的声音并回答问题。'}, {'role': 'user', 'content': '你对现在比较流行的RAG技术有什么看法？你认为什么时候适合RAG,什么适合finetuning？'}, {'role': 'assistant', 'content': 'RAG（Reinforcement Learning based Approach for Generative Modeling）是一种基于强化学习的生成模型方法，它在近年来受到了越来越多的关注。它的核心思想是通过训练一个强化学习模型来引导生成模型产生高质量的样本。相比之下，Fine-tuning是一种使用预训练模型来微调以特定
  务进行更好的方法。\n\n对于何时适合RAG和何时适合Fine-tuning，这取决于具体情况和需求。总的来说，如果需要快速获得一个能够产生良好结果的模型，并且数据量有限，那么Fine-tuning可能是一个更好的选择。这是因为Fine-tuning利用了预训练模型的知
  ，可以在较短的时间内获得较好的性能。\n\n相反，如果需要处理大量数据或者希望模型具有更高的泛化能力，那么RAG可能会更适用。RAG可以通过强化学习模型来更好地引导生成模型，从而在长期内获得更好的性能。\n\n当然，在实际应用中，最好的方法是尝
  多种方法并比较它们的性能，然后根据结果选择最适合的方法。'}]
2024-02-05 07:25:42.879 Uncaught app exception
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 535, in _run_script
    exec(code, module.__dict__)
  File "/home/dev/workspace/llm_space/ChatGLM3/composite_demo/main.py", line 72, in <module>
    demo_chat.main(
  File "/home/dev/workspace/llm_space/ChatGLM3/composite_demo/demo_chat.py", line 62, in main
    for response in client.generate_stream(
  File "/home/dev/workspace/llm_space/ChatGLM3/composite_demo/client.py", line 183, in generate_stream
    for new_text, _ in stream_chat(
  File "/home/dev/workspace/llm_space/ChatGLM3/composite_demo/client.py", line 110, in stream_chat
    for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
    response = gen.send(request)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py", line 1169, in stream_generate
    next_token_scores = logits_processor(input_ids, next_token_logits)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 97, in __call__
    scores = processor(input_ids, scores)
  File "/home/dev/workspace/llm_space/ChatGLM3/composite_demo/client.py", line 56, in __call__
    if torch.isnan(scores).any() or torch.isinf(scores).any():
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

而且最后一轮不是直接报错，是生成了一阵子才报错。

1 reply

treya-lin Feb 5, 2024

另一个badcase。
另外，虽然最后都是报的an illegal memory access was encountered，但是traceback每次也不一样的？我就是用单纯的chat功能而已呢。

==History==
 [{'role': 'system', 'content': "You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown."}, {'role': 'user', 'content': '你的知识最新更新到什么时候？'}, {'role': 'assistant', 'content': '我的知识库最
近一次的更新时间是2022年1月。虽然不能保证每个问题的答案都最准确，但我已经具备了相当程度的知识来回答各种问题。如果您有疑问，欢迎随时向我提问！'}, {'role': 'user', 'content': '你知道openai吗'}, {'role': 'assistant', 'content': 'OpenAI是一个总部位于美国的人工智能研究公司，成立于2015年。OpenAI的目标是实现安全的通用人工智能（AGI）并让其有益于人类。OpenAI的研究范围包括人工智能的各个方面，例如机器学习、自然语言处理、计算机视觉等。OpenAI的研究成果和论文经常被业界和学术界引用，并且该公司的技术也受到了广泛的关注和使用。'}, {'role': 'user', 'content': '你知道openai最新的产品是什么吗？'}, {'role': 'assistant', 'content': '是的，OpenAI 经常发布新的产品和功能。截至我知識截止日期2022年1月，OpenAI 的最新产品包括：\n\n* GPT-3：一个基于深度学习的自然语言处理模型，可以生成高质量的自然语言文本。GPT-3是OpenAI 公司的一款重要产品，也是世界上最大的公开可用的自然语言处理模型之一。\n* GPT-2：一个基于深度学习的自然语言处理模型，比GPT-3小一些，但仍然具有非常强大的语言生成能力。GPT-2可以用于许多不同的应用程序，包括自然语言生成、对话系统、文本摘要等。\n* Transformer：一种基于自注意力机制的神经网络结构，广泛应用于自然语言处理领域。Transformer 模型的出现极大地提高了深度学习在自然语言处理方面的表现。\n*自动机器学习（AutoML）：OpenAI 的自动机器学习平台，使研究人员和开发人员能够快速轻松地构建和训练机器学习模型。AutoML 可以用于许多不同的应用程序，包括图像识别、自然语言处理
、推荐系统等。\n\n这只是OpenAI 的部分产品和服务，OpenAI 还有许多其他的产品和服务正在开发中。'}, {'role': 'user', 'content': '你知道openai有什么语音方面的产品或服务吗'}, {'role': 'assistant', 'content': '是的，OpenAI 提供了一些与语音相关的产品和服务。其中最著名的是其语音合成技术，可以使用 deep text-to-speech （TTS）模型将文本转换为自然语音。这种技术可以应用于许多不同的应用程序，包括虚拟助手、语音翻译、朗读、导航等。\n\n此外，OpenAI 还提供了一个叫做“声音”的 API，可以用于在应用程序中添加背景音乐和其他音频效果。这个 API 可以与 TTS 技术一起使
用，创建更加真实和自然的语音交互体验。\n\n除此之外，OpenAI 还在研究和开发其他与语音相关的技术和产品，例如语音识别和语音情感分析等。'}]
2024-02-05 07:50:08.688 Uncaught app exception
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 535, in _run_script
    exec(code, module.__dict__)
  File "/home/dev/workspace/llm_space/ChatGLM3/composite_demo/main.py", line 72, in <module>
    demo_chat.main(
  File "/home/dev/workspace/llm_space/ChatGLM3/composite_demo/demo_chat.py", line 62, in main
    for response in client.generate_stream(
  File "/home/dev/workspace/llm_space/ChatGLM3/composite_demo/client.py", line 183, in generate_stream
    for new_text, _ in stream_chat(
  File "/home/dev/workspace/llm_space/ChatGLM3/composite_demo/client.py", line 110, in stream_chat
    for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
    response = gen.send(request)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py", line 1159, in stream_generate
    outputs = self(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py", line 937, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py", line 830, in forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py", line 640, in forward
    layer_ret = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py", line 544, in forward
    attention_output, kv_cache = self.self_attention(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py", line 415, in forward
    value_layer = torch.cat((cache_v, value_layer), dim=0)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Longleaves · 2024-02-28T06:12:22Z

Longleaves
Feb 28, 2024

用微调模型批量推理同样报错，怎么解决呀，感谢各位大佬！
应该不是输入长度问题，我的提问很短的，但是涉及隐私信息不方便发出来。

报错信息：
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [2,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
……

4 replies

xingyunjohn1 Mar 5, 2024

I had the same problem.

wududu123 Mar 15, 2024

I had the same problem. if running on cpu:

7 in embedding │
│ │
│ 2234 │ │ # torch.embedding_renorm_ │
│ 2235 │ │ # remove once script supports set_grad_enabled │
│ 2236 │ │ no_grad_embedding_renorm(weight, input, max_norm, norm_type) │
│ ❱ 2237 │ return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) │
│ 2238 │
│ 2239 │
│ 2240 def embedding_bag( │

IndexError: index out of range in self

lrh000 Mar 15, 2024

+1

ljc010717 Mar 24, 2024

+1

liujingmao · 2024-03-27T09:08:33Z

liujingmao
Mar 27, 2024

我也遇到了这个问题,希望官方能够快点解决这个问题

0 replies

AI-student2024 · 2024-03-27T09:27:54Z

AI-student2024
Mar 27, 2024

我直接在开源项目中执行这个文件：https://github.com/THUDM/ChatGLM3/blob/main/finetune_demo/lora_finetune.ipynb，也遇到了同样的报错。部分信息如下：/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/nn/modules/module.py:1511 in │
│ _wrapped_call_impl │
│ │
│ 1508 │ │ if self._compiled_call_impl is not None: │
│ 1509 │ │ │ return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] │
│ 1510 │ │ else: │
│ ❱ 1511 │ │ │ return self._call_impl(*args, **kwargs) │
│ 1512 │ │
│ 1513 │ def _call_impl(self, *args, **kwargs): │
│ 1514 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │
│ │
│ /root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/nn/modules/module.py:1520 in │
│ _call_impl │
│ │
│ 1517 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1518 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1519 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1520 │ │ │ return forward_call(*args, **kwargs) │
│ 1521 │ │ │
│ 1522 │ │ try: │
│ 1523 │ │ │ result = None │
│ │
│ /root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py:822 in │
│ forward │
│ │
│ 819 │ │ │ │ │ │ │ │ │ │ │ attention_mask], dim=-1) │
│ 820 │ │ │
│ 821 │ │ if full_attention_mask is None: │
│ ❱ 822 │ │ │ if (attention_mask is not None and not attention_mask.all()) or (past_key_va │
│ 823 │ │ │ │ full_attention_mask = self.get_masks(input_ids, past_key_values, padding │
│ 824 │ │ │
│ 825 │ │ # Rotary positional embeddings │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2 replies

hanke3 Mar 28, 2024

相同的错误

Win7win Mar 29, 2024

同样遇到这个问题

liujingmao · 2024-03-27T09:55:32Z

liujingmao
Mar 27, 2024

python inference_hf.py your_finetune_path --prompt your prompt 执行这个命令的时候,也会报这个

1 reply

hanke3 Mar 28, 2024

相同的错误

hanke3 · 2024-03-28T03:50:14Z

liujingmao · 2024-03-28T06:10:15Z

jeinlee1991 · 2024-03-29T07:49:30Z

zzk2021 · 2024-11-01T07:23:50Z

zzk2021
Nov 1, 2024

GLM4也是一样的，我用llama factory微调，但是llama factory只能peft>0.11

0 replies

模型运行的 CUDA error: device-side assert相关问题 #393

Replies: 37 comments · 34 replies

zRzRzRzRzRzRzR Nov 8, 2023 Maintainer

Btlmd Nov 8, 2023 Maintainer

Btlmd Nov 8, 2023 Maintainer

zRzRzRzRzRzRzR Dec 7, 2023 Maintainer

zRzRzRzRzRzRzR Dec 19, 2023 Maintainer

zRzRzRzRzRzRzR Dec 19, 2023 Maintainer

Replies: 37 comments 34 replies

zRzRzRzRzRzRzR
Nov 8, 2023
Maintainer

Btlmd
Nov 8, 2023
Maintainer

Btlmd
Nov 8, 2023
Maintainer

zRzRzRzRzRzRzR
Dec 7, 2023
Maintainer

zRzRzRzRzRzRzR Dec 19, 2023
Maintainer

zRzRzRzRzRzRzR Dec 19, 2023
Maintainer