-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the output of onnx model is different from model inferenced by TensorRT #4201
Comments
@rajeevsrao @ttyio @pranavm-nvidia @aaronp24 @ilyasher |
The problem is that |
Thanks for your reply. I recompiled the engine using the code below, but the inference results from the TensorRT engine are still different from those of Hugging Face.
result
|
same issue, you can set flash_attn to false and use bf16 to compile, it works for me |
I followed the method you provided for testing.
In the onnx -> trt stage, I tried both --fp16 and --best settings, but the result was the same: the difference between TRT and ONNX inference results remains significant.
Did you compile following these steps? |
I found that bfloat16 is not required, but use_flash_attn must be set to false when export onnx, and stronglyTyped should be added when convert to trt engine. |
I found that the following section of code in the Hugging Face model caused my TRT engine model export to be in the float32 format, which ensures that the inference results between TRT and HF remain consistent. If fp16 or best is configured, the results will not be consistent. However, inference with float32 is quite slow, so I am currently looking for a solution.
|
Description
I attempted to compile a Hugging Face model (the Hugging Face model link is: https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5, which includes both the model architecture code and model files) using TensorRT (TRT) to improve inference speed. The steps I followed are hf -> onnx -> trt.
I performed inference on the same image using Hugging Face (hf), ONNX, and TRT engine. I found that the inference results from hf and ONNX were consistent, but the TRT engine's result was different from the former two.
I would like to know why the ONNX results are correct, but the inference results from the engine compiled with trtexec are wrong. Why is this happening?
The conversion code from hf to ONNX is:
The conversion code from ONNX to TRT engine is:
The inference code for hf is:
The inference code for ONNX is:
The inference code for TRT engine is:
The inference results are as follows:
Environment
TensorRT Version:v100500
NVIDIA GPU:A100
NVIDIA Driver Version:535.54.03
CUDA Version:12.2
CUDNN Version:8920
Operating System: docker mirror: nvidia_cuda_12.4.0-devel-ubuntu22.04
Python Version (if applicable):3.10.12
Tensorflow Version (if applicable):
PyTorch Version (if applicable):2.2.2+cu121
Baremetal or Container (if so, version):docker
Relevant Files
Model link: https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5
internvn2_40b_image2_patch1.npy: internvn2_40b_image2_patch1.zip
onnx file link: https://drive.google.com/file/d/1lnEmuQ4cNzf8YA7ddznqUnYsz-W5y5aJ/view?usp=sharing
The text was updated successfully, but these errors were encountered: