Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about polygraphy convertsion #4218

Open
EmmaThompson123 opened this issue Oct 22, 2024 · 0 comments
Open

Questions about polygraphy convertsion #4218

EmmaThompson123 opened this issue Oct 22, 2024 · 0 comments

Comments

@EmmaThompson123
Copy link

I use polygraphy to convert onnx model to tensorrt :

import numpy as np
import os
from polygraphy.backend.trt import CreateConfig as CreateTrtConfig, EngineFromNetwork, EngineFromBytes, NetworkFromOnnxPath, Profile, TrtRunner, SaveEngine
import time
from polygraphy.logger import G_LOGGER
G_LOGGER.severity = G_LOGGER.VERBOSE

def GB_to_bytes(gb):
    return int(gb * 1024 * 1024 * 1024)

def main():
    workspace_GB = 1.7
    dynamic = False
    bs = 4
    if not dynamic:
        assert bs
    fp16 = True
    onnx_path = f"weights/model_bs{bs}.onnx"
    if dynamic:
        suffix1 = "_dynamic"
    else:
        suffix1 = f"_bs{bs}"
    if fp16:
        suffix2 = "_fp16"
    else:
        suffix2 = "_fp32"
    tactic_sources = ['CUBLAS_LT', 'CUDNN']
    tactic_sources_suffix = "_".join(tactic_sources)
    engine_path = f"weights/model{suffix1}{suffix2}_{workspace_GB}GB_{tactic_sources_suffix}.trt"
    print("build engine")
    profiles = [
        Profile().add('audio_seqs__0', min=[1, 1, 80, 16], opt=[bs, 1, 80, 16], max=[2*bs, 1, 80, 16]).add('img_seqs__1', min=[1, 6, 256, 256], opt=[bs, 6, 256, 256], max=[2*bs, 6, 256, 256]) \
            if dynamic else Profile().add('audio_seqs__0', min=[bs, 1, 80, 16], opt=[bs, 1, 80, 16], max=[bs, 1, 80, 16]).add('img_seqs__1', min=[bs, 6, 256, 256], opt=[bs, 6, 256, 256], max=[bs, 6, 256, 256])
    ]
    create_trt_config = CreateTrtConfig(max_workspace_size=GB_to_bytes(workspace_GB), fp16=fp16, profiles=profiles)
    build_engine = EngineFromNetwork(
        NetworkFromOnnxPath(onnx_path), config=create_trt_config)
    build_engine = SaveEngine(build_engine, path=engine_path)
    with TrtRunner(build_engine) as runner:
        audio_seqs__0 = np.random.randn(bs, 1, 80, 16).astype(np.float32)
        img_seqs__1 = np.random.randn(bs, 6, 256, 256).astype(np.float32)
        outputs = runner.infer(feed_dict={"audio_seqs__0": audio_seqs__0, "img_seqs__1": img_seqs__1})

if __name__ == "__main__":
    main()

its log is :

[V] Loaded Module: tensorrt           | Version: 8.0.3.4  | Path: ['/opt/conda/lib/python3.8/site-packages/tensorrt']
build engine
[V] Loaded Module: polygraphy.util    | Path: ['/opt/conda/lib/python3.8/site-packages/polygraphy/util']
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +503, GPU +0, now: CPU 515, GPU 521 (MiB)
[TensorRT] INFO: ----------------------------------------------------------------
[TensorRT] INFO: Input filename:   weights/model_bs4.onnx
[TensorRT] INFO: ONNX IR version:  0.0.7
[TensorRT] INFO: Opset version:    9
[TensorRT] INFO: Producer name:    pytorch
[TensorRT] INFO: Producer version: 1.10
[TensorRT] INFO: Domain:           
[TensorRT] INFO: Model version:    0
[TensorRT] INFO: Doc string:       
[TensorRT] INFO: ----------------------------------------------------------------
[V]     Setting TensorRT Optimization Profiles
[V]     Input tensor: audio_seqs__0 (dtype=DataType.FLOAT, shape=(4, 1, 80, 16)) | Setting input tensor shapes to: (min=[4, 1, 80, 16], opt=[4, 1, 80, 16], max=[4, 1, 80, 16])
[V]     Input tensor: img_seqs__1 (dtype=DataType.FLOAT, shape=(4, 6, 256, 256)) | Setting input tensor shapes to: (min=[4, 6, 256, 256], opt=[4, 6, 256, 256], max=[4, 6, 256, 256])
[I]     Configuring with profiles: [Profile().add(audio_seqs__0, min=[4, 1, 80, 16], opt=[4, 1, 80, 16], max=[4, 1, 80, 16]).add(img_seqs__1, min=[4, 6, 256, 256], opt=[4, 6, 256, 256], max=[4, 6, 256, 256])]
[I] Building engine with configuration:
    Workspace            | 1825361100 bytes (1740.80 MiB)
    Precision            | TF32: False, FP16: True, INT8: False, Strict Types: False
    Tactic Sources       | ['CUBLAS', 'CUBLAS_LT', 'CUDNN']
    Safety Restricted    | False
    Profiles             | 1 profile(s)
[TensorRT] INFO: [MemUsageSnapshot] Builder begin: CPU 721 MiB, GPU 521 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +793, GPU +342, now: CPU 1598, GPU 863 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +198, GPU +342, now: CPU 1796, GPU 1205 (MiB)
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 2 inputs and 1 output network tensors.
[TensorRT] INFO: Total Host Persistent Memory: 172288
[TensorRT] INFO: Total Device Persistent Memory: 63409664
[TensorRT] INFO: Total Scratch Memory: 119289856
[TensorRT] INFO: [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 187 MiB, GPU 4 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2602, GPU 1685 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2602, GPU 1693 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2602, GPU 1677 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2602, GPU 1659 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] Builder end: CPU 2517 MiB, GPU 1659 MiB
[I] Finished engine building in 89.164 seconds
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2537, GPU 1535 (MiB)
[TensorRT] INFO: Loaded engine size: 123 MB
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine begin: CPU 2537 MiB, GPU 1535 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2538, GPU 1669 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2538, GPU 1677 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2538, GPU 1659 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine end: CPU 2538 MiB, GPU 1659 MiB
[I] Saving engine to weights/wav2lip/wav2lip_bs4_fp16_1.7GB_CUBLAS_LT_CUDNN.trt
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation begin: CPU 2414 MiB, GPU 1659 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2414, GPU 1669 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2414, GPU 1677 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation end: CPU 2414 MiB, GPU 1953 MiB
[V] Loaded Module: numpy              | Version: 1.23.5   | Path: ['/opt/conda/lib/python3.8/site-packages/numpy']
[V] Runner input metadata is: {audio_seqs__0 [dtype=float32, shape=(4, 1, 80, 16)],
     img_seqs__1 [dtype=float32, shape=(4, 6, 256, 256)]}
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2423, GPU 1873 (MiB)
Deactivated tensorrt runner

My first question is why I have set tactic_sources = ['CUBLAS_LT', 'CUDNN'] ( I think my model is fp16, so I don't need CUBLAS, I want to reduce gpu memoey consumption), from the log, it still uses the CUBLAS tactic source ?
Then I use trtexec to run the converted trt model : trtexec --loadEngine=weights/model_bs4_fp16_1.7GB_CUBLAS_LT_CUDNN.trt --shapes=audio_seqs__0:4x1x80x16,img_seqs__1:4x6x256x256, the log is :

[10/21/2024-21:45:10] [I] === Model Options ===
[10/21/2024-21:45:10] [I] Format: *
[10/21/2024-21:45:10] [I] Model: 
[10/21/2024-21:45:10] [I] Output:
[10/21/2024-21:45:10] [I] === Build Options ===
[10/21/2024-21:45:10] [I] Max batch: explicit
[10/21/2024-21:45:10] [I] Workspace: 16 MiB
[10/21/2024-21:45:10] [I] minTiming: 1
[10/21/2024-21:45:10] [I] avgTiming: 8
[10/21/2024-21:45:10] [I] Precision: FP32
[10/21/2024-21:45:10] [I] Calibration: 
[10/21/2024-21:45:10] [I] Refit: Disabled
[10/21/2024-21:45:10] [I] Sparsity: Disabled
[10/21/2024-21:45:10] [I] Safe mode: Disabled
[10/21/2024-21:45:10] [I] Restricted mode: Disabled
[10/21/2024-21:45:10] [I] Save engine: 
[10/21/2024-21:45:10] [I] Load engine: weights/wav2lip/wav2lip_bs4_fp16_1.5GB_CUBLAS_LT_CUDNN.trt
[10/21/2024-21:45:10] [I] NVTX verbosity: 0
[10/21/2024-21:45:10] [I] Tactic sources: Using default tactic sources
[10/21/2024-21:45:10] [I] timingCacheMode: local
[10/21/2024-21:45:10] [I] timingCacheFile: 
[10/21/2024-21:45:10] [I] Input(s)s format: fp32:CHW
[10/21/2024-21:45:10] [I] Output(s)s format: fp32:CHW
[10/21/2024-21:45:10] [I] Input build shape: audio_seqs__0=4x1x80x16+4x1x80x16+4x1x80x16
[10/21/2024-21:45:10] [I] Input build shape: img_seqs__1=4x6x256x256+4x6x256x256+4x6x256x256
[10/21/2024-21:45:10] [I] Input calibration shapes: model
[10/21/2024-21:45:10] [I] === System Options ===
[10/21/2024-21:45:10] [I] Device: 0
[10/21/2024-21:45:10] [I] DLACore: 
[10/21/2024-21:45:10] [I] Plugins:
[10/21/2024-21:45:10] [I] === Inference Options ===
[10/21/2024-21:45:10] [I] Batch: Explicit
[10/21/2024-21:45:10] [I] Input inference shape: img_seqs__1=4x6x256x256
[10/21/2024-21:45:10] [I] Input inference shape: audio_seqs__0=4x1x80x16
[10/21/2024-21:45:10] [I] Iterations: 10
[10/21/2024-21:45:10] [I] Duration: 3s (+ 200ms warm up)
[10/21/2024-21:45:10] [I] Sleep time: 0ms
[10/21/2024-21:45:10] [I] Streams: 1
[10/21/2024-21:45:10] [I] ExposeDMA: Disabled
[10/21/2024-21:45:10] [I] Data transfers: Enabled
[10/21/2024-21:45:10] [I] Spin-wait: Disabled
[10/21/2024-21:45:10] [I] Multithreading: Disabled
[10/21/2024-21:45:10] [I] CUDA Graph: Disabled
[10/21/2024-21:45:10] [I] Separate profiling: Disabled
[10/21/2024-21:45:10] [I] Time Deserialize: Disabled
[10/21/2024-21:45:10] [I] Time Refit: Disabled
[10/21/2024-21:45:10] [I] Skip inference: Disabled
[10/21/2024-21:45:10] [I] Inputs:
[10/21/2024-21:45:10] [I] === Reporting Options ===
[10/21/2024-21:45:10] [I] Verbose: Disabled
[10/21/2024-21:45:10] [I] Averages: 10 inferences
[10/21/2024-21:45:10] [I] Percentile: 99
[10/21/2024-21:45:10] [I] Dump refittable layers:Disabled
[10/21/2024-21:45:10] [I] Dump output: Disabled
[10/21/2024-21:45:10] [I] Profile: Disabled
[10/21/2024-21:45:10] [I] Export timing to JSON file: 
[10/21/2024-21:45:10] [I] Export output to JSON file: 
[10/21/2024-21:45:10] [I] Export profile to JSON file: 
[10/21/2024-21:45:10] [I] 
[10/21/2024-21:45:11] [I] === Device Information ===
[10/21/2024-21:45:11] [I] Selected Device: NVIDIA A30
[10/21/2024-21:45:11] [I] Compute Capability: 8.0
[10/21/2024-21:45:11] [I] SMs: 56
[10/21/2024-21:45:11] [I] Compute Clock Rate: 1.44 GHz
[10/21/2024-21:45:11] [I] Device Global Memory: 24060 MiB
[10/21/2024-21:45:11] [I] Shared Memory per SM: 164 KiB
[10/21/2024-21:45:11] [I] Memory Bus Width: 3072 bits (ECC enabled)
[10/21/2024-21:45:11] [I] Memory Clock Rate: 1.215 GHz
[10/21/2024-21:45:11] [I] 
[10/21/2024-21:45:11] [I] TensorRT version: 8003
[10/21/2024-21:45:16] [I] [TRT] [MemUsageChange] Init CUDA: CPU +502, GPU +0, now: CPU 633, GPU 521 (MiB)
[10/21/2024-21:45:16] [I] [TRT] Loaded engine size: 123 MB
[10/21/2024-21:45:16] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 633 MiB, GPU 521 MiB
[10/21/2024-21:45:24] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +793, GPU +342, now: CPU 1426, GPU 987 (MiB)
[10/21/2024-21:45:30] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +198, GPU +342, now: CPU 1624, GPU 1329 (MiB)
[10/21/2024-21:45:30] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1624, GPU 1311 (MiB)
[10/21/2024-21:45:30] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1624 MiB, GPU 1311 MiB
[10/21/2024-21:45:30] [I] Engine loaded in 19.0488 sec.
[10/21/2024-21:45:30] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1500 MiB, GPU 1311 MiB
[10/21/2024-21:45:30] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +10, now: CPU 1501, GPU 1321 (MiB)
[10/21/2024-21:45:30] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1501, GPU 1329 (MiB)
[10/21/2024-21:45:30] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1501 MiB, GPU 1605 MiB
[10/21/2024-21:45:30] [I] Created input binding for audio_seqs__0 with dimensions 4x1x80x16
[10/21/2024-21:45:30] [I] Created input binding for img_seqs__1 with dimensions 4x6x256x256
[10/21/2024-21:45:30] [I] Created output binding for value__0 with dimensions 4x3x256x256
[10/21/2024-21:45:30] [I] Starting inference
[10/21/2024-21:45:33] [I] Warmup completed 26 queries over 200 ms
[10/21/2024-21:45:33] [I] Timing trace has 607 queries over 3.01887 s
[10/21/2024-21:45:33] [I] 
[10/21/2024-21:45:33] [I] === Trace details ===
[10/21/2024-21:45:33] [I] Trace averages of 10 runs:
[10/21/2024-21:45:33] [I] Average on 10 runs - GPU latency: 6.70789 ms - Host latency: 7.19922 ms (end to end 13.276 ms, enqueue 1.39257 ms)
...
[10/21/2024-21:45:33] [I] Average on 10 runs - GPU latency: 4.63984 ms - Host latency: 5.05623 ms (end to end 8.15396 ms, enqueue 1.21301 ms)
[10/21/2024-21:45:33] [I] 
[10/21/2024-21:45:33] [I] === Performance summary ===
[10/21/2024-21:45:33] [I] Throughput: 201.069 qps
[10/21/2024-21:45:33] [I] Latency: min = 4.92627 ms, max = 7.21841 ms, mean = 5.10661 ms, median = 5.06763 ms, percentile(99%) = 7.19328 ms
[10/21/2024-21:45:33] [I] End-to-End Host Latency: min = 5.026 ms, max = 13.315 ms, mean = 9.14545 ms, median = 9.25452 ms, percentile(99%) = 13.2415 ms
[10/21/2024-21:45:33] [I] Enqueue Time: min = 0.482788 ms, max = 2.88159 ms, mean = 1.24856 ms, median = 1.29077 ms, percentile(99%) = 2.15625 ms
[10/21/2024-21:45:33] [I] H2D Latency: min = 0.258789 ms, max = 0.338821 ms, mean = 0.267017 ms, median = 0.264648 ms, percentile(99%) = 0.321045 ms
[10/21/2024-21:45:33] [I] GPU Compute Time: min = 4.53687 ms, max = 6.72342 ms, mean = 4.71299 ms, median = 4.67749 ms, percentile(99%) = 6.70314 ms
[10/21/2024-21:45:33] [I] D2H Latency: min = 0.124268 ms, max = 0.170563 ms, mean = 0.126614 ms, median = 0.125732 ms, percentile(99%) = 0.168869 ms
[10/21/2024-21:45:33] [I] Total Host Walltime: 3.01887 s
[10/21/2024-21:45:33] [I] Total GPU Compute Time: 2.86078 s
[10/21/2024-21:45:33] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/21/2024-21:45:33] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8003] # trtexec --loadEngine=weights/model_bs4_fp16_1.5GB_CUBLAS_LT_CUDNN.trt --shapes=audio_seqs__0:4x1x80x16,img_seqs__1:4x6x256x256
[10/21/2024-21:45:33] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1501, GPU 1525 (MiB)

Note I set max_workspace_size=1.7G and fp16=True in the polygraphy convertsion pyhton script, but the log of trtexec shows

[10/21/2024-21:45:10] [I] Workspace: 16 MiB
...
[10/21/2024-21:45:10] [I] Precision: FP32

, why they are inconsistent ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant