You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm struggling to understand a couple of things about the simulator, given there's no documentation around it.
The simulation runtime is determined based on the requests (either trace files or synthetic request generator with arrival times determined with Poisson distribution). There is also execution time, which is the actual time it takes to process a batch or stage of computations during model inference. My question is, how does the existence of multiple GPUs (added through replica_config_num_pipeline_stages and replica_config_tensor_parallel_size parameters) affect simulation runtime and/or request execution time? It seems like (from the stats extractor script) the GPU hours are calculated by runtime * number of GPUs / 3600; however, I'm thinking that the runtime or the execution time should become less in the presence of multiple GPUs, thus the total GPU hours, due to load balancing. Is this incorrect?
Also, where in the code should I look to find out how load balancing is handled across multiple GPUs? Is there a load-balancing configuration across multiple GPUs, or are GPUs fully independent from each other? I am just curious to understand how the tasks are distributed across GPUs when we increase the world_size by increasing replica_config_num_pipeline_stages and replica_config_tensor_parallel_size parameters.
Finally, how are the batches determined, and are batches allocated to a specific GPU? I am asking this because it looks like the MFU metric (based on utils/mfu_calculator.py) is calculated batch-by-batch, and I am curious if it outputs the MFU over all the GPUs or a specific GPU that each batch is assigned to.
Thanks a lot!
The text was updated successfully, but these errors were encountered:
Hello,
I'm struggling to understand a couple of things about the simulator, given there's no documentation around it.
The simulation runtime is determined based on the requests (either trace files or synthetic request generator with arrival times determined with Poisson distribution). There is also execution time, which is the actual time it takes to process a batch or stage of computations during model inference. My question is, how does the existence of multiple GPUs (added through replica_config_num_pipeline_stages and replica_config_tensor_parallel_size parameters) affect simulation runtime and/or request execution time? It seems like (from the stats extractor script) the GPU hours are calculated by runtime * number of GPUs / 3600; however, I'm thinking that the runtime or the execution time should become less in the presence of multiple GPUs, thus the total GPU hours, due to load balancing. Is this incorrect?
Also, where in the code should I look to find out how load balancing is handled across multiple GPUs? Is there a load-balancing configuration across multiple GPUs, or are GPUs fully independent from each other? I am just curious to understand how the tasks are distributed across GPUs when we increase the world_size by increasing replica_config_num_pipeline_stages and replica_config_tensor_parallel_size parameters.
Finally, how are the batches determined, and are batches allocated to a specific GPU? I am asking this because it looks like the MFU metric (based on utils/mfu_calculator.py) is calculated batch-by-batch, and I am curious if it outputs the MFU over all the GPUs or a specific GPU that each batch is assigned to.
Thanks a lot!
The text was updated successfully, but these errors were encountered: