Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD integration #132

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

AMD integration #132

wants to merge 14 commits into from

Conversation

parthraut
Copy link
Collaborator

@parthraut parthraut commented Oct 17, 2024

  • Added amdsmi to project dependency
  • Implemented a non-blocking contructor of AMDGPU using concurrent.futures.ThreadPoolExecutor (AMDGPUs constructor takes 0.5s (polling time) regardless of number of GPUs)
  • Right now, it succeeds if the measured value is in 1% of expected value, and waits 0.5s. These can be changed.

A few questions:

  1. What do you think about the threshold (1%) and wait time (0.5s)?
  2. This is failing pyright check:
    zeus/zeus/device/gpu/amd.py:247:13 - error: Operator "*" not supported for types "c_uint32" and "Literal[1000]" (reportOperatorIssue) zeus/zeus/device/gpu/amd.py:258:9 - error: Method "supportsGetTotalEnergyConsumption" overrides class "GPU" in an incompatible manner Positional parameter count mismatch; base method has 1, but override has 2 (reportIncompatibleMethodOverride) zeus/zeus/device/gpu/amd.py:280:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:282:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:293:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:344:19 - error: Cannot access attribute "value" for class "AmdSmiException" Attribute "value" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:346:37 - error: Cannot access attribute "msg" for class "AmdSmiException"
    I can fix these, but do you think this is the right approach? I wanted to make sure before tweaking base class signatures.

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at a high level. Thanks for your work!

I think 0.5 seconds is just right, assuming it works.

zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! Just a couple nits :)

zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
Comment on lines 257 to 259
"""Check if the GPU supports retrieving total energy consumption. Returns a future object of the result."""
wait_time = 0.5 # seconds
threshold = 0.8 # 80% threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to first check whether self._supportsGetTotalEnergyConsumption is not None and return the cached value for future invocations of this method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the current structure it's never called more than once. But that would be better, I'll add it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is part of the public API of GPU and GPUs, meaning people can always call it in their code. Thanks!

zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
@jaywonchung
Copy link
Member

jaywonchung commented Oct 27, 2024

@parthraut Is this ready for review? Did stuff work properly on the AMD HPC cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants