-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto eval actually works #310
Conversation
# NOTE: we are assuming the first beaker dataset has the model | ||
# I have checked a couple of beaker jobs and found the first dataset is the model | ||
# but we should check this assumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yizhongw @jacob-morrison for multi-node job I assumed the first beaker dataset will have the model. Is this always the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked my experiment list. It seems so. I didn't see an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my one concern is if this is the same when preemption happens, but thats sort of an edge case so happy to merge and then patch later.
6804242
to
05a8f33
Compare
@@ -704,7 +763,7 @@ def upload_metadata_to_hf( | |||
# about a model for leaderboard displays. | |||
with open("tmp.json", "w") as f: | |||
json.dump(metadata_dict, f) | |||
api = HfApi(token=os.getenv("HF_TOKEN", None)) | |||
api = HfApi() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1001,49 +1001,65 @@ def main(args: FlatArguments): | |||
) | |||
|
|||
# remove all checkpoints to save space | |||
if accelerator.is_main_process: | |||
if accelerator.is_local_main_process: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should remove the intermedaite checkpoints on all the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good
# NOTE: we are assuming the first beaker dataset has the model | ||
# I have checked a couple of beaker jobs and found the first dataset is the model | ||
# but we should check this assumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my one concern is if this is the same when preemption happens, but thats sort of an edge case so happy to merge and then patch later.
So auto eval that actually works (also with oe-eval). The idea is to do something like the following
10:30 - now testing with docker containers.
docker submission
mason submission