Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the prodigy optimizer in training #368

Open
Mixomo opened this issue Oct 12, 2024 · 0 comments
Open

Adding the prodigy optimizer in training #368

Mixomo opened this issue Oct 12, 2024 · 0 comments

Comments

@Mixomo
Copy link

Mixomo commented Oct 12, 2024

Extracted from a Stable Diffusion oriented guide, but the efficiency of the optimizer applies equally to different networks and architectures.

There are two types: adaptative and non-adaptative. Adaptive optimizers consume more VRAM but simplify learning rate adjustments. If VRAM allows, "Prodigy" is recommended.

Using Prodigy optimizer:

Prodigy offers an adaptative approach to learning rate. It is a direct upgrade over DAdaptation and is a clear winner over the rest of the optimizers, but at the cost of significant VRAM usage.

PREREQUISITES:

Set Learning Rates to 1: This is paramount. Prodigy will dynamically adjust the learning rate during training.

Extra Arguments: In the "Optimizer Extra Arguments" field, input the following settings for a good starting point: "decouple=True" "weight_decay=0.01" "d_coef=0.8" "use_bias_correction=True" "safeguard_warmup=True" "betas=0.9,0.99"

Understanding Prodigy Parameters:

d_coef (range: 0.1 to 2, recommended: 0.8): this is the only parameter you should change. This parameter affects the rate of learning rate changes. Generally, keep it under 1. For smaller datasets, consider higher values. If your model overfits without learning anything, you should lower it.

weight_decay (recommended: 0.01): This represents a percentage of decay in learning your dataset. This adds a penalty to the loss function to prevent overfitting by encouraging smaller weight magnitudes (promoting model simplicity). During training, this penalty term encourages the model to keep the weights small, effectively reducing their magnitude. By doing this, weight decay helps simplifying the model, making it less likely to fit noise in the training data and more likely to generalize well to new, unseen data. It's a common technique in many optimization algorithms.
Some tutorials might recommend values like 0.1 or even 0.5, but in my experience, this is inefficient. This means losing 10% or 50% of your training every step (which you might realize is unwise). You can go as high as 0.05, but you shouldn't have to change anything. If your model overfits without learning anything, you might try upping it a bit.

safeguard_warmup: Set this to True if you use a warmup greater than 0. False otherwise.

decouple: Keep this set to True. You can read about it in the Prodigy paper (https://github.com/konstmish/prodigy) if you wish to know what it does

betas: Leave this at the default "0.9,0.99" for Stable Diffusion. Understanding betas requires a deeper dive into optimizer mechanics, which I will not do in this guide.

prodigy works well with the LR Scheduler cosine


By implementing this in All Talk in the future, I think it could help to achieve better training for those who have a GPU with a good amount of VRAM.

@erew123 erew123 mentioned this issue Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant