Adding the prodigy optimizer in training #368

Mixomo · 2024-10-12T17:03:44Z

Extracted from a Stable Diffusion oriented guide, but the efficiency of the optimizer applies equally to different networks and architectures.

There are two types: adaptative and non-adaptative. Adaptive optimizers consume more VRAM but simplify learning rate adjustments. If VRAM allows, "Prodigy" is recommended.

Using Prodigy optimizer:

Prodigy offers an adaptative approach to learning rate. It is a direct upgrade over DAdaptation and is a clear winner over the rest of the optimizers, but at the cost of significant VRAM usage.

PREREQUISITES:

Set Learning Rates to 1: This is paramount. Prodigy will dynamically adjust the learning rate during training.

Extra Arguments: In the "Optimizer Extra Arguments" field, input the following settings for a good starting point: "decouple=True" "weight_decay=0.01" "d_coef=0.8" "use_bias_correction=True" "safeguard_warmup=True" "betas=0.9,0.99"

Understanding Prodigy Parameters:

d_coef (range: 0.1 to 2, recommended: 0.8): this is the only parameter you should change. This parameter affects the rate of learning rate changes. Generally, keep it under 1. For smaller datasets, consider higher values. If your model overfits without learning anything, you should lower it.

weight_decay (recommended: 0.01): This represents a percentage of decay in learning your dataset. This adds a penalty to the loss function to prevent overfitting by encouraging smaller weight magnitudes (promoting model simplicity). During training, this penalty term encourages the model to keep the weights small, effectively reducing their magnitude. By doing this, weight decay helps simplifying the model, making it less likely to fit noise in the training data and more likely to generalize well to new, unseen data. It's a common technique in many optimization algorithms.
Some tutorials might recommend values like 0.1 or even 0.5, but in my experience, this is inefficient. This means losing 10% or 50% of your training every step (which you might realize is unwise). You can go as high as 0.05, but you shouldn't have to change anything. If your model overfits without learning anything, you might try upping it a bit.

safeguard_warmup: Set this to True if you use a warmup greater than 0. False otherwise.

decouple: Keep this set to True. You can read about it in the Prodigy paper (https://github.com/konstmish/prodigy) if you wish to know what it does

betas: Leave this at the default "0.9,0.99" for Stable Diffusion. Understanding betas requires a deeper dive into optimizer mechanics, which I will not do in this guide.

prodigy works well with the LR Scheduler cosine

By implementing this in All Talk in the future, I think it could help to achieve better training for those who have a GPU with a good amount of VRAM.

The text was updated successfully, but these errors were encountered:

erew123 mentioned this issue Oct 14, 2024

Alltalkbeta #288

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the prodigy optimizer in training #368

Adding the prodigy optimizer in training #368

Mixomo commented Oct 12, 2024

Adding the prodigy optimizer in training #368

Adding the prodigy optimizer in training #368

Comments

Mixomo commented Oct 12, 2024