Continuous-SAC-Pytorch

Reproduce results from Continuous SAC paper.

This repo is based on several SAC implementations, mainly Stable-Baselines3, author's implementation and SAC-Continuous-Pytorch.

Installation

After cloning the repo, install requirements by running

pip install -r requirements.txt

or it can be installed with pip

pip install git+https://github.com/giangbang/Continuous-SAC.git

How to run

python train.py --env_name HalfCheetah-v4 --total_env_step 1000000 --buffer_size 1000000 --actor_log_std_min -20 --batch_size 256 --eval_interval 5000 --critic_tau 0.005 --alpha_lr 3e-4 --num_layers 3 --critic_lr 3e-4 --actor_lr 3e-4 --init_temperature 1 --hidden_dim 256 --reward_scale .2 --train_freq 1 --gradient_steps 1

Some benchmark environments from gym, for example mujoco or RacingCar and LunarLanderContinuous, need to be installed separately from by pip install gymnasium[mujoco] or pip install gymnasium[box2d].

It can also be run from terminal by the following command from the entry point, if installed by setup.py

sac_continuous --env_name HalfCheetah-v4 --total_env_step 1_000_000

Results

Most of the experiments used the same hyper-parameters shown in the table. Set seed to -1 to use random seed every run.

Hyper params	Value	Hyper params	Value
`reward_scale`	1.0	`critic_lr`	0.0003
`buffer_size`	1000000	`critic_tau`	0.005
`start_step`	1000	`actor_lr`	0.0003
`total_env_step`	1000000	`actor_log_std_min`	-20.0
`batch_size`	256	`actor_log_std_max`	2
`hidden_dim`	256	`num_layers`	3
`gradient_steps`	1	`discount`	0.99
`train_freq`	1	`init_temperature`	0.2
`eval_interval`	5000	`alpha_lr`	0.0003
`num_eval_episodes`	10	`seed`	-1

Comments

Here are some critical minor implementation details but are crucial to achieve the desired performance

Handle done separately by truncation and termination. SAC performs much worse in some environment when we do not correctly implement this (about 2k rewards in difference in Half-Cheetah).
Using ReLU activation function slightly increases the performance, compared to using Tanh. I suspect that the three layer Tanh Activation network are not powerful enough to learn the value function of tasks with high reward range like Mujoco.
Using eps=1e-5 in Adam Optimizer does not provide any significant boost as suggested in stable-baselines3.
Initial temperature of alpha (entropy coefficient) can largely impact the final performance (than one might expect). In Half-Cheetah, alpha starting with the values of 0.2 and 1 can yield a gap ~ 1-2k in final performance.
Changing actor_log_std_min from -20 to -10 can sometimes reduce the performance, but this might not be consistent through out seeds

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
buffer.py		buffer.py
logger.py		logger.py
model.py		model.py
requirements.txt		requirements.txt
sac.py		sac.py
setup.py		setup.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Continuous-SAC-Pytorch

Installation

How to run

Results

Comments

About

Releases

Packages

Languages

License

giangbang/Continuous-SAC

Folders and files

Latest commit

History

Repository files navigation

Continuous-SAC-Pytorch

Installation

How to run

Results

Comments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages