Currently we have custom gridworld environment using the MiniGrid framework that trains a reinforcement learning agent using Proximal Policy Optimization (PPO) with Tianshou. The goal of the agent is to collect resources, craft a sword, and find a treasure in the environment.
The agent operates in a gridworld containing resources (Iron Ore, Silver Ore, Platinum Ore, Gold Ore, Trees), a crafting table, and a chest. The objective of the agent is to collect resources, craft a sword, and then open the chest to find the treasure. The environment is built using MiniGridEnv
from MiniGrid, and the agent is trained using the PPO algorithm from Tianshou.
The environment is a 12x12 grid, with the agent starting at a random or specified position. It contains the following objects:
- Iron Ore (red)
- Silver Ore (grey)
- Platinum Ore (purple)
- Gold Ore (yellow)
- Tree (green)
- Chest (purple)
- Crafting Table (blue)
- Walls
The agent uses LiDAR to detect nearby objects in the environment, and it has an inventory to store collected resources.
The agent can perform the following actions:
move_forward
: Move the agent forward by one step.turn_left
: Rotate the agent 90 degrees to the left.turn_right
: Rotate the agent 90 degrees to the right.toggle
: Interact with objects (collect resources or interact with boxes).craft_sword
: Craft a sword using resources in the inventory.open_chest
: Open the chest to win the game if the agent has crafted a sword.
The observation space consists of:
- LiDAR: A grid with 8 beams, each detecting one of the 8 possible objects in the environment.
- Inventory: The agent’s inventory containing resources it has collected.
The LiDAR data is flattened and concatenated with the inventory data to form the final observation space.
- Per-Step Penalty: -1 for each time step to encourage faster completion of the task.
- Resource Collection: +1 for collecting a new resource.
- Crafting the Sword: +50 (only during the first
max_reward_episodes
episodes). - Opening the Chest: +1000 for successfully opening the chest after crafting the sword.
- Failures: -1 for attempting invalid actions (e.g., crafting without the necessary resources).
The training is implemented using the PPO algorithm from Tianshou. The agent is trained across 8 parallel environments, using a replay buffer to store experience. The policy and value networks are trained jointly using the collected experience.
The agent's training will stop once it reaches a success rate of 90% over the last 10 evaluations. Success is defined as successfully opening the chest.
The environment also supports manual control via keyboard inputs. You can control the agent using the following keys:
- Arrow Keys: Move the agent (left, right, up).
- Spacebar: Interact with objects (collect resources).
- C: Craft the sword.
- O: Open the chest.
- Python 3.11+
- Gymnasium
- MiniGrid
- Tianshou
- Pytorch
- Numpy
- TensorBoard
-
Clone the repository.
-
Create a virtual environment and activate it:
python3 -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
or Conda
conda create --name see_say python=3.11.4
-
Install the dependencies:
pip install -r requirements.txt
To train the agent using PPO (PPO.py
), run the following command:
python run.py
You can modify various aspects of the training via command-line arguments:
--use-wrapper
: Use the environment wrapper that encodes constraints (default:False
).--use-attention
: Enable the attention mechanism for constraint-based observations (default:False
).--device
: Specify the device to run the training on (cpu
,cuda
, ormps
). By default, the script will automatically choose the most suitable device.--max-episodes
: Set the maximum number of training episodes (default:100000
).--max-timesteps
: Set the maximum number of timesteps per episode (default:300
).--update-timestep
: Set the number of timesteps before updating the PPO agent (default:2000
).
Example usage
python run.py --use-wrapper --use-attention --device cuda --max-episodes 50000 --max-timesteps 400
This will start the training process and log results to TensorBoard.
For tensorboard logging
tensorboard --logdir=log/
To manually control the agent, run the following command:
python env.py
You can then control the agent with the keyboard as described in the Manual Control section.