See-Say : Code base for Hint guided learning

Currently we have custom gridworld environment using the MiniGrid framework that trains a reinforcement learning agent using Proximal Policy Optimization (PPO) with Tianshou. The goal of the agent is to collect resources, craft a sword, and find a treasure in the environment.

Overview

The agent operates in a gridworld containing resources (Iron Ore, Silver Ore, Platinum Ore, Gold Ore, Trees), a crafting table, and a chest. The objective of the agent is to collect resources, craft a sword, and then open the chest to find the treasure. The environment is built using MiniGridEnv from MiniGrid, and the agent is trained using the PPO algorithm from Tianshou.

Environment Details

The environment is a 12x12 grid, with the agent starting at a random or specified position. It contains the following objects:

Iron Ore (red)
Silver Ore (grey)
Platinum Ore (purple)
Gold Ore (yellow)
Tree (green)
Chest (purple)
Crafting Table (blue)
Walls

The agent uses LiDAR to detect nearby objects in the environment, and it has an inventory to store collected resources.

Actions

The agent can perform the following actions:

move_forward: Move the agent forward by one step.
turn_left: Rotate the agent 90 degrees to the left.
turn_right: Rotate the agent 90 degrees to the right.
toggle: Interact with objects (collect resources or interact with boxes).
craft_sword: Craft a sword using resources in the inventory.
open_chest: Open the chest to win the game if the agent has crafted a sword.

Observation Space

The observation space consists of:

LiDAR: A grid with 8 beams, each detecting one of the 8 possible objects in the environment.
Inventory: The agent’s inventory containing resources it has collected.

The LiDAR data is flattened and concatenated with the inventory data to form the final observation space.

Reward Structure

Per-Step Penalty: -1 for each time step to encourage faster completion of the task.
Resource Collection: +1 for collecting a new resource.
Crafting the Sword: +50 (only during the first max_reward_episodes episodes).
Opening the Chest: +1000 for successfully opening the chest after crafting the sword.
Failures: -1 for attempting invalid actions (e.g., crafting without the necessary resources).

Training Details

The training is implemented using the PPO algorithm from Tianshou. The agent is trained across 8 parallel environments, using a replay buffer to store experience. The policy and value networks are trained jointly using the collected experience.

Success Criteria

The agent's training will stop once it reaches a success rate of 90% over the last 10 evaluations. Success is defined as successfully opening the chest.

Manual Control

The environment also supports manual control via keyboard inputs. You can control the agent using the following keys:

Arrow Keys: Move the agent (left, right, up).
Spacebar: Interact with objects (collect resources).
C: Craft the sword.
O: Open the chest.

Installation

Requirements

Python 3.11+
Gymnasium
MiniGrid
Tianshou
Pytorch
Numpy
TensorBoard

Installation Steps

Clone the repository.

Create a virtual environment and activate it:

python3 -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

or Conda

conda create --name see_say python=3.11.4

Install the dependencies:
```
pip install -r requirements.txt
```

Usage

Training the Agent

To train the agent using PPO (PPO.py), run the following command:

python run.py

You can modify various aspects of the training via command-line arguments:

--use-wrapper: Use the environment wrapper that encodes constraints (default: False).
--use-attention: Enable the attention mechanism for constraint-based observations (default: False).
--device: Specify the device to run the training on (cpu, cuda, or mps). By default, the script will automatically choose the most suitable device.
--max-episodes: Set the maximum number of training episodes (default: 100000).
--max-timesteps: Set the maximum number of timesteps per episode (default: 300).
--update-timestep: Set the number of timesteps before updating the PPO agent (default: 2000).

Example usage

python run.py --use-wrapper --use-attention --device cuda --max-episodes 50000 --max-timesteps 400

This will start the training process and log results to TensorBoard.

For tensorboard logging

tensorboard --logdir=log/

Running Manual Control

To manually control the agent, run the following command:

python env.py

You can then control the agent with the keyboard as described in the Manual Control section.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

See-Say : Code base for Hint guided learning

Table of Contents

Overview

Environment Details

Actions

Observation Space

Reward Structure

Training Details

Success Criteria

Manual Control

Installation

Requirements

Installation Steps

Usage

Training the Agent

Running Manual Control

More to come soon! Stay tuned!!!

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

See-Say : Code base for Hint guided learning

Table of Contents

Overview

Environment Details

Actions

Observation Space

Reward Structure

Training Details

Success Criteria

Manual Control

Installation

Requirements

Installation Steps

Usage

Training the Agent

Running Manual Control

More to come soon! Stay tuned!!!