Skip to content

This repository contains the source code for our paper: "PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers".

License

Notifications You must be signed in to change notification settings

SMARTlab-Purdue/PrefMMT

Repository files navigation

PrefMMT

This repository contains the source code for our paper: "PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers", submitted to the 2025 IEEE International Conference on Robotics and Automation (ICRA 2025). For more information, visit our project website.

Abstract

Preference-based Reinforcement Learning (PbRL) is a promising approach for aligning robot behaviors with human preferences, but its effectiveness relies on accurately modeling those preferences through reward models. Traditional methods often assume preferences are Markovian, neglecting the temporal dependencies within robot behavior trajectories that influence human evaluations. While recent approaches use sequence modeling to learn non-Markovian rewards, they overlook the multimodal nature of robot trajectories, consisting of both state and action elements. This oversight limits their ability to capture the intricate interplay between these modalities, which is critical in shaping human preferences.

In this work, we introduce PrefMMT, a multimodal transformer network designed to disentangle and model the state and action modalities separately. PrefMMT hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. Our experiments show that PrefMMT consistently outperforms state-of-the-art baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the Meta-World benchmark.

Comparison: PrefMMT vs. Other Preference Modeling Methods

Comparison of PrefMMT with other methods

The diagram above highlights the key distinctions between PrefMMT and other existing preference modeling methods. While traditional approaches often make Markovian assumptions and fail to capture the multimodal interactions between state and action, PrefMMT addresses this gap by leveraging a multimodal transformer architecture. This allows for a more accurate and dynamic understanding of human preferences by modeling both intra-modal and inter-modal dependencies.

Architecture Overview

PrefMMT Architecture

Installation

Requirements

  1. Install dependencies:

    pip install --upgrade pip
    conda install -y -c conda-forge cudatoolkit=11.1 cudnn=8.2.1
    pip install -r requirements.txt
    cd d4rl
    pip install -e .
    cd ..
  2. Install JAX and JAXlib:

Run the code

Run Training Reward Model

CUDA_VISIBLE_DEVICES=0 python -m JaxPref.main --use_human_label True --comment {experiment_name} --transformer.embd_dim 256 --transformer.n_layer 3 --transformer.n_head 4 --env {D4RL env name} --logging.output_dir './logs/pref_reward' --batch_size 256 --num_query {number of query} --query_len 100 --n_epochs 10000 --skip_flag 0 --seed {seed} --model_type PrefMMT

Run IQL with learned Reward Model

CUDA_VISIBLE_DEVICES=0 python train_offline.py --seq_len {sequence length in reward prediction} --comment {experiment_name} --eval_interval {5000: mujoco / 100000: antmaze / 5000: metaworld} --env_name {d4rl env name} --config {configs/(mujoco|antmaze|metaworld)_config.py} --eval_episodes {100 for ant , 10 o.w.} --use_reward_model True --model_type PrefMMT --ckpt_dir {reward_model_path} --seed {seed}

(The code was tested in Ubuntu 20.04 with Python 3.8.)

Acknowledgments

Our code is based on the implementation of PT, Flaxmodels and IQL.

About

This repository contains the source code for our paper: "PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published