[EMNLP 2024] Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

This repository contains the code for the paper "Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models". Below is its workflow.

I. Quick Start with Interactive Mode

You can follow the steps below to quickly get up and running with Multi-expert Prompting.

In a conda env with PyTorch / CUDA available clone and download this repository.

Create and activate a new virtual environment.

conda create -n mep python=3.11
conda activate mep

In the top-level directory run:
```
pip install -r requirements.txt
```
To run OpenAI models, you need to export your API key:
```
export OPENAI_API_KEY=your_api_key_here
```
Once you got everything installed correctly, use the following command:
```
python src/interactive.py --model=[model] --num_experts=[number-of-experts] --temperature=[temperaure] [--verbose]
```
Currently, we support the following open-source (Mistral, Meta-llama) and proprietary models (OpenAI models):
- --model: gpt-4o, chatgpt-4o-latest, gpt-4o-2024-08-06, gpt-3.5-turbo, mistralai/Mistral-7B-Instruct-v0.2, meta-llama/Llama-3.1-8B-Instruct.
- --num_experts: any number. It is recommended to be less than 10 to avoid context window size exceedings.
- --temperature: often between 0 and 1.
Example with gpt-3.5-turbo with 3 experts and temperature equal 0:
```
python src/interactive.py --model="gpt-3.5-turbo" --num_experts=3 --temperature=0 --verbose
```

II. Benchmark Experiment and Evaluation Scripts

Benchmark experiments: Benchmarking data and scripts are coming soon! Alternatively, you can shortly customize src/interactive.py to run your own benchmark experiments.

Benchmark evaluations: We share our outputs in the folder: ./evaluation/results. To obtain the evaluation results, perform the following steps:

Navigate to the directory metrics.

cd Multi-expert-Prompting/evaluation/metrics

Run the scripts there to compute metrics:
```
python BOLD_compute.py
python TOXICITY_compute.py
python HONEST_compute.py
```
Note: Evaluation instructions for TruthfulQA, FactualityPrompt and ExpertQA are coming soon!

III. Main Results

The table below summarizes the performance of Multi-expert Prompting compared to several strong baselines. The details of our outputs are shared in the folder: ./evaluation/results.

Mistral-7B-Inst. v0.2	TruthfulQA ↑	FactualityPrompt ↓	HONEST ↓
Zero-shot	76.00	8.98/16.07	0.012/0.009
Zero-shot-CoT	78.70	9.28/14.87	0.014/0.013
Self-refine	81.88	10.36/14.95	0.007/0.008
Universal Self-consistency	81.64	9.98/15.21	0.007/0.008
Multi-agent Debate	80.78	17.57/18.27	0.004/0.007
ExpertPrompting	80.34	11.43/15.32	0.005/0.005
Multi-expert Prompting	87.15	8.16/14.70	0.003/0.005

ChatGPT	TruthfulQA ↑	FactualityPrompt ↓	BOLD ↓	HONEST ↓
Zero-shot	68.05	6.99/12.90	0.163	0.038/0.023
Zero-shot-CoT	70.38	6.93/13.75	0.163	0.006/0.005
Self-refine	75.89	7.11/13.96	0.064	0.006/0.007
Universal Self-consistency	77.11	5.51/9.71	0.000	0.010/0.008
Multi-agent Debate	64.87	5.64/13.06	0.000	0.005/0.004
ExpertPrompting	80.66	5.64/15.66	0.129	0.004/0.004
Multi-expert Prompting	89.35	4.54/9.45	0.000	0.004/0.003

Key: ↑ indicates higher is better; ↓ indicates lower is better.

IV. Issues

Please report any software “bug”, or other problems with the models through one of the following means:

This Github repo.
Do Xuan Long via xuanlong.do@u.nus.edu.

V. Citation and Acknowledgements

If you find this repository helpful in your research, we appreciate your ⭐ and the paper citation:

@misc{long2024multiexpertpromptingimprovesreliability,
      title={Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models}, 
      author={Do Xuan Long and Duong Ngoc Yen and Anh Tuan Luu and Kenji Kawaguchi and Min-Yen Kan and Nancy F. Chen},
      year={2024},
      eprint={2411.00492},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.00492}, 
}

We would like to acknowledge the Huggingface evaluate and Huggingface transformers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

[EMNLP 2024] Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

I. Quick Start with Interactive Mode

II. Benchmark Experiment and Evaluation Scripts

III. Main Results

IV. Issues

V. Citation and Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

[EMNLP 2024] Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

I. Quick Start with Interactive Mode

II. Benchmark Experiment and Evaluation Scripts

III. Main Results

IV. Issues

V. Citation and Acknowledgements