Skip to content

Commit

Permalink
Merge pull request #175 from ChiYeungLaw/main
Browse files Browse the repository at this point in the history
release wizardcoder-34B-python-v1.0
  • Loading branch information
victorsungo authored Aug 26, 2023
2 parents 15aec23 + 789ca87 commit 8350e03
Show file tree
Hide file tree
Showing 5 changed files with 165 additions and 27 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,14 @@ Thanks to the enthusiastic friends, their video introductions are more lively an

## News

- We released **WizardCoder-15B-V1.0** , which surpasses **Claude-Plus (+6.8)**, **Bard (+15.3)** and **InstructCodeT5+ (+22.3)** on the [HumanEval Benchmarks](https://github.com/openai/human-eval). For more details, please refer to [WizardCoder](https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder).
- 🔥🔥🔥[2023/08/26] We released **WizardCoder-Python-34B-V1.0** , which achieves the **73.2 pass@1** and surpasses **GPT4 (2023/03/15)**, **ChatGPT-3.5**, and **Claude2** on the [HumanEval Benchmarks](https://github.com/openai/human-eval). For more details, please refer to [WizardCoder](https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder).
- [2023/06/16] We released **WizardCoder-15B-V1.0** , which surpasses **Claude-Plus (+6.8)**, **Bard (+15.3)** and **InstructCodeT5+ (+22.3)** on the [HumanEval Benchmarks](https://github.com/openai/human-eval). For more details, please refer to [WizardCoder](https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder).


| Model | Checkpoint | Paper | HumanEval | MBPP | Demo | License |
| ----- |------| ---- |------|-------| ----- | ----- |
| WizardCoder-15B-V1.0 | 🤗 <a href="https://huggingface.co/WizardLM/WizardCoder-15B-V1.0" target="_blank">HF Link</a> | 📃 <a href="https://arxiv.org/abs/2306.08568" target="_blank">[WizardCoder]</a> | 57.3 |51.8 | | <a href="https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement" target="_blank">OpenRAIL-M</a> |
| WizardCoder-Python-34B-V1.0 | 🤗 <a href="" target="_blank">HF Link</a> | 📃 <a href="https://arxiv.org/abs/2306.08568" target="_blank">[WizardCoder]</a> | 73.2 | 61.2 | [Demo (Only English)](http://47.103.63.15:50085/) | <a href="https://ai.meta.com/resources/models-and-libraries/llama-downloads/" target="_blank">Llama2</a> |
| WizardCoder-15B-V1.0 | 🤗 <a href="https://huggingface.co/WizardLM/WizardCoder-15B-V1.0" target="_blank">HF Link</a> | 📃 <a href="https://arxiv.org/abs/2306.08568" target="_blank">[WizardCoder]</a> | 59.8 |50.6 | -- | <a href="https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement" target="_blank">OpenRAIL-M</a> |



Expand Down
54 changes: 46 additions & 8 deletions WizardCoder/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,39 @@

[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](CODE_LICENSE)
[![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](DATA_LICENSE)
[![Model Weight License](https://img.shields.io/badge/Model%20Weights%20License-bigscience%20OpenRAIL%20M%20v1-yellow)](MODEL_WEIGHTS_LICENSE)
<!-- [![Model Weight License](https://img.shields.io/badge/Model%20Weights%20License-bigscience%20OpenRAIL%20M%20v1-yellow)](MODEL_WEIGHTS_LICENSE) -->
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)

To develop our WizardCoder model, we begin by adapting the Evol-Instruct method specifically for coding tasks. This involves tailoring the prompt to the domain of code-related instructions. Subsequently, we fine-tune the Code LLM, StarCoder, utilizing the newly created instruction-following training set.
To develop our WizardCoder model, we begin by adapting the Evol-Instruct method specifically for coding tasks. This involves tailoring the prompt to the domain of code-related instructions. Subsequently, we fine-tune the Code LLMs, StarCoder or Code LLama, utilizing the newly created instruction-following training set.

## News

- We released **WizardCoder-15B-V1.0** , which achieves the **57.3 pass@1** and surpasses **Claude-Plus (+6.8)**, **Bard (+15.3)** and **InstructCodeT5+ (+22.3)** on the [HumanEval Benchmarks](https://github.com/openai/human-eval). For more details, please refer to [WizardCoder](https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder).
- 🔥🔥🔥[2023/08/26] We released **WizardCoder-Python-34B-V1.0** , which achieves the **73.2 pass@1** and surpasses **GPT4 (2023/03/15)**, **ChatGPT-3.5**, and **Claude2** on the [HumanEval Benchmarks](https://github.com/openai/human-eval).
- [2023/06/16] We released **WizardCoder-15B-V1.0** , which achieves the **57.3 pass@1** and surpasses **Claude-Plus (+6.8)**, **Bard (+15.3)** and **InstructCodeT5+ (+22.3)** on the [HumanEval Benchmarks](https://github.com/openai/human-eval).

❗Note: There are two HumanEval results of GPT4 and ChatGPT-3.5. The 67.0 and 48.1 are reported by the official GPT4 Report (2023/03/15) of [OpenAI](https://arxiv.org/abs/2303.08774). The 82.0 and 72.5 are tested by ourselves with the latest API (2023/08/26).


| Model | Checkpoint | Paper | HumanEval | MBPP | Demo | License |
| ----- |------| ---- |------|-------| ----- | ----- |
| WizardCoder-15B-V1.0 | 🤗 <a href="https://huggingface.co/WizardLM/WizardCoder-15B-V1.0" target="_blank">HF Link</a> | 📃 <a href="https://arxiv.org/abs/2306.08568" target="_blank">[WizardCoder]</a> | 57.3 |51.8 | | <a href="https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement" target="_blank">OpenRAIL-M</a> |
| WizardCoder-Python-34B-V1.0 | 🤗 <a href="" target="_blank">HF Link</a> | 📃 <a href="https://arxiv.org/abs/2306.08568" target="_blank">[WizardCoder]</a> | 73.2 | 61.2 | [Demo (Only English)](http://47.103.63.15:50085/) | <a href="https://ai.meta.com/resources/models-and-libraries/llama-downloads/" target="_blank">Llama2</a> |
| WizardCoder-15B-V1.0 | 🤗 <a href="https://huggingface.co/WizardLM/WizardCoder-15B-V1.0" target="_blank">HF Link</a> | 📃 <a href="https://arxiv.org/abs/2306.08568" target="_blank">[WizardCoder]</a> | 59.8 |50.6 | -- | <a href="https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement" target="_blank">OpenRAIL-M</a> |

- &#x1F4E3; Please refer to our Twitter account https://twitter.com/WizardLM_AI and HuggingFace Repo https://huggingface.co/WizardLM . We will use them to announce any new release at the 1st time.

## Comparing WizardCoder-Python-34B-V1.0 with Other LLMs.

- &#x1F4E3; Please refer to our Twitter account https://twitter.com/WizardLM_AI and HuggingFace Repo https://huggingface.co/WizardLM . We will use them to announce any new release at the 1st time.
🔥 The following figure shows that our **WizardCoder-Python-34B-V1.0 attains the second position in this benchmark**, surpassing GPT4 (2023/03/15, 73.2 vs. 67.0), ChatGPT-3.5 (73.2 vs. 72.5) and Claude2 (73.2 vs. 71.2).

<p align="center" width="100%">
<a ><img src="imgs/compare_sota.png" alt="WizardCoder" style="width: 96%; min-width: 300px; display: block; margin: auto;"></a>
</p>

❗❗❗**Note: This performance is 100% reproducible! If you cannot reproduce it, please follow the steps in [Evaluation](#evaluation).**

## Comparing WizardCoder with the Closed-Source Models.
❗Note: There are two HumanEval results of GPT4 and ChatGPT-3.5. The 67.0 and 48.1 are reported by the official GPT4 Report (2023/03/15) of [OpenAI](https://arxiv.org/abs/2303.08774). The 82.0 and 72.5 are tested by ourselves with the latest API (2023/08/26).

## Comparing WizardCoder-15B-V1.0 with the Closed-Source Models.

🔥 The following figure shows that our **WizardCoder attains the third position in this benchmark**, surpassing Claude-Plus (59.8 vs. 53.0) and Bard (59.8 vs. 44.5). Notably, our model exhibits a substantially smaller size compared to these models.

Expand All @@ -33,7 +46,7 @@ To develop our WizardCoder model, we begin by adapting the Evol-Instruct method

**Note: In this study, we copy the scores for HumanEval and HumanEval+ from the [LLM-Humaneval-Benchmarks](https://github.com/my-other-github-account/llm-humaneval-benchmarks). Notably, all the mentioned models generate code solutions for each problem utilizing a **single attempt**, and the resulting pass rate percentage is reported. Our **WizardCoder** generates answers using greedy decoding and tests with the same [code](https://github.com/evalplus/evalplus).**

## Comparing WizardCoder with the Open-Source Models.
## Comparing WizardCoder-15B-V1.0 with the Open-Source Models.

The following table clearly demonstrates that our **WizardCoder** exhibits a substantial performance advantage over all the open-source models. ❗**If you are confused with the different scores of our model (57.3 and 59.8), please check the Notes.**

Expand Down Expand Up @@ -170,7 +183,9 @@ Below is an instruction that describes a task. Write a response that appropriate
### HumanEval

1. According to the instructions of [HumanEval](https://github.com/openai/human-eval), install the environment.
2. Run the following script to generate the answer.
2. Run the following scripts to generate the answer.

- (1) For WizardCoder-15B-V1.0 (base on StarCoder)
```bash
model="/path/to/your/model"
temp=0.2
Expand Down Expand Up @@ -203,6 +218,29 @@ for ((i = 0; i < $gpu_num; i++)); do
done
```

- (2) For WizardCoder-Python-34B-V1.0 (base on CodeLLama)

```bash
pip install vllm # This can acclerate the inference process a lot.
pip install transformers==4.31.0

model="/path/to/your/model"
temp=0.2
max_len=2048
pred_num=200
num_seqs_per_iter=2

output_path=preds/T${temp}_N${pred_num}

mkdir -p ${output_path}
echo 'Output path: '$output_path
echo 'Model to eval: '$model

CUDA_VISIBLE_DEVICES=0,1,2,3 python humaneval_gen_vllm.py --model ${model} \
--start_index 0 --end_index 164 --temperature ${temp} \
--num_seqs_per_iter ${num_seqs_per_iter} --N ${pred_num} --max_len ${max_len} --output_path ${output_path} --num_gpus 4
```

3. Run the post processing code `src/process_humaneval.py` to collect the code completions from all answer files.
```bash
output_path=preds/T${temp}_N${pred_num}
Expand Down
Binary file added WizardCoder/imgs/compare_sota.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 1 addition & 17 deletions WizardCoder/src/humaneval_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,10 @@
except:
pass

def extract_text(prompt, remove_lines=True):
token = '\"\"\"'
start = token
end = '>>>'

start_idx = prompt.find(start) + len(start)
end_idx = prompt.find(end)

output = prompt[start_idx: end_idx]
if remove_lines:
output = output.replace('\n', ' ')
output = re.sub(r"\s+", " ", output).strip()

return output

def generate_prompt(input):
INSTRUCTION = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Create a Python script for this problem:
{input}
Expand Down Expand Up @@ -98,8 +84,6 @@ def main():
argsdict = vars(args)
print(pprint.pformat(argsdict))

STOP_SEQS = ['\nclass', '\ndef', '\n#', '\nif', '\nprint']

problems = read_problems()

task_ids = sorted(problems.keys())[args.start_index: args.end_index]
Expand Down
114 changes: 114 additions & 0 deletions WizardCoder/src/humaneval_gen_vllm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
import argparse
import pprint
import sys
import os
import re
from tqdm import tqdm
import torch
from transformers import LlamaTokenizer, AutoModelForCausalLM, GenerationConfig, BitsAndBytesConfig
from human_eval.data import write_jsonl, read_problems, stream_jsonl

from vllm import LLM
from vllm import SamplingParams

if torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"

try:
if torch.backends.mps.is_available():
device = "mps"
except:
pass

def generate_prompt(input):
INSTRUCTION = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Create a Python script for this problem:
{input}
### Response:"""
return INSTRUCTION


def main():
parser = argparse.ArgumentParser()

parser.add_argument('--model', type=str, default='bigcode/starcoder', help="")
parser.add_argument('--lora', type=str, default='bigcode/starcoder', help="")
parser.add_argument('--output_path', type=str, help="")
parser.add_argument('--start_index', type=int, default=0, help="")
parser.add_argument('--end_index', type=int, default=164, help="")
parser.add_argument('--temperature', type=float, default=0.8, help="")
parser.add_argument('--N', type=int, default=200, help="")
parser.add_argument('--max_len', type=int, default=512, help="")
parser.add_argument('--num_gpus', type=int, default=4, help="")
parser.add_argument('--decoding_style', type=str, default='sampling', help="")
parser.add_argument('--num_seqs_per_iter', type=int, default=50, help='')
parser.add_argument('--overwrite', action='store_true', help='')

args = parser.parse_args()

argsdict = vars(args)
print(pprint.pformat(argsdict))

problems = read_problems()

task_ids = sorted(problems.keys())[args.start_index: args.end_index]
prompts = [problems[task_id]['prompt'] for task_id in task_ids]
num_samples = len(prompts)
print("Number of samples: {}".format(num_samples))

llm = LLM(base_model, tensor_parallel_size=args.num_gpus)
sampling_params = SamplingParams(temperature=args.temperature, top_p=1, max_tokens=args.max_len)

print(f"Loaded {args.model}.")
for i in tqdm(range(num_samples), ncols=0, total=num_samples):
output_file = args.output_path + '/{}.jsonl'.format(args.start_index + i)

if os.path.exists(output_file) and not args.overwrite:
print(f'Skip {output_file} as it already exists')
continue

prompt = prompts[i].replace(' ', '\t')
prompt_batch = [generate_prompt(prompt)]

ids_batch = [task_ids[i]]
completion_seqs = []

if args.decoding_style == 'sampling':
loops = int(args.N / args.num_seqs_per_iter)
else:
loops = 1

for _ in tqdm(range(loops), total=loops, leave=False, ncols=0):

with torch.no_grad():
completions = llm.generate(prompt_batch, sampling_params)
gen_seqs = [completions[0].outputs[0].text]

if gen_seqs is not None:
assert len(ids_batch) == 1
task_id = ids_batch[0]

for seq_idx, gen_seq in enumerate(gen_seqs):
completion_seq = gen_seq.split("### Response:")[-1]
completion_seq = completion_seq.replace('\t', ' ')
all_code = gen_seq.replace('\t', ' ')

completion_seqs.append(
{'task_id': task_id,
'completion': completion_seq,
'all_code': all_code,
}
)

print("Saving results to {}".format(output_file))
write_jsonl(output_file, completion_seqs)


if __name__ == '__main__':
main()

0 comments on commit 8350e03

Please sign in to comment.