Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2.5 #1863

Merged
merged 25 commits into from
Oct 31, 2024
Merged

Qwen2.5 #1863

merged 25 commits into from
Oct 31, 2024

Conversation

calvinpelletier
Copy link
Contributor

@calvinpelletier calvinpelletier commented Oct 18, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Issue #1624

Changelog

  • updated the list of special tokens
  • created a prompt template
  • added model builders
  • added a unit test for Qwen2.5 tokenization
  • moved special token logic from template to tokenizer
  • created separate qwen2/2.5 tokenizers
  • separated base and instruct model builders
  • created various configs

Test plan

Checked tokenizer and chat template parity with the current official Qwen2.5 implementation (huggingface).

Ran Qwen2.5 finetunes

Checklist

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (see above for links to loss curves)

Copy link

pytorch-bot bot commented Oct 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1863

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9271c5c with merge base 09c2619 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 18, 2024
Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested to get more opinions here, but I think it would be clearer to have a fully separate 2.5 folder even if a few of the models have the exact same definitions. For the .5, 1.5, and 7B models you can directly have them call the qwen2 builders if you want. But this way we have separate tokenizers (without introducing the concept of tokenizer versions) and separate recipes.

Additionally, I want to see more information on how the new models were checked for accuracy and more configs for the 2.5 model.

recipes/configs/qwen2_5/3B_full.yaml Outdated Show resolved Hide resolved
recipes/configs/qwen2_5/3B_full.yaml Outdated Show resolved Hide resolved
torchtune/models/qwen2/_prompt_template.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Oct 20, 2024

Codecov Report

Attention: Patch coverage is 12.08791% with 80 lines in your changes missing coverage. Please review.

Project coverage is 69.32%. Comparing base (c70ad29) to head (9595a46).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
torchtune/models/qwen2_5/_model_builders.py 0.00% 41 Missing ⚠️
torchtune/models/qwen2_5/_prompt_template.py 0.00% 36 Missing ⚠️
torchtune/models/qwen2_5/__init__.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1863      +/-   ##
==========================================
+ Coverage   67.30%   69.32%   +2.02%     
==========================================
  Files         304      311       +7     
  Lines       16000    16227     +227     
==========================================
+ Hits        10768    11250     +482     
+ Misses       5232     4977     -255     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fzyzcjy
Copy link

fzyzcjy commented Oct 22, 2024

Hi thank you for your great work!

I wonder whether it will support qwen2.5-math (instead of qwen2.5)?

@calvinpelletier
Copy link
Contributor Author

Hi thank you for your great work!

I wonder whether it will support qwen2.5-math (instead of qwen2.5)?

@fzyzcjy Yep! It's the same model just with different weights. So for example, run tune download Qwen/Qwen2.5-Math-7B-Instruct --output-dir /tmp/Qwen2_5-Math-7B-Instruct --ignore-patterns None then use the qwen2_5/7B_lora_single_device config with the checkpointer.checkpoint_dir updated to /tmp/Qwen2_5-Math-7B-Instruct.

For best results, you should preprocess your dataset to match the expected data format of Qwen2.5-Math. They recommend CoT or TIR reasoning, so a system prompt like "Please reason step by step, and put your final answer within \boxed{}."

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments, also wanna +1 @pbontrager's comment here. We probably wanna at least have some E2E runs with eval results for a couple of different model sizes

torchtune/models/qwen2_5/__init__.py Outdated Show resolved Hide resolved
torchtune/models/qwen2_5/_model_builders.py Show resolved Hide resolved
torchtune/models/qwen2_5/_model_builders.py Show resolved Hide resolved
torchtune/models/qwen2_5/_model_builders.py Show resolved Hide resolved
torchtune/models/qwen2_5/_model_builders.py Outdated Show resolved Hide resolved
torchtune/models/qwen2_5/_model_builders.py Outdated Show resolved Hide resolved
torchtune/models/qwen2_5/_prompt_template.py Outdated Show resolved Hide resolved
def __call__(
self,
messages: List[Message],
inference: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is in the interface, but I don't see it used here (or in any of our other prompt templates). Can we remove it? @pbontrager

torchtune/models/qwen2_5/_prompt_template.py Outdated Show resolved Hide resolved
torchtune/models/qwen2_5/_prompt_template.py Outdated Show resolved Hide resolved
@calvinpelletier
Copy link
Contributor Author

oh @ebsmothers I just realized what happened, I was looking at the qwen2.5 instruct configs and you were looking at the base model configs. I just assumed they would be the same models with different params, since instruct should just be a finetune of the base model. I'll create separate model builders for the instruct and base models.

Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome, I just left a few minor comments. Will let @ebsmothers comment more on the model builders.

torchtune/models/qwen2/_tokenizer.py Outdated Show resolved Hide resolved
torchtune/models/qwen2/_tokenizer.py Show resolved Hide resolved
torchtune/models/qwen2/_tokenizer.py Show resolved Hide resolved

The extra text will still get tokenized as normal text, not as special tokens.
Default: None
errors (str): Paradigm to follow when decoding bytes to UTF-8. Defaults to "replace".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would make this parameter name a bit more descriptive, like decode_error_handler?

torchtune/models/qwen2_5/_tokenizer.py Outdated Show resolved Hide resolved
torchtune/models/qwen2_5/_tokenizer.py Outdated Show resolved Hide resolved
torchtune/models/qwen2_5/_tokenizer.py Show resolved Hide resolved
@joecummings joecummings mentioned this pull request Oct 30, 2024
34 tasks
Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing some default configs.

Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too difficult, would be awesome if you can expose packed and compile in all the configs.

I'm also a bit confused on why some model builders have a base and instruct version and some don't. Would also be great if you could add to the docstring why the instruct is different from the base (seems like different seq len?)

recipes/configs/qwen2_5/0_5B_full.yaml Outdated Show resolved Hide resolved
recipes/configs/qwen2_5/0_5B_full.yaml Outdated Show resolved Hide resolved
recipes/configs/qwen2_5/0_5B_full.yaml Show resolved Hide resolved
recipes/configs/qwen2_5/0_5B_full.yaml Show resolved Hide resolved
recipes/configs/qwen2_5/32B_lora.yaml Outdated Show resolved Hide resolved
tests/torchtune/models/qwen2_5/test_tokenizer.py Outdated Show resolved Hide resolved
torchtune/models/qwen2_5/_model_builders.py Outdated Show resolved Hide resolved
torchtune/models/qwen2_5/_model_builders.py Show resolved Hide resolved
torchtune/models/qwen2_5/_model_builders.py Outdated Show resolved Hide resolved
torchtune/models/qwen2_5/_tokenizer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legendary PR, thanks for adding this model and all its variants

By default, we set the cache size equals to size of the official Qwen2 tokenizer.

Example:
>>> tokenizer = Qwen2Tokenizer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this example?

@RdoubleA RdoubleA merged commit 73d0821 into pytorch:main Oct 31, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants