关于Tokenizer对于一些特殊词的失效 #920
-
我在使用tokenizer对例如“你好”进行编码的时候,得到的结果是[36474, 54591, 1833, 30917, 30994],即作为eos_token的""被拆分成了1833, 30917, 30994,请问这是什么原因呢? from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("chatglm3-6b", trust_remote_code=True)
token_ids = tokenizer.encode("你好</s>", add_special_tokens=False)
print(token_ids) |
Beta Was this translation helpful? Give feedback.
Answered by
zRzRzRzRzRzRzR
Mar 6, 2024
Replies: 1 comment
-
按照 basic_demo 中读入tokenizer等方法读入,我们有自己的模板 |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
zRzRzRzRzRzRzR
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
按照 basic_demo 中读入tokenizer等方法读入,我们有自己的模板