-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
希望增加tok保存空格的选项,以便分词后还原文本 #1802
Labels
feature request
Suggest an idea for this project
Comments
Hi,
|
这有点尴尬,我自己写代码,比对原文和分割后的列表,实现了 “还原文本” |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the feature and the current behavior/state.
文本的空格(全形和半形)会在tok舍弃
Will this change the current api? How?
不知道
Who will benefit with this feature?
使用简繁转换的人
Are you willing to contribute it (Yes/No):
力有不逮
System information
pip install hanlp
安装Any other info
我主要是想用hanlp来进行文本简繁转换
因为opencc的简繁转换有时会出现问题(例如
只
和隻
的转换)在其github #224 (comment)的讨论中,看到有人使用HanLP分词再丢给opencc
所以试了一整天,感觉不错
但是因为tok未能保存空格以文本未能成功还原
例子
输出为:
Neuro-linguistic programming
两个词中的空格消失了把这段输出丢给opencc再还原后
就会变成
Neuro-linguisticprogramming
因为我编程能力极度有限
现在我只是使用python读取txt档
再像上面那样python的hanlp的tok分词
再使用json.dumps掉进terminal
在terminal用
opencc
进行简繁转换再使用
jq
,sed
等工具还原文本或者有没有什么更有效的分词简繁转换方法?
谢谢!
The text was updated successfully, but these errors were encountered: