We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
如上图,在 PDF 文档里,如果一行文本的最后一个单词分在两行显示,会在行尾加上 '-' 连接符号。 转换成 Markdown 之后,'-' 连接符号依然纯在,单词被 '-' 加一个空白分开。
可以使用这个 https://arxiv.org/pdf/2407.01906 pdf 复现
Linux
3.10
cpu
The text was updated successfully, but these errors were encountered:
拼接逻辑没考虑到这种情况,后续可以排期开发
Sorry, something went wrong.
论文里面基本都会是这种排版,即双列、破折号连接换行的词。这个功能是不是也不会非常复杂? 1.搞一个词典 2.换行处如果有破折号,去词典里查一下有没有这个词(带破折号) →有:确认带破折号 →无:去词典里查去掉破折号的词 →→有:确认不带破折号 →→无:去词典里分别查破折号拆分的两个词 →→→都有:确认带破折号 →→→否则:确认不带破折号(此块还可以再补充,比如直接由大模型判断一下,是否人名、地名、专有名词等)
No branches or pull requests
Description of the bug | 错误描述
如上图,在 PDF 文档里,如果一行文本的最后一个单词分在两行显示,会在行尾加上 '-' 连接符号。
转换成 Markdown 之后,'-' 连接符号依然纯在,单词被 '-' 加一个空白分开。
How to reproduce the bug | 如何复现
可以使用这个 https://arxiv.org/pdf/2407.01906 pdf 复现
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Device mode | 设备模式
cpu
The text was updated successfully, but these errors were encountered: