Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

折行的单词有连接符且没有拼接 #150

Open
yiyibooks opened this issue Jul 15, 2024 · 2 comments
Open

折行的单词有连接符且没有拼接 #150

yiyibooks opened this issue Jul 15, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@yiyibooks
Copy link

Description of the bug | 错误描述

image

image

如上图,在 PDF 文档里,如果一行文本的最后一个单词分在两行显示,会在行尾加上 '-' 连接符号。
转换成 Markdown 之后,'-' 连接符号依然纯在,单词被 '-' 加一个空白分开。

How to reproduce the bug | 如何复现

可以使用这个 https://arxiv.org/pdf/2407.01906 pdf 复现

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Device mode | 设备模式

cpu

@yiyibooks yiyibooks added the bug Something isn't working label Jul 15, 2024
@myhloli
Copy link
Collaborator

myhloli commented Jul 15, 2024

拼接逻辑没考虑到这种情况,后续可以排期开发

@kv1830
Copy link

kv1830 commented Jul 19, 2024

论文里面基本都会是这种排版,即双列、破折号连接换行的词。这个功能是不是也不会非常复杂?
1.搞一个词典
2.换行处如果有破折号,去词典里查一下有没有这个词(带破折号)
→有:确认带破折号
→无:去词典里查去掉破折号的词
→→有:确认不带破折号
→→无:去词典里分别查破折号拆分的两个词
→→→都有:确认带破折号
→→→否则:确认不带破折号(此块还可以再补充,比如直接由大模型判断一下,是否人名、地名、专有名词等)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants