We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
先前测试了10页的pdf,没报错,现在这个400页的报错了,二次尝试问题依旧。 错误信息如下:
2024-10-22 19:50:46.227 | INFO | magic_pdf.para.para_split_v2:para_split:773 - 连接了第300页和第301页的列表段落 2024-10-22 19:50:46.227 | INFO | magic_pdf.para.para_split_v2:__connect_list_inter_page:471 - 连接page 303 内的list 2024-10-22 19:50:46.227 | INFO | magic_pdf.para.para_split_v2:__connect_list_inter_page:471 - 连接page 309 内的list 2024-10-22 19:50:46.228 | ERROR | magic_pdf.tools.cli:parse_doc:96 - string index out of range Traceback (most recent call last): File "/root/miniconda3/envs/MinerU/bin/magic-pdf", line 8, in <module> sys.exit(cli()) │ │ └ <Command cli> │ └ <built-in function exit> └ <module 'sys' (built-in)> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x7fcb5d419990> └ <Command cli> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x7fcb5d7b6650> │ └ <function Command.invoke at 0x7fcb5d41a440> └ <Command cli> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) │ │ │ │ │ └ {'path': 'autodl-tmp/PDFs/上海浦东发展银行股份有限公司2023年年度报告.pdf', 'output_dir': '/root/', 'method': 'txt', 'debug_able': False, 'start_... │ │ │ │ └ <click.core.Context object at 0x7fcb5d7b6650> │ │ │ └ <function cli at 0x7fcae8a177f0> │ │ └ <Command cli> │ └ <function Context.invoke at 0x7fcb5d4191b0> └ <click.core.Context object at 0x7fcb5d7b6650> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) │ └ {'path': 'autodl-tmp/PDFs/上海浦东发展银行股份有限公司2023年年度报告.pdf', 'output_dir': '/root/', 'method': 'txt', 'debug_able': False, 'start_... └ () File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 102, in cli parse_doc(path) │ └ 'autodl-tmp/PDFs/上海浦东发展银行股份有限公司2023年年度报告.pdf' └ <function cli.<locals>.parse_doc at 0x7fcb5d66f250> > File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 84, in parse_doc do_parse( └ <function do_parse at 0x7fcae8a16f80> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 85, in do_parse pipe.pipe_parse() │ └ <function TXTPipe.pipe_parse at 0x7fcae8a16d40> └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/pipe/TXTPipe.py", line 25, in pipe_parse self.pdf_mid_data = parse_txt_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug, │ │ │ │ │ │ │ │ │ │ └ True │ │ │ │ │ │ │ │ │ └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890> │ │ │ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7fcae8a029e0> │ │ │ │ │ │ │ └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890> │ │ │ │ │ │ └ [{'layout_dets': [{'category_id': 2, 'poly': [95.07244873046875, 141.0135498046875, 539.9658813476562, 141.0135498046875, 539... │ │ │ │ │ └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890> │ │ │ │ └ b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(zh-CN) /StructTreeRoot 1471 0 R/Outlines 888 0... │ │ │ └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890> │ │ └ <function parse_txt_pdf at 0x7fcb2970b520> │ └ None └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/user_api.py", line 34, in parse_txt_pdf pdf_info_dict = parse_pdf_by_txt( └ <function parse_pdf_by_txt at 0x7fcae8a16560> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py", line 12, in parse_pdf_by_txt return pdf_parse_union(pdf_bytes, │ └ b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(zh-CN) /StructTreeRoot 1471 0 R/Outlines 888 0... └ <function pdf_parse_union at 0x7fcae8a163b0> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py", line 259, in pdf_parse_union para_split(pdf_info_dict, debug_mode=debug_mode) │ │ └ True │ └ {'page_0': {'preproc_blocks': [{'type': 'text', 'bbox': [34, 203, 296, 260.75], 'lines': []}], 'layout_bboxes': [{'layout_bbo... └ <function para_split at 0x7fcae89ef1c0> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/para/para_split_v2.py", line 762, in para_split is_conn = __connect_para_inter_page(pre_page_paras, next_page_paras, pre_page_layout_bbox, │ │ │ └ [[0, 79, 595.4400024414062, 92], [104, 117.83297729492188, 205, 457.8480224609375], [284, 105.71298217773438, 537, 463.047515... │ │ └ [[{'type': 'text', 'bbox': [70, 79, 122, 93], 'lines': [{'bbox': [70.94400024414062, 78.51200866699219, 122, 93.6809616088867... │ └ [[{'type': 'title', 'bbox': [100, 79, 122, 92], 'lines': [{'bbox': [99.26399993896484, 80.89250183105469, 121.34400177001953,... └ <function __connect_para_inter_page at 0x7fcae89eeef0> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/para/para_split_v2.py", line 616, in __connect_para_inter_page if pre_last_line['bbox'][2] == pre_x2_max and pre_last_line_text[-1] not in LINE_STOP_FLAG and \ │ │ │ └ ['.', '!', '?', '。', '!', '?', ':', ':', ')', ')', ';'] │ │ └ '' │ └ 537 └ {'bbox': [284, 454.14398193359375, 537, 463.0475158691406], 'spans': [{'bbox': [285.8900146484375, 454.14398193359375, 287.69... IndexError: string index out of range
magic-pdf -p 上海浦东发展银行股份有限公司2023年年度报告.pdf -o /root/ -m txt
Linux
3.10
0.8.x
cuda
The text was updated successfully, but these errors were encountered:
上海浦东发展银行股份有限公司2023年年度报告.pdf 文档在这里
Sorry, something went wrong.
这个bug已经在dev分支修复了,需要测试的话可以通过huggingface或者modelscope验证效果。 demo限制输入10页,可以通过使用浏览器打印pdf另存为的方式裁剪到报错的页码附近裁剪10页进行测试。
No branches or pull requests
Description of the bug | 错误描述
先前测试了10页的pdf,没报错,现在这个400页的报错了,二次尝试问题依旧。
错误信息如下:
How to reproduce the bug | 如何复现
magic-pdf -p 上海浦东发展银行股份有限公司2023年年度报告.pdf -o /root/ -m txt
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.8.x
Device mode | 设备模式
cuda
The text was updated successfully, but these errors were encountered: