Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

400页文字pdf识别报错 #768

Closed
whiteThrush opened this issue Oct 22, 2024 · 2 comments
Closed

400页文字pdf识别报错 #768

whiteThrush opened this issue Oct 22, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@whiteThrush
Copy link

Description of the bug | 错误描述

先前测试了10页的pdf,没报错,现在这个400页的报错了,二次尝试问题依旧。
错误信息如下:

2024-10-22 19:50:46.227 | INFO     | magic_pdf.para.para_split_v2:para_split:773 - 连接了第300页和第301页的列表段落
2024-10-22 19:50:46.227 | INFO     | magic_pdf.para.para_split_v2:__connect_list_inter_page:471 - 连接page 303 内的list
2024-10-22 19:50:46.227 | INFO     | magic_pdf.para.para_split_v2:__connect_list_inter_page:471 - 连接page 309 内的list
2024-10-22 19:50:46.228 | ERROR    | magic_pdf.tools.cli:parse_doc:96 - string index out of range
Traceback (most recent call last):

  File "/root/miniconda3/envs/MinerU/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
    │   │    └ <Command cli>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <function BaseCommand.main at 0x7fcb5d419990>
           └ <Command cli>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x7fcb5d7b6650>
         │    └ <function Command.invoke at 0x7fcb5d41a440>
         └ <Command cli>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'path': 'autodl-tmp/PDFs/上海浦东发展银行股份有限公司2023年年度报告.pdf', 'output_dir': '/root/', 'method': 'txt', 'debug_able': False, 'start_...
           │   │      │    │           └ <click.core.Context object at 0x7fcb5d7b6650>
           │   │      │    └ <function cli at 0x7fcae8a177f0>
           │   │      └ <Command cli>
           │   └ <function Context.invoke at 0x7fcb5d4191b0>
           └ <click.core.Context object at 0x7fcb5d7b6650>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
                       │       └ {'path': 'autodl-tmp/PDFs/上海浦东发展银行股份有限公司2023年年度报告.pdf', 'output_dir': '/root/', 'method': 'txt', 'debug_able': False, 'start_...
                       └ ()
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 102, in cli
    parse_doc(path)
    │         └ 'autodl-tmp/PDFs/上海浦东发展银行股份有限公司2023年年度报告.pdf'
    └ <function cli.<locals>.parse_doc at 0x7fcb5d66f250>
> File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 84, in parse_doc
    do_parse(
    └ <function do_parse at 0x7fcae8a16f80>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 85, in do_parse
    pipe.pipe_parse()
    │    └ <function TXTPipe.pipe_parse at 0x7fcae8a16d40>
    └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/pipe/TXTPipe.py", line 25, in pipe_parse
    self.pdf_mid_data = parse_txt_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug,
    │    │              │             │    │          │    │           │    │                      │    └ True
    │    │              │             │    │          │    │           │    │                      └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890>
    │    │              │             │    │          │    │           │    └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7fcae8a029e0>
    │    │              │             │    │          │    │           └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890>
    │    │              │             │    │          │    └ [{'layout_dets': [{'category_id': 2, 'poly': [95.07244873046875, 141.0135498046875, 539.9658813476562, 141.0135498046875, 539...
    │    │              │             │    │          └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890>
    │    │              │             │    └ b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(zh-CN) /StructTreeRoot 1471 0 R/Outlines 888 0...
    │    │              │             └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890>
    │    │              └ <function parse_txt_pdf at 0x7fcb2970b520>
    │    └ None
    └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x7fcae8a02890>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/user_api.py", line 34, in parse_txt_pdf
    pdf_info_dict = parse_pdf_by_txt(
                    └ <function parse_pdf_by_txt at 0x7fcae8a16560>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py", line 12, in parse_pdf_by_txt
    return pdf_parse_union(pdf_bytes,
           │               └ b'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(zh-CN) /StructTreeRoot 1471 0 R/Outlines 888 0...
           └ <function pdf_parse_union at 0x7fcae8a163b0>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py", line 259, in pdf_parse_union
    para_split(pdf_info_dict, debug_mode=debug_mode)
    │          │                         └ True
    │          └ {'page_0': {'preproc_blocks': [{'type': 'text', 'bbox': [34, 203, 296, 260.75], 'lines': []}], 'layout_bboxes': [{'layout_bbo...
    └ <function para_split at 0x7fcae89ef1c0>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/para/para_split_v2.py", line 762, in para_split
    is_conn = __connect_para_inter_page(pre_page_paras, next_page_paras, pre_page_layout_bbox,
              │                         │               │                └ [[0, 79, 595.4400024414062, 92], [104, 117.83297729492188, 205, 457.8480224609375], [284, 105.71298217773438, 537, 463.047515...
              │                         │               └ [[{'type': 'text', 'bbox': [70, 79, 122, 93], 'lines': [{'bbox': [70.94400024414062, 78.51200866699219, 122, 93.6809616088867...
              │                         └ [[{'type': 'title', 'bbox': [100, 79, 122, 92], 'lines': [{'bbox': [99.26399993896484, 80.89250183105469, 121.34400177001953,...
              └ <function __connect_para_inter_page at 0x7fcae89eeef0>
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/para/para_split_v2.py", line 616, in __connect_para_inter_page
    if pre_last_line['bbox'][2] == pre_x2_max and pre_last_line_text[-1] not in LINE_STOP_FLAG and \
       │                           │              │                             └ ['.', '!', '?', '。', '!', '?', ':', ':', ')', ')', ';']
       │                           │              └ ''
       │                           └ 537
       └ {'bbox': [284, 454.14398193359375, 537, 463.0475158691406], 'spans': [{'bbox': [285.8900146484375, 454.14398193359375, 287.69...
IndexError: string index out of range

How to reproduce the bug | 如何复现

magic-pdf -p 上海浦东发展银行股份有限公司2023年年度报告.pdf -o /root/ -m txt

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cuda

@whiteThrush whiteThrush added the bug Something isn't working label Oct 22, 2024
@whiteThrush
Copy link
Author

@myhloli
Copy link
Collaborator

myhloli commented Oct 23, 2024

这个bug已经在dev分支修复了,需要测试的话可以通过huggingface或者modelscope验证效果。
demo限制输入10页,可以通过使用浏览器打印pdf另存为的方式裁剪到报错的页码附近裁剪10页进行测试。

@myhloli myhloli closed this as completed Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants