Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringIndexOutOfBoundsException when using PagePdfDocumentReader #1689

Open
jazzm0 opened this issue Nov 7, 2024 · 5 comments
Open

StringIndexOutOfBoundsException when using PagePdfDocumentReader #1689

jazzm0 opened this issue Nov 7, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@jazzm0
Copy link

jazzm0 commented Nov 7, 2024

Please do a quick search on GitHub issues first, there might be already a duplicate issue for the one you are about to create.
If the bug is trivial, just go ahead and create the issue. Otherwise, please take a few moments and fill in the following sections:

Bug description
When using the class PagePdfDocumentReader to process pdf files there is a bug in org.springframework.ai.reader.pdf.layout.TextLine. In method getNextValidIndex the index is decremented by one which can lead to negative indexes. isSpaceCharacterAtIndex(index - 1) which then lead into call StringLatin1(-1). This will raise StringIndexOutOfBoundsException. This happens with many pdf files.

Environment
Please provide as many details as possible: Spring AI version, Java version, which vector store you use if any, etc
spring component version spring-ai-pdf-document-reader-1.0.0-20241106.071720
java 21
Steps to reproduce

 new PagePdfDocumentReader(new InputStreamResource(file.getInputStream()),
                PdfDocumentReaderConfig.builder()
                        .withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
                                .build())
                        .withPagesPerDocument(1)
                        .build()).get()

on the attached pdf file.
Uploading service-manual-whirlpool-dishwashers.pdf…

Expected behavior
Of course a proper boundary check :)

Minimal Complete Reproducible example
See the attached code. Pay attention to the TextLine class isSpaceCharacterAtIndex method, do a negativity check on the index.

@jazzm0
Copy link
Author

jazzm0 commented Nov 7, 2024

@jazzm0
Copy link
Author

jazzm0 commented Nov 7, 2024

I've created a PR showing the bug https://github.com/spring-projects/spring-ai/pull/1690/files

@ilayaperumalg ilayaperumalg self-assigned this Nov 7, 2024
@ilayaperumalg ilayaperumalg added the bug Something isn't working label Nov 7, 2024
@jazzm0
Copy link
Author

jazzm0 commented Nov 7, 2024

I added also a unit test showing the problematic code. Btw the TextLine is completely uncovered.

@jazzm0
Copy link
Author

jazzm0 commented Nov 7, 2024

@ilayaperumalg ilayaperumalg
Hi, I added tests, and rewrote the class to be more efficient. The string methods used while changing the line content can be made much more performant. Please double check! I also fixed the index out of bound issue.

@ilayaperumalg
Copy link
Member

@jazzm0 sure, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants