You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please do a quick search on GitHub issues first, there might be already a duplicate issue for the one you are about to create.
If the bug is trivial, just go ahead and create the issue. Otherwise, please take a few moments and fill in the following sections:
Bug description
When using the class PagePdfDocumentReader to process pdf files there is a bug in org.springframework.ai.reader.pdf.layout.TextLine. In method getNextValidIndex the index is decremented by one which can lead to negative indexes. isSpaceCharacterAtIndex(index - 1) which then lead into call StringLatin1(-1). This will raise StringIndexOutOfBoundsException. This happens with many pdf files.
Environment
Please provide as many details as possible: Spring AI version, Java version, which vector store you use if any, etc
spring component version spring-ai-pdf-document-reader-1.0.0-20241106.071720
java 21 Steps to reproduce
new PagePdfDocumentReader(new InputStreamResource(file.getInputStream()),
PdfDocumentReaderConfig.builder()
.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
.build())
.withPagesPerDocument(1)
.build()).get()
Expected behavior
Of course a proper boundary check :)
Minimal Complete Reproducible example
See the attached code. Pay attention to the TextLine class isSpaceCharacterAtIndex method, do a negativity check on the index.
The text was updated successfully, but these errors were encountered:
@ilayaperumalg ilayaperumalg
Hi, I added tests, and rewrote the class to be more efficient. The string methods used while changing the line content can be made much more performant. Please double check! I also fixed the index out of bound issue.
Please do a quick search on GitHub issues first, there might be already a duplicate issue for the one you are about to create.
If the bug is trivial, just go ahead and create the issue. Otherwise, please take a few moments and fill in the following sections:
Bug description
When using the class PagePdfDocumentReader to process pdf files there is a bug in org.springframework.ai.reader.pdf.layout.TextLine. In method getNextValidIndex the index is decremented by one which can lead to negative indexes.
isSpaceCharacterAtIndex(index - 1)
which then lead into call StringLatin1(-1). This will raise StringIndexOutOfBoundsException. This happens with many pdf files.Environment
Please provide as many details as possible: Spring AI version, Java version, which vector store you use if any, etc
spring component version spring-ai-pdf-document-reader-1.0.0-20241106.071720
java 21
Steps to reproduce
on the attached pdf file.
Uploading service-manual-whirlpool-dishwashers.pdf…
Expected behavior
Of course a proper boundary check :)
Minimal Complete Reproducible example
See the attached code. Pay attention to the TextLine class isSpaceCharacterAtIndex method, do a negativity check on the index.
The text was updated successfully, but these errors were encountered: