Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating text width and incorrect result array paths #619

Closed
gidzr opened this issue Jul 26, 2023 · 3 comments
Closed

Calculating text width and incorrect result array paths #619

gidzr opened this issue Jul 26, 2023 · 3 comments
Labels

Comments

@gidzr
Copy link

gidzr commented Jul 26, 2023

Hey there

Thanks so much for writing this wrapper. It's probably the only decent one out there for pdf parsing.

I've come across a few issues from the readme and usage against the behaviour of the project.

Firstly - I've been trying to work out how to get width and height of the bounding box for text (not the page width/height).

I came across your 'calculateTextWidth' function, trying it per your Usage instructions and attempting to do it my own way, but nothing works.

Btw, my pdfFile is a standard PDF created from an MSWord docx (text based), using the SaveAs pdf with MSWord. It comprises of 4 lines of text only. A very basic industry standard PDF.

I've also configed the parser
use Smalot\PdfParser\Config;
$config = new Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
use Smalot\PdfParser\Parser;
$parser = new Parser([], $config);

/////////////////////////////////////////////////////////////////////////////////////////////////////////////
////1. When running your usage approach (https://github.com/smalot/pdfparser/blob/master/doc/Usage.md), I get a resulting error:
$pdf = $parser->parseFile($pdfFile);
$data = $pdf->getPages()[0]->getDataTm();
$text = $data[0][1];
$font = reset($pdf->getFonts()); <------- line 299
$width = $font->calculateTextWidth($text);

A PHP Error was encountered
Severity: Notice
Message: Only variables should be passed by reference
Line Number: 299

/////////////////////////////////////////////////////////////////////////////////////////////////////////////
////2. When running your usage approach (https://github.com/smalot/pdfparser/blob/master/doc/Usage.md), I get a resulting error:
$pdf = $parser->parseFile($pdfFile);
$data = $pdf->getPages()[0]->getDataTm();
$fonts = $pdf->getFonts();
$font_id = $data[0][2]; //R7
$font = $fonts[$font_id]; <------- line 310
$text = $data[0][1];
$width = $font->calculateTextWidth($text);

A PHP Error was encountered
Severity: Warning
Message: Undefined array key "F1"
Line Number: 310

/////////////////////////////////////////////////////////////////////////////////////////////////////////////
////3. When I adjust all the approach based on the arrays I'm actually getting out of the parser
$pdf = $parser->parseFile($pdfFile);
$data = $pdf->getPages()[0]->getDataTm();
$fonts = $pdf->getPages()[0]->getFonts();
$font_id = $data[0][2]; //R7
$font = $fonts[$font_id];
$text = $data[0][1];
$width = $font->calculateTextWidth($text);

A PHP Error was encountered
Severity: Warning
Message: Undefined array key "Widths"
Filename: PdfParser/Font.php
Line Number: 279
Backtrace:
File: C:\xampp81\php\vendor\smalot\pdfparser\src\Smalot\PdfParser\Font.php
Line: 279
Function: _error_handler
File: C:\xampp81\htdocs\villg.life\application\views\tabsSecurity\listImage.php
Line: 286
Function: calculateTextWidth

/////////////////////////////////////////////////////////////////////////////////////////////////////////////

When I call ->getDetails() manually on the page (in the same way your function calculateTextWidth(string $text, array &$missing = null) does at https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Font.php, I get the following array from the page - which is page related data, not the bounding box for the text snippet.
Array (
[Type] => Page
[Parent] => Array (
[Type] => Pages
[Count] => 1
)
[Resources] => Array (
[ExtGState] => Array (
[GS5] => Array ( [Type] => ExtGState [BM] => Normal [ca] => 1 )
[GS11] => Array ( [Type] => ExtGState [BM] => Normal [CA] => 1 )
)
[Font] => Array (
[F1] => Array ( [Name] => ArialMT [Type] => Type0 [Encoding] => Identity-H [Subtype] => Type0 [BaseFont] => ArialMT)
[F2] => Array ( [Name] => ArialMT [Type] => TrueType [Encoding] => WinAnsiEncoding [Subtype] => TrueType [BaseFont] => ArialMT [FirstChar] => 32 [LastChar] => 32 ) )
[ProcSet] => Array ([0] => PDF [1] => Text [2] => ImageB [3] => ImageC [4] => ImageI )
)
[MediaBox] => Array ( [0] => 0 [1] => 0 [2] => 595.32 [3] => 841.92 ) [Contents] => Array ( [Filter] => FlateDecode [Length] => 683 ) [Group] => Array ( [Type] => Group [S] => Transparency [CS] => DeviceRGB ) [Tabs] => S [StructParents] => 0 )

I'm not not sure what's going..

Also, I've had a look into the code and it looks like there's a few hidden cool functions that aren't clearly documented in the readMe.. Is that the case or am I overthinking it?

@k00ni
Copy link
Collaborator

k00ni commented Jul 26, 2023

It is usually a good practice to create one issue per problem, otherwise it is harder to provide help. (1) + (2) are important to fix, but it seems its PDF-dependent and not shown in general.

@k00ni k00ni added the bug label Jul 26, 2023
@gidzr
Copy link
Author

gidzr commented Sep 29, 2023

@k00ni : Noted re issue being split. Will make sure I stick with that approach in future. Thanks for the steer.
Any update on the bug or still on the to-do list?
Thanks 👍

@k00ni
Copy link
Collaborator

k00ni commented Sep 29, 2023

1. problem

A PHP Error was encountered
Severity: Notice
Message: Only variables should be passed by reference
Line Number: 299

It's because we use reset with a non-variable. Fixed with #644.

2. problem

A PHP Error was encountered
Severity: Warning
Message: Undefined array key "F1"
Line Number: 310

Your code:

$pdf = $parser->parseFile($pdfFile);
$data = $pdf->getPages()[0]->getDataTm();
$fonts = $pdf->getFonts();
$font_id = $data[0][2]; //R7
$font = $fonts[$font_id]; // <------- line 310

Without the PDF I can't really help here. Maybe $data contains incomplete/wrong information which do not match with the result of getFonts.

3. problem

A PHP Error was encountered
Severity: Warning
Message: Undefined array key "Widths"
Filename: PdfParser/Font.php
Line Number: 286 # changed by @k00ni to match current line number

I created a hotfix to avoid the warning, but I could not find the origin of the error in time. #645 contains a hot fix which should suppress the warning in case the Widths-key is not set.

4. problem

Please open a new issue for that and provide an example PDF. @GreyWyvern revamped a huge chunk of the library in #634, you should try if that helps you.


I will close this, because most of your problems should be solved (to some extent). For problem 4 open a new issue. In case I forgot something, don't hesitate to comment here.

@k00ni k00ni closed this as completed Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants