Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for two bugs related to Unicode translation support by Font objects #667

Closed
wants to merge 4 commits into from

Conversation

unixnut
Copy link
Contributor

@unixnut unixnut commented Jan 19, 2024

Symptom was that some documents' contents was rendering as a bunch of control characters. These are the untranslated strings. This was happening because for two different reasons, these strings weren't being translated \Smalot\PdfParser\Font::decodeContent() in some circumstances.

First fix is to \Smalot\PdfParser\Font::loadTranslateTable():

  • Fixed bug where bfchar sections weren't loaded due to mistake in regexp.
  • It now uses * instead of + and thus supports translation tables with lines like <0000><0000>. (Required <0000> <0000> before.)

Second fix is for documents that attach their Font objects to the Pages object instead of each Page object:

  • \Smalot\PdfParser\Page now has a setFonts() method
  • \Smalot\PdfParser\Pages now declares its $fonts variable
  • \Smalot\PdfParser\Pages::getPages() now applies the object's fonts to each child Page
  • \Smalot\PdfParser\Pages::getFonts() copied from Page class

Type of pull request

  • Bug fix (involves code and configuration changes)
  • New feature (involves code and configuration changes)
  • Documentation update
  • Something else

About

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

  • Please add at least one test case (unit test, system test, ...) to demonstrate that the change is working. If existing code was changed, your tests cover these code parts as well.
    By the way, you don't have to provide a full fledged PDF file to demonstrate a fix. Instead a unit test may be sufficient sometimes,
    please have a look at FontTest for example code.
    Code changes without any tests are likely to be rejected. If you dont know how to write tests, no problem, tell us upfront and we may add them ourselves or discuss other ways.
  • Please run PHP-CS-Fixer before committing, to confirm with our coding styles. See https://github.com/smalot/pdfparser/blob/master/.php-cs-fixer.php for more information about our coding styles.
  • In case you fix an existing issue, please do one of the following:
    • Write in this text something like fixes #1234 to outline that you are providing a fix for the issue #1234.
    • After the pull request was created, you will find on the right side a section called Development. There issues can be selected which will be closed after the your pull request got merged.
  • In case you changed internal behavior or functionality, please check our documentation to make sure these changes are documented properly: https://github.com/smalot/pdfparser/tree/master/doc
  • In case you want to discuss new ideas/changes and you are not sure, just create a pull request and mark it as a draft
    (see here for more information).
    This will tell us, that it is not ready for merge, but you want to discuss certain issues.

Symptom was that some documents' contents was rendering as a bunch of
control characters.  These are the untranslated strings.  This was
happening because for two different reasons, these strings weren't being
translated \Smalot\PdfParser\Font::decodeContent() in some circumstances.

First fix is to \Smalot\PdfParser\Font::loadTranslateTable():

  - Fixed bug where bfchar sections weren't loaded due to mistake in regexp.
  - It now uses `*` instead of `+` and thus supports translation tables with
    lines like `<0000><0000>`.  (Required `<0000> <0000>` before.)

Second fix is for documents that attach their Font objects to the Pages
object instead of each Page object:

  - \Smalot\PdfParser\Page now has a setFonts() method
  - \Smalot\PdfParser\Pages now declares its $fonts variable
  - \Smalot\PdfParser\Pages::getPages() now applies the object's fonts to each child Page
  - \Smalot\PdfParser\Pages::getFonts() copied from Page class
Changed \Smalot\PdfParser\RawData\RawDataParser::getHeaderValue() to be
a static method.

\Smalot\PdfParser\RawData\RawDataParser::decodeStream() now has an
optional additional parameter $objRefArr (passed by
getIndirectObject()).

Added \Smalot\PdfParser\RawData\DataHelper
/home/alastair/src/unixnut_pdfparser (git):
diff --git c/src/Smalot/PdfParser/RawData/DataHelper.php i/src/Smalot/PdfParser/RawData/DataHelper.php
new file mode 100644
index 0000000..2f5d42f
--- /dev/null
+++ i/src/Smalot/PdfParser/RawData/DataHelper.php
@@ -0,0 +1,64 @@
+<?php
+
+/**
+ * This file is based on code of tecnickcom/TCPDF PDF library.
+ *
+ * Original author Nicola Asuni ([email protected]) and
+ * contributors (https://github.com/tecnickcom/TCPDF/graphs/contributors).
+ *
+ * @see https://github.com/tecnickcom/TCPDF
+ *
+ * Original code was licensed on the terms of the LGPL v3.
+ *
+ * ------------------------------------------------------------------------------
+ *
+ * @file This file is part of the PdfParser library.
+ *
+ * @author  Alastair Irvine <[email protected]>
+ *
+ * @Date    2024-01-12
+ *
+ * @license LGPLv3
+ *
+ * @url     <https://github.com/smalot/pdfparser>
+ *
+ *  PdfParser is a pdf library written in PHP, extraction oriented.
+ *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
+ *
+ *  This program is free software: you can redistribute it and/or modify
+ *  it under the terms of the GNU Lesser General Public License as published by
+ *  the Free Software Foundation, either version 3 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU Lesser General Public License for more details.
+ *
+ *  You should have received a copy of the GNU Lesser General Public License
+ *  along with this program.
+ *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
+ */
+
+namespace Smalot\PdfParser\RawData;
+
+class DataHelper
+{
+    /*
+     * Decode a string of the form "15_0" into an array.
+     *
+     * @param string $objRef    The object ID
+     * @return array object number and generation
+     *
+     * @throws \Exception if @p $objRef is invalid
+     */
+    public static function decodeRef(string $objRef): array
+    {
+        $objRefArr = \explode('_', $objRef);
+        if (2 !== \count($objRefArr)) {
+            throw new \Exception('Invalid object reference for $obj.');
+        }
+
+        return $objRefArr;
+    }
+}
diff --git c/src/Smalot/PdfParser/RawData/RawDataParser.php i/src/Smalot/PdfParser/RawData/RawDataParser.php
index 7763089..433a67a 100644
--- c/src/Smalot/PdfParser/RawData/RawDataParser.php
+++ i/src/Smalot/PdfParser/RawData/RawDataParser.php
@@ -87,7 +87,7 @@ class RawDataParser
      *
      * @throws \Exception
      */
-    protected function decodeStream(string $pdfData, array $xref, array $sdic, string $stream): array
+    protected function decodeStream(string $pdfData, array $xref, array $sdic, string $stream, array $objRefArr = null): array
     {
         // get stream length and filters
         $slength = \strlen($stream);
@@ -524,10 +524,7 @@ class RawDataParser
          * build indirect object header
          */
         // $objHeader = "[object number] [generation number] obj"
-        $objRefArr = explode('_', $objRef);
-        if (2 !== \count($objRefArr)) {
-            throw new \Exception('Invalid object reference for $obj.');
-        }
+        $objRefArr = DataHelper::decodeRef($objRef);

         $objHeaderLen = $this->getObjectHeaderLen($objRefArr);

@@ -558,7 +555,7 @@ class RawDataParser
             $offset = $element[2];
             // decode stream using stream's dictionary information
             if ($decoding && ('stream' === $element[0]) && null != $header) {
-                $element[3] = $this->decodeStream($pdfData, $xref, $header[1], $element[1]);
+                $element[3] = $this->decodeStream($pdfData, $xref, $header[1], $element[1], $objRefArr);
             }
             $objContentArr[$i] = $element;
             $header = isset($element[0]) && '<<' === $element[0] ? $element : null;
@@ -760,8 +757,8 @@ class RawDataParser
                         $offset += \strlen($matches[0]);

                         // we get stream length here to later help preg_match test less data
-                        $streamLen = (int) $this->getHeaderValue($headerDic, 'Length', 'numeric', 0);
-                        $skip = false === $this->config->getRetainImageContent() && 'XObject' == $this->getHeaderValue($headerDic, 'Type', '/') && 'Image' == $this->getHeaderValue($headerDic, 'Subtype', '/');
+                        $streamLen = (int) self::getHeaderValue($headerDic, 'Length', 'numeric', 0);
+                        $skip = false === $this->config->getRetainImageContent() && 'XObject' == self::getHeaderValue($headerDic, 'Type', '/') && 'Image' == self::getHeaderValue($headerDic, 'Subtype', '/');

                         $pregResult = preg_match(
                             '/(endstream)[\x09\x0a\x0c\x0d\x20]/isU',
@@ -814,7 +811,7 @@ class RawDataParser
      *
      * @return string|array|null value of obj header's section, or default value if none found, or its type doesn't match $type param
      */
-    private function getHeaderValue(?array $headerDic, string $key, string $type, $default = '')
+    public static function getHeaderValue(?array $headerDic, string $key, string $type, $default = '')
     {
         if (false === \is_array($headerDic)) {
             return $default;
\Smalot\PdfParser\RawData\RawDataParser::getRawObject() now uses
(almost) correct quoting semantics.

Does not yet support octal elements, e.g. `\037` (see
\Smalot\PdfParser\Font::decodeOctal())

Note that \Smalot\PdfParser\Page::extractDecodedRawData() still needs
fixing.

Both functions should probably use a common helper function.
Relies on three primary classes:

  - \Smalot\PdfParser\Encryption\Info
  - \Smalot\PdfParser\Encryption\FileKey
  - \Smalot\PdfParser\Encryption\Stream

This namespace also has a number of exception classes.

\Smalot\PdfParser\RawData\RawDataParser interacts with decryption
support via:

  - $decryptionHelper variable
  - decodeStream() now handles decryption after header processing and
    before the object filters are invoked
  - parseData() calls setupDecryption() if the document has an Encrypt object

Also added is \Smalot\PdfParser\Utils with a number of static utility
methods for binary string processing, etc.
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unixnut Thank your for this rather big pull request!

First of all, I am super busy currently, so my answers are delayed. Nevertheless, I will try to help to get this merged. Maybe @j0k3r, @GreyWyvern or someone else wants to help out here.

Here are my first questions:

  • Can we split this PR into smaller chunks which are easier to review? We are talking about 1.5k new lines.
  • Do you plan to add unit tests to cover your new changes?
  • Your code is about decryption and encryption, but in the title its about Unicode translation. Please elaborate on that a bit.
  • You mention two bugs this PR fixes. Please reference them.

I am looking forward to your response.

@GreyWyvern
Copy link
Contributor

I'm on vacation atm, but I can take a look at this on Monday. 👍

@GreyWyvern
Copy link
Contributor

  • Your code is about decryption and encryption, but in the title its about Unicode translation. Please elaborate on that a bit.

After a first pass of the code, this is what I'd like to know as well. I feel like for this many lines, the description as to what this PR does needs to be much much longer! :D What's the purpose of this encryption/decryption code? Does it supercede the previous PR #653?

I will keep studying it.

@k00ni
Copy link
Collaborator

k00ni commented Jan 25, 2024

Please merge in master branch, to get rid of these coding style issues.

@unixnut
Copy link
Contributor Author

unixnut commented Feb 5, 2024

@k00ni @GreyWyvern my mistake, I forgot GitHub consumes all commits on a branch for a PR.

This PR is supposed to be for the first commit only

@k00ni
Copy link
Collaborator

k00ni commented Feb 5, 2024

@k00ni @GreyWyvern my mistake, I forgot GitHub consumes all commits on a branch for a PR.

This PR is supposed to be for the first commit only

Proceed as you see fit. I assume either close this one and create another PR with your changes or simply remove all additional changes in this PR and force push in order to update it.

@GreyWyvern
Copy link
Contributor

I think I would probably cancel this PR and start a clean one, but that's just me. :)

@k00ni k00ni added the stale needs decision label Feb 9, 2024
@k00ni k00ni closed this Feb 26, 2024
@k00ni
Copy link
Collaborator

k00ni commented Feb 26, 2024

Please open new PRs with smaller sets of code changes. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants