Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for two bugs related to Unicode translation support by Font objects #667

Closed
wants to merge 4 commits into from

Commits on Jan 19, 2024

  1. Fix for two bugs related to Unicode translation support by Font objects

    Symptom was that some documents' contents was rendering as a bunch of
    control characters.  These are the untranslated strings.  This was
    happening because for two different reasons, these strings weren't being
    translated \Smalot\PdfParser\Font::decodeContent() in some circumstances.
    
    First fix is to \Smalot\PdfParser\Font::loadTranslateTable():
    
      - Fixed bug where bfchar sections weren't loaded due to mistake in regexp.
      - It now uses `*` instead of `+` and thus supports translation tables with
        lines like `<0000><0000>`.  (Required `<0000> <0000>` before.)
    
    Second fix is for documents that attach their Font objects to the Pages
    object instead of each Page object:
    
      - \Smalot\PdfParser\Page now has a setFonts() method
      - \Smalot\PdfParser\Pages now declares its $fonts variable
      - \Smalot\PdfParser\Pages::getPages() now applies the object's fonts to each child Page
      - \Smalot\PdfParser\Pages::getFonts() copied from Page class
    unixnut committed Jan 19, 2024
    Configuration menu
    Copy the full SHA
    cb1a70d View commit details
    Browse the repository at this point in the history
  2. Changes to RawDataParser to be aware of object IDs during stream parsing

    Changed \Smalot\PdfParser\RawData\RawDataParser::getHeaderValue() to be
    a static method.
    
    \Smalot\PdfParser\RawData\RawDataParser::decodeStream() now has an
    optional additional parameter $objRefArr (passed by
    getIndirectObject()).
    
    Added \Smalot\PdfParser\RawData\DataHelper
    /home/alastair/src/unixnut_pdfparser (git):
    diff --git c/src/Smalot/PdfParser/RawData/DataHelper.php i/src/Smalot/PdfParser/RawData/DataHelper.php
    new file mode 100644
    index 0000000..2f5d42f
    --- /dev/null
    +++ i/src/Smalot/PdfParser/RawData/DataHelper.php
    @@ -0,0 +1,64 @@
    +<?php
    +
    +/**
    + * This file is based on code of tecnickcom/TCPDF PDF library.
    + *
    + * Original author Nicola Asuni ([email protected]) and
    + * contributors (https://github.com/tecnickcom/TCPDF/graphs/contributors).
    + *
    + * @see https://github.com/tecnickcom/TCPDF
    + *
    + * Original code was licensed on the terms of the LGPL v3.
    + *
    + * ------------------------------------------------------------------------------
    + *
    + * @file This file is part of the PdfParser library.
    + *
    + * @author  Alastair Irvine <[email protected]>
    + *
    + * @Date    2024-01-12
    + *
    + * @license LGPLv3
    + *
    + * @url     <https://github.com/smalot/pdfparser>
    + *
    + *  PdfParser is a pdf library written in PHP, extraction oriented.
    + *  Copyright (C) 2017 - Sébastien MALOT <[email protected]>
    + *
    + *  This program is free software: you can redistribute it and/or modify
    + *  it under the terms of the GNU Lesser General Public License as published by
    + *  the Free Software Foundation, either version 3 of the License, or
    + *  (at your option) any later version.
    + *
    + *  This program is distributed in the hope that it will be useful,
    + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
    + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    + *  GNU Lesser General Public License for more details.
    + *
    + *  You should have received a copy of the GNU Lesser General Public License
    + *  along with this program.
    + *  If not, see <http://www.pdfparser.org/sites/default/LICENSE.txt>.
    + */
    +
    +namespace Smalot\PdfParser\RawData;
    +
    +class DataHelper
    +{
    +    /*
    +     * Decode a string of the form "15_0" into an array.
    +     *
    +     * @param string $objRef    The object ID
    +     * @return array object number and generation
    +     *
    +     * @throws \Exception if @p $objRef is invalid
    +     */
    +    public static function decodeRef(string $objRef): array
    +    {
    +        $objRefArr = \explode('_', $objRef);
    +        if (2 !== \count($objRefArr)) {
    +            throw new \Exception('Invalid object reference for $obj.');
    +        }
    +
    +        return $objRefArr;
    +    }
    +}
    diff --git c/src/Smalot/PdfParser/RawData/RawDataParser.php i/src/Smalot/PdfParser/RawData/RawDataParser.php
    index 7763089..433a67a 100644
    --- c/src/Smalot/PdfParser/RawData/RawDataParser.php
    +++ i/src/Smalot/PdfParser/RawData/RawDataParser.php
    @@ -87,7 +87,7 @@ class RawDataParser
          *
          * @throws \Exception
          */
    -    protected function decodeStream(string $pdfData, array $xref, array $sdic, string $stream): array
    +    protected function decodeStream(string $pdfData, array $xref, array $sdic, string $stream, array $objRefArr = null): array
         {
             // get stream length and filters
             $slength = \strlen($stream);
    @@ -524,10 +524,7 @@ class RawDataParser
              * build indirect object header
              */
             // $objHeader = "[object number] [generation number] obj"
    -        $objRefArr = explode('_', $objRef);
    -        if (2 !== \count($objRefArr)) {
    -            throw new \Exception('Invalid object reference for $obj.');
    -        }
    +        $objRefArr = DataHelper::decodeRef($objRef);
    
             $objHeaderLen = $this->getObjectHeaderLen($objRefArr);
    
    @@ -558,7 +555,7 @@ class RawDataParser
                 $offset = $element[2];
                 // decode stream using stream's dictionary information
                 if ($decoding && ('stream' === $element[0]) && null != $header) {
    -                $element[3] = $this->decodeStream($pdfData, $xref, $header[1], $element[1]);
    +                $element[3] = $this->decodeStream($pdfData, $xref, $header[1], $element[1], $objRefArr);
                 }
                 $objContentArr[$i] = $element;
                 $header = isset($element[0]) && '<<' === $element[0] ? $element : null;
    @@ -760,8 +757,8 @@ class RawDataParser
                             $offset += \strlen($matches[0]);
    
                             // we get stream length here to later help preg_match test less data
    -                        $streamLen = (int) $this->getHeaderValue($headerDic, 'Length', 'numeric', 0);
    -                        $skip = false === $this->config->getRetainImageContent() && 'XObject' == $this->getHeaderValue($headerDic, 'Type', '/') && 'Image' == $this->getHeaderValue($headerDic, 'Subtype', '/');
    +                        $streamLen = (int) self::getHeaderValue($headerDic, 'Length', 'numeric', 0);
    +                        $skip = false === $this->config->getRetainImageContent() && 'XObject' == self::getHeaderValue($headerDic, 'Type', '/') && 'Image' == self::getHeaderValue($headerDic, 'Subtype', '/');
    
                             $pregResult = preg_match(
                                 '/(endstream)[\x09\x0a\x0c\x0d\x20]/isU',
    @@ -814,7 +811,7 @@ class RawDataParser
          *
          * @return string|array|null value of obj header's section, or default value if none found, or its type doesn't match $type param
          */
    -    private function getHeaderValue(?array $headerDic, string $key, string $type, $default = '')
    +    public static function getHeaderValue(?array $headerDic, string $key, string $type, $default = '')
         {
             if (false === \is_array($headerDic)) {
                 return $default;
    unixnut committed Jan 19, 2024
    Configuration menu
    Copy the full SHA
    4ae0c9e View commit details
    Browse the repository at this point in the history
  3. Fix for raw string parsing

    \Smalot\PdfParser\RawData\RawDataParser::getRawObject() now uses
    (almost) correct quoting semantics.
    
    Does not yet support octal elements, e.g. `\037` (see
    \Smalot\PdfParser\Font::decodeOctal())
    
    Note that \Smalot\PdfParser\Page::extractDecodedRawData() still needs
    fixing.
    
    Both functions should probably use a common helper function.
    unixnut committed Jan 19, 2024
    Configuration menu
    Copy the full SHA
    65e3852 View commit details
    Browse the repository at this point in the history
  4. Decryption support

    Relies on three primary classes:
    
      - \Smalot\PdfParser\Encryption\Info
      - \Smalot\PdfParser\Encryption\FileKey
      - \Smalot\PdfParser\Encryption\Stream
    
    This namespace also has a number of exception classes.
    
    \Smalot\PdfParser\RawData\RawDataParser interacts with decryption
    support via:
    
      - $decryptionHelper variable
      - decodeStream() now handles decryption after header processing and
        before the object filters are invoked
      - parseData() calls setupDecryption() if the document has an Encrypt object
    
    Also added is \Smalot\PdfParser\Utils with a number of static utility
    methods for binary string processing, etc.
    unixnut committed Jan 19, 2024
    Configuration menu
    Copy the full SHA
    1236fec View commit details
    Browse the repository at this point in the history