-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for CRC23 checksum gzuncompress() #622
Conversation
A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042 Partially resolves smalot#592.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good extension!
Your code looks a bit clunky, because of a try-catch inside a catch. Also, an @
in front of a function may absorb errors which otherwise would have came to the developers attention. Some changes are just opinionated to improve readability. Here is my suggestion for the whole try-catch part (untested):
// initialize string to return
try {
$decoded = gzuncompress($data, $decodeMemoryLimit);
if (false === $decoded) {
// If gzuncompress() failed try again allowing for a CRC32
// checksum instead of Adler-32.
// See: https://www.php.net/manual/en/function.gzuncompress.php#79042
// Issue: https://github.com/smalot/pdfparser/issues/592
$crc32 = tempnam('/tmp', 'gz_fix');
if (false != $crc32) {
file_put_contents($crc32, "\x1f\x8b\x08\x00\x00\x00\x00\x00".$data);
$decoded = file_get_contents('compress.zlib://'.$crc32);
unlink($crc32);
}
if (is_string($decoded) && 0 < strlen($decoded)) {
// all good
} else {
// If the decoded string is empty, that means decoding failed.
throw new \Exception('TODO: meaningful error message');
}
} catch (\Exception $e) {
throw $e;
} finally {
// Restore old handler just in case it was customized outside of PDFParser.
restore_error_handler();
}
What do you think?
So I tried this and it turns out that the original So either there must be a try/catch within a try/catch, or else the protected function decodeFilterFlateDecode(string $data, int $decodeMemoryLimit): ?string
{
/* REMOVE THIS
* gzuncompress may throw a not catchable E_WARNING in case of an error (like $data is empty)
* the following set_error_handler changes an E_WARNING to an E_ERROR, which is catchable.
*/
/* REMOVE THIS
set_error_handler(function ($errNo, $errStr) {
if (\E_WARNING === $errNo) {
throw new \Exception($errStr);
} else {
// fallback to default php error handler
return false;
}
}); */
$decoded = null;
// initialize string to return
/* REMOVE THIS try { */
// Uncatchable E_WARNING for "data error" is @ suppressed
// so execution may proceed with an alternate decompression
// method.
$decoded = @gzuncompress($data, $decodeMemoryLimit);
if (false === $decoded) {
// If gzuncompress() failed, try again using the compress.zlib://
// wrapper to decode it in a file-based context.
// See: https://www.php.net/manual/en/function.gzuncompress.php#79042
// Issue: https://github.com/smalot/pdfparser/issues/592
$ztmp = tempnam('/tmp', 'gz_fix');
if (false != $ztmp) {
file_put_contents($ztmp, "\x1f\x8b\x08\x00\x00\x00\x00\x00".$data);
$decoded = file_get_contents('compress.zlib://'.$ztmp);
unlink($ztmp);
}
}
if (!is_string($decoded) || 0 === strlen($decoded)) {
// If the decoded string is empty, that means decoding failed.
throw new \Exception('decodeFilterFlateDecode: invalid data');
}
/* REMOVE THIS
} catch (\Exception $e) {
throw $e;
} finally {
// Restore old handler just in case it was customized outside of PDFParser.
restore_error_handler();
} */
return $decoded;
} This change will require changing some tests in FilterHelperTest.php as it will no longer be returning several different exceptions. Just the one 'invalid data' one. |
Good point. So the final code would look like the following (including some minor adaptions)? protected function decodeFilterFlateDecode(string $data, int $decodeMemoryLimit): ?string
{
// Uncatchable E_WARNING for "data error" is @ suppressed
// so execution may proceed with an alternate decompression
// method.
$decoded = @gzuncompress($data, $decodeMemoryLimit);
if (false === $decoded) {
// If gzuncompress() failed, try again using the compress.zlib://
// wrapper to decode it in a file-based context.
// See: https://www.php.net/manual/en/function.gzuncompress.php#79042
// Issue: https://github.com/smalot/pdfparser/issues/592
$ztmp = tempnam('/tmp', 'gz_fix');
if (false != $ztmp) {
file_put_contents($ztmp, "\x1f\x8b\x08\x00\x00\x00\x00\x00".$data);
$decoded = file_get_contents('compress.zlib://'.$ztmp);
unlink($ztmp);
}
}
if (false === is_string($decoded) || 0 === strlen($decoded)) {
// If the decoded string is empty, that means decoding failed.
throw new \Exception('decodeFilterFlateDecode: invalid data');
}
return $decoded;
} Do we have to adapt
|
Yes. We should even account for the file_get_contents('compress.zlib://'.$ztmp, false, null, 0, $decodeMemoryLimit);
No, the sections that are coded differently than |
Sounds reasonable, please do. |
Instead of setting an error handler to catch the E_WARNING's that gzuncompress() emits, suppress it with an @ so we can do away with the try/catch. Make a note of this in the comments. Switch from using tempnam() to tmpfile() because tempnam() can emit E_NOTICE's and would have to be suppressed as well. tmpfile() just returns a handle or false. Limit file_get_contents() by the $decodeMemoryLimit. Unlike gzuncompress() for which a limit value of zero (0) means "no limit", file_get_contents() takes null to mean "no limit".
Fix for PHP < 8.0 that doesn't like a length limit of null for file_get_contents().
Ready to merge? |
A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042
Partially resolves #592 by correctly decoding the document's default font stream. Includes
inv.pdf
from that issue by permission from the issue author.