Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Ambiguous translated references #2558

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

biredel
Copy link

@biredel biredel commented Mar 29, 2024

pypdf uses id(obj) to keep track of objects in _id_translated. This identifier is not unique for different objects, only for objects existing at the same time. e.g. In CPython id(obj), is just a memory address. This cannot be used to tell whether we already have referenced a page or not. It might point to a page of a since-deleted PdfReader, silently corrupting out output.

The bug manifests rarely, and the corrupted output document usually looks OK at first glance (but has missing/duplicated content), the following is my best reproducer so far, adapted from https://stackoverflow.com/questions/78186323/

# python3 repro.py < resources/multilang.pdf

import gc, io, sys
from pypdf import PdfReader, PdfWriter

def pdfsize(fin):
    pdfout = PdfWriter()
    pageout = pdfout.add_blank_page(width=10, height=10)
    reader = None
    for i in range(80):
        fin.seek(0)
        del reader
        gc.collect(0)
        reader = PdfReader(fin, strict=True)
        for pagein in reader.pages:
            pageout.merge_page(pagein)
    with io.BytesIO() as fout:
        pdfout.write(fout)
        return fout.tell()

if __name__ == "__main__":
    with io.BytesIO() as fin:
        fin.write(sys.stdin.buffer.read())
        for i in range(50):
            if i == 0:
                first_size = pdfsize(fin)
                first_size = pdfsize(fin)
            current_size = pdfsize(fin)
            if not first_size == current_size:
                raise RuntimeError(
                    f"generated PDF file size changed after repeating {i} times {first_size=} != {current_size=}"
                )

The attached draft replaces dict[id(obj)] with dict[weakref(obj)] - For the simple insert/lookup/delete cases, that behaves the same. Yet no-longer-existing objects are cleared automatically, removing the possible collision.

There is another instance of id(obj) outside of __repr__() methods in PageObject.hash_value_data - I have not determined whether that can cause related problems.

Essentially resubmitting #1995 as a replacement for #1841 which has memory usage side-effects and does not fully avoid #1788 anyway.

id(obj) can and will hit the same value for different objects, if they do not exist at the same time. In CPython, its just a memory address. When that collission occurs, PdfWriter when asked to insert a page from one PdfReader object may silently confuse it with one from a since-deleted PdfReader object, thus corrupting output in not-easily-noticed ways.

Replacing dict[id(obj)] with dict[weakref(obj)] should suffice, as we do not replace entries - we only ever insert or delete.
@pubpub-zz
Copy link
Collaborator

@biredel
can you have a review at #1788.
I'm expecting PdfReader used in merging into PdfWriters should not be deleted from memory through the PR attached.

@biredel
Copy link
Author

biredel commented Mar 29, 2024

@pubpub-zz Thanks for pointing me there. I only now noticed the almost identical PR #1995 .. suggested before I even ran into the bug.

What was applied instead, #1841 did not resolve the problem because it only touches one out of two (maybe three?) opportunities to get it wrong: PageObject.clone()->PdfObject._reference_clone() and IndirectObject.clone() duplicate that code.

I expect that (costly) workaround can be removed if the weakref fix works.

@pubpub-zz
Copy link
Collaborator

Also. Have you try first to clone the pages before calling merge_page ?

Copy link

codecov bot commented Mar 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.73%. Comparing base (e35df5a) to head (03e8b81).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2558      +/-   ##
==========================================
+ Coverage   94.71%   94.73%   +0.01%     
==========================================
  Files          50       50              
  Lines        8237     8257      +20     
  Branches     1646     1651       +5     
==========================================
+ Hits         7802     7822      +20     
  Misses        267      267              
  Partials      168      168              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@biredel
Copy link
Author

biredel commented Mar 30, 2024

Also. Have you try first to clone the pages before calling merge_page ?

That is how I know both code paths are exposed. Have the page clone() first, then PdfObject._reference_clone with the workaround catches the problem. Call merge_page() right away, then IndirectObject.clone bites you.

@pubpub-zz
Copy link
Collaborator

@biredel
I've tried your code unsuccessfully : I met no issue.

also there is a trick in the code : _id_translated values are not only Dict[int, int]] but there is also in the dict an element with "PreventGC" that keeps a link to the reader that should prevent the GC to delete it before the writer. This being assigned during the cloning process, it is currently not called when using merge_page

Therefore, looking at your test code I recommend a test calling pageout.merge_page(pagein.clone(pdfout)) to possibly prevent error.

Can you give feedback ?

@biredel
Copy link
Author

biredel commented Apr 1, 2024

I've tried your code unsuccessfully : I met no issue.

What worked best for me was a low-core-count amd64 virtual machine, on a 3.10 series CPython, high iteration count and reading resources/multilang.pdf.

also there is a trick in the code

I saw that and recommend this is discarded once the reason for keeping that reference is resolved.

Therefore, looking at your test code I recommend a test calling pageout.merge_page(pagein.clone(pdfout)) to possibly prevent error.
Can you give feedback ?

Yes; positive; confirming. I did do that. That is how I know both code paths are exposed. You can trigger the bug via pagein.clone() too, once there is at least one id of since-deleted objects in the dictionary keys. It is much easier to trigger the bug using merge_page only[1] so that is what my sample code does, but one such call suffices to potentially corrupt the output.

[1] I suspect that is because CPython is much more likely to recycle id() if the new object fits into the memory slot of a no longer referenced object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants