-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Ambiguous translated references #2558
base: main
Are you sure you want to change the base?
Conversation
id(obj) can and will hit the same value for different objects, if they do not exist at the same time. In CPython, its just a memory address. When that collission occurs, PdfWriter when asked to insert a page from one PdfReader object may silently confuse it with one from a since-deleted PdfReader object, thus corrupting output in not-easily-noticed ways. Replacing dict[id(obj)] with dict[weakref(obj)] should suffice, as we do not replace entries - we only ever insert or delete.
@pubpub-zz Thanks for pointing me there. I only now noticed the almost identical PR #1995 .. suggested before I even ran into the bug. What was applied instead, #1841 did not resolve the problem because it only touches one out of two (maybe three?) opportunities to get it wrong: PageObject.clone()->PdfObject._reference_clone() and IndirectObject.clone() duplicate that code. I expect that (costly) workaround can be removed if the weakref fix works. |
Also. Have you try first to clone the pages before calling merge_page ? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2558 +/- ##
==========================================
+ Coverage 94.71% 94.73% +0.01%
==========================================
Files 50 50
Lines 8237 8257 +20
Branches 1646 1651 +5
==========================================
+ Hits 7802 7822 +20
Misses 267 267
Partials 168 168 ☔ View full report in Codecov by Sentry. |
That is how I know both code paths are exposed. Have the page clone() first, then |
@biredel also there is a trick in the code : _id_translated values are not only Therefore, looking at your test code I recommend a test calling Can you give feedback ? |
What worked best for me was a low-core-count amd64 virtual machine, on a 3.10 series CPython, high iteration count and reading
I saw that and recommend this is discarded once the reason for keeping that reference is resolved.
Yes; positive; confirming. I did do that. That is how I know both code paths are exposed. You can trigger the bug via pagein.clone() too, once there is at least one id of since-deleted objects in the dictionary keys. It is much easier to trigger the bug using merge_page only[1] so that is what my sample code does, but one such call suffices to potentially corrupt the output. [1] I suspect that is because CPython is much more likely to recycle id() if the new object fits into the memory slot of a no longer referenced object. |
pypdf uses id(obj) to keep track of objects in
_id_translated
. This identifier is not unique for different objects, only for objects existing at the same time. e.g. In CPython id(obj), is just a memory address. This cannot be used to tell whether we already have referenced a page or not. It might point to a page of a since-deleted PdfReader, silently corrupting out output.The bug manifests rarely, and the corrupted output document usually looks OK at first glance (but has missing/duplicated content), the following is my best reproducer so far, adapted from https://stackoverflow.com/questions/78186323/
The attached draft replaces
dict[id(obj)]
withdict[weakref(obj)]
- For the simple insert/lookup/delete cases, that behaves the same. Yet no-longer-existing objects are cleared automatically, removing the possible collision.There is another instance of id(obj) outside of
__repr__()
methods inPageObject.hash_value_data
- I have not determined whether that can cause related problems.Essentially resubmitting #1995 as a replacement for #1841 which has memory usage side-effects and does not fully avoid #1788 anyway.