Coordinate conversion help #284

samshelley · 2023-06-19T22:39:08Z

samshelley
Jun 19, 2023

Firstly, thanks so much for creating this great library @mara004

I read #204 & especially #214 but neither seem to answer my question. I'm looking to get bounding boxes for text as percentages of the canvas found via search. I execute the search using the textpage.search method to get the starting index. Then I loop through and use get_charbox with the loose option to build my bounding boxes as seen in the snippet below:

index, count = searcher.get_next()
for i in range(count):
    left, bottom, right, top = textpage.get_charbox(index + i, loose=True)
    c_width = right - left
    c_height = bottom - top
    bounding_boxes.append(dict(
        left=(left/width)*100,
        top=(top/height)*100,
        width=c_width/width*100,
        height=c_height/height*100,
        page=page_index+1 # there's an outer loop for each page
  ))

This almost works, but I'm noticing two broken behaviors that seem potentially related:

The top value seems to be greater than the bottom value, which I think doesn't make sense in a coordinate system?
The calculated values as percentages for width & left are exactly correct, but the values for top & height are not -- they are off by a bit.

As a way to compare and check my logic here, I opened the pdf in Mac Preview and drew a rectangle in approximately the same area of the PDF that I was looking to extract. Here, the left/right values were again accurate, but top & bottom were off by ~40-60 canvas units.

Do you have any recommendations here? Am I using the APIs incorrectly? Apologies if this was already answered elsewhere or is included in the documentation.

Thanks so much for taking a look. If you need me to provide a full working example with an attached pdf I can do that as well, just wanted to see if it was something obvious first.

Answered by mara004

Jun 20, 2023

Hi, nice to hear you essentially figured out already.

Yes, in PDF, the coordinate system's origin is typically the bottom left corner (unlike top left for bitmaps), though in theory the PDF spec allows the coordinate system to be laid out between any opposite corners (I think, anyway).

As you say, comments #214 (comment) and #214 (comment) kind of discuss that already.

As this seems to be a common problem, I suppose you're right the docs would deserve a section on coordinate conversion. Maybe even some support model around FPDF_PageToDevice() / FPDF_DeviceToPage().
I'll need some time to consider, though.

View full answer

samshelley · 2023-06-20T00:57:26Z

samshelley
Jun 20, 2023
Author

I just solved my own issue, while researching Apache PDFBox: https://stackoverflow.com/a/54045861.

I think you sort of alluded to it in other support tickets, but I had to convert from x,y in the bottom left hand corner of the document to x,y in the top left.

I'm going to leave it open only as I think the docs would benefit from a brief section explaining how to convert "PDF Canvas Units" to typical x/y coordinate space. Feel free to close if you disagree!

0 replies

mara004 · 2023-06-20T10:52:43Z

mara004
Jun 20, 2023
Maintainer

Hi, nice to hear you essentially figured out already.

Yes, in PDF, the coordinate system's origin is typically the bottom left corner (unlike top left for bitmaps), though in theory the PDF spec allows the coordinate system to be laid out between any opposite corners (I think, anyway).

As you say, comments #214 (comment) and #214 (comment) kind of discuss that already.

As this seems to be a common problem, I suppose you're right the docs would deserve a section on coordinate conversion. Maybe even some support model around FPDF_PageToDevice() / FPDF_DeviceToPage().
I'll need some time to consider, though.

0 replies

samshelley · 2023-06-20T11:10:26Z

samshelley
Jun 20, 2023
Author

Thanks! Re-reading those comments it's clear in retrospect, I just didn't grasp it the first time.

I implemented it just using python and not considering rotation. For completeness, it seems like you are suggesting that this will work most of the time, but not all. Is rotation the only additional case to consider?

Or is the easiest solution just to use the raw APIs for each coordinate pair in the bounding box since it will handle it reliably?

0 replies

mara004 · 2023-06-20T11:27:43Z

mara004
Jun 20, 2023
Maintainer

I implemented it just using python and not considering rotation. For completeness, it seems like you are suggesting that this will work most of the time, but not all. Is rotation the only additional case to consider?

Yes, that's what I meant.
I think rotation is probably not the only additional case, though (there's the aforementioned "any opposite corners" problem, for one thing), and I'd indeed recommend to call these raw API functions since that's what seems safest/easiest.

0 replies

samshelley · 2023-06-20T11:35:29Z

samshelley
Jun 20, 2023
Author

Got it! I'm very unfamiliar with ctypes, but based on the method signature it seems to suggest that the method I would be using FPDF_PageToDevice returns values as integers instead of float/double which would be a problem since all of the values I'm working with have lots of decimal values.

FPDF_EXPORT FPDF_BOOL FPDF_CALLCONV FPDF_PageToDevice(FPDF_PAGE page,
--
  | int start_x,
  | int start_y,
  | int size_x,
  | int size_y,
  | int rotate,
  | double page_x,
  | double page_y,
  | int* device_x,
  | int* device_y);

Am I understanding this incorrectly?

If so, the logic for the method in FPDF_PageToDevice turns out to not actually be that complicated so if that method doesn't work I'll likely just come back to this later and re-implement it in python using the helper methods you made for PDFMatrix.

Is it possible currently to easily call the raw methods on a page object like CPDF_Page-> GetDisplayMatrix?

0 replies

mara004 · 2023-06-20T11:46:20Z

mara004
Jun 20, 2023
Maintainer

based on the method signature it seems to suggest that the method I would be using FPDF_PageToDevice returns values as integers instead of float/double which would be a problem since all of the values I'm working with have lots of decimal values.

Ooh, yes. If you're not actually targeting a bitmap to draw on, that sounds like a problem.
I guess you can use a large bitmap and then downscale so you don't run into real precision trouble, but yes, that's inelegant. Need to think about this...

Is it possible currently to easily call the raw methods on a page object like CPDF_Page-> GetDisplayMatrix?

Sadly the CPDF_* API layer is pdfium's private C++ backend which we can't access with ABI bindings / ctypes.
This is an unfortunate but known limitation of our (one could say, quick and dirty) bindings concept :(

0 replies

samshelley · 2023-06-20T11:52:50Z

samshelley
Jun 20, 2023
Author

OK thank you! This has been incredibly helpful -- really appreciate the pointers. Yes I think I'm doing something a bit different than others here (but it does work!)

GetDisplayMatrix is actually really simple as well so we've solved my issue for now -- https://pdfium.googlesource.com/pdfium.git/+/798e18f5e5cfb672c7f3186f6358b84c5ff7785b/core/fpdfapi/page/cpdf_page.cpp

0 replies

mara004 · 2023-06-20T12:11:37Z

mara004
Jun 20, 2023
Maintainer

That's good to hear, thanks!

However, I'm still left to think what I should do with pypdfium2 now.
And I'm sort of wondering why you want to change coordinate representation if you don't actually work with device pixels?

0 replies

samshelley · 2023-06-20T12:18:13Z

samshelley
Jun 20, 2023
Author

I am rendering a "highlight" layer in a web interface to highlight specific text in a displayed pdf. The rendering engine uses percentage values to determine where to place items so I need to use the right coordinate space.

I'm fairly new to all of this so honestly not sure if my suggestion is too narrow....but as far as what would be helpful to my use-case, if you had a python API implementation of FPDF_PageToDevice that maintained precision, I would 100% use that instead of what I'm likely to implement when I come back to this. But this also might be too narrow a use-case, so just a note somewhere in the docs that explains PDF coordinate space (and then a reference to it in the API docs for all of the methods that return coordinates) would have also been totally sufficient!

0 replies

mara004 · 2023-06-20T12:49:09Z

mara004
Jun 20, 2023
Maintainer

I see, thank you for elaborating.

Maybe, as an alternative to a python re-implementation, we could ask pdfium to add a float equivalent of FPDF_PageToDevice()? We don't need any bitmap parameters, just two functions for (almost-)lossless back and forth translation between normalized and native PDF coordinates.

0 replies

samshelley · 2023-06-20T13:11:29Z

samshelley
Jun 20, 2023
Author

That would work perfectly!

0 replies

mara004 · 2023-08-08T23:54:09Z

mara004
Aug 8, 2023
Maintainer

Commit a379ecc (in the devel branch) adds a helper around FPDF_PageToDevice() / FPDF_DeviceToPage(), but only to translate between a page and a corresponding bitmap rendering.

The quest for float coordinate normalization still stands.

0 replies

samshelley · 2023-08-11T11:48:09Z

samshelley
Aug 11, 2023
Author

Thanks for the update!

0 replies

mara004 · 2023-08-11T18:57:24Z

mara004
Aug 11, 2023
Maintainer

Our docs often mention coordinate order, such as left, bottom, right, top for rectangle return.
That feels problematic. At least we should add something like "relative to the PDF coordinate system".
Or maybe we should avoid these terms entirely and use unspecific variable names instead, e.g. x0, y0, x1, y1?

0 replies

mara004 · 2023-12-07T23:43:09Z

mara004
Dec 7, 2023
Maintainer

I think I'll convert this to a discussion, because I figured I don't like the idea of implementing coordinate conversion from scratch in pypdfium2 (nor would I have the time to do so). Especially given there is FPDF_PageToDevice() / FPDF_DeviceToPage() already, which covers the main use case.

However, to any users affected, feel free to file a feature request at pdfium for float coordinate normalization (or perhaps even contribute a patch yourself).

0 replies

Coordinate conversion help #284

samshelley Jun 19, 2023

Replies: 15 comments

samshelley Jun 20, 2023 Author

mara004 Jun 20, 2023 Maintainer

samshelley Jun 20, 2023 Author

mara004 Jun 20, 2023 Maintainer

samshelley Jun 20, 2023 Author

mara004 Jun 20, 2023 Maintainer

samshelley Jun 20, 2023 Author

mara004 Jun 20, 2023 Maintainer

samshelley Jun 20, 2023 Author

mara004 Jun 20, 2023 Maintainer

samshelley Jun 20, 2023 Author

mara004 Aug 8, 2023 Maintainer

samshelley Aug 11, 2023 Author

mara004 Aug 11, 2023 Maintainer

mara004 Dec 7, 2023 Maintainer

samshelley
Jun 19, 2023

samshelley
Jun 20, 2023
Author

mara004
Jun 20, 2023
Maintainer

samshelley
Jun 20, 2023
Author

mara004
Jun 20, 2023
Maintainer

samshelley
Jun 20, 2023
Author

mara004
Jun 20, 2023
Maintainer

samshelley
Jun 20, 2023
Author

mara004
Jun 20, 2023
Maintainer

samshelley
Jun 20, 2023
Author

mara004
Jun 20, 2023
Maintainer

samshelley
Jun 20, 2023
Author

mara004
Aug 8, 2023
Maintainer

samshelley
Aug 11, 2023
Author

mara004
Aug 11, 2023
Maintainer

mara004
Dec 7, 2023
Maintainer