Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair fix coords #43

Merged
merged 12 commits into from
Sep 14, 2020
Merged

Repair fix coords #43

merged 12 commits into from
Sep 14, 2020

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Sep 10, 2020

This attempts to fix problems caused by invalid polygons from ocrd-segment-repair (both in sanitize and plausibilize mode).

This taught me another lesson about what can go wrong with Shapely / numpy / PAGE interaction. To sum up:

  1. Ensuring valid polygons on the input side (e.g. from OpenCV) is always necessary. The only generic way I can think of is to feed them through simplify with ever increasing tolerance until valid. EDIT2 The problem is that the result of the algorithm implemented in Shapely/GEOS depends on the starting point it picked. In pathological cases, no simplification whatsoever can be achieved. (The only thing that then helps is re-ordering...)
  2. Operations like union or intersection can create collections of shapes. EDIT There are actually 2 cases here:
    1. homogeneous (MultiPolygon) – a discontiguous collection of Polygon – in which case one needs the convex hull.
    2. heterogeneous (GeometryCollection) – a collection of Polygon with Point or LineString – in which case one needs to filter out those shapes which have no intrinsic area (and then check again for the other cases)
  3. Operations like union or intersection can create non-integer points, which when rounded for PAGE serialization can become invalid paths. Unfortunately, Shapely always calculates in floating point internally. So all we can do is rounding and then ensuring validity (as in 1).

Related:

Copy link
Contributor

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this PR in my test for OCR-D/ocrd_tesserocr#149, and it seems to work.

@stweil
Copy link
Contributor

stweil commented Sep 14, 2020

@bertsky, you are faster with pushing new commits to existing pull requests than I am able to test them.

I suggest to wait a little with new commits until the existing pull request was merged and make a new one then (unless there is an urgent need for the new commit of course).

@stweil
Copy link
Contributor

stweil commented Sep 14, 2020

@kba (or whoever has merge rights), can this PR be merged? I'd like to have a working ocrd_all again, and this seems to be one required component.

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 14, 2020

@bertsky, you are faster with pushing new commits to existing pull requests than I am able to test them.

I suggest to wait a little with new commits until the existing pull request was merged and make a new one then (unless there is an urgent need for the new commit of course).

Sorry, there are so many related issues at the moment and the shared problem is always our Shapely code. Also, I don't have the time right now to go forward and backward on each one of them. So I felt like rushing it – this time.

@kba (or whoever has merge rights), can this PR be merged? I'd like to have a working ocrd_all again, and this seems to be one required component.

Right now I am still testing myself. Also, after delegating to the PAGE validator in core, I found another bug there...

requirements.txt Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants