Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range while running box_groups_to_span_groups #250

Open
egork520 opened this issue Jun 1, 2023 · 0 comments

Comments

@egork520
Copy link
Contributor

egork520 commented Jun 1, 2023

Here is the code to reproduce the error

from mmda.recipes.core_recipe import CoreRecipe
file_name = 'a85f7a895ed9cbe09a90b8b449ad7356fb92de6a.pdf'
doc = recipe_doc.from_path(file_name)

Stack trace:

ile ~/Documents/codes/git/ai2/s2/mmda/src/mmda/recipes/core_recipe.py:56, in CoreRecipe.from_path(self, pdfpath)
     53 equations = self.effdet_mfd_predictor.predict(document=doc)
     55 # we annotate layout info in the document
---> 56 doc.annotate(layout=layout)
     58 # list annotations separately
     59 doc.annotate(equations=equations)

File ~/Documents/codes/git/ai2/s2/mmda/src/mmda/types/document.py:97, in Document.annotate(self, is_overwrite, **kwargs)
     91     span_groups = self._annotate_span_group(
     92         span_groups=annotations, field_name=field_name
     93     )
     94 elif annotation_type == BoxGroup:
     95     # TODO: not good. BoxGroups should be stored on their own, not auto-generating SpanGroups.
     96     span_groups = self._annotate_span_group(
---> 97         span_groups=box_groups_to_span_groups(annotations, self), field_name=field_name
     98     )
     99 else:
    100     raise NotImplementedError(
    101         f"Unsupported annotation type {annotation_type} for {field_name}"
    102     )

File ~/Documents/codes/git/ai2/s2/mmda/src/mmda/utils/tools.py:70, in box_groups_to_span_groups(box_groups, doc, pad_x, center)
     66 for box in box_group.boxes:
     67 
     68     # Caching the page tokens to avoid duplicated search
     69     if box.page not in all_page_tokens:
---> 70         cur_page_tokens = all_page_tokens[box.page] = doc.pages[
     71             box.page
     72         ].tokens
     73         if token_box_in_box_group is None:
     74             # Determine whether box is stored on token SpanGroup span.box or in the box_group
     75             token_box_in_box_group = all(
     76                 [
     77                     (
   (...)
     82                 ]
     83             )

IndexError: list index out of range

It appears as the doc has less number of pages than box_groups, e.g.

ipdb> set([box.page for box_group in box_groups for box in box_group])
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36}
ipdb> len(doc.pages)
35

doc.pages misses some of the pages it appears

List of shas: 736aea59f4c4d6d52ffe5a5ffabc6f734e142239, a85f7a895ed9cbe09a90b8b449ad7356fb92de6a, 0197e4b6a68e920019b3bb2ae2acde6b61eb96c5

More error can be found in this datadog log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant