Keep minimal structure of tables in text #13

ivsanro1 · 2024-06-23T16:15:46Z

I think it'd be great to keep some basic sepatarors to not lose too much structural info from tables:

>>> import html_text

>>> tree = fromstring("""
... <table>
...   <tr>
...     <th>Company</th>
...     <th>Contact</th>
...     <th>Country</th>
...   </tr>
...   <tr>
...     <td>Alfreds Futterkiste</td>
...     <td>Maria Anders</td>
...     <td>Germany</td>
...   </tr>
...   <tr>
...     <td>Centro comercial Moctezuma</td>
...     <td>Francisco Chang</td>
...     <td>Mexico</td>
...   </tr>
... </table> 
... """)

>>> print(html_text.extract_text(tree, guess_layout=True))
Company Contact Country
Alfreds Futterkiste Maria Anders Germany
Centro comercial Moctezuma Francisco Chang Mexico

While some better output would be:

Company | Contact | Country
Alfreds Futterkiste | Maria Anders | Germany
Centro comercial Moctezuma | Francisco Chang | Mexico

@lopuhin do you think this would be relevant for this library?

The text was updated successfully, but these errors were encountered:

lopuhin · 2024-06-24T07:43:25Z

@ivsanro1 that makes a lot of sense. Thinking about other options here, one more possibility could be using tabs \t instead of | as a separator. That would still follow the approach that we don't add new non-blank characters to original text, but at the same time preserve the same amount of info as the |, and this is how tables are represented if you try to copy them and paste into a text field.

ivsanro1 · 2024-06-24T09:52:40Z

Thinking about other options here, one more possibility could be using tabs \t instead of | as a separator

makes sense @lopuhin thanks for your input on this. Originally I was thinking on | rather than tabs because of how latest LLMs (e.g. llama3) tend to have in their vocab combinations of spaces + tabs, making the resulting tokens less consistent, especially if there are cells in the table without text -- and I was wondering if that'd affect how a LLM would interpret this text, semantically speaking

I find using separators | more consistent in tokenization:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.encode("\t", add_special_tokens=False)
[197]
>>> tokenizer.encode("\t\t", add_special_tokens=False)
[298]
>>> tokenizer.encode("\t\t\t", add_special_tokens=False)
[573]
>>> tokenizer.encode(" | ", add_special_tokens=False)
[765, 220]
>>> tokenizer.encode(" |  | ", add_special_tokens=False)
[765, 220, 765, 220]
>>> tokenizer.encode("| ", add_special_tokens=False)
[91, 220]
>>> tokenizer.encode("|  |", add_special_tokens=False)
[91, 220, 765]
>>> tokenizer.encode(" \t  \t ", add_special_tokens=False)
[7163, 79199]
>>> tokenizer.encode(" \t  \t  \t", add_special_tokens=False)
[7163, 256, 63472]
>>> tokenizer.encode(" \t  \t  \t ", add_special_tokens=False)
[7163, 256, 8860, 3762]
>>> tokenizer.encode(" |  |  |", add_special_tokens=False)
[765, 220, 765, 220, 765]
>>> tokenizer.encode(" |  |  | ", add_special_tokens=False)
[765, 220, 765, 220, 765, 220]

But I also like the option of not adding non-spacing chars. I think the best option would be to make it customizable

ivsanro1 added the good first issue Good for newcomers label Jun 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep minimal structure of tables in text #13

Keep minimal structure of tables in text #13

ivsanro1 commented Jun 23, 2024

lopuhin commented Jun 24, 2024

ivsanro1 commented Jun 24, 2024 •

edited

Loading

Keep minimal structure of tables in text #13

Keep minimal structure of tables in text #13

Comments

ivsanro1 commented Jun 23, 2024

lopuhin commented Jun 24, 2024

ivsanro1 commented Jun 24, 2024 • edited Loading

ivsanro1 commented Jun 24, 2024 •

edited

Loading