-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep minimal structure of tables in text #13
Comments
@ivsanro1 that makes a lot of sense. Thinking about other options here, one more possibility could be using tabs |
makes sense @lopuhin thanks for your input on this. Originally I was thinking on I find using separators >>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.encode("\t", add_special_tokens=False)
[197]
>>> tokenizer.encode("\t\t", add_special_tokens=False)
[298]
>>> tokenizer.encode("\t\t\t", add_special_tokens=False)
[573]
>>> tokenizer.encode(" | ", add_special_tokens=False)
[765, 220]
>>> tokenizer.encode(" | | ", add_special_tokens=False)
[765, 220, 765, 220]
>>> tokenizer.encode("| ", add_special_tokens=False)
[91, 220]
>>> tokenizer.encode("| |", add_special_tokens=False)
[91, 220, 765]
>>> tokenizer.encode(" \t \t ", add_special_tokens=False)
[7163, 79199]
>>> tokenizer.encode(" \t \t \t", add_special_tokens=False)
[7163, 256, 63472]
>>> tokenizer.encode(" \t \t \t ", add_special_tokens=False)
[7163, 256, 8860, 3762]
>>> tokenizer.encode(" | | |", add_special_tokens=False)
[765, 220, 765, 220, 765]
>>> tokenizer.encode(" | | | ", add_special_tokens=False)
[765, 220, 765, 220, 765, 220]
But I also like the option of not adding non-spacing chars. I think the best option would be to make it customizable |
I think it'd be great to keep some basic sepatarors to not lose too much structural info from tables:
While some better output would be:
@lopuhin do you think this would be relevant for this library?
The text was updated successfully, but these errors were encountered: