-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reason for not applying remove_non_prining_characters normalization #416
Comments
Hi, thank you for your comment! The data_tooling/ac_dc/filtering.py Line 357 in e28064e
However, it was used just before the tokenization step: data_tooling/ac_dc/filtering.py Line 213 in e28064e
and data_tooling/ac_dc/filtering.py Line 688 in e28064e
Because we trained our tokenizers and KenLM models (https://huggingface.co/edugp/kenlm/tree/main/wikipedia) on data after removing these non-printing characters, to be sure the new data we pass to the tokenizer is the same form as the data it was trained on, we added this function as it was. This is the main reason why this function is present in the code. If we didn't use it for the normalization of the documents, it was probably because there was I don't really know why So if you want to use the same tokenizers or KenLM models as us, you should check the parameters of the normalization applied before and use the same ones. Don't hesitate if you have more questions! |
Hi,
We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the
remove_non_prining_characters
normalization step is not used for the final cleaning. Do you have any thoughts on why this should not be used?data_tooling/ac_dc/normalization.py
Line 5 in e28064e
There you have this:
Which we modified, to keep newlines (
\n
) and tabs (\t
), and to also remove soft-hyphens, non-breaking spaces, and zero-width space:There could of course be more characters that one may want to remove.
To be clear, I am writing this here for two reasons:
Thanks for your amazing contributions!
The text was updated successfully, but these errors were encountered: