Reason for not applying remove_non_prining_characters normalization #416

JoeyOhman · 2022-05-20T11:41:01Z

Hi,

We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the remove_non_prining_characters normalization step is not used for the final cleaning. Do you have any thoughts on why this should not be used?

data_tooling/ac_dc/normalization.py

Line 5 in e28064e

non_printing_characters_re = re.compile(

There you have this:

non_printing_characters_re = re.compile(
    f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]"
)

Which we modified, to keep newlines (\n) and tabs (\t), and to also remove soft-hyphens, non-breaking spaces, and zero-width space:

additional_chars_to_remove = [160, 173, 8203]
non_printing_characters_re = re.compile(
    f"[{''.join(map(chr, list(range(0,9)) + list(range(11, 32)) + list(range(127,160)) + additional_chars_to_remove))}]"
)

There could of course be more characters that one may want to remove.

To be clear, I am writing this here for two reasons:

To get your feedback. Do you think this is a good idea to use for the final data cleaning?
If so, this could be incorporated into this repository to help other people that might be thinking about this.

Thanks for your amazing contributions!

The text was updated successfully, but these errors were encountered:

HugoLaurencon · 2022-05-20T13:13:46Z

Hi, thank you for your comment!

The remove_non_printing_characters function was not used during the normalization of the documents:

data_tooling/ac_dc/filtering.py

Line 357 in e28064e

remove_non_printing_characters=False,

However, it was used just before the tokenization step:

data_tooling/ac_dc/filtering.py

Line 213 in e28064e

remove_non_printing_characters=True,

and

data_tooling/ac_dc/filtering.py

Line 688 in e28064e

remove_non_printing_characters=True,

Because we trained our tokenizers and KenLM models (https://huggingface.co/edugp/kenlm/tree/main/wikipedia) on data after removing these non-printing characters, to be sure the new data we pass to the tokenizer is the same form as the data it was trained on, we added this function as it was.

This is the main reason why this function is present in the code. If we didn't use it for the normalization of the documents, it was probably because there was \n or \t in the list as you mentioned, but it would make sense to use this function without these characters for the normalization of the documents (but not for before the tokenization, because the tokenizer did not see any \n or \t during its training).

I don't really know why \n or \t are in the list, it's mostly a code from Facebook CCNet which was used for the training of the tokenizers and KenLM models. I think the only thing we modified from them was not converting the characters to lower case. @edugp did this part.

So if you want to use the same tokenizers or KenLM models as us, you should check the parameters of the normalization applied before and use the same ones.

Don't hesitate if you have more questions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reason for not applying remove_non_prining_characters normalization #416

Reason for not applying remove_non_prining_characters normalization #416

JoeyOhman commented May 20, 2022

HugoLaurencon commented May 20, 2022

Reason for not applying remove_non_prining_characters normalization #416

Reason for not applying remove_non_prining_characters normalization #416

Comments

JoeyOhman commented May 20, 2022

HugoLaurencon commented May 20, 2022