Skip to content

Text encoding

Vaclav Hanzl edited this page Oct 31, 2022 · 8 revisions

In a nutshell:

  • Prak uses unicode utf-8 in NFC.
  • If prak encounters NFD at the input, it prints warning and converts text to NFC.
  • Prak also deletes any BOM found at the start of any line.
  • Prak works with any combination of CR and LF.

What is Unicode and utf-8

Unicode is a character set of all characters of the world. Computers manage it quite well today so there is no reason to use one-country-only character sets anymore (like Latin-2).

Encoding utf-8 is a clever way to encode Unicode in such a way that old good ASCII characters remain encoded exactly the same as they were 50 years ago. That's called backward compatibility. Any sane engineer will choose utf-8.

There is also utf-16. You open it in an editor expecting ASCII and see just garbage. Windows engineers have that strange love for incompatible things, so you may encounter it on Windows and praat has it as a default on Windows. Just switch your praat to utf-8.

What is NFC and NFD

Unicode has two ways to represent letters with accents:

  • as one character which happens to have accent as its integral part (NFC, composed)
  • as two characters: letter + accent (NFD, decomposed)

As long as all the accented letters you need do exist in Unicode in a composed form (very likely), NFC is an obvious choice.

NFD is not a total nonsense - it is like, say, driving on the left side. As long as everybody else on your island agrees, it works. But do not try it here. Decomposed characters can arise (you guessed it) on Windows, popping up for no particular reason. It you cut-n-paste text around, you can have NFD parts in otherwise fine NFC text and it looks exactly the same and certainly leads to bugs. When prak encounters NFD (in TextGrid or in exceptions.txt), in converts it to NFC and prints a lengthy and very fair warning.

What is BOM (and CR and LF)

BOM is an invisible character added by some Windows software at the beginning of file. Apart from breaking other software, BOM does not do much. Prak silently deletes BOM on input, not only at the beginning of file but at start of any line - just in case you contaminated your text by cut-n-paste of some first line got elsewhere.

CR and LF are "carriage return" (go to beginning of line) and "line feed" (go to next line). It made sense on mechanical typewriters. Does not make so much sense in computer files today. Sane systems us just LF as end of line. (You know or guessed by now where you can encounter CR+LF, right?) Prak works with both just LF and CR+LF as end of line. In fact, for TextGrid, it takes some effort to smuggle such characters into string. For exceptions.txt, you can safely edit it with software which saves with CR+LF and it will work.

Phone enconding

Unicode has IPA characters but there is much more to consider, see Phone symbols.