Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Languages / writing systems with 2 line breaking conventions in common use? #11

Open
frivoal opened this issue Oct 16, 2019 · 5 comments

Comments

@frivoal
Copy link

frivoal commented Oct 16, 2019

Are there writing-systems other than Korean/Hangul that meet the following criteria:

  • has two different line-breaking conventions in relatively common use, between which document authors (or possibly readers) may want to switch:
    1. one of which allows breaking between any letters (letter = grapheme cluster)
    2. the other one of which disallows breaking between letters of a word and only allowing breaking a spaces
  • has variant (i) as the "default" behavior, in the sense of being the one invoked by css's word-break: normal

Context: the CSS-WG is planning to introduce a new value to the word-break property, that behaves like normal except for hangul, where it would have behavior (ii) (the same as keep-all). If this is only useful to Korean, then the name of the value can be specific to korean (i.e. keep-all-hangul). If some other language would want to use it, then the value should be named something more generic, and the behavior adjusted to handle that other language as well.

The reason keep-all is insufficient to serve this need is that not all content can be language tagged (for instance, user generated content in an editable text field isn't), and keep-all is neither appropriate as a default for all languages, not is it appropriate to content that contains any amount of Korean, multi-lingual content exists, and keep-all would not be appropriate for Korean mixed with Japanese (for instance). So we need a second value that's like normal, but with behavior (ii) instead of (i) for hangul.

@frivoal
Copy link
Author

frivoal commented Oct 16, 2019

Additionally, if there are languages with two line breaking behaviors in common use, where the default (as in, the behavior of word-break: normal) is the other way around and which would benefit from being able to opt into a normal-with-break-all-for-a-certain-script, that too would be useful to know.

@r12a
Copy link
Contributor

r12a commented Oct 24, 2019

Hmm. Not sure.

http://w3c.github.io/elreq/#ethiopic_line_breaking and http://w3c.github.io/elreq/#ethiopic_hyphenation indicate that languages using the Ethiopic script break character by character, regardless of whether space or the word-separator are used between words. However, major browsers actually break on word boundaries (space or word-sep), and i'm not sure whether that might be establishing a new expectation. @dyacob any thoughts on that?

@frivoal
Copy link
Author

frivoal commented Jan 21, 2020

As far as I can tell, browsers do that because Unicode tells them to: https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt classifies Ethiopic syllables as AL, which by UAX14 prohibits breaks between pairs of such letters.

But given the explanation in elreq, that actually makes sense: when ethiopic was primarily written with word separators, using a break-all style of line breaking was fine, but with the advent using spaces, line breaking anywhere becomes somewhat ambiguous.

So, what elreq currently describes seems to be the historic reality that breaking between all letters was the common practice. What it doesn't say is whether there's a continued desire for this behavior.

@r12a
Copy link
Contributor

r12a commented Jan 24, 2024

@dyacob is it reasonable to assert that, although it is mostly used for historic text, some modern content authors of text using Ethiopic orthographies still sometimes want the line to break before the last character that fits, rather than wrapping whole words? This is so that Florian can decide whether to name his line-break property value with a generic or a Korean-specific name.

Do you know of other orthographies that behave like Korean?

Personally, i think a generic name would be best because even if modern content authors generally don't expect the text to break like Korean, people writing expository texts about archaic scripts will probably also need this.(?)

@dyacob
Copy link
Member

dyacob commented Jan 25, 2024

@r12a I think that is very reasonable to say, particularly for content authors targetting web media. In print media, the desire is greater to have the inner-word breaking. I would imagine that other scripts that historically used a printed wordspace would behave like Ethiopic with respect to breaking.

I don't know of others scripts that behave like Korean ("unbreakable" if I'm understanding it correctly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants