-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of ZWJ #2
Comments
[Response by @kojiishii]
Technically speaking, this belongs to fonts technology, so the best place to look for is OpenType spec. Search for "OpenType zwj site:microsoft.com" shows several hits. The spec to handle ZWJ for Arabic is here. I don't have enough knowledge to know whether this matches to what authors need or not, but I hope this gives something for the WG to review? |
Given that the ZWJ has been used in many different contexts, a general question like "What is the spec of the zero-width joiner?" may be difficult to answer. If this is a discussion specific to the Arabic script, it would be good to say so. |
ZWJ may indeed appear either before or after a letter. When placed between two letters that normally do not connect, it attempts to create a ligature between them, though this may fail if there is no such ligature. This is commonplace in Indic scripts, and rare but sometimes useful in Western scripts. In an Arabic or Syriac or similar context, space+letter+ZWJ+space will produce a left-joined form, space+ZWJ+letter+space will produce a right-joined form, and space+ZWJ+letter+ZWJ+space will produce a double-joined form, all of them physically isolated from other letters by spaces (which is what you need when talking about specific forms with examples), always providing such joined forms actually exist. |
In Arabic the behavior of ZWJ is rather simple. It is IMO best explained that is acts as an invisible dual joining character. ZWNJ is the exact opposite; it acts as an invisible non-joining character. Also the function of ZWJ and ZWNJ should be font-independent (unless the font is going out of its way to do something unusual) and should not be dependent on the presence or absence of a glyph for it in the font (just like any other control character). Any behavior other than this is a bug AFAIK. The behavior for other scripts (specially Indic scripts) can be different. |
A few tests do produce some surprising results on both Firefox and Chrome. Take a bidi mix of characters like the following: U+0061 U+0061 LATIN SMALL LETTER A Displayed as HTML in a LTR context it looks like: and in a RTL context like: In a textarea, in a LTR context it looks like: and only in a textarea with RTL context does it look like you might expect, ie. I wonder whether it's something to do with the browser trying to determine the directional runs first, then packaging up the runs of characters in such a way that the ZWJ becomes isolated from the arabic character it is adjacent to. |
@r12a Your point about browsers may be right. Without any LTR characters, all seems ok, whatever the base direction. |
I think it might indeed be related to bidi itemisation, since browsers will itemize the text first into runs with the same direction before doing the shaping, so the ZWJ might end in the wrong run. I think @behdad might have a more concrete idea (IIRC he talked once about this very issue). BTW, http://www.unicode.org/cldr/utility/bidi.jsp is a great way to check the bidi resolution of a string, which shows quite some difference on where the ZWJ goes based on paragraph direction. |
@khaledhosny: You said "I think it might indeed be related to bidi itemisation, since browsers will itemize the text first into runs with the same direction before doing the shaping, so the ZWJ might end in the wrong run." |
@duerst That's indeed what Chrome does, ie make surrounding text available to the shaper. That's the only way to get boundary ZWJ right, as it might affect both sides. |
Reported for Firefox here. Chrome renders this different to Firefox, but does not look correct either: So it seems that Chrome fails to apply the ZWJ when it comes first, while Firefox fails to apply it when it comes last. |
[Comment by @ntounsi]
What is the spec of the zero-width joiner?
"... The use of a joiner adjacent to a suitable letter permits that letter to form a cursive connection without a visible neighbor." and gives a use case "This provides a simple way to encode some special cases, such as exhibiting a connecting form in isolation."
Guess that "suitable letter" means those letters who join according to their writing system (e.g. Aleph doesn't join to it's left in Arabic).
But what "adjacent" mean? At the right, left or both?
"without a visible neighbor" : what about no visible member next? A visible member but separated by space, say?
To support the definition, Unicode gives an example of a special case with letter HEH and shows what ZWJ means for that special case.
"When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms."
More explicit but (only?) "between" two letters. When "would" they be not connected. What are those situations?
Any way, browsers implementation of ZWJ differs from one another. I noticed that browsers (based Gecko vs. based Webkit) display depends on
Also, letter Ghain behave differently than letter Heh, though they both have four different shapes: initial, medial, final and isolated.
Moreover, Gecko implementation is right for "colored letters" within a word (they join). But ZWJ applied to only one letter works better in Webkit implementation : all the four shapes show up .
The text was updated successfully, but these errors were encountered: