Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of ZWJ #2

Open
r12a opened this issue Nov 10, 2016 · 12 comments
Open

Use of ZWJ #2

r12a opened this issue Nov 10, 2016 · 12 comments

Comments

@r12a
Copy link
Contributor

r12a commented Nov 10, 2016

[Comment by @ntounsi]

What is the spec of the zero-width joiner?

  1. Unicode-Ch9, Sec 9.2 Arabic, p371 (http://www.unicode.org/versions/Unicode9.0.0/ch09.pdf) says:
    "... The use of a joiner adjacent to a suitable letter permits that letter to form a cursive connection without a visible neighbor." and gives a use case "This provides a simple way to encode some special cases, such as exhibiting a connecting form in isolation."

Guess that "suitable letter" means those letters who join according to their writing system (e.g. Aleph doesn't join to it's left in Arabic).
But what "adjacent" mean? At the right, left or both?
"without a visible neighbor" : what about no visible member next? A visible member but separated by space, say?
To support the definition, Unicode gives an example of a special case with letter HEH and shows what ZWJ means for that special case.

  1. Wikipedia informal définition says:
    "When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms."
    More explicit but (only?) "between" two letters. When "would" they be not connected. What are those situations?

Any way, browsers implementation of ZWJ differs from one another. I noticed that browsers (based Gecko vs. based Webkit) display depends on

  • font used (some fonts seem to react better),
  • What letter is neighbour, even if separated by a space
  • base direction rtl or ltr (!?)

Also, letter Ghain behave differently than letter Heh, though they both have four different shapes: initial, medial, final and isolated.
Moreover, Gecko implementation is right for "colored letters" within a word (they join). But ZWJ applied to only one letter works better in Webkit implementation : all the four shapes show up .

@r12a
Copy link
Contributor Author

r12a commented Nov 10, 2016

[Response by @kojiishii]

What is the spec of the zero-width joiner?

Technically speaking, this belongs to fonts technology, so the best place to look for is OpenType spec. Search for "OpenType zwj site:microsoft.com" shows several hits.

The spec to handle ZWJ for Arabic is here. I don't have enough knowledge to know whether this matches to what authors need or not, but I hope this gives something for the WG to review?

@duerst
Copy link

duerst commented Nov 11, 2016

Given that the ZWJ has been used in many different contexts, a general question like "What is the spec of the zero-width joiner?" may be difficult to answer. If this is a discussion specific to the Arabic script, it would be good to say so.

@johnwcowan
Copy link

johnwcowan commented Nov 11, 2016

ZWJ may indeed appear either before or after a letter. When placed between two letters that normally do not connect, it attempts to create a ligature between them, though this may fail if there is no such ligature. This is commonplace in Indic scripts, and rare but sometimes useful in Western scripts.

In an Arabic or Syriac or similar context, space+letter+ZWJ+space will produce a left-joined form, space+ZWJ+letter+space will produce a right-joined form, and space+ZWJ+letter+ZWJ+space will produce a double-joined form, all of them physically isolated from other letters by spaces (which is what you need when talking about specific forms with examples), always providing such joined forms actually exist.

@khaledhosny
Copy link

In Arabic the behavior of ZWJ is rather simple. It is IMO best explained that is acts as an invisible dual joining character. ZWNJ is the exact opposite; it acts as an invisible non-joining character.

Also the function of ZWJ and ZWNJ should be font-independent (unless the font is going out of its way to do something unusual) and should not be dependent on the presence or absence of a glyph for it in the font (just like any other control character).

Any behavior other than this is a bug AFAIK.

The behavior for other scripts (specially Indic scripts) can be different.

@r12a
Copy link
Contributor Author

r12a commented Nov 22, 2016

A few tests do produce some surprising results on both Firefox and Chrome. Take a bidi mix of characters like the following:

U+0061 U+0061 LATIN SMALL LETTER A
U+0020 U+0020 SPACE
U+0647 U+0647 ARABIC LETTER HEH
U+200D U+200D ZERO WIDTH JOINER
U+0020 U+0020 SPACE
U+0062 U+0062 LATIN SMALL LETTER B
U+0020 U+0020 SPACE
U+200D U+200D ZERO WIDTH JOINER
U+0647 U+0647 ARABIC LETTER HEH
U+200D U+200D ZERO WIDTH JOINER
U+0020 U+0020 SPACE
U+0063 U+0063 LATIN SMALL LETTER C
U+0020 U+0020 SPACE
U+200D U+200D ZERO WIDTH JOINER
U+0647 U+0647 ARABIC LETTER HEH

Displayed as HTML in a LTR context it looks like:

screen shot 2016-11-22 at 14 16 00

and in a RTL context like:

screen shot 2016-11-22 at 14 20 17

In a textarea, in a LTR context it looks like:

screen shot 2016-11-22 at 14 20 31

and only in a textarea with RTL context does it look like you might expect, ie.

screen shot 2016-11-22 at 14 21 43

I wonder whether it's something to do with the browser trying to determine the directional runs first, then packaging up the runs of characters in such a way that the ZWJ becomes isolated from the arabic character it is adjacent to.

@ntounsi
Copy link

ntounsi commented Nov 22, 2016

It also depend on fonts. Initial Heh may be the same as middle Heh for some (simple) fonts.
I confirm the weird result.

The example only works with some good fonts, e.g. Amiri, Traditional Arabic, Arabic Typesetting etc.
AND works only when base direcrtion is rtl

Here is my test with Firefox.
LTR context on the left and RTL on the right
testrizwj

Browsers agree only for the RTL context (blue). All other cases are weird.

The point is then why relate ZWJ to base direction?

@ntounsi
Copy link

ntounsi commented Nov 22, 2016

@r12a Your point about browsers may be right. Without any LTR characters, all seems ok, whatever the base direction.

@khaledhosny
Copy link

I think it might indeed be related to bidi itemisation, since browsers will itemize the text first into runs with the same direction before doing the shaping, so the ZWJ might end in the wrong run. I think @behdad might have a more concrete idea (IIRC he talked once about this very issue).

BTW, http://www.unicode.org/cldr/utility/bidi.jsp is a great way to check the bidi resolution of a string, which shows quite some difference on where the ZWJ goes based on paragraph direction.

@duerst
Copy link

duerst commented Nov 23, 2016

@khaledhosny: You said "I think it might indeed be related to bidi itemisation, since browsers will itemize the text first into runs with the same direction before doing the shaping, so the ZWJ might end in the wrong run."
Shouldn't a ZWJ ideally be included in both runs (not necessarily for reordering, but before being sent to rendering) if it falls on a run boundary? That's probably not what the Bidi algorithm says. But the Bidi algorithm is about ordering, and ZWJ has nothing to do with ordering. On the other hand, ZWJ is about shaping, and is defined (as far as I understand) to work logically.

@behdad
Copy link

behdad commented Nov 23, 2016

@duerst That's indeed what Chrome does, ie make surrounding text available to the shaper. That's the only way to get boundary ZWJ right, as it might affect both sides.

@khaledhosny
Copy link

Reported for Firefox here.

Chrome renders this different to Firefox, but does not look correct either:
Chrome

Compared to Firefox:
Firefox

So it seems that Chrome fails to apply the ZWJ when it comes first, while Firefox fails to apply it when it comes last.

@khaledhosny
Copy link

Here is LibreOffice rendering in comparison, Pango also renders it similarly:
zwj-libreoffice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants