Intent and internationalization #409

dginev · 2022-08-08T17:14:13Z

Tying my last loose end, here is a write-up on the examples of international Bulgarian notations in math and chemistry that I am currently aware of:

https://hackmd.io/@dginev/Bk3RY5C6c

Systems by Western authors (such as MathCAT) would likely need explicit intent annotations on all notations that are unknown/conflicting with standard USA conventions. For a "worst-case but valid" example of markup, the open interval from 1.1 to 1.9 could be localized as:

<math>
 <mrow intent="отворен-интервал(1.1, 1.9)">
   <mo>(</mo>
   <mn>1</mn>
   <mo>,</mo>
   <mn>1</mn>
   <mo>;</mo>
   <mn>1</mn>
   <mo>,</mo>
   <mn>9</mn>
   <mo>)</mo>
 </mrow>
</math>

(assuming the presentation tree is this fragmented due to a hypothetical MathML generator tool that only recognizes a decimal dot and not a decimal comma, thus chunking a decimal number into two mn elements)

The main questions of this issue touch on w3c/mathml-docs#40 , but are distinctly different:

What should human remediators who are working on localized documents in non-English languages deposit as the "concept name" values of intent? Do we encourage localized alphabets or insist on English names?
How (if at all) will such values connect to the "Core" and "Open" Intent list values when they are simple translations? (e.g. "отворен-интервал" === "open-interval", "функция-на-Акерман" === "Ackermann-function")

note localizations

I had mentioned before that Wikidata can be leveraged, but that is only an informal sentiment at the moment. E.g. Q341835 has 27 localizations

It could help to have examples from additional language+alphabet combinations, but I hope this much is already helpful to bootstrap a discussion.

The text was updated successfully, but these errors were encountered:

davidcarlisle · 2022-08-08T18:15:05Z

I think we should keep the dictionary lookup as currently described (case and [-_. ] insensitive, but otherwise an exact match)

But a given system should be allowed to have a localised dictionary (or equivalently) columns of localised names giving localised entries. A system should also be allowed to return a speech hint in any localisation, whatever entry was used for lookup.

Which would mean

open-interval($a,$b) could be found in Core but return a localised string отворен-интервал($a,$b)

отворен-интервал($a,$b) could be found in a system specific dictionary and return отворен-интервал($a,$b)

In the first case the lookup SHOULD succeed (as it is in Core Intent) but it MAY return either open-interval or отворен-интервал depending on the system.

In the second case the lookup MAY succeed and if it fails the default reading отворен-интервал($a,$b) would be given

so using the first form would maximise the chance of lookup succeeding but may end up with English text, using the second form ensures a Bulgarian reading, but perhaps just the default literal reading of the intent, but either would be a reasonable choice to make, depending on requirements

But this is just initial thoughts on how I'd imagined intent to work here, not really any fully worked plan... (Also only addressing intent lookup not your notational difference examples, they need more thought...)

NSoiffer · 2022-08-09T00:24:36Z

@dginev's Bulgarian examples often use Cyrillic in subscripts. I see no problem with that. Where I can see a problem is maybe with RTL languages doing that. But at least we have a solution with Unicode or dir. However, I'm not sure we have a solution for vertical direction languages... or even if there is a problem for them. I can easily imagine that a Japanese author would always use horizontal text in math. This link says

...However, with the introduction of western materials, the alphabet, Arabic numbers, and mathematical formulas, it became less convenient to write things vertically. Science-related texts, which include many foreign words, gradually had to be changed to horizontal text.

Today most school textbooks, except those about Japanese or classical literature, are written horizontally...

Finding out if this is universal or not is something I hope we can do (basically finding someone who knows more about vertical mode writing systems).

NSoiffer · 2022-08-09T03:31:58Z

I agree with @davidcarlisle's view on what will happen for names in Core and not in Core with the correct that if it is in core, it may not say "open interval". It may use "the interval from a to b, not including a or b" or something else.

Core names are biased towards English in that if a translation for the language doesn't exist, then you end up with English because those are the words. But unless every bit of the math is marked up with intent, the defaults would be in English (or whatever default language the implementer used). So ultimately, it isn't much of a bias.

I find some of the notations clever like ÷ for a (assumedly) geometric progress. Here are a few comments on others:

The decimal comma is the standard "switch" from the American "."/"," so that doesn't seem special. Am I missing something?
"tg" and "cotg" are used in other European languages (I think Italian and Spanish, but that's a guess). It was common enough and non-conflicting enough that I added them to my base trig function list in MathPlayer.
currency is particularly interesting because it can occur in front (like $), at the end (as in Bulgarian or ¢ in US). I have seen € used at the start, end, and middle (as a decimal separator). The later seems rarely used though and the Wikipedia page from which I spotted that usage has eliminated it.

dginev · 2022-08-09T12:12:48Z

@NSoiffer filling in some minor details you asked about:

The decimal comma is the standard "switch" from the American "."/"," so that doesn't seem special. Am I missing something?

That's exactly it.

An implication that may not be immediately obvious is that such differences result in "localizing" any "Default interpretation" rules for intent to a specific practice. Take <mn>1</mn><mo>,</mo><mn>9</mn> - is it the number 1,9 in Bulgarian, or a list of 1 and 9 in the US? Assuming one leads to the other producing inaccurate speech via the Default rules. Is 1.9 "one times 9" in Bulgaria or "one point nine" in the US? Similarly for : pronounced as ratio.

In practice that may mean that a big part of the remediation of non-English documents will be neutralizing inappropriate defaults, or - similar to my expectation for arXiv - that advanced defaults (trying to infer intervals,lists, etc) will be completely disabled. Whichever gets more mileage.

Bulgarian examples often use Cyrillic in subscripts. I see no problem with that.

Right, largely not problematic. The reason I took the trouble to include the variety of uses of Cyrillic in those two books (annotations, variable names, function names) was to demonstrate Cyrillic is an alphabet that is used in math syntax.

The relevant background is that Murray S. had repeatedly claimed he had an expectation for Cyrillic to not be used, so far as to override that block to have different behavior in a particular use of Braille. I apologize for not remembering the technical details of what exact Braille feature was being enabled.

NSoiffer · 2022-08-11T00:44:07Z

@dginev wrote:

Take 1,9 - is it the number 1,9 in Bulgarian, or a list of 1 and 9 in the US?

If it is correct MathML, it can't be the number 1,9 because that should be <mn>1,9</mn>. The sad fact is, and this may be your point, most MathML generators don't do that because they don't ask/intuit what it means. So to your point, any MathML cleanup needs to be aware of the locale. (I have a comment in my MathCAT code saying this particular attempt at cleanup needs to take locale into consideration, but locale is currently not one of settings in MathCAT 👎)

NSoiffer · 2022-09-13T22:08:45Z

To the point about number syntax switch "." and ","... In Switzerland, I recently saw numbers written as "1'234'567". And spaces between digit blocks are also common.

Also (for the record): Asian countries tend to use blocks of four digits, not three digits as in common in the Western world.

brucemiller · 2022-10-11T07:44:02Z

On 8/10/22 20:44, NSoiffer wrote: @dginev <https://github.com/dginev> wrote: Take 1,9 - is it the number 1,9 in Bulgarian, or a list of 1 and 9 in the US? If it is correct MathML, it can't be the number 1,9 because that /should/ be |<mn>1,9</mn>|.

I disagree. The description of mn in the spec is a bit wishy-washy. It starts off with "Generally speaking, a numeric literal is a sequence of digits, perhaps including a decimal point..." but later says "since mn is a presentation element, there are a few situations where it may be desirable to include arbitrary text...". My reading of this is that the content of mn is a "literal number", but it doesn't say *which* number; it's presentation after all. That's content or intent's job. So, it seems to me quite legitimate that either comma or period can be the "decimal point", while period or comma or even spacing can be used as thousands (or 10K, or 100K...) separator. [Aside: I've often wondered whether the "point" of "decimal point" was universally understood as "dot" rather than "location". Merriam-Webster says: "Definition of decimal point : a period, centered dot, or in some countries a comma at the left of ..." So they've generalized it to comma, but not to RTL! :> ]

dginev · 2023-01-06T16:54:02Z

A related detail here is that we may want to explicitly mention how/if intent values - and especially Core list entries - interplay with the language specified by the hosting document.

As some examples:

HTML uses the lang attribute (and xml:lang for compatibility)
JATS uses xml:lang docs
ePub (uniquely?) also uses a dc:language element docs

On a separate note:
This is naturally complicated by xml:lang being a global attribute that can (in theory) be used on any MathML node. A fun didactic example may be to write an expression that states that a number is equal in two different languages. For example:

<math>
  <mn xml:lang="en">2</mn>
  <mo>=</mo>
  <mn xml:lang="bg">2</mn>
</math>

which ought to render the two numbers identically on a screen and Braille display. But may (may not? open question) speak them differently.

We could of course decide this level of complexity is unlikely to be useful, and recommend against using language annotations on any inner node.

Just jotting down the question, and maybe this is partially related to the technical details of #425 .

NSoiffer · 2023-01-07T01:52:07Z

NVDA actually sticks a lang attr on all the MathML it creates for speech. For reasons I haven't tracked down, even when I change NVDA's language to French (or something else), it still insists on putting 'en' on the math tag. Maybe it is getting it from the OS??? I ended up ignoring it for MathCAT although I need more testing by others to figure out what really is right.

I haven't seen anything about how lang affects braille (but I haven't really looked). I would think that it does, and that might even affect the display of numbers (some braille systems use "drop numbers", some use different prefixes for digits). I suspect allowing lang, or more specifically, saying lang affects braille in math would be a big problem. For speech, it is probably less of a problem technically/theoretically, but practically, probably a big one.

polx · 2023-01-19T12:28:42Z

I disagree. The description of mn in the spec is a bit wishy-washy. It starts off with "Generally speaking, a numeric literal is a sequence of digits, perhaps including a decimal point..." but later says "since mn is a presentation element, there are a few situations where it may be desirable to include arbitrary text...".

While I feel it is ok for mn to contain any text I think we should upgrade the spec so that wishy-washy is removed. A number can only be a number if included in a single mn.
That would enable conversion 1,99 to a single number if in a singlemn and to a list if not.
An interval with three elements should not exist but coordinates, for example, often include "," as the separator with it being in conflict with the decimal comma of some languages. Ensuring mn singled-ness. sounds like a good idea.

polx · 2023-01-19T12:29:47Z

To the point about number syntax switch "." and ","... In Switzerland, I recently saw numbers written as "1'234'567". And spaces between digit blocks are also common.

That's everywhere in Europe.

Also (for the record): Asian countries tend to use blocks of four digits, not three digits as in common in the Western world.

Way fascinating! Bring me a scan please @NSoiffer , so I make this into the notation census!

polx · 2023-01-19T12:31:47Z

lang attribute

I am doubtful we should suggest multi-lingual processing of intents within a single page.
The most inner-one lang attribute is probably the on to have authority... Or?

NSoiffer · 2023-01-25T23:33:10Z

@polx For digit grouping, Wikipedia has some discussion.

dginev · 2023-03-27T13:10:01Z

On a separate note: This is naturally complicated by xml:lang being a global attribute that can (in theory) be used on any MathML node. A fun didactic example may be to write an expression that states that a number is equal in two different languages. For example:
<math>
  <mn xml:lang="en">2</mn>
  <mo>=</mo>
  <mn xml:lang="bg">2</mn>
</math>

To follow-up on my example here, today I was reminded (by listening to a talk by a UK speaker) that there are many names for the number 0 in English. The wiki article contains a nice overview as usual. So one can imagine my didactic example one level down the lang codes:

<math>
  <mn xml:lang="en-uk">0</mn>
  <mo>=</mo>
  <mn xml:lang="en-us">0</mn>
</math>

which may be expected to produce "naught equals zero". The slang narration examples from wiki may also be good illustrations for underscore use, as in <mn intent="_nada">0</mn> or <mn intent="_zilch">0</mn>.

dginev · 2023-04-20T17:52:52Z

We discussed in the call today that the specific mapping between a Core intent name (where the Core list is a unique list in English) and its translation in a different language, is not a Core connection between the two, but an Open one.

As such, the details of connecting translations of Core concepts to their English counterparts is deferred to the way Open lists are organized.

And we can close here without further action. Naturally, we can open new issues for other internationalization questions.

dginev added the intent Issues involving the proposed "intent" attr label Aug 8, 2022

dginev closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intent and internationalization #409

Intent and internationalization #409

dginev commented Aug 8, 2022 •

edited

Loading

davidcarlisle commented Aug 8, 2022

NSoiffer commented Aug 9, 2022

NSoiffer commented Aug 9, 2022

dginev commented Aug 9, 2022 •

edited

Loading

NSoiffer commented Aug 11, 2022

NSoiffer commented Sep 13, 2022

brucemiller commented Oct 11, 2022 via email

dginev commented Jan 6, 2023

NSoiffer commented Jan 7, 2023

polx commented Jan 19, 2023

polx commented Jan 19, 2023

polx commented Jan 19, 2023

NSoiffer commented Jan 25, 2023

dginev commented Mar 27, 2023

dginev commented Apr 20, 2023

Intent and internationalization #409

Intent and internationalization #409

Comments

dginev commented Aug 8, 2022 • edited Loading

davidcarlisle commented Aug 8, 2022

NSoiffer commented Aug 9, 2022

NSoiffer commented Aug 9, 2022

dginev commented Aug 9, 2022 • edited Loading

NSoiffer commented Aug 11, 2022

NSoiffer commented Sep 13, 2022

brucemiller commented Oct 11, 2022 via email

dginev commented Jan 6, 2023

NSoiffer commented Jan 7, 2023

polx commented Jan 19, 2023

polx commented Jan 19, 2023

polx commented Jan 19, 2023

NSoiffer commented Jan 25, 2023

dginev commented Mar 27, 2023

dginev commented Apr 20, 2023

dginev commented Aug 8, 2022 •

edited

Loading

dginev commented Aug 9, 2022 •

edited

Loading