Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TeX2MathML converter implementation guidelines #39

Open
physikerwelt opened this issue Jan 22, 2022 · 10 comments
Open

TeX2MathML converter implementation guidelines #39

physikerwelt opened this issue Jan 22, 2022 · 10 comments

Comments

@physikerwelt
Copy link
Member

physikerwelt commented Jan 22, 2022

TLDR: Can we create a list of LaTeX commands that generate all elements described by the core spec?

The goal of the Wikimedia community group math is to improve the display of mathematical expressions in Wikipedia. Indeed, using browser-based MathML rendering to deliver high-quality formulae is desirable. The new MathML core specification seems promising as it appears to be detailed enough to implement and evaluate MathML rendering engines based on the spec. Therefore, there are good reasons to be optimistic. Once the spec is final and the rendering engines have been implemented, reasonable MathML markup will lead to appealing rendering results that the community will appreciate.

However, the de-facto standard in 2022 for authoring and rendering mathematical formulae are formats from the TeX family. Therefore, I suggest a deeper investigation of the conversion process TeX like inputs formats to MathML. We need conversion tools that generate the intended MathML 4 output from TeX like input as a prerequisite for our new MathML 4 standard to become a success story. In 2018, we evaluated several TeX2MathML conversion tools including those listed on our tool page. At that point, we created a manual gold standard dataset for presentation and Content MathML. However, the gold standard dataset's quality might not be optimal as it was influenced by LaTeXML. In particular, we used LaTeXML to generate the initial version of the MathML output and fixed problems we spotted by chance in that output.

Therefore, I suggest creating a non-normative document describing how to convert TeX expressions to the corresponding MathML core expression. While this task is open-ended, I recommend stopping after all elements described in the MathML core spec have at least one corresponding LaTeX input.

After that is completed and we still have enthusiasm, we could extend the exercise not only for core but also for intent. Here one could stop, for example, after having touched all symbols with the planned custom style tag annotations and their corresponding content MathML representations.

Disclaimer: I am currently considering implementing a texvc to MathML converter in PHP. For a TDD workflow, it would therefore be good to be able to generate meaningful test cases.

@dginev
Copy link
Contributor

dginev commented Jan 22, 2022

Collecting macros for creating a tiny tree with each of the MathML Core elements is certainly quite doable.

My fear is that to make it useful, you'd also have to collect the macros needed for generating the different meaningful values for each of the MathML Core attributes.

And then macros to generate some of the idiomatic expression trees.


For example, would such a list care that a script of an <msup> is better attached to a completed parenthetical base, rather than the closing fence?

( \ldots )^2

attached to full parenthetical base, (latexml with enabled grammar):

  <msup>
    <mrow>
      <mo stretchy="false">(</mo>
      <mi mathvariant="normal">…</mi>
      <mo stretchy="false">)</mo>
    </mrow>
    <mn>2</mn>
  </msup>

vs attaching to the closing fence (via mathjax)

  <mo stretchy="false">(</mo>
  <mo>&#x2026;</mo>
  <msup>
    <mo stretchy="false">)</mo>
    <mn>2</mn>
  </msup>

vs not attaching at all. (latexml with grammar disabled via --noparse):

  <mo>(</mo>
  <mi mathvariant="normal">…</mi>
  <mo>)</mo>
  <msup>
    <mi/>
    <mn>2</mn>
  </msup>

The elements are about the same, but the trees are markedly different. Well, there's also apparent debate whether the ellipsis is an <mi> or <mo> -- which is also a question for a useful list, should most of the math symbols be given clear MathML Core targets?

I fear that a useful list will have to spend a lot more writing in talking about tree structure, than the individual leaf elements.
Still worth starting, but I'd expect it to hit 50-100 pages in size pretty quickly if we include that area of consideration.

@davidfarmer
Copy link

davidfarmer commented Jan 22, 2022 via email

@physikerwelt
Copy link
Member Author

My point is that the absolute value and the open interval have to be macros, because this version requires guessing the intents: If $0 &lt; x^2 &lt; 100$ then $|x| \in (0, 10)$.

@davidfarmer I think even in a controlled environment like Wikipedia, with a very restricted set of commands it is hard to predict what people will actually write. Often there is a lot of formatting included. For example, for your interval example, the actual code in Wikipedia looks like this

Both notations are described in [[International standard]] [[ISO 31-11]]. Thus, in [[set builder notation]],
: \begin{align}
{\color{Maroon}(} a,b{\color{Maroon})} = \mathopen{\color{Maroon}]}a,b\mathclose{\color{Maroon}[} &= {x\in\R\mid a{\color{Maroon}{}<{}}x{\color{Maroon}{}<{}}b}, \{}
{\color{DarkGreen}[}a,b{\color{Maroon})} = \mathopen{\color{DarkGreen}[} a,b\mathclose{\color{Maroon}[} &= {x\in\R\mid a{\color{DarkGreen}{}\le{}} x{\color{Maroon}{}<{}}b}, \{}
{\color{Maroon}(} a,b{\color{DarkGreen}]} = \mathopen{\color{Maroon}]}a,b\mathclose{\color{DarkGreen}]} &= {x\in\R\mid a{\color{Maroon}{}<{}}x{\color{DarkGreen}{}\le{}} b}, \{}
{\color{DarkGreen}[}a,b{\color{DarkGreen}]} = \mathopen{\color{DarkGreen}[} a,b\mathclose{\color{DarkGreen}]} &= {x\in\R\mid a{\color{DarkGreen}{}\le{}} x{\color{DarkGreen}{}\le{}} b}.
\end{align}
Each interval {{open-open|''a'', ''a''}}, {{closed-open|''a'', ''a''}}, and {{open-closed|''a'', ''a''}} represents the [[empty set]], whereas {{closed-closed|''a'', ''a''}} denotes the singleton set {{math|{''a''}{{null}}}}. When {{math|''a'' > ''b''}}, all four notations are usually taken to represent the empty set.

As you can see, due to the absence of a native TeX or MathML-based solution to annotate intent, people came up with custom templates such as closed-open, etc. However, those templates are hard to discover for authors. E.g, the closed-open template is only used 45 times within English Wikipedia. On the other hand, I think the effort people spend in writing and rewriting Wikipedia articles is much higher than the effort to write a paper once and upload it to arxiv. Feel free to look at the statistics of the interval example. Just to quote one of many impressive numbers: The average time between edits is 8.2 days.
While the tex code within the wikitext tag <math> (not be confused with the HTML5 element <math>;-) produces MathML output right now, the templates in double curly brackets generate text, e.g., for the closed-open example
<span class="texhtml">[<i>a</i>, <i>a</i>)</span>. Certainly one could change the implementation of the templates to also output MathML. So overall, I see it as a big advantage that we have those semantic templates. However, there are several thousand math templates and one would need to provide good reasons for people to spend effort modifying these templates. I could imagine that improved accessibility would be a convincing argument. Telling a long story short. I expect that if we find a convenient and intuitive way to specify the intent, I guess there is a good chance that it will be implemented in Wikipedia. However, having users change the MathML code either directly or via a WYSIWYG editor is not a solution, because it would be too frustrating if a minor change in the TeX source and subsequent regeneration of the MathML code would reset the intent properties.

@physikerwelt
Copy link
Member Author

I fear that a useful list will have to spend a lot more writing in talking about tree structure, than the individual leaf elements.
Still worth starting, but I'd expect it to hit 50-100 pages in size pretty quickly if we include that area of consideration.

@dginev I think it would be useful to write this. At least to everyone implementing conversion tex2mathml converters, which could be 20 people or even more. I am afraid, I might have co-authored papers that are read by fewer people;-) Maybe we can just start something and see how it goes. Can you recommend an authoring tool?

@dginev
Copy link
Contributor

dginev commented Jan 22, 2022

Can you recommend an authoring tool?

I think for this type of collaborative writing, HackMD may be my current default choice. They have higlighting of TeX and XML snippets (similarly to github issues), and also have native MathML rendering (which I had asked of them some time back, <math>...</math> markup will render as regular HTML-in-markdown). But it's just one idea - maybe there's something better.

@davidfarmer
Copy link

@physikerwelt made what I think is a key point:

"having users change the MathML code either directly or via a WYSIWYG editor is not a solution"

What is needed, especially in an environment like Wikipedia, is either

a) A human-readable, human-writable source format which automatically
converts to the desired MathML output.

b) An editing program which lets a person create the content, and which all
potential authors can use.

My hope is to support option a). The key point @physikerwelt made is a warning
about possible failures of option b).

Both options can coexist of the editing program of b) can output the source
format of a).

I like option a) because it provides an archival format which can adapt to future
changes in the recommended MathML output.

@physikerwelt
Copy link
Member Author

@davidfarmer exactly. Just to link it back to the Wikimedia terminology. a) corresponds to wikitext and b) corresponds to VisualEditor

Both options can coexist of the editing program of b) can output the source format of a).

I would like to mention that the development of the VisualEditor was extremely challenging to the constraint that the wikitext output should still remain human-readable and editable. This constraint did not only make it a bit more effort but put it into a whole new class of problems and increased the effort about several orders of magnitude.

@physikerwelt
Copy link
Member Author

A quick update on that. @Hyper-Node and I have now converted the Latex (subset) parser to PHP and now have an AST of the LaTeX representation. We are now looking for ideas to generate the MathML output from that AST. @Hyper-Node is going to look into the MathJax source code to come up with a high-level design on how to generate the MathML output from that tree. I will generate the lists mentioned before.

@NSoiffer
Copy link
Contributor

NSoiffer commented Oct 24, 2022 via email

@NSoiffer
Copy link
Contributor

NSoiffer commented Jan 5, 2023

Move this to mathml-docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants