Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK blog post #835

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
2 changes: 1 addition & 1 deletion _posts/2024-10-05-max-data-uri-size.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ authors:
- https://github.com/opoudjis

excerpt: >-
This post describes the how Metanorma leverages Data URIs for media files and
This post describes how Metanorma leverages Data URIs for media files and
document attachments to create a single, unified XML document for seamless
distribution, and when it is necessary to disable Data URI encoding in cases.
---
Expand Down
212 changes: 212 additions & 0 deletions _posts/2024-10-26-i18n-cjk.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
---
layout: post
title: "Support for Japanese internationalisation"
date: 2024-10-26
categories: documentation

authors:
- name: Nick Nicholas
email: [email protected]
social_links:
- https://github.com/opoudjis

excerpt: >-
This post describes how Metanorma internationalises content, specifically
for Japanese, in light of Metanorma supporting JIS and Plateau as new flavours.
---

== Introduction

Metanorma supports a number of flavours of standardisation documents, many of
which use languages other than English. As a result, internationalisation of content
is a core concern of Metanorma -- particularly with automatically generated content,
such as captions, crossreferences, and autonumbering.

Scripts other than Latin pose their own challenges in internationalisation, including
RTL (right-to-left) scripts like Arabic and Hebrew, and CJK (Chinese, Japanese, Korean),
ideographic scripts. Our
recent move to support documents in Japanese has led to a good deal of effort in CJK scripts
specifically. We have already written here on the work we have done with
link:/blog/2023-12-19/ruby-in-metanorma/[Ruby annotation]. We summarise here some of our recent work.

== JIS and Plateau

Metanorma has done some work in support of Guóbiāo (Chinese national) standards in the past,
and it supports Chinese as one of the six working languages of the ITU, alongside Arabic,
English, Spanish, French, and Russian. However, the most extensive work we have done on
internationalisation has been with Japanese, promoted by our expansion of Metanorma to two
flavours primarily using Japanese:

* link:/author/jis/[`metanorma-jis`] supports Japanese Industrial Standards (JIS), published
by the Japanese Standards Association (JSA). The JSA as a national standards body coordinates
with ISO and IEC, and its format is closely aligned to ISO. Its documents are published in both
Japanese and English.

* link:https://github.com/metanorma/metanorma-plateau[`metanorma-plateau`] supports the
https://www.mlit.go.jp/plateau/[Plateau] project of the Japanese Ministry of Land, Infrastructure, Transport and Tourism.
The flavour is implemented to derive from `metanorma-jis`, but overrides its formatting in several
instances.

== Vertical printing

The default for Japanese standardisation documents follows the Western convention of writing text
left-to-right, top-down; this is particularly preferred as standardisation documents typically
include mathematical formulas, and Western-language text. However, the
https://en.wikipedia.org/wiki/Horizontal_and_vertical_writing_in_East_Asian_scripts:[traditional Japanese practice of writing Japanese top-to-bottom, right-to-left]
remains common, particularly in legal text. Metanorma is currently working on implementing vertical
writing in CJK in the PDF format of JIS docuemnts, as a rendering option.

== Japanese calender

Japan uses the Western Gregorian calendar alongside the traditional https://en.wikipedia.org/wiki/Japanese_calendar:[Japanese calendar],
which uses regnal years for the year rather than Anno Domini dating. In official contexts, the Japanese calendar
is used: that includes indication of when documents were created and published.

Metanorma uses ISO 8601, which is founded on the Gregorian calendar, to enter its dates as metadata; so the date
a document was published will be indicated as something like `created-date: 2020-10-11`. When the date
is displayed in the document frontispiece, it is rendered in the Japanese calendar, as 令和二年10月11日
[Year 2 of the Reiwa era -- the reign of emperor Naruhito; month 10 day 11].

== Japanese numbering

As with vertical printing, Japanese standardisation documents typically use Arabic numerals
for automated numbering (clause numbers, ordered list numbers) and in metadata (edition numbers,
dates of publication). However more conservatively formatted documents such as legal documents,
that tend to use vertical writing, also tend to use Japanese numbering (properly speaking, Chinese
numbering) in those contexts. Metanorma has recently added functionality in JIS and Plateau
to use Japanese numbering instead of Arabic numbering in those contexts. So by default, a Japanese
document equivalent to

____
Published: 2020-10-11

*1. Introduction.

*1.1. Scope.*

The following topics are in scope of this document:

1. Japanese numbers.
2. Arabic numbers.
3. Conversion between Japanese and Arabic numbers.
____

would be:

____

公開日: 令和二年10月11日

*1 はじめに。*

*1.1 範囲。*

このドキュメントの範囲は以下のトピックです:

1. 日本語の数字。
2. アラビア数字。
3. 日本語とアラビア数字の変換。
____

If Japanese numbering is set:

[source,asciidoc]
----
:presentation-metadata-autonumbering-style: japanese
----

the document will instead look like:

____

公開日: 令和二年十月十一日

*一 はじめに。*

*一・一 範囲。*

このドキュメントの範囲は以下のトピックです:

一. 日本語の数字。
二. アラビア数字。
三. 日本語とアラビア数字の変換。
____

NOTE: the dot between numbers in clause numbers is a middle dot in Japanese numbering.

== Full-width punctuation

Punctiation in CJK scripts is different from that of Roman script, even where CJK has adopted
Western punctuation. In order to fit in with the ideographic characters of Chinese, Japanese and
Korean, the punctuation of CJK needs to be of the same width as an ideographic character
("full-width punctuation"). Automatically populated text in Metanorma will be automatically
populated by default with Roman punctuation; any such punctuation needs to be adjusted to
be full-width. The context where this is most apparent is in bibliographic references, which
are populated by template out of a bibliographic database; but this also applies to cross-references
and captions.

For instance, a list item cross-reference will by default end in a closing parenthesis: _1)_.
If Japanese numbering is being used, that needs to be rendered not as _一)_, but as _一)_,
with a full-width parenthesis.

Often in technical documents, Roman text and Arabic numberals are interspersed with CJK text.
In such cases, full-width punctuation should not be applied when it adjoins Roman text, but
only with CJK text. So in the example above, if Arabic numbering is used for lists, _1)_ should
be left alone, and not converted to _1)_.

== CJK carriage return

In Roman text, the carriage return at the end of a line in Asciidoc is interpreted as space; so
a text entered as

[source,asciidoc]
----
Now is the time for all good men
to come to the aid of the party.
----

is reflowed in Metanorma XML (and thus Metanorma outputs) as

[source,xml]
----
<p>Now is the time for all good men to come to the aid of the party.</p>
----

Space is used much more sparingly in CJK; as a result, a carriage return in CJK Asciidoc text
is *not* interpreted as space; so

[source,asciidoc]
----
今こそ、すべての善良な人々が
政党を支援する時です。
----

is reflowed in Metanorma XML as

[source,xml]
----
<p>今こそ、すべての善良な人々が政党を支援する時です。</p>
----

with no Roman or CJK space introduced between 人々が and 政党を.

However, as with punctuation, any lines ending with Roman text have the space respected:

[source,asciidoc]
----
実施は中村秀子氏と John
Smith 氏の間で交渉されました。
----

reflows to

[source,xml]
----
<p>実施は中村秀子氏と John Smith 氏の間で交渉されました</p>
----

== Extended space

In CJK scripts, titles consisting of only a few characters are rendered in extended spacing;
so _Foreword_ as a title is not rendered as 序文, but as 序 文. This behaviour has been implemented
in Metanorma for all section titles consisting of four characters or less.
135 changes: 135 additions & 0 deletions _posts/2024-11-07-iso-historical.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
layout: post
title: "Support for historical ISO versions"
date: 2024-11-07
categories: documentation

authors:
- name: Nick Nicholas
email: [email protected]
social_links:
- https://github.com/opoudjis

excerpt: >-
This post describes how Metanorma supports legacy versions of ISO standards.
---

== Introduction

Metanorma is mostly used to prepare new standards for publication, but it is also starting
to be used to format and generate older versions of standards as well, particularly as
those standards are integrated with data. In particular we have been working on the
https://www.iso.org/standard/7472.html:[ISO-2533] standard (the
https://en.wikipedia.org/wiki/International_Standard_Atmosphere[International Standard Atmosphere],
a data-intensive standard published in 1975, with addenda in 1985 and 1997, and with a
new version currently in preparation. This work has required us to regenerate the standards
as they were originally published, using the same data, and with the look-and-feel ISO documents
had at the time, rather than how those documents would be presented now.

This work has motivated us to support older versions of how ISO standards were specified and formatted,
so that such regenerated standards do not look anachronistic. There have been several different iterations
of document layout for ISO standards over the years:

`2024`::: (default) The latest document layout as of 2024 (default)
`2013`::: Document layout used from 2013 to early 2024.
`2012`::: Document layout used from mid-2012 to 2013. It is equivalent to the `1989` layout with a logo change.
`1989`::: Document layout used from 1989 to mid-2012.
`1987`::: Document layout used from 1987 to 1989.
`1972`::: Document layout used from 1972 to 1987.
`1951`::: Document layout used from 1951 to 1971. The first available published ISO layout.

Metanorma is configured to select the appropriate layout to render a given document in. This is done in
two ways:

* By default, if the `:copyright_year:` document attribute is specified for a document, that year
is compared to the ranges given above, and the corresponding document layout is applied for the document.
(Dates are assumed to apply from January 1 of each year, except for the `2012` format, which applies from
`2012-07-01`.)
* If the user specifies one of the given years as the `:document-scheme:` document attribute, that
year's layout is applied to the document, overriding any layout chosen through the `:copyright_year:`.

At this time, the document layout is only applied to PDF output: HTML and Word output use the latest
output, no matter what the document scheme.

.The "Rice document" PDF cover page, as it appears in the 2013 document scheme (inferred from its copyright year 2016)
image::/assets/blog/2024-11-07a.png[]

.The "Rice document" PDF cover page, as it appears in the 2024 document scheme
image::/assets/blog/2024-11-07b.png[]

.The "Rice document" PDF cover page, as it appears in the 1951 document scheme
image::/assets/blog/2024-11-07c.png[]

The differences between schemes are mostly a matter of visual presentation, but before 2013, ISO documents
allowed Scope, Normative References and Terms and Definitions to be subclauses of an initial General
clause, rather than requiring them to be separate clauses at the start of the document body.

Document attributes are mostly the same across the document schemas, with the following exceptions:

* In `2024`, the attribute `:semantic-metadata-feedback-link:`, which specifies a URL for readers to provide
feedback for a specific document, is used to generate a QR code on the cover page of the document PDF.
* From 1994, ISO has used the
https://en.wikipedia.org/wiki/International_Classification_for_Standards[International Categorization for Standards]
number to classify documents, and the ICS number appears on document cover pages;
it is specified as comma-delimited values of the `:library-ics:` document attribute (e.g. `:library-ics: 43.040.20,35.220.20-10`.
Prior to 1994, ISO instead used the
https://en.wikipedia.org/wiki/Universal_Decimal_Classification[Universal Decimal Classification (UDC)],
and this is supported through the generic `:classification:` document attribute; values are comma-delimited,
and each UDC value must be prefixed with `UDC:` (e.g. `:classification: UDC:663.971/.976:620.1:551.511.12, UDC:535.643.2`).

In addition, our support for legacy format of ISO means we now support not only Amendments and Technical Corrigenda
of documents (`:doctype: amendment`, `:doctype: technical-corrigendum`), but also Addenda (`:doctype: addendum`),
which were published by ISO under the 2000s. Addenda are marked up in the same way as Amendments and Technical Corrigenda:
they are updates of documents (whose identifier is given under `:updates:`), and they have distinct titles,
indicated through `:title-addendum-{en,fr}:`. For example, the following is how ISO 2533:1975/ADD 1:1985 (Addendum 1 of ISO 2533)
is marked up:

[source,asciidoc]
----
= Standard atmosphere
:docnumber: 2533
:edition: 1
:copyright-year: 1975
:revdate: 1985-02-15
:language: en
:title-main-en: Standard atmosphere
:title-intro-fr: Atmosphère type
:updates: ISO 2533:1975
:has-draft: ISO 2533:1975/DAD 1
:updates-document-type: international-standard
:addendum-number: 1
:doctype: addendum
:docstage: 60
:docsubstage: 60
----

Note the use of `has-draft:`, which gives the identifier of the pre-publication version of the addendum
(`ISO 2533:1975/DAD 1`: Draft Addendum 1)

And this is how ISO 2533:1975/ADD 2:1997 is marked up:

[source,asciidoc]
----
= Standard atmosphere
:docnumber: 2533
:edition: 1
:copyright-year: 1985
:revdate: 1997-11-01
:language: en
:title-main-en: Standard atmosphere
:title-main-fr: Atmosphère type
:title-main-ru: Стандартная атмосфера
:title-addendum-en: Extension to -- 5000 m and standard atmosphere as a function of altitude in feet
:title-addendum-fr: Extension à -- 5000 m, et atmosphère type en fonction de l'altitude, en feet
:title-addendum-ru: Расширени до -- 5000 м и стандартная атмосфера в функции от высоты в футах
:updates: ISO 2533:1975
:updates-document-type: international-standard
:addendum-number: 2
:doctype: addendum
:docstage: 60
:docsubstage: 60
----

Addendum 2 does not give a pre-publication version identifier, but it does provide a title of the addendum
specifically.

Binary file added assets/blog/2024-11-07a.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/blog/2024-11-07b.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/blog/2024-11-07c.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 5 additions & 2 deletions author/iso/ref/document-attributes.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ updating [added in https://github.com/metanorma/isodoc/releases/tag/v1.3.25].

`:docsubtype:`:: A subclass of doctype for which special processing rules apply.

`vocabulary`:::
`:vocabulary`:::
The "vocabulary" document type is defined in the
https://www.iso.org/ISO-house-style.html[ISO House Rules]
and title requirements defined in the ISO/IEC Directives, Part 2, 2018, 11.5.2.
Expand Down Expand Up @@ -139,7 +139,7 @@ There may be more than one ICS for a document; if so, they should be comma-delim
`:classification:`::
+
--
(for `document-scheme` values of `1989` and prior, and a publication date of 1994 onwards)
(for `document-scheme` values of `1989` and prior, and a publication date before 1994)

The
https://en.wikipedia.org/wiki/Universal_Decimal_Classification[Universal Decimal Classification (UDC)]
Expand Down Expand Up @@ -196,6 +196,9 @@ as CIE uses UDC.
====
--

`:price-code:`:: price code group of publication, as documented in the
https://www.iec.ch/members_experts/tools/pdf/IEC_DATA_FEEDS.pdf[IEC Data Feeds: Technical documentation document] [added in https://github.com/metanorma/metanorma-iso/releases/tag/v2.8.10]

=== Document identifier

==== General
Expand Down
4 changes: 4 additions & 0 deletions author/topics/collections/configuration.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -593,6 +593,10 @@ Prefatory content from the collection manifest [added in https://github.com/meta
`final-content`::
Final content from the collection manifest [added in https://github.com/metanorma/metanorma/releases/tag/v1.5.6].

`bibdata`::
A hash representation of the `bibdata` element representing the bibliographic metadata
of the manifest [added in https://github.com/metanorma/metanorma/releases/tag/v2.0.8].


== Multilingual documents

Expand Down
Loading
Loading