A new pattern for translating metadata in tabular data packages #687

augusto-herrmann · 2021-07-14T21:25:37Z

augusto-herrmann
Jul 14, 2021

Frictionless already has a couple of patterns for translation of tabular data. The first involves creating additional columns with a language tag in the end, e.g. "name@en". The other one suggests creating different files in the same path, one CSV file for each language.

Both of these require duplicating a lot of data. Sometimes you have a lot of numeric columns, so it makes no sense to translate the data (which is just numbers after all), but you still want to translate the metadata (i.e. column names and descriptions). Other times, you could translate the data, but the volume is so great that it isn't feasible in the short term, but you still want to release the untranslated data with translated metadata.

With that in consideration, I would like to suggest a third pattern for translating just the metadata in a tabular data package.

Leave the CSV files untouched.
Create multiple datapackage.xx.json files, replacing xx with the language tag for the file. The strings in those file should all be in the respective language. All other definitions in the tabular data package (e.g. column type, missing values, restrictions, etc.) should be exactly the same in every file.

We are already using this approach in a data package we publish as open data. A challenge of this pattern is keeping the non-textual information in sync among all those files. To tackle this challenge, we create the datapackage.xx.json files in Python, defining all the data validation properties in there. All text-based strings (titles and descriptions) are read from a separate yaml file. If we want to add a new language, just copy the datapackage-strings.xx.yaml file to a new one, translate it, run the Python script and we get the new respective datapackage.xx.json file. If we want to change any of the data validation properties, we edit the Python script and it will generate all of the datapackage.xx.json files, keeping data validation in sync.

Any comments on alternate ways to translate just the metadata (titles and descriptions) in tabular data packages would be appreciated. If we find out this pattern I describe is really the best way to do it, perhaps we should add it to the translation support patterns.

lwinfree · 2021-07-23T19:28:25Z

lwinfree
Jul 23, 2021

I really like this idea @augusto-herrmann!

1 reply

augusto-herrmann Nov 2, 2021
Author

Here is an example of a data package using this idea:

Should I open an issue for this in frictionless/specs?

AFoletti · 2021-11-02T11:30:23Z

AFoletti
Nov 2, 2021

I have a similar problem with my datapackages, and I solved it in another way.
I am not duplicating data in the CSV nor proliferate datapackages. Rather, I add the @ tag to the needed elements in the datapackage itself.
Example:

{
    "resources": [{
    "name": "ogd10_energieforschungstatistik_ch.csv",
    "languages": ["de", "fr"],
[...]
    {
        "name": "finanzquelle",
        "type": "string",
        "format": "default",
        "title@de": "Finanzquelle",
        "title@fr": "Source de financement",
        "description@de": "In- oder ausländische Stelle, welche die Forschung finanziert",
        "description@fr": "Entité nationale ou étrangère qui finance la recherche"
    }
[...]
}

The Data Package Validator on DataHub tells me it's valid, but I am not really confident that I am using the datapackage the way it's meant to be used.
What does the community think about that?

2 replies

augusto-herrmann Nov 3, 2021
Author

I think your pattern looks similar to the current inline pattern, which is intended for l10n of data, but applied to metadata instead, so that you store the column only once.

IMHO it is a sensible approach and is also a possible solution to specs#355 for only translating fields names for which a translation is available. It would need to be supported in all metadata that takes a string, including at the data package level (which I have seen that your data package file does have but the snippet does not show).

How about the use case where you have both the data and metadata multilingual? I think the @ tag could be interpreted to mean the language of the data when used as the value of a field name, and to mean the language of the metadata when used as the key of other properties. For instance, a possible way to combine the inline pattern with the approach you suggested:

{
    "title@en": "Multilingual dataset example",
    "title@fr": "Exemple de jeu de données multilingue",
    "resources": [{
    "name": "example.csv",
    "languages": ["en", "fr"],
[...]
    {
        "name": "description",
        "type": "string",
        "format": "default",
        "title@en": "The description",
        "title@fr": "La description (en anglais)",
    },
    {
        "name": "description@fr",
        "type": "string",
        "format": "default",
        "title@en": "The description (in French)",
        "title@fr": "La description",
    }
[...]
}

id,description,description@fr
1,A sentence describing item 1 in English.,Une phrase qui décrite ligne 1 en français.
2,Yet another row.,Une ligne de plus.

AFoletti Nov 3, 2021

Yes, it is actually the inline pattern approach applied to metadata instead of data, both at the package and resource level. It works in my opinion quite well and allows me to only translate those elements where I actually have more than one language.

And sure, you can combine it with multlingual data as you suggested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new pattern for translating metadata in tabular data packages #687

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

A new pattern for translating metadata in tabular data packages #687

augusto-herrmann Jul 14, 2021

Replies: 2 comments · 3 replies

lwinfree Jul 23, 2021

augusto-herrmann Nov 2, 2021 Author

AFoletti Nov 2, 2021

augusto-herrmann Nov 3, 2021 Author

AFoletti Nov 3, 2021

augusto-herrmann
Jul 14, 2021

Replies: 2 comments 3 replies

lwinfree
Jul 23, 2021

augusto-herrmann Nov 2, 2021
Author

AFoletti
Nov 2, 2021

augusto-herrmann Nov 3, 2021
Author