A Survey of Corpora for Germanic Low-Resource Languages and Dialects

You can read more about this corpus collection here. If you find this overview useful for your research, please cite:

@inproceedings{blaschke-etal-2023-survey,
  title = {A survey of corpora for {G}ermanic low-resource languages and dialects},
  author = {Blaschke, Verena and Sch{\"u}tze, Hinrich and Plank, Barbara},
  year = {2023},
  month = may,
  booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
  address = {T{\'o}rshavn, Faroe Islands},
  publisher = {University of Tartu Library},
  url = {https://aclanthology.org/2023.nodalida-1.41},
  pages = {392--414},
}

Language varieties:

General
North Germanic (Faroese · (non-std.) Norwegian · Jutish · East Danish · Elfdalian · (non-std.) Swedish)
West Germanic
- North Sea Germanic
  - Anglo-Frisian (Scots · (non-std.) English · West Frisian · North Frisian · Saterland Frisian)
  - Low German (Low Saxon · East Frisian Low Saxon · Gronings · Twents · Achterhoeks · Westphalian)
- Macro-Dutch ((non-std.) Dutch · West Flemish · Zeelandic)
- High German
  - Central German (Upper Saxon · Moselle Franconian incl. Luxembourgish · Colognian · Limburgish · Rhine Franconian incl. Palatine German · Pennsylvania Dutch · Yiddish)
  - Upper German ((non-std.) German · Upper Franconian · Bavarian · Cimbrian · Mòcheno · Swabian · Central Alemannic (Swiss German & Alsatian) · Walser)

Inclusion criteria:

Accessible to researchers
Can be downloaded (easily)
No extensive pre-processing required (appropriate file formats; no abundance of OCR errors)
~~Full sentences/utterances rather than word lists~~ We have relaxed this criterion and are now also including word-based resources useful for variationist research.
Data are contemporaneous or from the past century
If only a written version is available, it should be (manually) annotated and/or showcase variation through phone[t/m]ic transcriptions or orthographies used specifically for that language variety

We focus on manual or manually corrected annotations rather than fully automatically annotated data. For corpora with an “uncurated” note, we strongly recommend manually checking the data quality, as it might be low or mixed. We've excluded corpora where we were able to determine large-scale data quality issues. Note that the webcrawl-based corpora likely overlap with the contents of some of the other corpora, and for languages with especially few resources, the overlap with Wikipedia tends to be extremely high.

The license names link to where the license is mentioned on the corpus website, unless the license is mentioned on the site linked in the first column, in the article accompanying the dataset, or in the downloaded corpus files. Always refer to the original corpus websites/papers to double-check the license information; we cannot guarantee that the information here is up to date.

Did we forget a corpus for a Germanic low-resource language or dialect that fits these inclusion criteria? Please reach out to us via a GitHub issue or an email to verena DOT blaschke ÄT cis.lmu.de!

General

Corpus	Notes	Size	Representation	License
Sound Comparisons: Germanic (Paschen ea 2019)	word-based, 120 locations/doculects from all Germanic sub-branches	106 words × 120 locations	audio, phono (IPA), English ortho, ortho of relevant std languages	CC BY-NC-ND 4.0

Faroese · fao · fao1244

Corpus	Notes	Size	Representation	License
UD Faroese OFT (Tyers ea 2018)	POS (UPOS, Giellatekno-FAO), dependencies (UD), morpho (UD), lemmas. Contains material from Wikipedia	1.2k sentences	Faroese ortho	GNU GPL 2.0, GNU LGPL 2.1, Mozilla Public License 1.1
FarPaHC (Ingason ea 2012, Rögnvalsson ea 2012)	POS (mod. Penn-historical, phrase structure (mod. Penn-historical)	53k tokens	Faroese ortho	CC BY 4.0
UD Faroese FarPaHC (Ingason ea 2012, Rögnvalsson ea 2012)	POS (UPOS), dependencies (UD), morpho (UD)	40k tokens	Faroese ortho	CC BY-SA 4.0
FoNE (Snæbjarnarson ea 2023)	named entities (8 classes). The text overlaps with the BLARK background corpus (Sosialurin subcorpus)	118k tokens	Faroese ortho	CC BY 4.0
Fo-STS (Snæbjarnarson ea 2023)	semantic text similarity (sentence-level), translated subset of the English STS corpus (Cer ea 2017)	729 sentence pairs	Faroese ortho	CC BY 4.0
BLARK 1.0 (background corpus) (Simonsen ea 2022)		25M tokens	Faroese ortho	CC BY 4.0
Sprotin translations	English–Faroese parallel sentences	126k sentence pairs	Faroese ortho	MIT license
Føroyskur talumálsbanki (Jacobsen 2022)		599.9k tokens	Faroese ortho(, audio?)	CLARIN RES-PLAN-BY-PRIV-NORED
Faroese text collection (FTS)	in BLARK 1.0 background corpus	1.1M tokens	Faroese ortho	CC BY 4.0
Korp (Giellatekno)	in BLARK 1.0 background corpus (download via BLARK), contains Wikipedia articles	?	Faroese ortho	CC BY 4.0
BLARK 1.0 (audio) (Simonsen ea 2022)	locations (Suðuroy, Sandoy, Suðurstreymoy, Norðurstreymoy/Eysturoy, Vágar, Norðuroyggjar)	100 hrs	audio, Faroese ortho, some phono	CC BY 4.0
Faroese Danish Corpus Hamburg (FADAC Hamburg) (subset) (Debess 2019)	locations (Tórshavn, Vágar, Suðuroy, Eysturoy/Norðuroyggjar)	31 hrs	audio, Faroese ortho	HZSK-RES
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022)	parallel with ~200 languages	2k sentences	Faroese ortho	CC BY-SA 4.0
Tatoeba (fao subset)	translations into other languages	417 sentences	Faroese ortho	CC BY 2.0 FR
ITU Faroese/Danish (Derczynski ea 2022)	Danish translations; overlaps with (Danish) Tatoeba	4k sentences		CC BY 4.0
Ubuntu via OPUS (Tiedemann 2012)	translations into other languages	20.2k tokens	Faroese ortho	?
QED via OPUS (Abdelali ea 2014, Tiedemann 2012)	translations into other languages	6.4k tokens	Faroese ortho	?
UDHR-LID (subset) (Karagan ea 2023, Unicode)		57 sentences		CC0 1.0
OpenLID (subset) (Burchell ea 2023)	combines other corpora	40k lines		depend on source datasets
FAO News 2020 (Goldhahn ea 2012)	uncurated?	33.8k sentences		?
FAO Newscrawl 2011 (Goldhahn ea 2012)	uncurated?	8.8k sentences		?
Faroese Mixed Corpus (Goldhahn ea 2012)	uncurated?	300k sentences		?
Faroese Web Corpus (Goldhahn ea 2012)	uncurated?	1M sentences		?
FC3 (Snæbjarnarson ea 2023)	Faroese subset of CommonCrawl (uncurated)	98k paragraphs / 9M tokens	Faroese ortho	unspecified CC license
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)	uncurated	102 MB	Faroese ortho	CC BY-SA 3.0
MADLAD-400 (subset) (Kudugunta ea 2023)	uncurated, subset of CommonCrawl	1.8M sentences		CC-BY-4.0
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	2.3M sentences		Apache 2.0 + licenses of source datasets
Wikipedia (fo subset)	uncurated	14k articles	Faroese ortho	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

For additional resources/tools, see also the resource list of the Faroese Centre for Language Technology.

Corpus	Notes	Size	Representation	License
LIA Treebank (+transcriptions) (Øvrelid ea 2018)	POS (mod. NDT), dependencies (mod. NDT), morpho (mod. NDT), lemmas, locations (17 places in Norway). Annotated subset of LIA Norsk	7.5k speech segments / 78k tokens	Nynorsk ortho, phono	CC BY-NC-SA 4.0
UD Norwegian Nynorsk LIA (+transcriptions) (Øvrelid ea 2018)	POS (UPOS), dependencies (UD), morpho (UD), lemmas, locations (10 places in Norway). Annotated subset of LIA Norsk	5.3k speech segments / 55k tokens	Nynorsk ortho, phono; aligned Nynorsk+phono here (Blaschke ea 2023)	treebank: CC BY-SA 4.0, transcriptions: CC BY-NC-SA 4.0
NDC Treebank (+transcriptions; website) (Kåsen ea 2022, Johannessen ea 2009)	POS (mod. NDT), dependencies (mod. NDT), morpho (mod. NDT), lemmas, locations (17 places)	4.6k speech segments / 66k tokens	Bokmål ortho, phono	treebank and transcriptions: CC BY-NC-SA 4.0
NoMusic (Mæhlum & Scherrer 2024) subset of xSID	slot filling, intent detection, translations into Bokmål and 16 other languages; location (8 dialects)	8×800 sentences	ad-hoc pronunciation spelling	CC BY-SA 4.0
NorDial (subset) (Barnes ea 2021)		348 tweets	ad-hoc spelling	CC0 1.0
NorDial (POS-annotated subset) (Mæhlum ea 2022 – contact authors)	POS (UPOS)	35+ tweets	ad-hoc spelling
Nordic Dialect Corpus (subset) (Johannessen ea 2009)	locations (>100 places)	1.9M tokens	Bokmål ortho, phono; aligned Bokmål+phono here (Scherrer 2023)	CC BY-NC-SA 4.0
LIA Norsk (Øvrelid ea 2018)	locations (222 places)	3.5M tokens	Nynorsk ortho, phono	CC BY-NC-SA 4.0
LIA Norsk (downloadable audio subset) (Øvrelid ea 2018)	locations (178 places)	?	audio, Nynorsk ortho, phono	CC BY-NC-SA 4.0
The spoken language investigation in Oslo (TAUS)	locations (East vs. West Oslo)	387k tokens	Bokmål ortho, phono	CC BY-NC-SA 4.0
American Nordic Speech Corpus (CANS) (subset) (Johannessen ea 2015)	locations (57 places in USA/Canada)	773k tokens	Bokmål ortho, phono	CC BY-NC-SA 4.0
Speech Database for Norwegian (NB Tale)	locations (24 areas)	365 × 2 mins (spontaneous speech), 7.6k sentences (reading)	audio, Bokmål ortho, mod. X-SAMPA	CC0
Norwegian Parliamentary Speech Corpus (NPSC)	locations (5 dialect regions)	140 hrs / 65k sentences / 1.2M tokens	audio, Bokmål/Nynorsk ortho	CC0

Corpus	Notes	Size	Representation	License
Parallel dialectal-standard Swedish data (Hämäläinen ea 2020, Ivars & Södergård 2007)	Finland Swedish (with locations)	86.5k tokens	transcription, Swedish ortho	CC BY-NC-SA 4.0
American Nordic Speech Corpus (CANS) (subset) (Johannessen ea 2015)	locations (7 places in the US)	46k tokens	Swedish ortho, phono	CC BY-NC-SA 4.0

Corpus	Notes	Size	Representation	License
POS-tagged Scots corpus (Lameris & Stymne 2021)	POS (UPOS); overlaps with the SCOTS corpus	1k tokens		partially ad hoc (SCOTS), partially with a standardized orthography (Mak Forrit)
Scottish Corpus of Texts & Speech (SCOTS) (subset) (Anderson ea 2007)	partially annotated in the POS-tagged Scots corpus	unknown (4.6M tokens total)	mix of ad-hoc spelling and English ortho	custom
UDHR-LID (subset) (Karagan ea 2023, Unicode)		58 sentences		CC0 1.0
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)	uncurated	35 MB	?	CC BY-SA 3.0
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	410k sentences		Apache 2.0 + licenses of source datasets
Wikipedia (sco subset)	uncurated, see reports here and here ⚠	39k articles	Scots spelling recommendations	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
TwitterAAE-UD (Blodgett ea 2016)	dependencies (UD); AAVE	250 tweets	ad-hoc spelling
Diachronic Electronic Corpus of Tyneside English (DECTE) (Corrigan ea 2012	locations (19 places in NE England). Contains the Newcastle Electronic Corpus of Tyneside English (NECTE) and NECTE2, and NECTE in turn contains the Tyneside Linguistic Survey (TLS) and the Phonological Variation and Change in Contemporary Spoken English (PVC) corpus.	72 hrs / 804k tokens	audio, English ortho, partially: phono	custom
Intonational Variation in English (IViE) (Nolan & Post 2013)	locations (British Isles: Belfast, Dublin, Newcastle, Leeds, Bradford, Liverpool, Cambridge, Cardiff, London)	36 hrs	audio, English ortho	custom
Crowdsourced high-quality UK and Ireland English Dialect speech data set (Demirsahin ea 2020)	locations (British Isles: Ireland, Midlands, Northern England, Scotland, Southern England, Wales)	31 hrs	audio, English ortho	CC BY-SA 4.0
Helsinki Corpus of British English Dialects	locations (UK: Cambridgeshire, Devon, Essex/Lancashire, Isle of Ely, Somerset, Suffolk)	1M tokens	audio, English ortho
Nationwide Speech Project (NSP) (Clopper & Pisoni 2006)	locations (USA: West, Midland, North, South, New England, Mid-Atlantic)	60 × 1 hr	audio, partially: English ortho
Corpus of Regional African American Language (CORAAL) (Kendall & Farrington 2021)	6 locations, AAVE	135.6 hrs / 1.5M tokens	audio, English ortho	CC BY-NC-SA 4.0
Sound Comparisons: Englishes (Maguire ea 2019)	word-based, 51 locations	110 words × 51 locations	audio, phono (IPA), English ortho	CC BY-NC-ND 4.0

Corpus	Notes	Size	Representation	License
UD Frisian/Dutch Fame (Braggar & van der Goot 2021, Yılmaz ea 2016)	POS (UPOS), dependencies (UD), code-switching; code-mixed Frisian and Dutch. Annotated subset of FAME.	400 sentences	Frisian(/Dutch) ortho	CC BY-SA 4.0
UD Frisian Frysk (Heeringa ea 2021)	under development!; POS (UPOS), dependencies (UD), morpho (UD), lemmas	2.9k sentences	Frisian ortho	CC BY-SA 3.0
Common Voice (subset) (Ardila ea 2020)		211 hrs	audio, Frisian ortho	CC0
Frisian AudioMining Enterprise (FAME!) (Yılmaz ea 2016)	partially: locations	18.5 hrs	audio, Frisian ortho
Recordings of Dutch-Frisian council meetings (Bentum ea 2022)		26 hrs / 281k tokens	audio, Frisian ortho
Corpus Spoken Frisian / Korpus Sprutsen Frysk (KSF)		200 hrs (65 hrs thereof transcribed)	audio, partially: Frisian ortho
Boarnsterhim Corpus (BHC) (subset) (Sloos ea 2018)	under revision!	unknown (250 hrs total, with Dutch)	audio
Tatoeba (fry subset)	translations into other languages	641 sentences	Frisian ortho	CC BY 2.0 FR
Ubuntu via OPUS (Tiedemann 2012)	translations into other languages	22.4k tokens	Frisian ortho
KDE4 via OPUS (Tiedemann 2012)	translations into other languages	ca. 300k tokens	Frisian ortho
GNOME via OPUS (Tiedemann 2012)	translations into other languages	55.7k tokens	Frisian ortho
Mozilla-I10n	translations into other languages	ca. 400k tokens	Frisian ortho	Mozilla Public License 2.0
UDHR-LID (subset) (Karagan ea 2023, Unicode)		58 sentences		CC0 1.0
FRY News 2020 (Goldhahn ea 2012)	uncurated?	107.5k sentences	? (written)	?
Western Frisian Newscrawl (Goldhahn ea 2012)	uncurated?	100k sentences
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)	uncurated	72 MB	Frisian ortho	CC BY-SA 3.0
CC-100 (subset) (Wenzek ea 2020)	uncurated, subset of CommonCrawl	174 MB	Frisian ortho
OSCAR (subset) (Abadji ea 2022)	uncurated, subset of CommonCrawl	9.9M tokens / 70.4 MB	Frisian ortho	Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)	uncurated, subset of mc4 and OSCAR	223k sentences		see mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023)	uncurated, subset of CommonCrawl	3.7M sentences		CC-BY-4.0
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	927k sentences		Apache 2.0 + licenses of source datasets
Wikipedia (fy subset)	uncurated	50k articles	Frisian ortho	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
Tatoeba (frr subset)	translations into other languages	2.9k sentences	?	CC BY 2.0 FR
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	55.3k sentences		Apache 2.0 + licenses of source datasets
Wikipedia (frr subset)	uncurated, partially tagged with dialect information	17k articles	different dialect-based (ad-hoc?) orthographies	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
Tatoeba (stq subset)	translations into other languages	96 sentences	?	CC BY 2.0 FR
MADLAD-400 (subset) (Kudugunta ea 2023)	uncurated, subset of CommonCrawl	27.7k sentences		CC-BY-4.0
Wikipedia (stq subset)	uncurated	4k articles	revised Kramer orthography for Saterfrisian (unclear if example, recommendation or rule for this wiki)	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
UD Low Saxon LSDC (Siewert ea 2021)	POS (UPOS), dependencies (UD), morphological features (UD), glosses (Middle Low Saxon), lemmas, locations (18 dialect areas, see also LSDC note); overlaps with LSDC	1000 sentences	ad-hoc spelling, Nysassiske Sryvwyse	CC BY-SA 4.0
TaPaCo (subset) (Scherrer 2020)	paraphrases; annotated subset of Tatoeba	1107 sentences	?	CC BY 2.0
Low Saxon Dialect Classification (LSDC) (Siewert ea 2020)	locations (15 dialect areas); overlaps with UD Low Saxon LSDC; also contains FRS, WEP, TWD, ACT content	88.9k sentences (incl. FRS etc.)	ad-hoc spelling	CC BY-NC-SA 4.0
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (Schröder 2011, Elmentaler ea 2015) (Low German subset)	varieties of Low Saxon (Nordhannoversch, Emsländisch Oldenburgisch), East Frisian Low Saxon and (Northern) German	unknown (300 hrs total)	audio	HZSK-RES
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))	locations	80 min / 10.7k tokens	audio, German ortho	custom terms
Tatoeba (nds subset)	translations into other languages	18.1k sentences	?	CC BY 2.0 FR
Ubuntu via OPUS (Tiedemann 2012)	translations into other languages	35.3k tokens		?
KDE4 via OPUS (Tiedemann 2012)	translations into other languages	1.1M tokens		?
GNOME via OPUS (Tiedemann 2012)	translations into other languages	ca. 700k tokens		?
UDHR-LID (subset) (Karagan ea 2023, Unicode)		58 sentences		CC0 1.0
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)	uncurated	24 MB	?	CC BY-SA 3.0
OSCAR (subset) (Abadji ea 2022)	uncurated, subset of CommonCrawl	1.6M tokens / 10.7 MB	?	Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)	uncurated, subset of mc4 and OSCAR	15.1k sentences		see mc4 & OSCAR
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	934k sentences		Apache 2.0 + licenses of source datasets
Wikipedia (nds subset)	uncurated, partially tagged with dialect information	84k articles	Sass'sche Schrievwies	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0
Wikipedia (nds-nl subset)	uncurated, partially tagged with dialect information	8k articles	Nysassiske Skryvwyse (preferred) and Algemene Nedersaksische Schriefwieze (older articles)	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (East Frisian Low Saxon subset)	varieties of Low Saxon, East Frisian Low Saxon and (Northern) German	unknown (300 hrs total)	audio	HZSK-RES
Low Saxon Dialect Classification (LSDC) (OFR subset) (Siewert ea 2020)	minor overlaps with UD Low Saxon LSDC	240 sentences	ad-hoc spelling	CC BY-NC-SA 4.0

Corpus	Notes	Size	Representation	License
TaPaCo (subset) (Scherrer 2020)	paraphrases; annotated subset of Tatoeba	122 sentences	?	CC BY 2.0
Automatic speech recognition dataset for Gronings (Bartelds ea 2023)		4 hours	audio, written	CC BY 4.0
Dataset: Gronings (Bartelds & San 2021, San ea 2021)		23 mins	audio, written	CC BY 4.0
Tatoeba (gos subset)	translations into other languages	5.7k sentences	?	CC BY 2.0 FR

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
README.md		README.md

Corpus	Notes	Size	Representation	License
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))		15 min / 2.4k tokens	audio, German ortho	custom terms
Low Saxon Dialect Classification (LSDC) (OWL subset) (Siewert ea 2020)	minor overlaps with UD Low Saxon LSDC	15k sentences	ad-hoc spelling	CC BY-NC-SA 4.0

Corpus	Notes	Size	Representation	License
Corpus of Southern Dutch Dialects (GCND) (Breitbarth ea 2018)	under construction!; might also include West Flemish, Zeelandic, and/or Limburgs		audio, transcriptions
SAND (Barbiers ea 2006)	locations	?	phono	custom
MAND/FAND/GTRP (Goeman ea) (contact institute)	locations		phono (K-IPA)

Corpus	Notes	Size	Representation	License
Stemmen uit het verleden (annotated subset) (Lybaert ea 2019, Van Keymeulen ea 2019)	V2 variation, locations (25 places)	1.4k sentences	phono	CC BY-NC 4.0
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	102k sentences		Apache 2.0 + licenses of source datasets
VLS Community 2017 (Goldhahn ea 2012)	possibly uncurated	36.4k sentences	? (written)	?
Wikipedia (vls subset)	uncurated, partially tagged with dialect information	8k articles	Standoardvlams (orthography developped by vls.wikipedia.org editors)	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	34.4k sentences		Apache 2.0 + licenses of source datasets
Wikipedia (zea subset)	uncurated	6k articles	?	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
SXUCorpus (Herms ea 2016) (contact authors)	8 locations	500 min / 70 k tokens	audio, German ortho
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))		12 min / 1.7k tokens	audio, German ortho	custom terms

Corpus	Notes	Size	Representation	License
UD Luxembourgish LuxBank (Plum ea 2024)	POS tags (UPOS), dependencies (UD)	20 sentences	Luxembourgish ortho
Banking Client Support (BCS) Dataset (Lothritz ea 2021)	intent detection, slot filling, parallel with DEU, FRA, ENG	1k sentences	Luxembourgish ortho	?
Luxembourgish translation of Winograd Natural Language Inference (L-WNLI) (Lothritz ea 2022)	NLI, parallel with other languages (Levesque ea 2012)	767 samples	Luxembourgish ortho	?
Luxembourgish POS and NER (Lothritz ea 2022) (contact authors)	POS tags (15 tags), NER (PER, ORG, LOC, GPE, MISC)	5.5k sentences	Luxembourgish ortho	?
Luxembourgish news classification (Lothritz ea 2022) (contact authors)	8 classes	10k articles	Luxembourgish ortho	?
SA1 (Lothritz ea 2023; contact authors)	sentiment	1.8k sentences
Luxembourgish sentence negation (Lothritz ea 2023)	position of negation particle; overlaps with Leipzig corpora (Newscrawl and/or Web and/or Wikipedia)	46k sentences
LuxId (Lavergne ea 2014)	code-switching (LTZ, DEU, FRA)	924 sentences (most with LTZ content)	Luxembourgish(/German/French) ortho	CC BY-SA 3.0
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022)	parallel with ~200 languages	2k sentences	Luxembourgish ortho	CC BY-SA 4.0
FLEURS (subset) (Conneau ea 2023)	parallel with ~100 languages; audio version of FLORES (Goyal ea 2022)	1-3 recordings each of 1.9k sentences (3.8k recordings total)	audio, Luxembourgish ortho	CC BY 4.0
Tatoeba (ltz subset)	translations into other languages	884 sentences	Luxembourgish ortho	CC BY 2.0 FR
Ubuntu via OPUS (Tiedemann 2012)	translations into other languages	17k tokens	Luxembourgish ortho	?
KDE4 via OPUS (Tiedemann 2012)	translations into other languages	28.8k tokens	Luxembourgish ortho	?
Mozilla-I10n	translations into other languages	6.9k tokens	Luxembourgish ortho	Mozilla Public License 2.0
QED via OPUS (Abdelali ea 2014, Tiedemann 2012)	translations into other languages	19.2k tokens	Luxembourgish ortho	?
TED2020 via OPUS (Reimers & Gurevych, Tiedemann 2012)	translations into other languages	1.7k tokens	Luxembourgish ortho	CC BY-NC-ND 4.0
UDHR-LID (subset) (Karagan ea 2023, Unicode)		59 sentences		CC0 1.0
OpenLID (subset) (Burchell ea 2023)	combines other corpora	37.7k lines		depend on source datasets
Luxembourgish Newscrawl (Goldhahn ea 2012)	uncurated?	300k sentences
Luxembourgish Web Corpus (Goldhahn ea 2012)	uncurated?	1M sentences
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)	uncurated	81 MB	?	CC BY-SA 3.0
OSCAR (subset) (Abadji ea 2022)	uncurated, subset of CommonCrawl	2.5M tokens / 18.4 MB	?	Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)	uncurated, subset of mc4 and OSCAR	166k sentences		see mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023)	uncurated, subset of CommonCrawl	3.4M sentences		CC-BY-4.0
Wikipedia (lb subset)	uncurated	61k articles	Luxembourgish ortho	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
Tatoeba (ksh subset)	translations into other languages	82 sentences	?	CC BY 2.0 FR
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	33.5k sentences		Apache 2.0 + licenses of source datasets
Wikipedia (ksh subset)	uncurated, Colognian and other varieties of Ripuarian, partially tagged with dialect and/or orthography information	3k articles	ad-hoc spelling, some articles according to various Ripuarian orthographies	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022)	parallel with ~200 languages; Maastrichtian Limburgs	2k sentences		CC BY-SA 4.0
Ubuntu via OPUS (Tiedemann 2012)	translations into other languages	18.4k tokens		?
GNOME via OPUS (Tiedemann 2012)	translations into other languages	ca. 400k tokens		?
OpenLID (subset) (Burchell ea 2023)	combines other corpora	48k lines		depend on source datasets
LIM Community 2017 (Goldhahn ea 2012)	possibly uncurated	84.4k sentences	? (written)	?
LIM Web 2010 (Netherlands) (Goldhahn ea 2012)	uncurated?	35.4k sentences	? (written)	?
CC-100 (subset) (Wenzek ea 2020)	uncurated, subset of CommonCrawl	8.3 MB
CulturaX (subset) (Nguyen ea 2023)	uncurated, subset of mc4 and OSCAR	206 sentences		see mc4 & OSCAR
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	652k sentences		Apache 2.0 + licenses of source datasets
Wikipedia (li subset)	uncurated, partially tagged with dialect and/or orthography information	14k articles	Veldeke-sjpelling, Algemein Gesjreve Limburgs	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
Thorsten-Voice Dataset 2023.09 Hessisch (Müller & Kreutz 2024)	Hessian	2 hrs / 2.1k sentences	audio, German ortho	CC0
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))	Hessian	8 min / 1.4k tokens	audio, German ortho	custom terms
Wikipedia (pfl subset)	uncurated, partially tagged with dialect information; contains articles in Palatine German, Lorraine Franconian, Hessian	3k articles	(implied) ad-hoc spelling	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
Tatoeba (pdc subset)	translations into other languages	57 sentences	?	CC BY 2.0 FR
Wikipedia (pdc subset)	uncurated	2k articles	?	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

mainlp/germanic-lrl-corpora

Folders and files

Latest commit

History

Repository files navigation

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

General

Faroese · fao · fao1244

Norwegian · nor · norw1258

Jutish · juti1236

East Danish · scan1238

Elfdalian/Övdalian · ovd · elfd1234

Swedish · swe · swe1254

Anglo-Frisian

Scots · sco · scot1243

English · eng · stan1293

West(ern) Frisian · fry · west2354

North(ern) Frisian · frr · north2626

Saterland Frisian/Saterfrisian · stq · sate1242

Low German

Low Saxon/Low German · nds · lowg1239

East Frisian Low Saxon · frs · east2288

Gronings · gos · gron1242

Twents · twd · twen1241

Achterhoeks · act · acht1238

Westphalic/Westphalish/Westphalian · wep · west2356

Macro-Dutch

Dutch · nld · dutc1256

West(ern) Flemish · vls · vlaa1240

Zeelandic/Zeeuws · zea · zeeu1238

Central German

Upper Saxon · sxu · uppe1400

Moselle Franconian · luxe1241

Luxembourgish · ltz · luxe1243

Transylvanian Saxon · tran1294

Colognian · ksh · kols1241

Limburgish/Limburgan · lim · lim1263

Rhine/Rhenish Franconian · rhin1244

Pennsylvania Dutch · pdc · penn1240

Yiddish · yid · west2361/east2295

Upper German

German · deu · stan1295

Upper/High Franconian · uppe1464

Bavarian · bar · bava1246

Cimbrian · cim · cimb1238

Mòcheno · mhn · moch1255

Swabian · swg · swab1242

Central Alemannic (incl. Swiss German & Alsatian) · gsw · swis1247

Walser · wae · wals1238

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages

Corpus	Notes	Size	Representation	License
Penn Parsed Corpus of Historical Yiddish (Santorini 2021)	POS (Penn-historical, phrase structure (Penn-historical)	200k tokens	partially YIVO transliteration, partially YIVO-inspired ad-hoc transliteration	CC BY-NC-SA 4.0
CABank Yiddish Corpus (Newman 2015)	New York	1 hr	audio, transcriptions (partially IPA, partially orthography-based (YIVO-transliteration-based?))	CC BY-NC-SA 3.0
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022)	parallel with ~200 languages; Eastern Yiddish (Hasidic)	2k sentences		CC BY-SA 4.0
UDHR-LID (subset) (Karagan ea 2023, Unicode)	Eastern Yiddish	59 sentences		CC0 1.0
OpenLID (subset) (Burchell ea 2023)	combines other corpora; Eastern Yiddish	911 lines		depend on source datasets
YDD Community 2017 (Goldhahn ea 2012)	Eastern Yiddish; possibly uncurated	21.8k sentences	? (written)	?
CC-100 (subset) (Wenzek ea 2020)	uncurated, subset of CommonCrawl	51 MB
OSCAR (subset) (Abadji ea 2022)	uncurated, subset of CommonCrawl	14.3M tokens / 171.7 MB	?	Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)	uncurated, subset of mc4 and OSCAR	141k sentences		see mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023)	uncurated, subset of CommonCrawl	1.9M sentences		CC-BY-4.0
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	220k sentences		Apache 2.0 + licenses of source datasets
Wikipedia (yi subset)	uncurated	15k articles		text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
UD Bavarian MaiBaam (Blaschke ea, 2024)	POS (UPOS), dependencies (UD); dialect/location information; overlaps with wiki, xSID	1k sentences	ad-hoc pronunciation spelling	CC BY-SA 4.0
Kontatto (Dal Negro & Ciccolone 2020)	POS (unknown), lemmas (German). South Tyrolean	147k tokens	audio, phono	custom
BarNER (Peng ea 2024)	named entities (based on CoNLL2003); overlaps with wiki	11k sentences	ad-hoc pronunciation spelling	CC-BY 4.0
xSID/SID4LR (van der Goot ea 2021; Aepli ea 2023; Winkler ea 2024) (de-st and de-ba subsets)	slot filling, intent detection, translations into 16 languages; South Tyrolean and Central Bavarian	2×800 sentences	ad-hoc pronunciation spelling	CC BY-SA 4.0
DiDi (Frey ea 2015, 2019) (subset)	South Tyrolean	9.6k messages	ad-hoc pronunciation spelling	CLARIN ACA-BY-NC-NORED
Kontatti (Ghilardi 2019) (subset)	South Tyrolean	unknown (6:48 hrs total)	audio, German ortho	custom
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))		between 21 and 34 min / between 2.7k and 3.2k tokens	audio, German ortho	custom terms
AlpiLinK (Rabanus ea 2023) (tir subset)	South Tyrolean; location information	1908 files (49 sentences, up to 43 speakers)	audio, German ortho	CC BY-NC-SA 4.0
VinKo (tir subset) (Rabanus ea 2023, Krujt ea 2023)	South Tyrolean; location information	148 sentences + 71 words (up to 195 speakers per entry)	audio, German ortho	CC BY-NC-ND 4.0
Tatoeba (bar subset)	translations into other languages	226 sentences	ad-hoc pronunciation spelling	CC BY 2.0 FR
Wikipedia (bar subset)	uncurated, partially tagged with dialect information	27k articles	ad-hoc pronunciation spelling with some optional conventions	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
AlpiLinK (Rabanus ea 2023) (mhn subset)	location information	42 sentences (1 speaker)	audio, German ortho	CC BY-NC-SA 4.0
VinKo (mhn subset) (Rabanus ea 2023, Krujt ea 2023)	location information	159 sentences + 30 words (up to 17 speakers per entry)	audio, German ortho	CC BY-NC-ND 4.0

Corpus	Notes	Size	Representation	License
Tatoeba (swg subset)	translations into other languages	1.9k sentences	ad-hoc pronunciation spelling	CC BY 2.0 FR
Wikipedia (subset of als subset)	uncurated	927 (of 27k) articles tagged as Swabian	no defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthography	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
Annotated Corpus for the Alsatian Dialects (Bernhard ea 2018, 2019)	POS (UPOS, mod. UPOS), lemmas, glosses (French), NEs (locations); Alsatian; overlap with Wikipedia	798 sentences	ad-hoc pronunciation spelling	CC BY-SA 4.0
BISAME GSW (STIH 2020, Millour & Fort 2018)	POS (mod. UPOS); Alsatian	382 sentences	ad-hoc pronunciation spelling	CC BY-NC-SA 3.0 FR
NOAH's corpus (Hollenstein & Aepli 2015)	POS (mod. STTS, partially also STTS and UPOS); overlap with UD Swiss German UZH and Wikipedia	115k toks	(mostly?) ad-hoc pronunciation spelling	annotations: CC BY 4.0
UD Swiss German UZH (Aepli & Clematide 2018)	POS (UPOS, mod. STTS), dependencies (UD); overlap with NOAH's corpus and Wikipedia	100 sentences	(mostly?) ad-hoc pronunciation spelling	CC BY-SA 4.0
WUS DIALOG GSW (Stark ea 2014-20, Ueberwasser & Stark 2017) (subset)	POS (mod. STTS), locations	34.7k tokens	ad-hoc pronunciation spelling, German ortho	CC BY-NC-ND
SID4LR (Aepli ea 2023) (gsw subset)	slot filling, intent detection, translations into 16 languages. Bernese	800 sentences
SwissDial (Dogan-Schönberger ea 2021)	topics (14 classes), translations (across dialects and into German), locations (Aargau, Bern, Basel, Graubünden, Luzern, St. Gallen, Wallis, Zürich); the Wallis data are presumably in Walser (wae)	2.5-4.6 hrs × 7-8 dialects	audio, pronunciation spelling, German ortho	CC BY-NC 4.0
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))		10 min / 612 tokens	audio, German ortho	custom terms
SpinningBytes Swiss German Corpus (SB-CH) (annotated subset) (Grubenmann ea 2018)	sentiment; potential overlap with NOAH's corpus	2.8k sentences	pronunciation spelling	CC BY 4.0
anko Schweizerdeutsch (subset of the Picture postcard corpus) (Sugisaki ea 2023)	discourse-related text spans	600 postcards	pronunciation spelling	?
What's up, Switzerland? (subset) (Stark ea 2014-20, Ueberwasser & Stark 2017)	locations	507k messages / 3.6M tokens	pronunciation spelling	CC BY-NC-ND
Swatchgroup Geschäftsbericht (subset) via PaCoCo (Graën ea 2019)		79.6k tokens	pronunciation spelling	CC BY-SA
Schweizerdeutsches Mundartkorpus (CHMK; downloadable subcorpus) (Weibel & Peter 2020)	locations	?		CC BY-SA 4.0
Text+Berg via PaCoCo (subset) (Bubenhofer ea 2015, Graën ea 2019)		156 sentences / 3.1k tokens		CC BY-SA
ArchiMob (Scherrer ea 2019)		70 hrs	audio, transcription based on the Dieth orthography for Swiss German, German ortho	CC BY-NC-SA 4.0
STT4SG-350 (Plüss ea 2023)	locations (7 regions)	343 hrs	audio, German ortho	META-SHARE NonCommercial NoRedistribution
SDS-200 (Plüss ea 2022)		200 hrs	audio, German ortho	META-SHARE NonCommercial NoRedistribution
Swiss Parliaments Corpus (Plüss ea 2021a)		293 hrs	audio, German ortho
All Swiss German Dialects Test Set (Plüss ea 2021b)	locations (cantons, incl. Wallis)	13 hrs / 5.8k utterances	audio, German ortho	MIT
Gemeinderat Zürich Audio Corpus (Plüss ea 2021b)		1208 hrs	audio	MIT
Ein geparstes und grammatisch annotiertes Korpus schweizerdeutscher Spontansprachdaten (Schönenberger & Haeberli 2019) (contact authors)	POS (mod. Penn-historical, phrase structure (Penn-historical). Location: Wil (SG)	100k+ tokens	Dieth orthography
UDHR-LID (subset) (Karagan ea 2023, Unicode)		59 sentences	?	CC0 1.0
Swiss Crawl (Linder ea 2020)	uncurated	500k+ sentences	?	CC BY-NC 4.0
SpinningBytes Swiss German Corpus (SB-CH) (Grubenmann ea 2018)	uncurated; contains NOAH's corpus	116k sentences		CC BY 4.0
SwigSpot (Linder 2018)	uncurated	8k sentences	?	Apache 2.0
Tatoeba (gsw subset)	translations into other languages	474 sentences	?	CC BY 2.0 FR
Swiss German Web Corpus (Goldhahn ea 2012)	uncurated?	100+k sentences		?
OSCAR (subset) (Abadji ea 2022)	uncurated, subset of CommonCrawl	34k tokens / 233 KB	?	Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)	uncurated, subset of mc4 and OSCAR	6.9k sentences		see mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023)	uncurated, subset of CommonCrawl. the dataset audit notes issues with the Swiss German subcorpus ⚠	1M sentences		CC-BY-4.0
Glot500-c (subset) (Imani ea 2023)	partially uncurated, corpus overlap documented in data	441k sentences		Apache 2.0 + licenses of source datasets
Wikipedia (subset of als subset)	uncurated, partially tagged with dialect information	27k total (including Swabian and Walser), thereof 2.3k (directly or indirectly) tagged as Alsatian, and 1.7k (directly or indirectly) tagged as Swiss German	no defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthography	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

Corpus	Notes	Size	Representation	License
ArchiWals / CLiMAlp (Angster ea 2017, Gaeta 2020)	locations (Gressoney, Issime, Formazza, Rimella, Alagna)	80k+ tokens	pronunciation spelling
Walliserdeutsch/RRO (Garner 2014, Garner ea 2014)		8.3 hrs	audio, non-standardized transcription	custom
SwissDial (subset) (Dogan-Schönberger ea 2021)	topics (14 classes), translations (into German and 7 Swiss German dialects)	3.3 hrs	audio, pronunciation spelling, German ortho	CC BY-NC 4.0
All Swiss German Dialects Test Set (Plüss ea 2021b)	locations (cantons, incl. Wallis)	unk	audio, German ortho	MIT
AlpiLinK (Rabanus ea 2023) (wae subset)	location information	122 files (42 sentences, up to 3 speakers)	audio, German ortho	CC BY-NC-SA 4.0
Wikipedia (subset of als subset)	uncurated	35 (of 27k total) tagged as Wal(li)ser	no defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthography	text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0