A framework for neural machine translation. The name is in reference to the framework's first task of English-to-French translation.
- Sutskever et al.'s "Sequence to Sequence with Neural Networks"
- Europarl Parallel Corpora: Proceedings of the European Parliament from 1996 to 2011
The system is built with PyTorch and AllenNLP, which are the main dependencies.
- Python 3.6 (3.6.5+ recommended)
It is recommended to first create a virtual environment before installing dependencies.
conda create --name le-traducteur python=3.6
python3 -m venv /path/to/new/virtual/environment
Download PyTorch and AllenNLP via
`pip install -r requirements.txt`
The current version of AllenNLP doesn't support restricting vocabulary by namespace. To enable this and run the provided experiments, you'll have to download AllenNLP from source.
Once version 0.5.2 is released, this should no longer be a problem.
Several tokenizers used rely on NLTK and spaCy's pre-trained models for tokenizing English as well as French and Spanish. Feel free to not explicit download these models yourself. They will be downloaded automatically if a tokenizer in the config is specified to use a spaCy model that does not yet exist on your machine.
Go to the root directory of this repository and run pytest
to verify the provided dataset readers are working.
generate_parallel_corpus.py
is a provided tool to create a combined parallel corpus for any language pair. It is recommend to refer to languages via their ISO codes when using this script and the framework in general.
This script can be used for any pair of monolingual transcriptions. It only makes the assumption that the files it is given have the same number of lines, where each line in one file is a translation of the other file at the same place.
Arguments to this script are:
- src language: The ISO code of the source language in which to translate from
- dst language: The ISO code of the destination language in which to translate to
- src path: The path to the source language utterances
- dst path: The path to the destination language utterances
- save dir: The directory in which to save the new corpus
Output: A jsonl file containing a single JSON object per line of the form
{
'id': <Line number of the >
<src language>: <The src language utterance>
<dst language>: <The dst language utterance>
}
An example dataset reader meant for reading the Europarl French-to-English dataset is provided in europarl_french_english.py
. smoke_europarl_en_fr.jsonl
is a subset of the full English-French parallel corpus produced by passing the Europarl transcriptions to generate_parallel_corpus.py
.
Example parallel corpora and configurations are provided in experiments/
and tests/fixtures
.
Experiments are run by doing
allennlp train <path to the current experiment's JSON configuration> \
-s <directory for serialization> \
--include-package library
A recommended workflow for extending beyond the provided models and supported language pairs is provided in conventions.md.
- Tam Dang
This project is licensed under the Apache License - see the LICENSE.md file for details.