Language Identification using Character Trigrams

Language identification is a task in human language processing (HLP or NLP). In this project, the use of trigrams is explored in order to create a model capable of solving this problem. Additionally, 3 different smoothing functions have been implemented and tested: Lidstone, Linear Discounting, and Absolute Discounting.

Dataset

The data used as a training and test corpus comes from Wortschats Leipzig Corpora, which contains texts in different languages. Specifically, Spanish, Italian, English, French, Dutch, and German have been used.

The training set consists of 30,000 sentences for each language, while the test set has 10,000.

Evaluation

The evaluation of the model, in terms of accuracy, is 99.8932%, which translates to only 64 errors. The confusion matrix is as follows:

Original_langId: Contains the dataset obtained from Wortschats Leipzig Corpora.
Preprocessed_langId: Contains the preprocessed datasets.
Weights: Contains the model parameters (both for the test and the validation).
Train.py: Includes code for the text preprocessing and for the creation of json files.
Main.ipynb: Notebook with the validation and the test of the model. Main part.
Report: Detailed documentation on the decisions taken, justifications, results, and conclusions.
Requirements: python 11.+ , sklearn, matplotlib, seaborn, nltk, (spacy in case of detecting proper names).

References

Wortschats Leipzig Corpora: Link to Dataset
HLP Course of GIA (UPC)

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
original_langId		original_langId
plots		plots
preprocessed_langId		preprocessed_langId
weights		weights
.gitattributes		.gitattributes
.gitignore		.gitignore
Informe Pràctica1 PLH - Pau Hidalgo i Cai Selvas.pdf		Informe Pràctica1 PLH - Pau Hidalgo i Cai Selvas.pdf
README.md		README.md
functions.py		functions.py
main.ipynb		main.ipynb
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Identification using Character Trigrams

Dataset

Evaluation

Contents

About

Releases

Packages

Contributors 2

Languages

caiselvas/language-identification

Folders and files

Latest commit

History

Repository files navigation

Language Identification using Character Trigrams

Dataset

Evaluation

Contents

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages