Skip to content

Commit

Permalink
Add Spanish (#1)
Browse files Browse the repository at this point in the history
* Add spanish to tracked

* Fix special characters not showing

* Formats for 5000
df2 English column needs fixing: whitespace

* Format done for 5000

* add format.py

* Update requirements.txt

* Add spanish_3000 and update build config

* Update requirements.txt

* Bump Python version to 3.12.1

* fixup! Update requirements.txt

* Remove xelatex

* Only use Xelatex for English tex

* Use lualatex for spanish

* Replace utf8x with utf8

* Add spanish babel

* Remove helvet

* Revert "Remove helvet"

This reverts commit 4711eaf.

* Update spanish usepackages

* fixup! Update spanish usepackages

* fixup! fixup! Update spanish usepackages

* Add fontspec
  • Loading branch information
sogladev authored Feb 5, 2024
1 parent 45108ec commit 8ec933d
Show file tree
Hide file tree
Showing 33 changed files with 10,057 additions and 124 deletions.
115 changes: 106 additions & 9 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ on:
branches: [ main ]
workflow_dispatch:
jobs:
build:
build-english:
runs-on: ubuntu-22.04
steps:
- name: Set up Git repository
uses: actions/checkout@v2
- name: Setup Python3
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: 3.12.1
- name: Install pip dependencies
run: |
python -m pip install -r requirements.txt
Expand All @@ -30,9 +30,9 @@ jobs:
- name: Create format files with Python
working-directory: ./english
run: |
python format_english.py oxford_3000
python format_english.py oxford_5000
python format_english.py oxford_5000_exclusive
python format.py oxford_3000
python format.py oxford_5000
python format.py oxford_5000_exclusive
- name: Upload html as artifact
uses: actions/upload-artifact@v4
with:
Expand Down Expand Up @@ -76,8 +76,74 @@ jobs:
with:
name: english-output-pdf-from-tex
path: english/format/*.pdf
build-spanish:
runs-on: ubuntu-22.04
steps:
- name: Set up Git repository
uses: actions/checkout@v2
- name: Setup Python3
uses: actions/setup-python@v2
with:
python-version: 3.12.1
- name: Install pip dependencies
run: |
python -m pip install -r requirements.txt
- name: Create build and output folder
working-directory: ./spanish
run: |
mkdir -p build output
- name: Upload data as artifact
uses: actions/upload-artifact@v4
with:
name: spanish-data
path: spanish/data
- name: Create format files with Python
working-directory: ./spanish
run: |
python format.py spanish_3000
python format.py spanish_5000
- name: Upload html as artifact
uses: actions/upload-artifact@v4
with:
name: spanish-output-html
path: spanish/output/*.html
- name: Install wkthtmltopdf
run: |
sudo apt-get update && sudo apt-get install -y wkhtmltopdf
- name: Convert html to pdf with wkhtmltopdf
working-directory: ./spanish
run: |
wkhtmltopdf --user-style-sheet format/table.css output/spanish_3000_alphabetical.html output/spanish_3000_alphabetical.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_3000_shuffled.html output/spanish_3000_shuffled.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_3000_underscore_alphabetical.html output/spanish_3000_underscore_alphabetical.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_3000_underscore_by_cefr_alphabetical.html output/spanish_3000_underscore_by_cefr_alphabetical.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_3000_underscore_by_cefr_shuffled.html output/spanish_3000_underscore_by_cefr_shuffled.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_3000_underscore_shuffled.html output/spanish_3000_underscore_shuffled.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_5000_alphabetical.html output/spanish_5000_alphabetical.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_5000_shuffled.html output/spanish_5000_shuffled.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_5000_underscore_alphabetical.html output/spanish_5000_underscore_alphabetical.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_5000_underscore_by_cefr_alphabetical.html output/spanish_5000_underscore_by_cefr_alphabetical.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_5000_underscore_by_cefr_shuffled.html output/spanish_5000_underscore_by_cefr_shuffled.pdf
wkhtmltopdf --user-style-sheet format/table.css output/spanish_5000_underscore_shuffled.html output/spanish_5000_underscore_shuffled.pdf
- name: Upload html-to-pdf as artifact
uses: actions/upload-artifact@v4
with:
name: spanish-output-pdf-from-html
path: spanish/output/*.pdf
- name: Create pdf with Latex
uses: xu-cheng/latex-action@v3
with:
working_directory: spanish/format
latexmk_use_xelatex: true
root_file: |
*.tex
- name: Upload tex-to-pdf as artifact
uses: actions/upload-artifact@v4
with:
name: spanish-output-pdf-from-tex
path: spanish/format/*.pdf
deploy:
needs: build
needs: [build-english, build-spanish]
runs-on: ubuntu-22.04
if: startsWith(github.ref, 'refs/tags/v')
#git tag -a v1.0.0 -m "initial release"
Expand All @@ -86,13 +152,23 @@ jobs:
- name: Setup folder structure
run: |
mkdir -p english/data english/output
mkdir -p spanish/data spanish/output
- name: Retrieve data artifact
uses: actions/download-artifact@v4
with:
name: english-data
path: english/data
- name: Retrieve data artifact
uses: actions/download-artifact@v4
with:
name: spanish-data
path: spanish/data
- name: Display structure of data files
run: |
ls -R english/data
- name: Display structure of data files
run: ls -R english/data
run: |
ls -R spanish/data
- name: Retrieve formatted artifacts
uses: actions/download-artifact@v4
with:
Expand All @@ -109,13 +185,34 @@ jobs:
uses: montudor/action-zip@v1
with:
args: zip -qq -r english-output.zip . -i english/output/*
- name: Retrieve formatted artifacts
uses: actions/download-artifact@v4
with:
pattern: spanish-output-*
merge-multiple: true
path: spanish/output
- name: Display structure of output files
run: ls -R spanish/output
- name: zip spanish-data
uses: montudor/action-zip@v1
with:
args: zip -qq -r spanish-data.zip . -i spanish/data/*
- name: zip spanish-output
uses: montudor/action-zip@v1
with:
args: zip -qq -r spanish-output.zip . -i spanish/output/*
- name: Release
uses: softprops/action-gh-release@v1
with:
body: |
in this Release
- extracted data/ found in `.pkl`, `.csv`, `.json` format
English:
- extracted data/ in `.pkl`, `.csv`, `.json` format
- formatted/styled output/ in `.pdf` and `.html` format.
Spanish:
- extracted data/ in `.pkl` format
- formatted/style output/ in `.pdf` and `.html` formta
files: |
english-data.zip
english-output.zip
spanish-data.zip
spanish-output.zip
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# wip
spanish/

english/build/
english/upload/
Expand All @@ -12,6 +11,11 @@ english/format/*
resources/*
*.zip

spanish/data/*xhtml
spanish/output/*
spanish/format/*
!spanish/format/*.tex
!spanish/format/*.css

# Created by https://www.toptal.com/developers/gitignore/api/latex,visualstudiocode,python,jupyternotebooks
# Edit at https://www.toptal.com/developers/gitignore?templates=latex,visualstudiocode,python,jupyternotebooks
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ See release or see hosted seperately https://www.mediafire.com/folder/ik6n07bume
[by_cefr_two_column_by_cefr_shuffle_pdf_sample](english/img/oxford_5000_exclusive_two_column_by_cefr_shuffle_sample.pdf)


## Folder structure for each language
## Folder structure
```
├── audio
│   └── *.mp3
Expand Down
67 changes: 11 additions & 56 deletions english/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# English vocabulary + pronunciation + definition
# Spanish vocabulary + definition + examples translation

```
latexmk -pdfxe -cd format/spanish_5000_two_column_alphabetical_by_rank_with_example.tex -outdir=../output
```


This project aims to provide easy-to-read and printable vocabulary list of the
most common words of the English language with their meaning.
Expand All @@ -7,10 +12,10 @@ The lists are mostly based on data gathered from the oxford 3000, 5000 and 5000

The word lists contain the following points of data
* Spelling (text)
* Pronunciation (audio)
* Lexical spelling (text)
* Meaning (text)
* Example (text)
* Example translation (text)

This project contains scripts to extract data and formatting.

Expand All @@ -19,15 +24,12 @@ see `scraping` below
see `formatting` below

## Data
Extracted data is hosted seperately on mediafire and can be
found in formats `.pkl`, `.csv`, `.json`
Audio consists of around 10,000 *.mp3 files totalling 200MB

Formatted lists in `/output` are formatted alphabetically, by CEFR rating, random and viewable in
`.pdf` and `.html` format.

## Sample outputs

To be updated

1. grouped by CEFR alphabetical order
![by_cefr_img_sample](./img/oxford_5000_exclusive_by_cefr_sample.jpg)
[by_cefr_pdf_sample](./img/oxford_5000_exclusive_by_cefr_sample.pdf)
Expand All @@ -38,55 +40,17 @@ Formatted lists in `/output` are formatted alphabetically, by CEFR rating, rando

## Folder structure
```
├── audio
│   ├── *_uk.mp3
│   ├── *_us.mp3
│   ├── ...
├── data
│   ├── df_concat.pkl
│   ├── df_definition.pkl
│   ├── df.pkl
│   ├── oxford_3000.csv
│   ├── oxford_3000.json
│   ├── oxford_3000.pkl
│   ├── oxford_5000.csv
│   ├── oxford_5000_exclusive.csv
│   ├── oxford_5000_exclusive.json
│   ├── oxford_5000_exclusive.pkl
│   ├── oxford_5000.json
│   └── oxford_5000.pkl
├── output
│   ├── oxford_3000_alphabetical.html
│   ├── oxford_3000_alphabetical.pdf
│   ├── oxford_3000_by_cefr.html
│   ├── oxford_3000_by_cefr.pdf
│   ├── oxford_3000_two_column_alphabetical.pdf
│   ├── oxford_3000_two_column_by_cefr.pdf
│   ├── oxford_5000_alphabetical.html
│   ├── oxford_5000_alphabetical.pdf
│   ├── oxford_5000_by_cefr.html
│   ├── oxford_5000_by_cefr.pdf
│   ├── oxford_5000_exclusive_alphabetical.html
│   ├── oxford_5000_exclusive_alphabetical.pdf
│   ├── oxford_5000_exclusive_by_cefr.html
│   ├── oxford_5000_exclusive_by_cefr.pdf
│   ├── oxford_5000_exclusive_two_column_alphabetical.pdf
│   ├── oxford_5000_exclusive_two_column_by_cefr.pdf
│   ├── oxford_5000_two_column_alphabetical.pdf
│   └── oxford_5000_two_column_by_cefr.pdf
│   └── *pdf / *html
├── format.ipynb
└── scrape.ipynb
```
## Scraping
Selenium, beautifulsoup4, requests, pandas
and geckodriver
beautifulsoup4, requests, pandas

https://github.com/mozilla/geckodriver/releases
```
$ tar -xf geckodriver-v0.30.0-linux64.tar.gz
$ chmod +x geckodriver
$ mv geckodriver /usr/local/bin

```
See `scrape.ipynb`
Expand Down Expand Up @@ -114,12 +78,3 @@ flowchart LR
See `format.ipynb`

## Resources and credit
Oxford 5000 list, online interface to lookup words, filter by CEFR level,
listen pronunciation (US,UK)
also shows meaning but only after clicking to a new page.
https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000

dictionary by tusharlock10
https://github.com/tusharlock10/Dictionary
with relevant stackoverflow thread
https://stackoverflow.com/questions/41768215/english-json-dictionary-with-word-word-type-and-definition
2 changes: 1 addition & 1 deletion english/format_english.py → english/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def replace_word_in_example_with_underscore(word, example):
example_split = example.split(' ')
def _replace(e):
if word not in e:
return e
return e
if not re.match(f"^{word}.*?$", e):
return e
return e.replace(word, '_')
Expand Down
3 changes: 2 additions & 1 deletion english/format/oxford_3000_two_column_alphabetical.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
\usepackage[a4paper,left=1cm,right=1cm,top=1cm,bottom=1cm]{geometry}
\usepackage{blindtext}

\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{supertabular}
\usepackage{array}

Expand Down
3 changes: 2 additions & 1 deletion english/format/oxford_3000_two_column_by_cefr.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
\usepackage[a4paper,left=1cm,right=1cm,top=1cm,bottom=1cm]{geometry}
\usepackage{blindtext}

\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{supertabular}
\usepackage{array}

Expand Down
3 changes: 2 additions & 1 deletion english/format/oxford_3000_two_column_by_cefr_shuffle.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
\usepackage[a4paper,left=1cm,right=1cm,top=1cm,bottom=1cm]{geometry}
\usepackage{blindtext}

\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{supertabular}
\usepackage{array}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
\usepackage[a4paper,left=1cm,right=1cm,top=1cm,bottom=1cm]{geometry}
\usepackage{blindtext}

\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{supertabular}
\usepackage{array}

Expand Down
3 changes: 2 additions & 1 deletion english/format/oxford_5000_exclusive_two_column_by_cefr.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
\usepackage[a4paper,left=1cm,right=1cm,top=1cm,bottom=1cm]{geometry}
\usepackage{blindtext}

\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{supertabular}
\usepackage{array}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
\usepackage[a4paper,left=1cm,right=1cm,top=1cm,bottom=1cm]{geometry}
\usepackage{blindtext}

\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{supertabular}
\usepackage{array}

Expand Down
3 changes: 2 additions & 1 deletion english/format/oxford_5000_two_column_alphabetical.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
\usepackage[a4paper,left=1cm,right=1cm,top=1cm,bottom=1cm]{geometry}
\usepackage{blindtext}

\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{supertabular}
\usepackage{array}

Expand Down
3 changes: 2 additions & 1 deletion english/format/oxford_5000_two_column_by_cefr.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
\usepackage[a4paper,left=1cm,right=1cm,top=1cm,bottom=1cm]{geometry}
\usepackage{blindtext}

\usepackage[utf8x]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{supertabular}
\usepackage{array}

Expand Down
Loading

0 comments on commit 8ec933d

Please sign in to comment.