Skip to content

Commit

Permalink
Update poetry & move files to eval/data folder
Browse files Browse the repository at this point in the history
  • Loading branch information
Guillaume Millot authored and Guillaume Millot committed May 1, 2024
1 parent cba3c42 commit 3f2ff17
Show file tree
Hide file tree
Showing 7 changed files with 8,893 additions and 53 deletions.
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,4 @@ cython_debug/

# Pickle files
*.pkl

# Ref data file
data_step2_before-currency-units.csv
!eval/data/eval_20240408_200249.pkl
61 changes: 13 additions & 48 deletions eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,76 +2,41 @@

## Setup

To run the evaluation script, we need some additional requirements that are not
listed in the project dependencies.
To run the evaluation scripts, we need some additional requirements that are not listed in the project dependencies.

```
apt-get install wkhtmltopdf
python3 -m pip install pdfkit streamlit_option_menu streamlit-pdf-viewer
```

## Qualitative evaluation
## Generate evaluation data

The evaluation is performed with the `eval_table_extraction.py` script. This
script will iterate through several reports and apply the set of table
extraction algorithms you gave in your yaml configuration.
First, you need to generate evaluation data with the `eval_table_extraction.py` script. This script will iterate through several reports and apply the set of table extraction algorithms you provided in your yaml configuration.

As an example, you might consider selecting the pages in the report from their
filename and then apply several table extraction algorithms :
As an example, you might select the pages in the report from their filename and then apply several table extraction algorithms. Check out `configs/eval_table_extraction.yaml` for a suitable evaluation script.

A suitable `config.yaml` script would be :

```
pagefilter:
type: FromFilename
table_extraction:
- type: Unstructured
params:
hi_res_model_name: "yolox"
- type: Unstructured
params:
hi_res_model_name: "yolox"
pdf_image_dpi: 300
- type: Unstructured
params:
hi_res_model_name: "yolox"
pdf_image_dpi: 500
- type: UnstructuredAPI
params:
hi_res_model_name: "yolox"
- type: LLamaParse
```

You can then call the evaluation script as :
You can then call the script as :

```
python eval/eval_table_extraction.py configs/eval_table_extraction.yaml
./example_set/inputs/ ./example_set/extractions
```

This will apply the pipeline for all the reports in the `./example_set/inputs`
directory and save :
This will apply the pipeline for all the reports in the `./example_set/inputs` directory and save :

- the extracted tables with all the algorithms in one file per report in the
- the extracted tables with all the algorithms in one output PDF file per input report in the
`./example_set/extractions` directory
- all the extracted assets in a pickle file in the current directory `eval_xxxx.pkl`
- all the extracted assets in a pickle file `eval_xxxx.pkl` located in the `eval/data/` directory

## Comparison with a streamlit app
## Evaluation with a streamlit app

To facilitate the qualitative comparison of the extractions, you can use the
streamlit app `eval/eval_app.py`.
To facilitate the evaluation of the extractions, you can run the streamlit app `eval/eval_app.py`.

To run the application, it is as simple as :

```
streamlit run eval/eval_app.py
streamlit run eval/eval_app.py eval/data/data_step2_before-currency-unit_eval.csv
```

If you have access to the `data_step2_before-currency-unit.csv` extraction of
the tax observatory, you can give its path to the command line :
`data_step2_before-currency-unit_eval.csv` is a cleaned up version of the `data_step2_before-currency-unit.csv` file which contains reference data extracted and manually cleaned up by the team.


```
streamlit run eval/eval_app.py ./path/to/data_step2_before-currency-unit.csv
```
At launch, you will be requested to provide a pickle file with extracted data. You might select `eval_20240408_200249.plk` in the `eval/data/` directory. It contains extracted tables for multiple reports and extractions and is a great way to get started.
Loading

0 comments on commit 3f2ff17

Please sign in to comment.