Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
guillaume-millot authored May 1, 2024
1 parent b7483c5 commit 60c421d
Showing 1 changed file with 21 additions and 19 deletions.
40 changes: 21 additions & 19 deletions eval/README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,42 @@
# Evaluation of the table extraction

## Setup
## Evaluate extractions with the streamlit eval app

To run the evaluation scripts, you need some additional requirements that are not listed in the project dependencies.
To get started, run the streamlit eval app as:

```
apt-get install wkhtmltopdf
streamlit run eval/eval_app.py eval/data/data_step2_before-currency-unit_eval.csv
```

## Generate evaluation data
This app allows you to visually compare tables extracted via multiples methodologies and for multiple reports. It needs two input files (only one mandatory):
- *[Optional]* The optional REF data file `data_step2_before-currency-unit_eval.csv` is a cleaned up version of `data_step2_before-currency-unit.csv`. The latter file contains reference data extracted and manually cleaned up by the TaxObservatory team and allows you to benchmark the extractions against it.
- *[Mandatory]* At launch, the app will request you to provide a pickle file with extracted data. Select `eval_20240408_200249.plk` in the `eval/data/` directory to not have to generate evaluation data yourself and get started easily!

## Generate your own evaluation data

First, you need to generate evaluation data with the `eval_table_extraction.py` script. This script will iterate through several reports and apply the set of table extraction algorithms you provided in your yaml configuration.
You can instead generate your own picke file containing extracted data.

Check out `configs/eval_table_extraction.yaml` for a suitable yaml configuration.
### Setup

You can then call the script as :
Install the following package that is used to generate PDF output files.

```
python eval/eval_table_extraction.py configs/eval_table_extraction.yaml
./example_set/inputs/ ./example_set/extractions
apt-get install wkhtmltopdf
```

This will apply the pipeline for all the reports in the `./example_set/inputs` directory and save :

- the extracted tables with all the algorithms in one output PDF file per input report in the
`./example_set/extractions` directory
- all the extracted assets in a pickle file `eval_xxxx.pkl` located in the `eval/data/` directory
### Data generation

## Evaluation with a streamlit app
Run the `eval_table_extraction.py` script. This script will iterate through several reports and apply the set of table extraction algorithms you provided in your yaml configuration. Check out `configs/eval_table_extraction.yaml` for a suitable yaml configuration.

To facilitate the evaluation of the extractions, you can run the streamlit app `eval/eval_app.py` as:
You can run the script as:

```
streamlit run eval/eval_app.py eval/data/data_step2_before-currency-unit_eval.csv
python eval/eval_table_extraction.py configs/eval_table_extraction.yaml
./example_set/inputs/ ./example_set/extractions
```

`data_step2_before-currency-unit_eval.csv` is a cleaned up version of the `data_step2_before-currency-unit.csv` file which contains reference data extracted and manually cleaned up by the TaxObservatory team.
This will apply the pipeline for all the reports in the `./example_set/inputs` directory and save :

At launch, you will be requested to provide a pickle file with extracted data. You might select `eval_20240408_200249.plk` from the `eval/data/` directory. It contains extracted tables for multiple reports and extractions and is a great way to get started.
- the extracted tables with all the algorithms in one output PDF file per input report in the
`./example_set/extractions` directory
- all the extracted assets in a pickle file `eval_xxxx.pkl` located in the `eval/data/` directory

0 comments on commit 60c421d

Please sign in to comment.