generated from dataforgoodfr/d4g-project-template
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b7483c5
commit 60c421d
Showing
1 changed file
with
21 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,40 +1,42 @@ | ||
# Evaluation of the table extraction | ||
|
||
## Setup | ||
## Evaluate extractions with the streamlit eval app | ||
|
||
To run the evaluation scripts, you need some additional requirements that are not listed in the project dependencies. | ||
To get started, run the streamlit eval app as: | ||
|
||
``` | ||
apt-get install wkhtmltopdf | ||
streamlit run eval/eval_app.py eval/data/data_step2_before-currency-unit_eval.csv | ||
``` | ||
|
||
## Generate evaluation data | ||
This app allows you to visually compare tables extracted via multiples methodologies and for multiple reports. It needs two input files (only one mandatory): | ||
- *[Optional]* The optional REF data file `data_step2_before-currency-unit_eval.csv` is a cleaned up version of `data_step2_before-currency-unit.csv`. The latter file contains reference data extracted and manually cleaned up by the TaxObservatory team and allows you to benchmark the extractions against it. | ||
- *[Mandatory]* At launch, the app will request you to provide a pickle file with extracted data. Select `eval_20240408_200249.plk` in the `eval/data/` directory to not have to generate evaluation data yourself and get started easily! | ||
|
||
## Generate your own evaluation data | ||
|
||
First, you need to generate evaluation data with the `eval_table_extraction.py` script. This script will iterate through several reports and apply the set of table extraction algorithms you provided in your yaml configuration. | ||
You can instead generate your own picke file containing extracted data. | ||
|
||
Check out `configs/eval_table_extraction.yaml` for a suitable yaml configuration. | ||
### Setup | ||
|
||
You can then call the script as : | ||
Install the following package that is used to generate PDF output files. | ||
|
||
``` | ||
python eval/eval_table_extraction.py configs/eval_table_extraction.yaml | ||
./example_set/inputs/ ./example_set/extractions | ||
apt-get install wkhtmltopdf | ||
``` | ||
|
||
This will apply the pipeline for all the reports in the `./example_set/inputs` directory and save : | ||
|
||
- the extracted tables with all the algorithms in one output PDF file per input report in the | ||
`./example_set/extractions` directory | ||
- all the extracted assets in a pickle file `eval_xxxx.pkl` located in the `eval/data/` directory | ||
### Data generation | ||
|
||
## Evaluation with a streamlit app | ||
Run the `eval_table_extraction.py` script. This script will iterate through several reports and apply the set of table extraction algorithms you provided in your yaml configuration. Check out `configs/eval_table_extraction.yaml` for a suitable yaml configuration. | ||
|
||
To facilitate the evaluation of the extractions, you can run the streamlit app `eval/eval_app.py` as: | ||
You can run the script as: | ||
|
||
``` | ||
streamlit run eval/eval_app.py eval/data/data_step2_before-currency-unit_eval.csv | ||
python eval/eval_table_extraction.py configs/eval_table_extraction.yaml | ||
./example_set/inputs/ ./example_set/extractions | ||
``` | ||
|
||
`data_step2_before-currency-unit_eval.csv` is a cleaned up version of the `data_step2_before-currency-unit.csv` file which contains reference data extracted and manually cleaned up by the TaxObservatory team. | ||
This will apply the pipeline for all the reports in the `./example_set/inputs` directory and save : | ||
|
||
At launch, you will be requested to provide a pickle file with extracted data. You might select `eval_20240408_200249.plk` from the `eval/data/` directory. It contains extracted tables for multiple reports and extractions and is a great way to get started. | ||
- the extracted tables with all the algorithms in one output PDF file per input report in the | ||
`./example_set/extractions` directory | ||
- all the extracted assets in a pickle file `eval_xxxx.pkl` located in the `eval/data/` directory |