//Any questions or inquiries please email [email protected]//
This program gives back a probability that PDF documents are copyright protected or not. It uses structural features as well as pixel-based features processed through an ensemble of classifiers to generate a probability.
1. Install a python virtual environment with python 3.5 with this command.
virtualenv -p [PATH TO PYTHON 3.5 EXECUTABLE] [PATH FOR VIRTUAL ENVIRONMENT]
For example virtualenv -p /usr/bin/python3 deepdocvirt
2. Clone the deepdocclass application from the repository.
git clone ...
3. Activate the virtual environment with the command
source deepdocvirt/bin/activate
4. Go into the project folder and install the libraries specified in the requirements.txt file with the command.
pip install -r requirements.txt
5. Before running you need to install the NLTK stopwords by going into the python shell and installing with the commands
import nltk
nltk.download('stopwords')
6. Make sure to place the pdf files you want to process on a path where you have write permissions.
This application processes any pdf file. It uses different types of features to classify the documents. Some of those features are based on metadata information that is not available in the file so it has to be provided. The metadata features used are the name of the folder the file was located on the server or uploaded to and the name of the fileon the server. To provide this data you need a csv file with the id of the document, the folder name and the file name. You can also provide the number of participants and the course name for each file if you want extra statistics on your report. The headers of the csv file should be:
'document_id', 'file_name', 'folder_name', 'number_participants', 'course_name'
When you run the script the results will be saved on the project folder under results in csv and json format. If you want to generate a report simply use the -report parameter and it will be saved also in the results directory. If you want to choose random files for manual inspection, use the -manual parameter.
To run the script for prediction simply do the command:
For classification with basic bow and numeric features only:
python classify_pdf.py -fp [DOCUMENT TO CLASSIFY OR PATH OF DOCUMENTS TO CLASSIFY]
For classification including basic features and metadata features:
python classify_pdf.py -fp [DOCUMENT TO CLASSIFY OR PATH OF DOCUMENTS TO CLASSIFY] -meta [PATH TO METADATA CSV FILE]
For classification including deep features:
python classify_pdf.py -fp [DOCUMENT TO CLASSIFY OR PATH OF DOCUMENTS TO CLASSIFY] -deep
For classification including, both, deep features and metadata features:
python classify_pdf.py -fp [DOCUMENT TO CLASSIFY OR PATH OF DOCUMENTS TO CLASSIFY] -meta [PATH TO METADATA CSV FILE] -deep
A training section will be added shortly...
For the KMK Test the classify.run script should be used. This script will run the files in the quantity specified by batch and will generate a merged report file with all the statistics combined as well as a report file for a sample of 100 files. Additionally, result files for the sample as well as all the rest of the files with a timestamp of when they were created will be created. In case of a crash or interruption, the script will resume where it left of. NOTE: If you want to start over from scratch: delete all the files in the results directory and the file called processed_files.csv inside the data/ directory.
To use the bash script for automatizing the process of files do
./classify.run -fp [PATH OF DOCUMENTS TO CLASSIFY] -rp [PATH FOR RESULTS] -b [BATCH QUANTITY] -meta [METADATA FILE]
The following results will be generated:
- A report file for a random sample of 100 files. Please send this sample report back to us as soon as it is created so we can check if the results look realistic or if something went wrong.
- A results file for all of the analyzed files. It is created incrementally over the course of approx. 2-3 days (6sec per document).
- A folder inside the results directory with documents for the manual evaluation. You can also use the script located at help_scripts/copy_files_manual_check.run for this. To use it just run:
./copy_files_manual_check.run -fp [PATH OF DOCUMENTS TO CLASSIFY] -rp [PATH FOR RESULTS]
We have heard back that when using the Docker Container, there are some things to take into account and possibly adjust manually:
- The metadata export file has to be explicitly named metadata.csv and be moved to .../files/ directory
- If the pdf files are in subfolders, they might all have to moved into the highest directory, like this:
mv */*.pdf .
- The file IDs might have to be amended with their .pdf endings, this is possible with the following command:
for i in uploads/*/*; do mv "$i" "$i.pdf"; done
For studIP exports:
- Please delete the 'licence' column in the metadata.csv before executing the script.
There has been some reports given to us that some big files may be biasing the report in unexpected ways. We now included a script that can generate a custom report excluding files bigger than a specified number of pages (default: 250) without having to run the whole classification again. The script is on the help_scripts folder and to run it just do:
python create_custom_report.py [PATH TO PDF FILES] [PATH TO RESULT FILES] [PATH TO METADATA FILE] [LIMIT OF PAGES]
The LIMIT OF PAGES argument is optional, you can put any number of pages you want as the upper limit of pages from the files you want to generate the report with.
Please let us know of any other errors that have occured when using the software so far! We will make sure to amend these problems instead of providing hotfixes in the future.
usage: classify_pdf.py [-h] [-fp FP] [-fl FL [FL ...]] [-meta [META]] [-c C]
[-conf CONF] [-pf [PF]] [-po [PO]] [-train [TRAIN]]
[-deep [DEEP]] [-overwrite] [-report] [-manual]
[-sample [SAMPLE]] [-load_classified [LOAD_CLASSIFIED]]
[-rp [RP]] [-b B] [-t T] [-mod [MOD]]
Copyright document classification software.
required arguments:
-fp FP path to pdf file(s) or list of pdf files. If you use
saved features data this is not required, otherwise it
is required.
optional arguments:
-fl FL [FL ...] list of pdf files to process.
-meta [META] specifies metadata file and whether to use metadata
for classification.
-c C specifies amount of cores for parallel processing.
-conf CONF specifies configuration file.
-pf [PF] specifies the name for the file to load the features
data. If -fp is also used then this flag and the data
to load will be ignored.
-po [PO] specifies that the users wants to only preprocess
data. The preprocess data will be saved.
-train [TRAIN] specifies if the user wants to train the
classification system and load the label data. You can
pass the labels file.
-deep [DEEP] specifies the path to the unlabeled image data needed
for the training procedure. If specified without a
path, then it is used during classification to use the
trained deep models.WARNING: While in training mode
this can take a huge amount of time and space.
-overwrite will overwrite all saved data, if any. If not
specified, the program will try to concatenate the
data to existing files.
-report Generate a report with the results and other helpful
statistics.
-manual Provides a random sample of positively classified
documents for manual evaluation.
-sample [SAMPLE] Process just a random sample of the documents. Default
value is 100.
-load_classified [LOAD_CLASSIFIED]
Checks which files are already classified by checking
the file ../data/processed_files.csv and takes them
out of the list from files to process.
-rp [RP] Specifies the path to store the results, report and
sample for manual evaluation.
-b B ONLY USED IF NOT ON TRAINING MODE. Specifies amount of
files per batch.
-t T ONLY USED IF NOT ON TRAINING MODE. Specifies the value
for the threshold for the classification decision. The
results will be shown in the results file.
-mod [MOD] ONLY USED IF NOT ON TRAINING MODE. Specifies path to
trained models. If not specified the default path
(models/) will be used.