Future Updates: -> containerization for easy distribution of code.
The inputs to the models are the entities and contextual information obtained from the tables and its surroundings. The embeddings go into a CNN (for extracting features) and then into LSTM/BiLSTM which is then fed to a softmax layer for multi-class classification.
Dataset used: https://doi.org/10.7939/DVN/SHL1SL
First create a new conda environment using the following (make sure you have Anaconda installed.)
conda create --name comemnet python=3.8
conda activate comemnet
Now inside the 'comemnet' conda environment, install the following packages. Follow pip installation guidelines. If using Anaconda package manager, use conda to install packages, but generally pip should work.
Python data science stack.
pip install pandas
pip install numpy
python -m pip install -U matplotlib
pip install seaborn
pip install -U scikit-learn
- Tensorflow >=2.5.0
pip install tensorflow
- sentencepiece
pip install sentencepiece
- TensorflowHub
pip install "tensorflow>=2.0.0"
pip install --upgrade tensorflow-hub
NOTE: If you want to create a demo for the trained model, you need gradio. No need to install if demo is not required. This is optional.
!pip install -q gradio
Before running the code, make sure you have the pretrained model checkpoint files (if using the pretrained model). Download this folder from Google Drive. It is a big folder (~few gigabytes). As the default behavior uses the pretrained model, you will need the checkpoint files.
python cmput656_full_data.py #for CNN-BiLSTM
python cnn_plus_lstm.py #for CNN-LSTM
python bilstm_only.py #for BiLSTM-only
Hyperparameter | Ours | Macdonald and Barbosa (2020) |
---|---|---|
CNN Filters | 8 | None |
LSTM/BiLSTM units | 8 | 1 (only LSTM) |
Batch Size | 16 | 16 |
Optimizer | Adam | RMSProp |
Max Token Length | 80 | 50 |
Learning Rate | 2e-5 | 0.001 |
Dropout (for LSTM / BiLSTM) | 0.2 | None |
Model | Parameters |
---|---|
Macdonald and Barbosa | 4,559 |
CNN + LSTM | 40,581 |
CNN + BiLSTM | 50,405 |
BiLSTM only | 86,877 |
Results shown for 5 seeds.
Model | Accuracy | F1 | #Relations (#Tables) | Epochs |
---|---|---|---|---|
Macdonald and Barbosa 2020 | 92% | 95% | 29 (All) | 50 |
CNN-LSTM | 97.57% | 91.44% | 29 (All) | 40 |
CNN-BiLSTM | 97.80% | 92.46% | 29 (All) | 40 |
BiLSTM (8 units) | 98.19% | 94.35% | 29 (All) | 40 |