This directory contains the implementation of the neural method described in the paper. The method is used to calculate text similarity, for applications such as similar question retrieval in community-based QA forums.
The data used in this work is taken from the AskUbuntu 2014 dump. The processed data can be downloaded at this repo.
To run the code, you need the following extra packages installed:
- PrettyTable (only for this project)
- Scikit-Learn (only for this project)
- Numpy and Theano (required in general for this repository)
- Clone the rcnn repo
- Use “export PYTHONPATH=/path/to/rcnn/code” to add the rcnn/code directory to Python library
- Run
python main.py --help
to see all running options
To specify Theano configs, run the code via THEANO_FLAGS='...' python main.py ...
For instance, here is an example to run the model with default parameters:
THEANO_FLAGS='device=gpu,floatX=float32' # use GPU and 32-bit float
python main.py --corpus path/to/corpus # path to the corpus file
--embeddings /path/to/vectors # path to load word vectors
--train path/to/train # path to training file
--dev path/to/dev
--test path/to/test
--dropout 0.1 # dropout probability
-d 400 # hidden dimension
--save_model model.pkl.gz # save trained model to this file
The corpus, training/development/test files and the word vectors are available at the data repo.
The above example trains a model from scratch.
To fine tune a model that is pre-trained using unlabeled text (see code/pt directory for more information), use the --load_pretrain
option:
THEANO_FLAGS='device=gpu,floatX=float32'
python main.py --corpus path/to/corpus
--embeddings /path/to/vectors
--train path/to/train
--dev path/to/dev
--test path/to/test
--dropout 0.1
-d 400
--save_model model.pkl.gz
--load_pretrain path/to/pretrained/model
You can train the model with different settings by specifying the following options:
- Layer type (--layer): rcnn, lstm, gru
- Activation (--act): relu, tanh, etc
- Average pooling (--average): 0 or 1 (whether use mean pooling or just take the last state)
- Number of layers (--depth)
- Dropout (--dropout), L2 regularization (--l2_reg) and hidden dimension (-d)
- Learning method (--learning): adam, adagrad, adadelta etc.
- Learning rate (--learning_rate): 0.001, 0.01 etc
- Feature filter width (--order): 2, 3, etc