DensePhrases 是一项由Korea University和Princeton University联合完成的,基于短语级的英文文本匹配(召回)模型,面向于NLP中“开放域问答”和“阅读理解”任务。其项目论文被收录于ACL2021,你也可以直接通过其Github项目地址来了解此模型,或使用其面向维基百科(2018.12.20)数据所训练的Demo来切身体会。
本次项目需要GPU资源,若本地没有GPU,可使用Google Colab运行项目下UCAS-DensePhrase.ipynb
脚本(推荐)
***** 2021.09.30 *****
***** by lilingwei([email protected])*****
首先请安装好conda
(官网安装:https://www.anaconda.com )
(博客指导:https://www.jianshu.com/p/edaa744ea47d )
创建DensePhrases项目的conda环境,并安装好所需工具包
# Install torch with conda (remember to check your CUDA version)
conda create -n densephrases python=3.7
conda activate densephrases
conda install pytorch=1.7.1
conda install cudatoolkit=11.0 -c pytorch
# Install densephrases for course of UCAS
git clone https://github.com/blackli7/DensePhrases.git
# Install apex
git clone https://www.github.com/nvidia/apex.git
cd apex
python setup.py install
cd ..
# Install other toolkits
cd DensePhrases
pip install -r requirements.txt
python setup.py develop
设置相关环境变量
# Running config.sh will set the following three environment variables:
# DATA_DIR: for datasets
# SAVE_DIR: for pre-trained models or index; new models and index will also be saved here
# CACHE_DIR: for cache files from huggingface transformers
source config.sh
检查一下正确性
# Check downloads
pip list
# If yes, you can see these information on the console.
apex faiss-gpu torch transformers ...
# Check config
echo $SAVE_DIR
# If yes, you can see these information on the console.
.//outputs
通过项目中已训练好的预训练模型densephrases-multi
在简单的物理资料(data/wiki_physics.json
,来源于English Wikipedia)上建立模型Demo.
DensePhrases所使用的训练数据必须满足以下json格式:(具体见:sample/articles.json
)
{
"data": [
{
"title": "America's Got Talent (season 4)",
"paragraphs": [
{
"context": " The fourth season of \"America's Got Talent\", ... Country singer Kevin Skinner was named the winner on September 16, 2009 ..."
},
{
"context": " Season four was Hasselhoff's final season as a judge. This season started broadcasting live on August 4, 2009. ..."
},
...
]
},
]
}
相关数据格式处理示例见:data_process
运行以下命令,生成模型Demo
# generate phrase vectors
# build phrase index
# evaluate phrase retrieval
# (try it more times if something goes wrong.)
make step1
完成后会在命令台看到如下信息:
通过命令台输入测试Demo模型。
# evaluate phrase retrieval with input question
# output the answer, but write details in 'sample/step1_question_test_out.json'
make step1_test
完成后会在命令台看到如下信息,按照提示输入问题文本:
输入完成后,回车,经过一段时间后模型会输出答案:进一步地,通过运行web_demo_django
文件夹下或者自己编写的网页演示程序来将模型封装,进行交互式的输入输出:
# move into the web directory
cd web_demo_django
# run django server
python manage.py runserver
# then open the address(http://127.0.0.1:8000/) on your browser.
Q: 在安装好conda后,使用conda安装工具时,报错PackagesNotFoundError:
A:尝试通过下面的命令增加下载源后再试一次:conda config --add channels conda-forge
conda config --add channels \ https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
conda config --add channels \ https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
Q: git clone远程下载失败:
A:多试几次,或在git命令前设置连接节点 GIT_CURL_VERBOSE=0 ,例如:GIT_CURL_VERBOSE=0 git clone https://www.github.com/nvidia/apex.git
如遇到任何问题,可以直接询问课程老师和助教,或者联系我(lilingwei:[email protected]
),你也可以直接通过发起Github Issue发布相关问题,我会尽量及时回复。
Please cite the paper if you use DensePhrases in your work:
@inproceedings{lee2021learning,
title={Learning Dense Representations of Phrases at Scale},
author={Lee, Jinhyuk and Sung, Mujeen and Kang, Jaewoo and Chen, Danqi},
booktitle={Association for Computational Linguistics (ACL)},
year={2021}
}
Please see LICENSE for details.