Skip to content

ACL'2021: Learning Dense Representations of Phrases at Scale

License

Notifications You must be signed in to change notification settings

blackli7/DensePhrases

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DensePhrases Demo

DensePhrases 是一项由Korea University和Princeton University联合完成的,基于短语级的英文文本匹配(召回)模型,面向于NLP中“开放域问答”和“阅读理解”任务。其项目论文被收录于ACL2021,你也可以直接通过其Github项目地址来了解此模型,或使用其面向维基百科(2018.12.20)数据所训练的Demo来切身体会。

本次项目需要GPU资源,若本地没有GPU,可使用Google Colab运行项目下UCAS-DensePhrase.ipynb脚本(推荐)

Open In Colab

更新于

***** 2021.09.30 *****

***** by lilingwei([email protected])*****

目录

安装环境

首先请安装好conda

(官网安装:https://www.anaconda.com

(博客指导:https://www.jianshu.com/p/edaa744ea47d

项目建立

创建DensePhrases项目的conda环境,并安装好所需工具包

# Install torch with conda (remember to check your CUDA version)
conda create -n densephrases python=3.7
conda activate densephrases
conda install pytorch=1.7.1 
conda install cudatoolkit=11.0 -c pytorch

# Install densephrases for course of UCAS
git clone https://github.com/blackli7/DensePhrases.git

# Install apex
git clone https://www.github.com/nvidia/apex.git
cd apex
python setup.py install
cd ..

# Install other toolkits
cd DensePhrases
pip install -r requirements.txt
python setup.py develop

设置相关环境变量

# Running config.sh will set the following three environment variables:
# DATA_DIR: for datasets
# SAVE_DIR: for pre-trained models or index; new models and index will also be saved here
# CACHE_DIR: for cache files from huggingface transformers
source config.sh

检查一下正确性

# Check downloads
pip list
# If yes, you can see these information on the console.
apex faiss-gpu torch transformers ...
# Check config
echo $SAVE_DIR
# If yes, you can see these information on the console.
.//outputs

训练Demo

通过项目中已训练好的预训练模型densephrases-multi 在简单的物理资料(data/wiki_physics.json,来源于English Wikipedia)上建立模型Demo.

DensePhrases所使用的训练数据必须满足以下json格式:(具体见:sample/articles.json

{
    "data": [
        {
            "title": "America's Got Talent (season 4)",
            "paragraphs": [
                {
                    "context": " The fourth season of \"America's Got Talent\", ... Country singer Kevin Skinner was named the winner on September 16, 2009 ..."
                },
                {
                    "context": " Season four was Hasselhoff's final season as a judge. This season started broadcasting live on August 4, 2009. ..."
                },
                ...
            ]
        },
    ]
}

相关数据格式处理示例见:data_process

运行以下命令,生成模型Demo

# generate phrase vectors
# build phrase index
# evaluate phrase retrieval
# (try it more times if something goes wrong.)
make step1

完成后会在命令台看到如下信息:

step1

测试Demo

通过命令台输入测试Demo模型。

# evaluate phrase retrieval with input question
# output the answer, but write details in 'sample/step1_question_test_out.json'
make step1_test

完成后会在命令台看到如下信息,按照提示输入问题文本:

step1_test_q
输入完成后,回车,经过一段时间后模型会输出答案:
step1_test_a

进一步地,通过运行web_demo_django文件夹下或者自己编写的网页演示程序来将模型封装,进行交互式的输入输出:

# move into the web directory
cd web_demo_django
# run django server
python manage.py runserver
# then open the address(http://127.0.0.1:8000/) on your browser.
web_demo

FAQ

1.

Q: 在安装好conda后,使用conda安装工具时,报错PackagesNotFoundError:

faq1
A:尝试通过下面的命令增加下载源后再试一次:
conda config --add channels conda-forge
conda config --add channels \ https://mirrors.ustc.edu.cn/anaconda/pkgs/free/
conda config --add channels \ https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/

2.

Q: git clone远程下载失败:

faq2
A:多试几次,或在git命令前设置连接节点 GIT_CURL_VERBOSE=0 ,例如:
GIT_CURL_VERBOSE=0 git clone https://www.github.com/nvidia/apex.git

问题反馈

如遇到任何问题,可以直接询问课程老师和助教,或者联系我(lilingwei:[email protected]),你也可以直接通过发起Github Issue发布相关问题,我会尽量及时回复。

Reference

Please cite the paper if you use DensePhrases in your work:

@inproceedings{lee2021learning,
   title={Learning Dense Representations of Phrases at Scale},
   author={Lee, Jinhyuk and Sung, Mujeen and Kang, Jaewoo and Chen, Danqi},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2021}
}

License

Please see LICENSE for details.

About

ACL'2021: Learning Dense Representations of Phrases at Scale

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 61.8%
  • Jupyter Notebook 19.9%
  • HTML 12.5%
  • CSS 2.8%
  • Makefile 2.5%
  • Shell 0.5%