Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text2gql #285

Closed
wants to merge 52 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
dd27e9c
feat(nlu): First NLU finetuning
fangyinc Jun 11, 2024
2343282
fix: Fix train error
fangyinc Jun 11, 2024
1d76041
chore: Add log for NLU training
fangyinc Jun 12, 2024
b3874d2
fix: Fix dbgpt-hub-sql error
fangyinc Jun 13, 2024
ce478fa
ner
Jun 22, 2024
c0d663c
Feat: Add ner task, using qwen-1.5b LLM
Jun 22, 2024
62fa328
Update ner.sh
zhanghy-sketchzh Jun 22, 2024
861182c
fix:label2id
Jun 24, 2024
e5b92a9
add missing pyproject.toml
SonglinLyu Jul 15, 2024
a48fd2b
create dbgpt-hub-graph
SonglinLyu Jul 23, 2024
c392a09
add a prototype of query similarity evaluator
SonglinLyu Aug 13, 2024
1db6ffb
a demo for grammar parser generated from .g4 file
SonglinLyu Aug 14, 2024
bbd46eb
add lcypher and gql eavluator
SonglinLyu Aug 16, 2024
c4e5444
remove unnecessary data file
SonglinLyu Aug 16, 2024
e2028ab
force commit all current changes
SonglinLyu Aug 16, 2024
2c096d0
delete data preparation related folder
SonglinLyu Aug 16, 2024
87b15b0
remove useless dataset
SonglinLyu Aug 16, 2024
e332786
add tugraph-analytics dataset
SonglinLyu Aug 16, 2024
15e74ca
rename dbgpt-hub-graph to dbgpt-hub-gql
SonglinLyu Aug 16, 2024
e52e1b2
rename dbgpt-hub-graph to dbgpt-hub-gql
SonglinLyu Aug 16, 2024
c56e7e8
remove eval_data folder
SonglinLyu Aug 16, 2024
310cb18
remove unnecessary log file
SonglinLyu Aug 16, 2024
ebf9318
ignore wandb folder
SonglinLyu Aug 16, 2024
9e8337d
add README.md
SonglinLyu Aug 16, 2024
c41e61e
add table to README
SonglinLyu Aug 16, 2024
b63a504
rename sql to gql
SonglinLyu Aug 19, 2024
40c0fd2
remove unused data_process module
SonglinLyu Aug 19, 2024
0dee13c
remove baseline
SonglinLyu Aug 19, 2024
4bbc197
correct dataset path
SonglinLyu Aug 19, 2024
8258e92
use prettytable to format evaluation output
SonglinLyu Aug 19, 2024
0e03386
add detail log for evaluation
SonglinLyu Aug 19, 2024
9c3e508
remove sql
SonglinLyu Aug 20, 2024
e78e775
remove ouputs that are not query from dataset
SonglinLyu Aug 20, 2024
2c3cf0e
change tugraph-db to tugraph-db-example, this folder only contains ab…
SonglinLyu Aug 20, 2024
99edb59
remove tugraph-analytics folder
SonglinLyu Aug 20, 2024
1792b45
add tugraph-db-example, a mini dataset
SonglinLyu Aug 20, 2024
ccc4f58
update readme, include tugraph-analytics dataset download method
SonglinLyu Aug 20, 2024
c15e240
delete unneeded notation and print
SonglinLyu Aug 20, 2024
7a41f06
reformate with black
SonglinLyu Aug 20, 2024
bc307b0
update readme
SonglinLyu Aug 20, 2024
973efe8
update readme
SonglinLyu Aug 20, 2024
3e2168c
remove temporary change for hub_sql
SonglinLyu Aug 21, 2024
715fe9d
update introduction
SonglinLyu Aug 21, 2024
6d60913
remove temporary change for hub_sql
SonglinLyu Aug 21, 2024
9c60eb3
update readme
SonglinLyu Aug 21, 2024
4b82cbf
update baseline test result
SonglinLyu Aug 26, 2024
be20757
add link to tugraph-analytics parser
SonglinLyu Aug 26, 2024
5906fcd
fix: readme add text2nlu & gql
csunny Aug 26, 2024
c29f447
fix readme comments
SonglinLyu Aug 26, 2024
906bd68
Merge branch 'text2gql_lsl' of https://github.com/SonglinLyu/DB-GPT-H…
SonglinLyu Aug 26, 2024
5c01d59
update readme(dataset link, baseline result)
SonglinLyu Aug 26, 2024
7adfdd7
Merge branch 'main' into text2gql_lsl
zhanghy-sketchzh Aug 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 33 additions & 14 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,40 @@ data/spider
data/eval
output_pred/
wandb/
dbgpt_hub/data/*
src/dbgpt-hub-sql/dbgpt_hub_sql/data/*
src/dbgpt-hub-gql/dbgpt_hub_gql/data/*
src/dbgpt-hub-sql/codellama/*
src/dbgpt-hub-gql/codellama/*
src/dbgpt-hub-sql/wandb/*
src/dbgpt-hub-gql/wandb/*
# But track the data/eval_data folder itself
!dbgpt_hub/data/eval_data/
!dbgpt_hub/data/dataset_info.json
!dbgpt_hub/data/example_text2sql.json

# Ignore everything under dbgpt_hub/ouput/ except the adapter directory
dbgpt_hub/output/adapter/*
!dbgpt_hub/output/adapter/.gitkeep
dbgpt_hub/output/logs/*
!dbgpt_hub/output/logs/.gitkeep
dbgpt_hub/output/pred/*
!dbgpt_hub/output/pred/.gitkeep


!src/dbgpt-hub-sql/dbgpt_hub_sql/data/eval_data/
!src/dbgpt-hub-sql/dbgpt_hub_sql/data/dataset_info.json
!src/dbgpt-hub-sql/dbgpt_hub_sql/data/example_text2sql.json
!src/dbgpt-hub-gql/dbgpt_hub_gql/data/tugraph-db-example
!src/dbgpt-hub-gql/dbgpt_hub_gql/data/dataset_info.json
!src/dbgpt-hub-gql/dbgpt_hub_gql/data/example_text2sql.json

# Ignore everything under dbgpt_hub_sql/ouput/ except the adapter directory
src/dbgpt-hub-sql/dbgpt_hub_sql/output/
src/dbgpt-hub-sql/dbgpt_hub_sql/output/adapter/*
!src/dbgpt-hub-sql/dbgpt_hub_sql/output/adapter/.gitkeep
src/dbgpt-hub-sql/dbgpt_hub_sql/output/logs/*
!src/dbgpt-hub-sql/dbgpt_hub_sql/output/logs/.gitkeep
src/dbgpt-hub-sql/dbgpt_hub_sql/output/pred/*
!src/dbgpt-hub-sql/dbgpt_hub_sql/output/pred/.gitkeep

src/dbgpt-hub-gql/dbgpt_hub_gql/output/
src/dbgpt-hub-gql/dbgpt_hub_gql/output/adapter/*
!src/dbgpt-hub-gql/dbgpt_hub_gql/output/adapter/.gitkeep
src/dbgpt-hub-gql/dbgpt_hub_gql/output/logs/*
!src/dbgpt-hub-gql/dbgpt_hub_gql/output/logs/.gitkeep
src/dbgpt-hub-gql/dbgpt_hub_gql/output/pred/*
!src/dbgpt-hub-gql/dbgpt_hub_gql/output/pred/.gitkeep

# Ignore NLU output
src/dbgpt-hub-nlu/output
src/dbgpt-hub-nlu/data

#
build/
Expand Down
31 changes: 31 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
.DEFAULT_GOAL := help

SHELL=/bin/bash
VENV = venv

# Detect the operating system and set the virtualenv bin directory
ifeq ($(OS),Windows_NT)
VENV_BIN=$(VENV)/Scripts
else
VENV_BIN=$(VENV)/bin
endif

setup: $(VENV)/bin/activate

$(VENV)/bin/activate: $(VENV)/.venv-timestamp

$(VENV)/.venv-timestamp: src/dbgpt-hub-nlu/setup.py requirements
# Create new virtual environment if setup.py has changed
python3 -m venv $(VENV)
$(VENV_BIN)/pip install --upgrade pip
$(VENV_BIN)/pip install -r requirements/lint-requirements.txt
touch $(VENV)/.venv-timestamp


.PHONY: fmt
fmt: setup ## Format Python code
# TODO: Use isort to sort Python imports.
# https://github.com/PyCQA/isort
$(VENV_BIN)/isort src/
# https://github.com/psf/black
$(VENV_BIN)/black --extend-exclude="examples/notebook" .
76 changes: 42 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,17 @@
</p>


[**简体中文**](README.zh.md) | [**Discord**](https://discord.gg/7uQnPuveTY) | [**Wechat**](https://github.com/eosphoros-ai/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC) | [**Huggingface**](https://huggingface.co/eosphoros) | [**Community**](https://github.com/eosphoros-ai/community) | [**Paper**](https://arxiv.org/abs/2406.11434)

[**简体中文**](README.zh.md) | [**Discord**](https://discord.gg/7uQnPuveTY) | [**Wechat**](https://github.com/eosphoros-ai/DB-GPT/blob/main/README.zh.md#%E8%81%94%E7%B3%BB%E6%88%91%E4%BB%AC) | [**Huggingface**](https://huggingface.co/eosphoros) | [**Community**](https://github.com/eosphoros-ai/community)
[**Text2SQL**](README.zh.md) | [**Text2GQL**](src/dbgpt-hub-gql/README.zh.md) | [**Text2NLU**](src/dbgpt-hub-nlu/README.zh.md)

</div>


## 🔥🔥🔥 News
- Support [Text2NLU](src/dbgpt-hub-nlu/README.zh.md) fine-tuning to improve semantic understanding accuracy.
- Support [Text2GQL](src/dbgpt-hub-gql/README.zh.md) fine-tuning to generate graph query.

## Baseline
- update time: 2023/12/08
- metric: execution accuracy (ex)
Expand Down Expand Up @@ -392,13 +400,13 @@ Firstly, install `dbgpt-hub` with the following command

Then, set up the arguments and run the whole process.
```python
from dbgpt_hub.data_process import preprocess_sft_data
from dbgpt_hub.train import start_sft
from dbgpt_hub.predict import start_predict
from dbgpt_hub.eval import start_evaluate
from dbgpt_hub_sql.data_process import preprocess_sft_data
from dbgpt_hub_sql.train import start_sft
from dbgpt_hub_sql.predict import start_predict
from dbgpt_hub_sql.eval import start_evaluate

# Config the input datasets
data_folder = "dbgpt_hub/data"
data_folder = "dbgpt_hub_sql/data"
data_info = [
{
"data_source": "spider",
Expand All @@ -424,7 +432,7 @@ train_args = {
"template": "llama2",
"lora_rank": 64,
"lora_alpha": 32,
"output_dir": "dbgpt_hub/output/adapter/CodeLlama-13b-sql-lora",
"output_dir": "dbgpt_hub_sql/output/adapter/CodeLlama-13b-sql-lora",
"overwrite_cache": True,
"overwrite_output_dir": True,
"per_device_train_batch_size": 1,
Expand All @@ -443,20 +451,20 @@ predict_args = {
"model_name_or_path": "codellama/CodeLlama-13b-Instruct-hf",
"template": "llama2",
"finetuning_type": "lora",
"checkpoint_dir": "dbgpt_hub/output/adapter/CodeLlama-13b-sql-lora",
"predict_file_path": "dbgpt_hub/data/eval_data/dev_sql.json",
"predict_out_dir": "dbgpt_hub/output/",
"checkpoint_dir": "dbgpt_hub_sql/output/adapter/CodeLlama-13b-sql-lora",
"predict_file_path": "dbgpt_hub_sql/data/eval_data/dev_sql.json",
"predict_out_dir": "dbgpt_hub_sql/output/",
"predicted_out_filename": "pred_sql.sql",
}

# Config evaluation parameters
evaluate_args = {
"input": "./dbgpt_hub/output/pred/pred_sql_dev_skeleton.sql",
"gold": "./dbgpt_hub/data/eval_data/gold.txt",
"gold_natsql": "./dbgpt_hub/data/eval_data/gold_natsql2sql.txt",
"db": "./dbgpt_hub/data/spider/database",
"table": "./dbgpt_hub/data/eval_data/tables.json",
"table_natsql": "./dbgpt_hub/data/eval_data/tables_for_natsql2sql.json",
"input": "./dbgpt_hub_sql/output/pred/pred_sql_dev_skeleton.sql",
"gold": "./dbgpt_hub_sql/data/eval_data/gold.txt",
"gold_natsql": "./dbgpt_hub_sql/data/eval_data/gold_natsql2sql.txt",
"db": "./dbgpt_hub_sql/data/spider/database",
"table": "./dbgpt_hub_sql/data/eval_data/tables.json",
"table_natsql": "./dbgpt_hub_sql/data/eval_data/tables_for_natsql2sql.json",
"etype": "exec",
"plug_value": True,
"keep_distict": False,
Expand All @@ -479,15 +487,15 @@ start_evaluate(evaluate_args)

DB-GPT-Hub uses the information matching generation method for data preparation, i.e. the SQL + Repository generation method that combines table information. This method combines data table information to better understand the structure and relationships of the data table, and is suitable for generating SQL statements that meet the requirements.

Download the [Spider dataset]((https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ)) from the Spider dataset link. By default, after downloading and extracting the data, place it in the dbgpt_hub/data directory, i.e., the path should be `dbgpt_hub/data/spider`.
Download the [Spider dataset]((https://drive.google.com/uc?export=download&id=1TqleXec_OykOYFREKKtschzY29dUcVAQ)) from the Spider dataset link. By default, after downloading and extracting the data, place it in the dbgpt_hub_sql/data directory, i.e., the path should be `dbgpt_hub_sql/data/spider`.

For the data preprocessing part, simply **run the following script** :
```bash
## generate train and dev(eval) data
poetry run sh dbgpt_hub/scripts/gen_train_eval_data.sh
poetry run sh dbgpt_hub_sql/scripts/gen_train_eval_data.sh
```

In the directory `dbgpt_hub/data/`, you will find the newly generated training file example_text2sql_train.json and testing file example_text2sql_dev.json, containing 8659 and 1034 entries respectively. For the data used in subsequent fine-tuning, set the parameter `file_name` value to the file name of the training set in dbgpt_hub/data/dataset_info.json, such as example_text2sql_train.json
In the directory `dbgpt_hub_sql/data/`, you will find the newly generated training file example_text2sql_train.json and testing file example_text2sql_dev.json, containing 8659 and 1034 entries respectively. For the data used in subsequent fine-tuning, set the parameter `file_name` value to the file name of the training set in dbgpt_hub_sql/data/dataset_info.json, such as example_text2sql_train.json


The data in the generated JSON looks something like this:
Expand All @@ -500,43 +508,43 @@ The data in the generated JSON looks something like this:
"history": []
},
```
The data processing code of `chase`, `cosql` and `sparc` has been embedded in the data processing code of the project. After downloading the data set according to the above link, you only need to add ` in `dbgpt_hub/configs/config.py` Just loosen the corresponding code comment in SQL_DATA_INFO`.
The data processing code of `chase`, `cosql` and `sparc` has been embedded in the data processing code of the project. After downloading the data set according to the above link, you only need to add ` in `dbgpt_hub_sql/configs/config.py` Just loosen the corresponding code comment in SQL_DATA_INFO`.

### 3.4. Model fine-tuning

The model fine-tuning supports both LoRA and QLoRA methods. We can run the following command to fine-tune the model. By default, with the parameter --quantization_bit, it uses the QLoRA fine-tuning method. To switch to LoRAs, simply remove the related parameter from the script.
Run the command:

```bash
poetry run sh dbgpt_hub/scripts/train_sft.sh
poetry run sh dbgpt_hub_sql/scripts/train_sft.sh
```

After fine-tuning, the model weights will be saved by default in the adapter folder, specifically in the dbgpt_hub/output/adapter directory.
After fine-tuning, the model weights will be saved by default in the adapter folder, specifically in the dbgpt_hub_sql/output/adapter directory.

If you're using **multi-GPU training and want to utilize deepseed**, you should modify the default content in train_sft.sh. The change is:

```
CUDA_VISIBLE_DEVICES=0 python dbgpt_hub/train/sft_train.py \
CUDA_VISIBLE_DEVICES=0 python dbgpt_hub_sql/train/sft_train.py \
--quantization_bit 4 \
...
```
change to :
```
deepspeed --num_gpus 2 dbgpt_hub/train/sft_train.py \
--deepspeed dbgpt_hub/configs/ds_config.json \
deepspeed --num_gpus 2 dbgpt_hub_sql/train/sft_train.py \
--deepspeed dbgpt_hub_sql/configs/ds_config.json \
--quantization_bit 4 \
...
```

if you need order card id
```
deepspeed --include localhost:0,1 dbgpt_hub/train/sft_train.py \
--deepspeed dbgpt_hub/configs/ds_config.json \
deepspeed --include localhost:0,1 dbgpt_hub_sql/train/sft_train.py \
--deepspeed dbgpt_hub_sql/configs/ds_config.json \
--quantization_bit 4 \
...
```

The other parts that are omitted (…) can be kept consistent. If you want to change the default deepseed configuration, go into the `dbgpt_hub/configs` directory and make changes to ds_config.json as needed,the default is stage2.
The other parts that are omitted (…) can be kept consistent. If you want to change the default deepseed configuration, go into the `dbgpt_hub_sql/configs` directory and make changes to ds_config.json as needed,the default is stage2.

In the script, during fine-tuning, different models correspond to key parameters lora_target and template, as shown in the following table:

Expand All @@ -563,10 +571,10 @@ In the script, during fine-tuning, different models correspond to key parameters

> quantization_bit: Indicates whether quantization is applied, with valid values being [4 or 8].
> model_name_or_path: The path of the LLM (Large Language Model).
> dataset: Specifies the name of the training dataset configuration, corresponding to the outer key value in dbgpt_hub/data/dataset_info.json, such as example_text2sql.
> dataset: Specifies the name of the training dataset configuration, corresponding to the outer key value in dbgpt_hub_sql/data/dataset_info.json, such as example_text2sql.
> max_source_length: The length of the text input into the model. If computing resources allow, it can be set as large as possible, like 1024 or 2048.
> max_target_length: The length of the SQL content output by the model; 512 is generally sufficient.
> output_dir: The output path of the Peft module during SFT (Supervised Fine-Tuning), set by default to `dbgpt_hub/output/adapter/` .
> output_dir: The output path of the Peft module during SFT (Supervised Fine-Tuning), set by default to `dbgpt_hub_sql/output/adapter/` .
> per_device_train_batch_size: The size of the batch. If computing resources allow, it can be set larger; the default is 1.
> gradient_accumulation_steps: The number of steps for accumulating gradients before an update.
> save_steps: The number of steps at which model checkpoints are saved; it can be set to 100 by default.
Expand All @@ -575,10 +583,10 @@ In the script, during fine-tuning, different models correspond to key parameters

### 3.5. Model Predict

Under the project directory ./dbgpt_hub/output/pred/, this folder is the default output location for model predictions(if not exist, just mkdir).
Under the project directory ./dbgpt_hub_sql/output/pred/, this folder is the default output location for model predictions(if not exist, just mkdir).

```bash
poetry run sh ./dbgpt_hub/scripts/predict_sft.sh
poetry run sh ./dbgpt_hub_sql/scripts/predict_sft.sh
```

In the script, by default with the parameter `--quantization_bit`, it predicts using QLoRA. Removing it switches to the LoRA prediction method.
Expand All @@ -593,7 +601,7 @@ You can find the second corresponding model weights from Huggingface [hg-eospho
If you need to merge the weights of the trained base model and the fine-tuned Peft module to export a complete model, execute the following model export script:

```bash
poetry run sh ./dbgpt_hub/scripts/export_merge.sh
poetry run sh ./dbgpt_hub_sql/scripts/export_merge.sh
```

Be sure to replace the parameter path values in the script with the paths corresponding to your project.
Expand All @@ -602,7 +610,7 @@ Be sure to replace the parameter path values in the script with the paths corres
To evaluate model performance on the dataset, default is spider dev dataset.
Run the following command:
```bash
poetry run python dbgpt_hub/eval/evaluation.py --plug_value --input Your_model_pred_file
poetry run python dbgpt_hub_sql/eval/evaluation.py --plug_value --input Your_model_pred_file
```
You can find the results of our latest review and part of experiment results [here](docs/eval_llm_result.md)
**Note**: The database pointed to by the default code is a 95M database downloaded from [Spider official website] (https://yale-lily.github.io/spider). If you need to use Spider database (size 1.27G) in [test-suite](https://github.com/taoyds/test-suite-sql-eval), please download the database in the link to the custom directory first, and run the above evaluation command which add parameters and values ​​like `--db Your_download_db_path`.
Expand Down
Loading
Loading