DTS-SQL with CodeQwen1.5: Text2SQL Implementation

This project uses CodeQwen1.5, fine-tuned in a DTS-SQL approach, and performs inference with VLLM to implement a text-to-SQL generation pipeline. Here's how to set up and use this project.

Installation

First, clone the repository from GitHub:

git clone https://github.com/MohammadrezaPourreza/DTS-SQL/tree/main
cd DTS-SQL

Install the necessary dependencies:

pip install -U accelerate bitsandbytes peft transformers==4.39 datasets trl git-lfs wandb flash-attn sql-metadata scipy sqlglot

Note: Avoid using `transformers==4.40` as it leads to errors with CodeQwen.

Data Preparation

To set up the dataset for this project, follow these steps:

Download the dataset from Google Drive.
Unzip the downloaded file if it is compressed.
Place the unzipped dataset folder into the spider directory at the root of this project.

This project utilizes the Spider 1.0 dataset, which can be found on the Spider dataset website.

Alternatively, you can skip the script execution and directly download the pre-built datasets:

These links contain all the necessary files to get started with the project.

If you dont't want to download, o prepare the training and validation datasets, you have the option to use the provided scripts. Please note that you may need to adjust the dataset paths in the scripts accordingly.

python finetuning_dataset_creator.py

The file with 'filtered' in its name contains the ideal results (directly extracted information from SQL answers).

Fine-tuning and Inference

Stage 1: Schema Linking

Execute the Jupyter notebook for schema linking fine-tuning:
```
jupyter notebook schema_linking_generation_finetuning.ipynb
```
Run the inference notebook:
```
jupyter notebook schema_linking_generation_inference.ipynb
```
The output of this step is a trained model capable of extracting potentially relevant table names from a given question and its related database schema.

Stage 2: SQL Generation

Train the SQL generation model:

jupyter notebook sql_generation_finetuning.ipynb

Perform inference to generate SQL queries:
```
jupyter notebook sql_generation_inference.ipynb
```
For an upperbound result, replace the file `generated_test_schema_links.csv` with `./validation/filtered_spider_syn_dataset.csv` in the inference notebook.

The output of this step will be the final checkpoint for the first stage LoRA layer and the stage1 model's inference results on the test set (final_checkpoint_part1). Additionally, data_processing_scheme2inference.py is integrated to generate the data (useful_val_dataset1.csv) used for final SQL generation inference.

Batch Execution

Run the batch script to execute all notebooks automatically:

sh run.sh

Using VLLM for Faster Inference

For faster inference, you can use the VLLM model by running:

jupyter notebook inference_vllm.ipynb

Note: The results with VLLM may not be as accurate as with the fully fine-tuned models.

Models

Download the required models from the following links:

Stage 1 LoRA model: Google Drive
Stage 2 LoRA model: Google Drive
Stage 1 trained model: Hugging Face
Stage 2 trained model: Hugging Face

Results

Evaluation on Spider (https://yale-lily.github.io/spider)

Expected results are:

Exact Match (EM): 0.736
Execution Accuracy (EX): 0.813
Upperbound result: 0.876 with ideal data setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DTS-SQL with CodeQwen1.5: Text2SQL Implementation

Installation

Data Preparation

Fine-tuning and Inference

Stage 1: Schema Linking

Stage 2: SQL Generation

Batch Execution

Using VLLM for Faster Inference

Models

Results

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
eval		eval
utils		utils
Predicted_promptchange_best.sql		Predicted_promptchange_best.sql
README.md		README.md
data_processing_scheme2inference.py		data_processing_scheme2inference.py
finetuning_dataset_creator.py		finetuning_dataset_creator.py
inference_vllm.ipynb		inference_vllm.ipynb
run.sh		run.sh
schema_linking_generation_finetuning.ipynb		schema_linking_generation_finetuning.ipynb
schema_linking_generation_inference.ipynb		schema_linking_generation_inference.ipynb
sql_generation_finetuning.ipynb		sql_generation_finetuning.ipynb
sql_generation_inference.ipynb		sql_generation_inference.ipynb

lin001126/Text2SQL-CodeQwen1.5-Finetune

Folders and files

Latest commit

History

Repository files navigation

DTS-SQL with CodeQwen1.5: Text2SQL Implementation

Installation

Data Preparation

Fine-tuning and Inference

Stage 1: Schema Linking

Stage 2: SQL Generation

Batch Execution

Using VLLM for Faster Inference

Models

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages