This project uses CodeQwen1.5, fine-tuned in a DTS-SQL approach, and performs inference with VLLM to implement a text-to-SQL generation pipeline. Here's how to set up and use this project.
First, clone the repository from GitHub:
git clone https://github.com/MohammadrezaPourreza/DTS-SQL/tree/main
cd DTS-SQL
Install the necessary dependencies:
pip install -U accelerate bitsandbytes peft transformers==4.39 datasets trl git-lfs wandb flash-attn sql-metadata scipy sqlglot
Note: Avoid using `transformers==4.40` as it leads to errors with CodeQwen.
To set up the dataset for this project, follow these steps:
-
Download the dataset from Google Drive.
-
Unzip the downloaded file if it is compressed.
-
Place the unzipped dataset folder into the
spider
directory at the root of this project.
This project utilizes the Spider 1.0 dataset, which can be found on the Spider dataset website.
Alternatively, you can skip the script execution and directly download the pre-built datasets:
These links contain all the necessary files to get started with the project.
If you dont't want to download, o prepare the training and validation datasets, you have the option to use the provided scripts. Please note that you may need to adjust the dataset paths in the scripts accordingly.
python finetuning_dataset_creator.py
The file with 'filtered' in its name contains the ideal results (directly extracted information from SQL answers).
- Execute the Jupyter notebook for schema linking fine-tuning:
jupyter notebook schema_linking_generation_finetuning.ipynb
- Run the inference notebook:
The output of this step is a trained model capable of extracting potentially relevant table names from a given question and its related database schema.
jupyter notebook schema_linking_generation_inference.ipynb
- Train the SQL generation model:
jupyter notebook sql_generation_finetuning.ipynb
- Perform inference to generate SQL queries:
For an upperbound result, replace the file `generated_test_schema_links.csv` with `./validation/filtered_spider_syn_dataset.csv` in the inference notebook.
jupyter notebook sql_generation_inference.ipynb
The output of this step will be the final checkpoint for the first stage LoRA layer and the stage1 model's inference results on the test set (final_checkpoint_part1
). Additionally, data_processing_scheme2inference.py
is integrated to generate the data (useful_val_dataset1.csv
) used for final SQL generation inference.
Run the batch script to execute all notebooks automatically:
sh run.sh
For faster inference, you can use the VLLM model by running:
jupyter notebook inference_vllm.ipynb
Note: The results with VLLM may not be as accurate as with the fully fine-tuned models.
Download the required models from the following links:
- Stage 1 LoRA model: Google Drive
- Stage 2 LoRA model: Google Drive
- Stage 1 trained model: Hugging Face
- Stage 2 trained model: Hugging Face
Evaluation on Spider (https://yale-lily.github.io/spider)
Expected results are:
- Exact Match (EM): 0.736
- Execution Accuracy (EX): 0.813
- Upperbound result: 0.876 with ideal data setup