text2sql方法:基于ChatGPT的zero-shot方法C3

ChatGPT SQL

ChatGPT SQL出自2023年3月的论文《A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability》(github),这篇论文分析了ChatGPT的text2sql能力,实验结果表明ChatGPT的text2sql能力令人印象深刻,虽然没有达到当时的SOTA,但是无需训练性能也比SOTA低14%,且ChatGPT的鲁棒性与SOTA相比只有7.8%的差距。

使用zero-shot让ChatGPT生成SQL,prompt来自OpenAI展示的demo prompt,论文没有做特意的调整,如论文图1所示,图中上半部分是单轮场景下的text2sql,下半部分是多轮场景下的text2sql。

C3

text2sql方法C3出自2023年7月的论文《C3: Zero-shot Text-to-SQL with ChatGPT》(github), 它通过zero-shot的方式来prompt ChatGPT生成SQL。

C3的prompt包括3个组成部分:++C++ lear Prompting (CP), ++C++ alibration with Hints (CH), and ++C++onsistent Output (CO)

Clear Prompting:包括两个部分clear layout 和clear context,如论文图2©所示。

  • clear layout:用#将prompt里的指令、上下文(数据库schema)、问题分开。因为实验结果表明直接用论文图2(b)的指令可能会使生成的SQL有冗余列,所以在指令后添加了"and do not select extra columns that are not explicitly requested in the query"。
  • clear context:通过schema linking选择与问题相关的表和列。schema linking是通过zero-shot prompt ChatGPT来实现的,包括Table Recall 和 column Recall:

    • Table recall,zero-shop prompt ChatGPT分为3步来选择表。并用self-consistency来保证稳定性,即让ChatGPT生成10个结果集,每个结果中包括了top 4的表格,最后的结果为这10个结果集中出现最频繁的结果集。

      python 复制代码
      """
      Given the database schema and question, perform the following actions: 
      1 - Rank all the tables based on the possibility of being used in the SQL according to the question from the most relevant to the least relevant, Table or its column that matches more with the question words is highly relevant and must be placed ahead. 
      2 - Check whether you consider all the tables. 
      3 - Output a list object in the order of step 2, Your output should contain all the tables. The format should be like: 
      [
      "table_1", "table_2", ...
      ]
      
      Schema:
      # continents ( contid, continent )
      # countries ( countryid, countryname, continent )
      # car_makers ( id, maker, fullname, country )
      # model_list ( moddeli, maker, model )
      # car_names ( makeid, model, make )
      # cars_data ( id, mpg, cylinders, edispl, horsepower, weight, accelerate, year )
      Question:
      ### What is the name of the different car makers who produced a car in 1970?
      """
    • Column Recall,也通过zero-prompt来让ChatGPT分成两步来召回列。同样用self-consistency来保证稳定性,先让ChatGPT对每个表生成10个结果集,最后的结果为这10个结果集中出现最多频繁的5个列。

      python 复制代码
      """
      Given the database tables and question, perform the following actions: 
      1 - Rank the columns in each table based on the possibility of being used in the SQL, Column that matches more with the question words or the foreign key is highly relevant and must be placed ahead.
      You should output them in the order of the most relevant to the least relevant. 
      Explain why you choose each column. 
      2 - Output a JSON object that contains all the columns in each table according to your explanation. The format should be like: 
      { 
      "table_1": ["column_1", "column_2", ......],
      "table_2": ["column_1", "column_2", ......],
      "table_3": ["column_1", "column_2", ......],
      ...... 
      } 
      
      Schema: 
      # car_makers ( id, maker, fullname, country )
      # model_list ( modelid, maker, model )
      # car_names ( makeid, model, make ) 
      # cars_data ( id, mpg, cylinders, edispl, horsepower, weight, accelerate, year )
      Foreign keys: 
      # model_list.maker = car_makers.id 
      # car_names.model = model_list.model 
      # cars_data.id = car_names.makeid 
      
      Question:
      ### What is the name of the different car makers who produced a car in 1970?
      
      """

    Calibration with Hints : 通过对ChatGPT生成的SQL进行分析,发现它容易因为bias出现如论文图3所示的错误,所以在prompt里添加了如论文图1右上部分所示的两个提示。

Consistent Output:使用execution-based Self-consistency。先让LLM采样输出多个SQL结果,然后将这些生成的SQL查询在数据库上执行并记录执行结果,去掉错误记录后,通过对执行结果采取投票机制来选择最后SQL。

github issue 里有一个问题是关于执行时间的,作者回复如下:

Time taken for recalling table: approximately 7s per sample.

Time taken for recalling column: approximately 25s per sample.

Time taken for generating SQL: approximately 2s per sample.

The time spent also depends on the internet status and the rate limits of APl calls

在用self-consistency时,如issue作者所回复,通过ChatGPT api里的参数n来一次生成多个结果,对于n个输入是共享同一个输入token的。

相关推荐
sca1p315 小时前
新南威尔士大学 LiM
论文阅读·人工智能·加密流量分类
m0_650108247 小时前
Lift, Splat, Shoot:自动驾驶多视图相机的 BEV 语义表示学习
论文阅读·自动驾驶·数据驱动·lss·纯视觉bev感知·bev 语义分割·可解释的端到端轨迹规划
m0_6501082411 小时前
Sparse4D v3:端到端 3D 检测与跟踪的技术突破
论文阅读·自动驾驶·sparse4d v3·端到端3d感知框架·去噪思想·端到端跟踪·纯视觉感知
smilejingwei1 天前
Text2SQL 破局技术解析之二:MQL 实现与复杂性
ai·text2sql·spl·chatbi
m0_650108241 天前
VADv2:基于概率规划的端到端矢量化自动驾驶
论文阅读·自动驾驶·端到端矢量化·驾驶场景中的不确定性·概率场建模·多模态编码·vadv2
提娜米苏1 天前
[论文笔记] End-to-End Audiovisual Fusion with LSTMs
论文阅读·深度学习·lstm·语音识别·论文笔记·多模态
m0_650108241 天前
DiffusionDrive:面向端到端自动驾驶的截断扩散模型
论文阅读·扩散模型·端到端自动驾驶·阶段扩散策略·高级联扩散解码器·cvpr2025
提娜米苏1 天前
[论文笔记] 基于 LSTM 的端到端视觉语音识别 (End-to-End Visual Speech Recognition with LSTMs)
论文阅读·深度学习·计算机视觉·lstm·语音识别·视觉语音识别
m0_650108241 天前
BEVDet:鸟瞰图视角下的高性能多相机 3D 目标检测
论文阅读·bevdet·bev视角·3d目标检测范式·多任务统一框架·bev语言分割
STLearner2 天前
AI论文速读 | U-Cast:学习高维时间序列预测的层次结构
大数据·论文阅读·人工智能·深度学习·学习·机器学习·数据挖掘