text2sql方法:基于ChatGPT的zero-shot方法C3

ChatGPT SQL

ChatGPT SQL出自2023年3月的论文《A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability》(github),这篇论文分析了ChatGPT的text2sql能力,实验结果表明ChatGPT的text2sql能力令人印象深刻,虽然没有达到当时的SOTA,但是无需训练性能也比SOTA低14%,且ChatGPT的鲁棒性与SOTA相比只有7.8%的差距。

使用zero-shot让ChatGPT生成SQL,prompt来自OpenAI展示的demo prompt,论文没有做特意的调整,如论文图1所示,图中上半部分是单轮场景下的text2sql,下半部分是多轮场景下的text2sql。

C3

text2sql方法C3出自2023年7月的论文《C3: Zero-shot Text-to-SQL with ChatGPT》(github), 它通过zero-shot的方式来prompt ChatGPT生成SQL。

C3的prompt包括3个组成部分:++C++ lear Prompting (CP), ++C++ alibration with Hints (CH), and ++C++onsistent Output (CO)

Clear Prompting:包括两个部分clear layout 和clear context,如论文图2©所示。

  • clear layout:用#将prompt里的指令、上下文(数据库schema)、问题分开。因为实验结果表明直接用论文图2(b)的指令可能会使生成的SQL有冗余列,所以在指令后添加了"and do not select extra columns that are not explicitly requested in the query"。
  • clear context:通过schema linking选择与问题相关的表和列。schema linking是通过zero-shot prompt ChatGPT来实现的,包括Table Recall 和 column Recall:

    • Table recall,zero-shop prompt ChatGPT分为3步来选择表。并用self-consistency来保证稳定性,即让ChatGPT生成10个结果集,每个结果中包括了top 4的表格,最后的结果为这10个结果集中出现最频繁的结果集。

      python 复制代码
      """
      Given the database schema and question, perform the following actions: 
      1 - Rank all the tables based on the possibility of being used in the SQL according to the question from the most relevant to the least relevant, Table or its column that matches more with the question words is highly relevant and must be placed ahead. 
      2 - Check whether you consider all the tables. 
      3 - Output a list object in the order of step 2, Your output should contain all the tables. The format should be like: 
      [
      "table_1", "table_2", ...
      ]
      
      Schema:
      # continents ( contid, continent )
      # countries ( countryid, countryname, continent )
      # car_makers ( id, maker, fullname, country )
      # model_list ( moddeli, maker, model )
      # car_names ( makeid, model, make )
      # cars_data ( id, mpg, cylinders, edispl, horsepower, weight, accelerate, year )
      Question:
      ### What is the name of the different car makers who produced a car in 1970?
      """
    • Column Recall,也通过zero-prompt来让ChatGPT分成两步来召回列。同样用self-consistency来保证稳定性,先让ChatGPT对每个表生成10个结果集,最后的结果为这10个结果集中出现最多频繁的5个列。

      python 复制代码
      """
      Given the database tables and question, perform the following actions: 
      1 - Rank the columns in each table based on the possibility of being used in the SQL, Column that matches more with the question words or the foreign key is highly relevant and must be placed ahead.
      You should output them in the order of the most relevant to the least relevant. 
      Explain why you choose each column. 
      2 - Output a JSON object that contains all the columns in each table according to your explanation. The format should be like: 
      { 
      "table_1": ["column_1", "column_2", ......],
      "table_2": ["column_1", "column_2", ......],
      "table_3": ["column_1", "column_2", ......],
      ...... 
      } 
      
      Schema: 
      # car_makers ( id, maker, fullname, country )
      # model_list ( modelid, maker, model )
      # car_names ( makeid, model, make ) 
      # cars_data ( id, mpg, cylinders, edispl, horsepower, weight, accelerate, year )
      Foreign keys: 
      # model_list.maker = car_makers.id 
      # car_names.model = model_list.model 
      # cars_data.id = car_names.makeid 
      
      Question:
      ### What is the name of the different car makers who produced a car in 1970?
      
      """

    Calibration with Hints : 通过对ChatGPT生成的SQL进行分析,发现它容易因为bias出现如论文图3所示的错误,所以在prompt里添加了如论文图1右上部分所示的两个提示。

Consistent Output:使用execution-based Self-consistency。先让LLM采样输出多个SQL结果,然后将这些生成的SQL查询在数据库上执行并记录执行结果,去掉错误记录后,通过对执行结果采取投票机制来选择最后SQL。

github issue 里有一个问题是关于执行时间的,作者回复如下:

Time taken for recalling table: approximately 7s per sample.

Time taken for recalling column: approximately 25s per sample.

Time taken for generating SQL: approximately 2s per sample.

The time spent also depends on the internet status and the rate limits of APl calls

在用self-consistency时,如issue作者所回复,通过ChatGPT api里的参数n来一次生成多个结果,对于n个输入是共享同一个输入token的。

相关推荐
张较瘦_3 小时前
[论文阅读] 人工智能+项目管理 | 当 PMBOK 遇见 AI:传统项目管理框架的破局之路
论文阅读·人工智能
张较瘦_3 小时前
[论文阅读] 人工智能 | 大语言模型计划生成的新范式:基于过程挖掘的技能学习
论文阅读·人工智能·语言模型
0x2118 小时前
[论文阅读]TrustRAG: Enhancing Robustness and Trustworthiness in RAG
论文阅读
要努力啊啊啊12 小时前
强化学习基础概念图文版笔记
论文阅读·人工智能·笔记·深度学习·语言模型·自然语言处理
二进制的Liao13 小时前
【数据分析】什么是鲁棒性?
运维·论文阅读·算法·数学建模·性能优化·线性回归·负载均衡
Jamence1 天前
多模态大语言模型arxiv论文略读(108)
论文阅读·人工智能·语言模型·自然语言处理·论文笔记
张较瘦_1 天前
[论文阅读] 人工智能 | 用大语言模型解决软件元数据“身份谜题”:科研软件的“认脸”新方案
论文阅读·人工智能·语言模型
Jamence1 天前
多模态大语言模型arxiv论文略读(106)
论文阅读·人工智能·语言模型·自然语言处理·论文笔记
崔高杰1 天前
To be or Not to be, That‘s a Token——论文阅读笔记——Beyond the 80/20 Rule和R2R
论文阅读·笔记
张较瘦_1 天前
[论文阅读] 人工智能+软件工程 | 用大模型优化软件性能
论文阅读·人工智能·软件工程