文章目录
- [AI(学习笔记第十七课)langchain v1.0(SQL Agent)](#AI(学习笔记第十七课)langchain v1.0(SQL Agent))
-
- [1. `langchain v1.0`的`sql agent`](#1.
langchain v1.0的sql agent) -
- [1.1 整体的`sql agent`说明](#1.1 整体的
sql agent说明) - [1.2 整体的`sql agent`示例代码](#1.2 整体的
sql agent示例代码) - [1.3 整体的`sql agent`示例`database`](#1.3 整体的
sql agent示例database)
- [1.1 整体的`sql agent`说明](#1.1 整体的
- [2. 代码解析](#2. 代码解析)
-
- [2.1 配置大模型和`langsmith`](#2.1 配置大模型和
langsmith) - [2.2 配置`sqlite`的`database`](#2.2 配置
sqlite的database) - [2.3 使用大模型生成`tools`](#2.3 使用大模型生成
tools) - [2.4 提供系统提示词`system prompt`](#2.4 提供系统提示词
system prompt) - [2.5 生成`sql agent`](#2.5 生成
sql agent) - [2.6 准备用户的关于这个`database`的提问](#2.6 准备用户的关于这个
database的提问) - [2.7 开始对`sql agent`进行提问](#2.7 开始对
sql agent进行提问)
- [2.1 配置大模型和`langsmith`](#2.1 配置大模型和
- [3 确认执行的结果](#3 确认执行的结果)
-
- [3.1 `human message`(`human`→`AI`)提出查询问题](#3.1
human message(human→AI)提出查询问题) - [3.2 `AI message`(`AI`→`AI database tool`)获得数据库的所有表](#3.2
AI message(AI→AI database tool)获得数据库的所有表) - [3.3 `tool message`(`AI database tool`→`AI`)`AI tool`回答数据库的所有表](#3.3
tool message(AI database tool→AI)AI tool回答数据库的所有表) - [3.4 `AImessage`(`AI`→`AI database tool`)`AI`进行分析,进一步请求相关表的`schema`](#3.4
AImessage(AI→AI database tool)AI进行分析,进一步请求相关表的schema) - [3.5 `tool message`(`AI database tool`→`AI`)`AI tool`给出了相关表的`schema`](#3.5
tool message(AI database tool→AI)AI tool给出了相关表的schema) - [3.6 `AI message`(`AI`)`AI`进一步分析,初步思考出`sql`文](#3.6
AI message(AI)AI进一步分析,初步思考出sql文) - [3.6 `tool message`(`AI database tool`)`sql_db_query_checker`给出了检查结果](#3.6
tool message(AI database tool)sql_db_query_checker给出了检查结果) - [3.7 `AI message`(`AI`)大模型调用`sql_db_query`进行`db query`](#3.7
AI message(AI)大模型调用sql_db_query进行db query) - [3.8 `tool message`(`AI database tool`)`AI tool`查询数据库给出结果](#3.8
tool message(AI database tool)AI tool查询数据库给出结果) - [3.9 `AI message`(`AI`)`AI`给出了结果的分析](#3.9
AI message(AI)AI给出了结果的分析) - [3.10 `AI tool message`(`AI database tool`)`AI tool`给出了数据库的查询结果(全量)](#3.10
AI tool message(AI database tool)AI tool给出了数据库的查询结果(全量)) - [3.11 `AI message`(`AI`)`AI`最终给出了全面的回答](#3.11
AI message(AI)AI最终给出了全面的回答)
- [3.1 `human message`(`human`→`AI`)提出查询问题](#3.1
- [4. 接下来](#4. 接下来)
- [1. `langchain v1.0`的`sql agent`](#1.
AI(学习笔记第十七课)langchain v1.0(SQL Agent)
-
langchain v1.0的sql agent - 配置数据库
- 为
model设定SQLDatabaseToolkit -
model和数据库的交互
1. langchain v1.0的sql agent
1.1 整体的sql agent说明
langchain v1.0 sql agent
AI通过如下过程,能够和database进行交互,理解用户的自然语言,主动进行sql的检索。
- 查询
database的有效表avaiable tables以及全部表的schemas - 决定那些表
tables和本地查询有关系 - 取得相关表的
schemas - 生成基于问题的
sql(注意,这里需要指定database's dialect,就是数据库的方言) - 对生成的
sql进行double check - 执行
sql,并返回结果。 - 如果
sql有错误,那么纠正错误,重新query,直到执行成功 - 对结果进行格式化(
formulate),给用户返回
1.2 整体的sql agent示例代码
1.3 整体的sql agent示例database
2. 代码解析
2.1 配置大模型和langsmith
python
import os
from langchain_openai import ChatOpenAI
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import SQLDatabaseToolkit
from langchain.agents import create_agent
# DeepSeek API
model = ChatOpenAI(
api_key = 'sk-xxxxxxx',
base_url = 'https://api.deepseek.com/v1',
model='deepseek-chat'# 或其他 DeepSeek 模型
)
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "lsv2_xxxxx"
2.2 配置sqlite的database
python
# 获取当前文件的目录,然后构建数据库路径
current_dir = os.path.dirname(os.path.abspath(__file__))
db_path = os.path.join(current_dir, "02_chinook.db") # 根据实际位置调整
db_uri = f"sqlite:///{db_path}"
db = SQLDatabase.from_uri(db_uri)
print(f"Dialect: {db.dialect}")
print(f"Available tables: {db.get_usable_table_names()}")
print(f'Sample output: {db.run("SELECT * FROM Artists LIMIT 5;")}')
这里采用本地的sqlite的数据库,进行一个简单的数据库练习。数据库文件已经在git代码库里准备了,不用另行准备。
执行程序,会看到。
json
Dialect: sqlite
Available tables: ['albums', 'artists', 'customers', 'employees', 'genres', 'invoice_items', 'invoices', 'media_types', 'playlist_track', 'playlists', 'tracks']
Sample output: [(1, 'AC/DC'), (2, 'Accept'), (3, 'Aerosmith'), (4, 'Alanis Morissette'), (5, 'Alice In Chains')]
2.3 使用大模型生成tools
python
toolkit = SQLDatabaseToolkit(db=db, llm=model)
tools = toolkit.get_tools()
for tool in tools:
print(f"{tool.name}: {tool.description}\n")
执行结果:
json
sql_db_query: Input to this tool is a detailed and correct SQL query, output is a result from the database. If the query is not correct, an error message will be returned. If an error is returned, rewrite the query, check the query, and try again. If you encounter an issue with Unknown column 'xxxx' in 'field list', use sql_db_schema to query the correct table fields.
sql_db_schema: Input to this tool is a comma-separated list of tables, output is the schema and sample rows for those tables. Be sure that the tables actually exist by calling sql_db_list_tables first! Example Input: table1, table2, table3
sql_db_list_tables: Input is an empty string, output is a comma-separated list of tables in the database.
sql_db_query_checker: Use this tool to double check if your query is correct before executing it. Always use this tool before executing a query with sql_db_query!
这里看到,提供了四个database的工具(tool):
sql_db_query能够根据sql对数据库进行querysql_db_schema给定tables,可以得到表的schemas和sample rowssql_db_list_tables列出数据库的所有表tablessql_db_query_checker能够进行sql的检查
可以看出,这里使用大模型model,提供了四个database的tools。
2.4 提供系统提示词system prompt
python
system_prompt = """
You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {dialect} query to run,
then look at the results of the query and return the answer. Unless the user
specifies a specific number of examples they wish to obtain, always limit your
query to at most {top_k} results.
You can order the results by a relevant column to return the most interesting
examples in the database. Never query for all the columns from a specific table,
only ask for the relevant columns given the question.
You MUST double check your query before executing it. If you get an error while
executing a query, rewrite the query and try again.
DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the
database.
To start you should ALWAYS look at the tables in the database to see what you
can query. Do NOT skip this step.
Then you should query the schema of the most relevant tables.
""".format(
dialect=db.dialect,
top_k=5,
)
2.5 生成sql agent
python
agent = create_agent(
model,
tools,
system_prompt=system_prompt,
)
这里参数,包括:
- 大模型(
model) - 各种
tools - 系统系统提示词
system prompt
2.6 准备用户的关于这个database的提问
python
question = "Which genre on average has the longest tracks?"
这里示例database是关于音乐的流派(genre)和音轨(即单个曲目)(tracks)等的音乐曲目数据库。
2.7 开始对sql agent进行提问
python
for step in agent.stream(
{"messages": [{"role": "user", "content": question}]},
stream_mode="values",
):
step["messages"][-1].pretty_print()
3 确认执行的结果
通过每一步来检查来看sql agent是如何工作。
3.1 human message(human→AI)提出查询问题
json
================================ Human Message =================================
Which genre on average has the longest tracks?
human message向AI提问。
这里的问题是哪个音乐流派(genre)有平均长度最长 的音轨track。当然期待AI在给定的音乐曲目数据库中查询。
3.2 AI message(AI→AI database tool)获得数据库的所有表
json
================================== Ai Message ==================================
I'll help you find which genre on average has the longest tracks. Let me start by exploring the database structure.
Tool Calls:
sql_db_list_tables (call_00_q8ks92xNcHhSoGW9tgsJPPNm)
Call ID: call_00_q8ks92xNcHhSoGW9tgsJPPNm
Args:
tool_input:
3.3 tool message(AI database tool→AI)AI tool回答数据库的所有表
python
================================= Tool Message =================================
Name: sql_db_list_tables
albums, artists, customers, employees, genres, invoice_items, invoices, media_types, playlist_track, playlists, tracks
AI database tool得到了database的所有表名tables。
3.4 AImessage(AI→AI database tool)AI进行分析,进一步请求相关表的schema
python
================================== Ai Message ==================================
Now let me look at the schema for the relevant tables - particularly `genres` and `tracks` tables since we need to analyze track lengths by genre.
Tool Calls:
sql_db_schema (call_00_nlSRcRvpO3yQAP1otxxWkbRw)
Call ID: call_00_nlSRcRvpO3yQAP1otxxWkbRw
Args:
table_names: genres, tracks
AI通过分析,进一步调用AI tool,进一步获得相关表的schemas。
3.5 tool message(AI database tool→AI)AI tool给出了相关表的schema
这里,AI database tool给AI返回了两种信息:
- 一个是表的
schema - 另一个是表对应的
sample data - 最后应该有各个字段的补充说明。但是这个例子没有
通过这些数据,AI model能够有充分的数据来了解整个database的结构。
json
================================= Tool Message =================================
Name: sql_db_schema
CREATE TABLE genres (
"GenreId" INTEGER NOT NULL,
"Name" NVARCHAR(120),
PRIMARY KEY ("GenreId")
)
/*
3 rows from genres table:
GenreId Name
1 Rock
2 Jazz
3 Metal
*/
CREATE TABLE tracks (
"TrackId" INTEGER NOT NULL,
"Name" NVARCHAR(200) NOT NULL,
"AlbumId" INTEGER,
"MediaTypeId" INTEGER NOT NULL,
"GenreId" INTEGER,
"Composer" NVARCHAR(220),
"Milliseconds" INTEGER NOT NULL,
"Bytes" INTEGER,
"UnitPrice" NUMERIC(10, 2) NOT NULL,
PRIMARY KEY ("TrackId"),
FOREIGN KEY("MediaTypeId") REFERENCES media_types ("MediaTypeId"),
FOREIGN KEY("GenreId") REFERENCES genres ("GenreId"),
FOREIGN KEY("AlbumId") REFERENCES albums ("AlbumId")
)
/*
3 rows from tracks table:
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
2 Balls to the Wall 2 2 1 None 342562 5510424 0.99
3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman 2306193990994 0.99
*/
3.6 AI message(AI)AI进一步分析,初步思考出sql文
json
================================== Ai Message ==================================
Perfect! I can see that:
1. The `tracks` table has a `GenreId` column that references the `genres` table
2. The `tracks` table has a `Milliseconds` column which represents the track length
3. The `genres` table has `GenreId` and `Name` columns
Now I need to write a query that:
4. Joins the `tracks` and `genres` tables
5. Groups by genre
6. Calculates the average track length in milliseconds for each genre
7. Orders by average track length descending to find the genre with the longest average tracks
Let me first check my query before executing it:
Tool Calls:
sql_db_query_checker (call_00_rJzx7nT7IpvLXemzJhCYi9jG)
Call ID: call_00_rJzx7nT7IpvLXemzJhCYi9jG
Args:
query: SELECT
g.Name AS Genre,
AVG(t.Milliseconds) AS AverageTrackLengthMs,
AVG(t.Milliseconds) / 1000.0 AS AverageTrackLengthSeconds,
AVG(t.Milliseconds) / 60000.0 AS AverageTrackLengthMinutes
FROM tracks t
JOIN genres g ON t.GenreId = g.GenreId
GROUP BY g.GenreId, g.Name
ORDER BY AverageTrackLengthMs DESC
LIMIT 5
这里,可以看出AI能够分析出写出sql的分析过程,给出了4个步骤:
- Joins the
tracksandgenrestables - Groups by genre
- Calculates the average track length in milliseconds for each genre
- Orders by average track length descending to find the genre with the longest average tracks
但是以防万一,这里AI model接下来调用sql_db_query_checker为这个sql进行检查。
3.6 tool message(AI database tool)sql_db_query_checker给出了检查结果
json
================================= Tool Message =================================
Name: sql_db_query_checker
```sql
SELECT
g.Name AS Genre,
AVG(t.Milliseconds) AS AverageTrackLengthMs,
AVG(t.Milliseconds) / 1000.0 AS AverageTrackLengthSeconds,
AVG(t.Milliseconds) / 60000.0 AS AverageTrackLengthMinutes
FROM tracks t
JOIN genres g ON t.GenreId = g.GenreId
GROUP BY g.GenreId, g.Name
ORDER BY AverageTrackLengthMs DESC
LIMIT 5
3.7 AI message(AI)大模型调用sql_db_query进行db query
python
================================== Ai Message ==================================
The query looks good. Now let me execute it:
Tool Calls:
sql_db_query (call_00_skJHEgeGsjap4lhoI1X5ohvx)
Call ID: call_00_skJHEgeGsjap4lhoI1X5ohvx
Args:
query: SELECT
g.Name AS Genre,
AVG(t.Milliseconds) AS AverageTrackLengthMs,
AVG(t.Milliseconds) / 1000.0 AS AverageTrackLengthSeconds,
AVG(t.Milliseconds) / 60000.0 AS AverageTrackLengthMinutes
FROM tracks t
JOIN genres g ON t.GenreId = g.GenreId
GROUP BY g.GenreId, g.Name
ORDER BY AverageTrackLengthMs DESC
LIMIT 5
3.8 tool message(AI database tool)AI tool查询数据库给出结果
json
================================= Tool Message =================================
Name: sql_db_query
[('Sci Fi & Fantasy', 2911783.0384615385, 2911.7830384615386, 48.52971730769231),
('Science Fiction', 2625549.076923077, 2625.549076923077, 43.759151282051285),
('Drama', 2575283.78125, 2575.28378125, 42.92139635416667),
('TV Shows', 2145041.0215053763, 2145.041021505376, 35.75068369175627),
('Comedy', 1585263.705882353, 1585.263705882353, 26.421061764705883)]
3.9 AI message(AI)AI给出了结果的分析
json
================================== Ai Message ==================================
Great! The results show that **Sci Fi & Fantasy** has the longest average track length at approximately:
- 2,911,783 milliseconds (2,911.8 seconds)
- 48.53 minutes on average
However, I notice these genres seem unusual for music tracks (Sci Fi & Fantasy, Science Fiction, Drama, TV Shows, Comedy). These might be audiobooks or spoken word content rather than traditional music genres. Let me check if there are more traditional music genres with longer average track lengths by looking at all genres:
Tool Calls:
sql_db_query (call_00_K3SlmvjpbSFh6GR8qESVMjWu)
Call ID: call_00_K3SlmvjpbSFh6GR8qESVMjWu
Args:
query: SELECT
g.Name AS Genre,
AVG(t.Milliseconds) AS AverageTrackLengthMs,
AVG(t.Milliseconds) / 60000.0 AS AverageTrackLengthMinutes,
COUNT(t.TrackId) AS NumberOfTracks
FROM tracks t
JOIN genres g ON t.GenreId = g.GenreId
GROUP BY g.GenreId, g.Name
ORDER BY AverageTrackLengthMs DESC
这里,虽然AI得出了结论,但是AI慎重起见,又进行了进一步的分析。
sql
SELECT
g.Name AS Genre,
AVG(t.Milliseconds) AS AverageTrackLengthMs,
AVG(t.Milliseconds) / 60000.0 AS AverageTrackLengthMinutes,
COUNT(t.TrackId) AS NumberOfTracks
FROM tracks t
JOIN genres g ON t.GenreId = g.GenreId
GROUP BY g.GenreId, g.Name
ORDER BY AverageTrackLengthMs DESC
这里,给出了全体的数据,而不是limit 5
3.10 AI tool message(AI database tool)AI tool给出了数据库的查询结果(全量)
json
[('Sci Fi & Fantasy', 2911783.0384615385, 48.52971730769231, 26),
('Science Fiction', 2625549.076923077, 43.759151282051285, 13),
('Drama', 2575283.78125, 42.92139635416667, 64),
('TV Shows', 2145041.0215053763, 35.75068369175627, 93),
('Comedy', 1585263.705882353, 26.421061764705883, 17),
('Metal', 309749.4438502674, 5.162490730837789, 374),
('Electronica/Dance', 302985.8, 5.049763333333333, 30),
('Heavy Metal', 297452.9285714286, 4.95754880952381, 28),
('Classical', 293867.5675675676, 4.897792792792793, 74),
('Jazz', 291755.3769230769, 4.862589615384615, 130),
('Rock', 283910.0431765613, 4.7318340529426886, 1297),
('Blues', 270359.77777777775, 4.505996296296296, 81),
('Alternative', 264058.525, 4.400975416666667, 40),
('Reggae', 247177.75862068965, 4.119629310344828, 58),
('Soundtrack', 244370.88372093023, 4.072848062015503, 43),
('Alternative & Punk', 234353.84939759035, 3.9058974899598393, 332),
('Latin', 232859.26252158894, 3.8809877086931492, 579),
('Pop', 229034.10416666666, 3.817235069444444, 48),
('World', 224923.82142857142, 3.748730357142857, 28),
('R&B/Soul', 220066.8524590164, 3.66778087431694, 61),
('Bossa Nova', 219590.0, 3.6598333333333333, 15),
('Easy Listening', 189164.20833333334, 3.1527368055555556, 24),
('Hip Hop/Rap', 178176.2857142857, 2.969604761904762, 35),
('Opera', 174813.0, 2.91355, 1),
('Rock And Roll', 134643.5, 2.2440583333333333, 12)]
3.11 AI message(AI)AI最终给出了全面的回答
json
================================== Ai Message ==================================
Now I can see the complete picture. The top 5 genres with the longest average track lengths are all non-music content (audiobooks/spoken word). Among traditional music genres, the ones with the longest average track lengths are:
1. **Metal** - 309,749 ms (5.16 minutes)
2. **Electronica/Dance** - 302,986 ms (5.05 minutes)
3. **Heavy Metal** - 297,453 ms (4.96 minutes)
4. **Classical** - 293,868 ms (4.90 minutes)
5. **Jazz** - 291,755 ms (4.86 minutes)
So, to answer your question:
**Among traditional music genres, Metal has the longest average track length at approximately 5.16 minutes.**
However, if we include all content types in the database (including audiobooks and spoken word), then **Sci Fi & Fantasy** has by far the longest average track length at approximately 48.53 minutes.
AI + AI database tool构建的sql agent发挥了巨大的威力。
4. 接下来
