之前探索了使用LLM从长文本中提取简单数值并进行计算的示例。
https://blog.csdn.net/liliang199/article/details/159244753
这里进一步探索横跨两个文本的复杂数值的提取和计算示例。
所用资料和代码,修改和参考自网络资料。
1 文档获取
1.1 下载数据
这里从SEC EDGAR 获取苹果公司 2022 和 2023 年 10-K 的文本版本。
对应链接如下所示
aapl-20220924
https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm
aapl-20230930
https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm
为简化分析,这里直接打开上述链接,选中所有文本复制,然后粘贴到本地。
分别存储为aapl-20220924.txt和aapl-20230930.txt
然后两文档合并,若两文档合计token在128K以内,则可直接拼接。
with open("aapl-20230930.txt", "r") as f:
text_2023 = f.read()
with open("aapl-20220924.txt", "r") as f:
text_2022 = f.read()
print(f"2023 长度: {len(text_2023)} 字符")
print(f"2022 长度: {len(text_2022)} 字符")
输出如下所示
2023 长度: 203704 字符
2022 长度: 218592 字符
1.2 tokens量估计
这里使用tiktoken估计两个文档合并后的总token量,示例程序如下所示。
import tiktoken
def num_tokens(text):
enc = tiktoken.get_encoding("cl100k_base")
return len(enc.encode(text))
tokens_2023 = num_tokens(text_2023)
tokens_2022 = num_tokens(text_2022)
print(f"2023 tokens: {tokens_2023}, 2022 tokens: {tokens_2022}, 合计: {tokens_2022 + tokens_2023}")
if tokens_2022 + tokens_2023 < 120000:
combined_text = "=== 苹果公司 2022 财年 10-K ===\n" + text_2022 + "\n\n=== 苹果公司 2023 财年 10-K ===\n" + text_2023
else:
# 超出则需截断或使用 RAG
combined_text = text_2022[:60000] + text_2023[:60000] # 简单截断,可能导致信息丢失
输出如下所示,92k tokens,在128k窗口内。
2023 tokens: 45185, 2022 tokens: 47670, 合计: 92855
2 提取计算
这里先说明需要提取的数据和计算指标。
提取数据分别来自两个不同的文档,比如,2022财年总营收、2023财年总营收。
部分计算指标会用到不同文档数据,比如,营收增长率、研发费用占营收比例变化。
2.1 提示词
这里采用提示词方式说明需要提取哪些数据,以及需要计算哪些指标。
提示词需清晰说明任务、给出计算要求,并指示使用函数调用。
这里还加入思维链指令,让模型先推理再填写函数参数。
prompt = f"""
你是一位经验丰富的财务分析师。以下是苹果公司 2022 和 2023 财年 10-K 年报的部分文本。
请仔细阅读,提取所需的财务数据,并完成以下计算。所有金额单位统一为 **百万美元**。
**需要提取的原始数据(必须从文本中查找):**
- revenue_2023:2023 财年总营收
- revenue_2022:2022 财年总营收
- cogs_2023:2023 财年营业成本
- cogs_2022:2022 财年营业成本
- net_income_2023:2023 财年净利润
- net_income_2022:2022 财年净利润
- r_and_d_2023:2023 财年研发费用
- r_and_d_2022:2022 财年研发费用
- total_assets_2023:2023 财年末总资产
- total_liabilities_2023:2023 财年末总负债
- operating_cash_flow_2023:2023 财年经营活动现金流
- capital_expenditure_2023:2023 财年资本支出(通常为"购置固定资产"的现金流出)
**需要计算的指标(请根据上面提取的数据计算,并填入 JSON):**
- revenue_growth:营收增长率,格式如 "8.5%"
- gross_margin_2023:2023 毛利率,格式如 "40.2%"
- gross_margin_2022:2022 毛利率,格式如 "39.8%"
- net_profit_margin_2023:2023 净利润率,格式如 "25.0%"
- net_profit_margin_2022:2022 净利润率,格式如 "24.5%"
- r_and_d_pct_change:研发费用占营收比例的变化(百分点),如 "+0.5pp"
- debt_to_assets_2023:2023 资产负债率,格式如 "80.1%"
- free_cash_flow_2023:2023 自由现金流(单位百万美元,数字)
**请以 JSON 格式输出,包含以上所有字段**。输出的 JSON 对象必须包含上述所有键,且值为正确的数字或字符串(百分比用字符串表示,数字用数值表示)。
文本内容:
{combined_text}
请一步步推理,然后输出 JSON。
"""
2.2 LLM调用
在准备好提示词后,这里进一步调用 API 获取LLM的输出和推理过程,并解析结果。
response = client.chat.completions.create(
model=model_name, # 支持 JSON 模式的模型
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=1500,
response_format={"type": "json_object"} # 强制输出 JSON
)
# 提取返回的 JSON 字符串
content = response.choices[0].message.content
reasoning_content = response.choices[0].message.reasoning_content
print("模型返回的原始内容:", content)
print(f"模型思考过程: ", reasoning_content)
# 解析 JSON
try:
result = json.loads(content)
print("\n解析后的结果:")
print(json.dumps(result, indent=2, ensure_ascii=False))
except json.JSONDecodeError as e:
print("JSON 解析失败:", e)
# 可尝试用正则提取
输出示例如下,在思考过程(reasoning_content)中,
LLM先提取原始数据RawData,再在梳理计算指标,以及计算这些指标需要的数据。
在此基础上给出计算数据,并且进行了验证。
虽然整个过程没有提到公式,但后续验证代码示例说明所有指标均计算正确。
可能LLM已经内化这些计算公式,不需要显式提及。
模型返回的原始内容: {
"revenue_2023": 383285,
"revenue_2022": 394328,
"cogs_2023": 214137,
"cogs_2022": 223546,
"net_income_2023": 96995,
"net_income_2022": 99803,
"r_and_d_2023": 29915,
"r_and_d_2022": 26251,
"total_assets_2023": 352583,
"total_liabilities_2023": 290437,
"operating_cash_flow_2023": 110543,
"capital_expenditure_2023": 10959,
"revenue_growth": "-2.80%",
"gross_margin_2023": "44.13%",
"gross_margin_2022": "43.31%",
"net_profit_margin_2023": "25.30%",
"net_profit_margin_2022": "25.31%",
"r_and_d_pct_change": "+1.15pp",
"debt_to_assets_2023": "82.37%",
"free_cash_flow_2023": 99584
}
模型思考过程: The user wants me to extract financial data from Apple's 2022 and 2023 10-K reports and calculate specific financial metrics. I need to carefully read through the provided text to find the required data points.
Let me identify what I need to extract:
**Raw Data to Extract:**
- revenue_2023: 2023 fiscal year total revenue
- revenue_2022: 2022 fiscal year total revenue
- cogs_2023: 2023 fiscal year cost of sales
- cogs_2022: 2022 fiscal year cost of sales
- net_income_2023: 2023 fiscal year net income
- net_income_2022: 2022 fiscal year net income
- r_and_d_2023: 2023 fiscal year R&D expense
- r_and_d_2022: 2022 fiscal year R&D expense
- total_assets_2023: 2023 fiscal year end total assets
- total_liabilities_2023: 2023 fiscal year end total liabilities
- operating_cash_flow_2023: 2023 fiscal year operating cash flow
- capital_expenditure_2023: 2023 fiscal year capital expenditure (purchase of property, plant and equipment)
**Calculated Metrics:**
- revenue_growth: Revenue growth rate
- gross_margin_2023: 2023 gross margin percentage
- gross_margin_2022: 2022 gross margin percentage
- net_profit_margin_2023: 2023 net profit margin percentage
- net_profit_margin_2022: 2022 net profit margin percentage
- r_and_d_pct_change: R&D as % of revenue change (percentage points)
- debt_to_assets_2023: 2023 debt-to-assets ratio (liabilities/assets)
- free_cash_flow_2023: 2023 free cash flow (operating cash flow - capital expenditure)
I'll start by locating the 2023 fiscal year data in the 2023 10-K report. The Consolidated Statements of Operations shows total net sales of 383,285 million for the year ended September 30, 2023, with cost of sales at 214,137 million and net income of 96,995 million. Research and development expenses totaled 29,915 million. From the Consolidated Balance Sheets, I can see total assets were 352,583 million and total liabilities were 290,437 million as of September 30, 2023. The Consolidated Statements of Cash Flows shows operating cash flow of 110,543 million and capital expenditures (payments for acquisition of property, plant and equipment) of 10,959 million for 2023.
Now I need to find the corresponding 2022 data from the 2022 10-K report to complete my analysis.
From the 2022 10-K Consolidated Statements of Operations, total net sales were 394,328 million for the year ended September 24, 2022, with cost of sales at 223,546 million and net income of 99,803 million. Research and development expenses were 26,251 million. The Consolidated Balance Sheets show total assets of 352,755 million and total liabilities of 302,083 million as of September 24, 2022. Operating cash flow from the Consolidated Statements of Cash Flows was 122,151 million, and capital expenditures were 10,708 million for 2022.
Now I can calculate the key financial metrics. Revenue declined 2.81% year-over-year from 394,328 million to 383,285 million. Gross margin improved slightly from 43.31% in 2022 to 44.13% in 2023. Net profit margin decreased from 25.31% to 25.30%. R&D spending as a percentage of revenue increased from 6.66% to 7.81%, representing a 1.15 percentage point increase. The debt-to-assets ratio improved from 85.63% to 82.37%. Free cash flow for 2023 was $99,584 million after subtracting capital expenditures from operating cash flow.
I should verify these calculations against the source documents to ensure accuracy before finalizing the analysis.
From the 2023 10-K Consolidated Statements of Operations, I can confirm: Total net sales of 383,285 million, cost of sales at 214,137 million, net income of 96,995 million, and R\&D expenses of 29,915 million. The Balance Sheet shows total assets of 352,583 million and total liabilities of 290,437 million. The Cash Flow Statement reports operating cash flow of 110,543 million and capital expenditures of 10,959 million.
For 2022, the 10-K shows: Total net sales of 394,328 million, cost of sales at 223,546 million, net income of 99,803 million, and R\&D expenses of 26,251 million. The Balance Sheet indicates total assets of 352,755 million and total liabilities of 302,083 million. Operating cash flow was 122,151 million with capital expenditures of 10,708 million.
Now I'll calculate the key metrics. Revenue growth came in at -2.80%, reflecting the year-over-year decline. Gross margin improved to 44.13% in 2023 from 43.31% in 2022. Net profit margin remained relatively stable at 25.30% versus 25.31%. R&D intensity increased by 1.15 percentage points to 7.81% of revenue. The debt-to-assets ratio decreased to 82.37% from the prior year. Free cash flow for 2023 totaled $99,584 million after subtracting capital expenditures from operating cash flow.
解析后的结果:
{
"revenue_2023": 383285,
"revenue_2022": 394328,
"cogs_2023": 214137,
"cogs_2022": 223546,
"net_income_2023": 96995,
"net_income_2022": 99803,
"r_and_d_2023": 29915,
"r_and_d_2022": 26251,
"total_assets_2023": 352583,
"total_liabilities_2023": 290437,
"operating_cash_flow_2023": 110543,
"capital_expenditure_2023": 10959,
"revenue_growth": "-2.80%",
"gross_margin_2023": "44.13%",
"gross_margin_2022": "43.31%",
"net_profit_margin_2023": "25.30%",
"net_profit_margin_2022": "25.31%",
"r_and_d_pct_change": "+1.15pp",
"debt_to_assets_2023": "82.37%",
"free_cash_flow_2023": 99584
}
2.3 真实对比
这里通过与与真实财报数据进行比对,评估模型准确性。
示例代码如下
# 真实数据(单位百万美元)
real_data = {
"revenue_2023": 383285,
"revenue_2022": 394328,
"cogs_2023": 214137,
"cogs_2022": 223546,
"net_income_2023": 96995,
"net_income_2022": 99803,
"r_and_d_2023": 29915,
"r_and_d_2022": 26251,
"total_assets_2023": 352583,
"total_liabilities_2023": 290437,
"operating_cash_flow_2023": 110543,
"capital_expenditure_2023": 10959
}
# 真实计算值
real_metrics = {
"revenue_growth": f"{(real_data['revenue_2023'] - real_data['revenue_2022'])/real_data['revenue_2022']*100:.2f}%",
"gross_margin_2023": f"{(real_data['revenue_2023'] - real_data['cogs_2023'])/real_data['revenue_2023']*100:.2f}%",
"gross_margin_2022": f"{(real_data['revenue_2022'] - real_data['cogs_2022'])/real_data['revenue_2022']*100:.2f}%",
"net_profit_margin_2023": f"{real_data['net_income_2023']/real_data['revenue_2023']*100:.2f}%",
"net_profit_margin_2022": f"{real_data['net_income_2022']/real_data['revenue_2022']*100:.2f}%",
"r_and_d_pct_change": f"{(real_data['r_and_d_2023']/real_data['revenue_2023'] - real_data['r_and_d_2022']/real_data['revenue_2022'])*100:+.2f}pp",
"debt_to_assets_2023": f"{real_data['total_liabilities_2023']/real_data['total_assets_2023']*100:.2f}%",
"free_cash_flow_2023": real_data['operating_cash_flow_2023'] - real_data['capital_expenditure_2023']
}
# 对比模型输出
for key in real_metrics:
if key in result:
pred = result[key]
real = real_metrics[key]
print(f"{key}: 预测 {pred} vs 真实 {real}")
else:
print(f"警告:模型输出缺少字段 {key}")
输出示例如下,输出显示,LLM计算结果与真实指标非常接近。
revenue_growth: 预测 -2.80% vs 真实 -2.80%
gross_margin_2023: 预测 44.13% vs 真实 44.13%
gross_margin_2022: 预测 43.31% vs 真实 43.31%
net_profit_margin_2023: 预测 25.30% vs 真实 25.31%
net_profit_margin_2022: 预测 25.31% vs 真实 25.31%
r_and_d_pct_change: 预测 +1.15pp vs 真实 +1.15pp
debt_to_assets_2023: 预测 82.37% vs 真实 82.37%
free_cash_flow_2023: 预测 99584 vs 真实 99584
苹果公司10-K 2022 & 2023财务数据如下
| 指标 | 2023 | 2022 |
|---|---|---|
| 总营收 | $383,285 M | $394,328 M |
| 营业成本 | $214,137 M | $223,546 M |
| 净利润 | $96,995 M | $99,803 M |
| 研发费用 | $29,915 M | $26,251 M |
| 总资产 | $352,583 M | $352,755 M (2022末) |
| 总负债 | $290,437 M | $302,083 M (2022末) |
| 经营活动现金流 | $110,543 M | $122,151 M |
| 资本支出 | $10,959 M | $10,708 M |
数据来源链接如下
https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm
https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm
reference
LLM数值提取-计算场景示例
https://blog.csdn.net/liliang199/article/details/159244753
LLM长上下文和数值类有效输出的关系探索