LLM复杂数值的提取计算场景示例

之前探索了使用LLM从长文本中提取简单数值并进行计算的示例。

https://blog.csdn.net/liliang199/article/details/159244753

这里进一步探索横跨两个文本的复杂数值的提取和计算示例。

所用资料和代码,修改和参考自网络资料。

1 文档获取

1.1 下载数据

这里从SEC EDGAR 获取苹果公司 2022 和 2023 年 10-K 的文本版本。

对应链接如下所示

aapl-20220924

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

aapl-20230930

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm

为简化分析,这里直接打开上述链接,选中所有文本复制,然后粘贴到本地。

分别存储为aapl-20220924.txt和aapl-20230930.txt

然后两文档合并,若两文档合计token在128K以内,则可直接拼接。

复制代码
with open("aapl-20230930.txt", "r") as f:
    text_2023 = f.read()
with open("aapl-20220924.txt", "r") as f:
    text_2022 = f.read()

print(f"2023 长度: {len(text_2023)} 字符")
print(f"2022 长度: {len(text_2022)} 字符")

输出如下所示

2023 长度: 203704 字符

2022 长度: 218592 字符

1.2 tokens量估计

这里使用tiktoken估计两个文档合并后的总token量,示例程序如下所示。

复制代码
import tiktoken

def num_tokens(text):
    enc = tiktoken.get_encoding("cl100k_base")
    return len(enc.encode(text))

tokens_2023 = num_tokens(text_2023)
tokens_2022 = num_tokens(text_2022)
print(f"2023 tokens: {tokens_2023}, 2022 tokens: {tokens_2022}, 合计: {tokens_2022 + tokens_2023}")

if tokens_2022 + tokens_2023 < 120000:
    combined_text = "=== 苹果公司 2022 财年 10-K ===\n" + text_2022 + "\n\n=== 苹果公司 2023 财年 10-K ===\n" + text_2023
else:
    # 超出则需截断或使用 RAG
    combined_text = text_2022[:60000] + text_2023[:60000]  # 简单截断,可能导致信息丢失

输出如下所示,92k tokens,在128k窗口内。

2023 tokens: 45185, 2022 tokens: 47670, 合计: 92855

2 提取计算

这里先说明需要提取的数据和计算指标。

提取数据分别来自两个不同的文档,比如,2022财年总营收、2023财年总营收。

部分计算指标会用到不同文档数据,比如,营收增长率、研发费用占营收比例变化。

2.1 提示词

这里采用提示词方式说明需要提取哪些数据,以及需要计算哪些指标。

提示词需清晰说明任务、给出计算要求,并指示使用函数调用。

这里还加入思维链指令,让模型先推理再填写函数参数。

复制代码
prompt = f"""
你是一位经验丰富的财务分析师。以下是苹果公司 2022 和 2023 财年 10-K 年报的部分文本。

请仔细阅读,提取所需的财务数据,并完成以下计算。所有金额单位统一为 **百万美元**。

**需要提取的原始数据(必须从文本中查找):**
- revenue_2023:2023 财年总营收
- revenue_2022:2022 财年总营收
- cogs_2023:2023 财年营业成本
- cogs_2022:2022 财年营业成本
- net_income_2023:2023 财年净利润
- net_income_2022:2022 财年净利润
- r_and_d_2023:2023 财年研发费用
- r_and_d_2022:2022 财年研发费用
- total_assets_2023:2023 财年末总资产
- total_liabilities_2023:2023 财年末总负债
- operating_cash_flow_2023:2023 财年经营活动现金流
- capital_expenditure_2023:2023 财年资本支出(通常为"购置固定资产"的现金流出)

**需要计算的指标(请根据上面提取的数据计算,并填入 JSON):**
- revenue_growth:营收增长率,格式如 "8.5%"
- gross_margin_2023:2023 毛利率,格式如 "40.2%"
- gross_margin_2022:2022 毛利率,格式如 "39.8%"
- net_profit_margin_2023:2023 净利润率,格式如 "25.0%"
- net_profit_margin_2022:2022 净利润率,格式如 "24.5%"
- r_and_d_pct_change:研发费用占营收比例的变化(百分点),如 "+0.5pp"
- debt_to_assets_2023:2023 资产负债率,格式如 "80.1%"
- free_cash_flow_2023:2023 自由现金流(单位百万美元,数字)

**请以 JSON 格式输出,包含以上所有字段**。输出的 JSON 对象必须包含上述所有键,且值为正确的数字或字符串(百分比用字符串表示,数字用数值表示)。

文本内容:
{combined_text}

请一步步推理,然后输出 JSON。
"""

2.2 LLM调用

在准备好提示词后,这里进一步调用 API 获取LLM的输出和推理过程,并解析结果。

复制代码
response = client.chat.completions.create(
    model=model_name,          # 支持 JSON 模式的模型
    messages=[{"role": "user", "content": prompt}],
    temperature=0,
    max_tokens=1500,
    response_format={"type": "json_object"}   # 强制输出 JSON
)

# 提取返回的 JSON 字符串
content = response.choices[0].message.content
reasoning_content = response.choices[0].message.reasoning_content
print("模型返回的原始内容:", content)
print(f"模型思考过程: ", reasoning_content)

# 解析 JSON
try:
    result = json.loads(content)
    print("\n解析后的结果:")
    print(json.dumps(result, indent=2, ensure_ascii=False))
except json.JSONDecodeError as e:
    print("JSON 解析失败:", e)
    # 可尝试用正则提取

输出示例如下,在思考过程(reasoning_content)中,

LLM先提取原始数据RawData,再在梳理计算指标,以及计算这些指标需要的数据。

在此基础上给出计算数据,并且进行了验证。

虽然整个过程没有提到公式,但后续验证代码示例说明所有指标均计算正确。

可能LLM已经内化这些计算公式,不需要显式提及。

模型返回的原始内容: {

"revenue_2023": 383285,

"revenue_2022": 394328,

"cogs_2023": 214137,

"cogs_2022": 223546,

"net_income_2023": 96995,

"net_income_2022": 99803,

"r_and_d_2023": 29915,

"r_and_d_2022": 26251,

"total_assets_2023": 352583,

"total_liabilities_2023": 290437,

"operating_cash_flow_2023": 110543,

"capital_expenditure_2023": 10959,

"revenue_growth": "-2.80%",

"gross_margin_2023": "44.13%",

"gross_margin_2022": "43.31%",

"net_profit_margin_2023": "25.30%",

"net_profit_margin_2022": "25.31%",

"r_and_d_pct_change": "+1.15pp",

"debt_to_assets_2023": "82.37%",

"free_cash_flow_2023": 99584

}

模型思考过程: The user wants me to extract financial data from Apple's 2022 and 2023 10-K reports and calculate specific financial metrics. I need to carefully read through the provided text to find the required data points.

Let me identify what I need to extract:

**Raw Data to Extract:**

  1. revenue_2023: 2023 fiscal year total revenue
  2. revenue_2022: 2022 fiscal year total revenue
  3. cogs_2023: 2023 fiscal year cost of sales
  4. cogs_2022: 2022 fiscal year cost of sales
  5. net_income_2023: 2023 fiscal year net income
  6. net_income_2022: 2022 fiscal year net income
  7. r_and_d_2023: 2023 fiscal year R&D expense
  8. r_and_d_2022: 2022 fiscal year R&D expense
  9. total_assets_2023: 2023 fiscal year end total assets
  10. total_liabilities_2023: 2023 fiscal year end total liabilities
  11. operating_cash_flow_2023: 2023 fiscal year operating cash flow
  12. capital_expenditure_2023: 2023 fiscal year capital expenditure (purchase of property, plant and equipment)

**Calculated Metrics:**

  1. revenue_growth: Revenue growth rate
  2. gross_margin_2023: 2023 gross margin percentage
  3. gross_margin_2022: 2022 gross margin percentage
  4. net_profit_margin_2023: 2023 net profit margin percentage
  5. net_profit_margin_2022: 2022 net profit margin percentage
  6. r_and_d_pct_change: R&D as % of revenue change (percentage points)
  7. debt_to_assets_2023: 2023 debt-to-assets ratio (liabilities/assets)
  8. free_cash_flow_2023: 2023 free cash flow (operating cash flow - capital expenditure)

I'll start by locating the 2023 fiscal year data in the 2023 10-K report. The Consolidated Statements of Operations shows total net sales of 383,285 million for the year ended September 30, 2023, with cost of sales at 214,137 million and net income of 96,995 million. Research and development expenses totaled 29,915 million. From the Consolidated Balance Sheets, I can see total assets were 352,583 million and total liabilities were 290,437 million as of September 30, 2023. The Consolidated Statements of Cash Flows shows operating cash flow of 110,543 million and capital expenditures (payments for acquisition of property, plant and equipment) of 10,959 million for 2023.

Now I need to find the corresponding 2022 data from the 2022 10-K report to complete my analysis.

From the 2022 10-K Consolidated Statements of Operations, total net sales were 394,328 million for the year ended September 24, 2022, with cost of sales at 223,546 million and net income of 99,803 million. Research and development expenses were 26,251 million. The Consolidated Balance Sheets show total assets of 352,755 million and total liabilities of 302,083 million as of September 24, 2022. Operating cash flow from the Consolidated Statements of Cash Flows was 122,151 million, and capital expenditures were 10,708 million for 2022.

Now I can calculate the key financial metrics. Revenue declined 2.81% year-over-year from 394,328 million to 383,285 million. Gross margin improved slightly from 43.31% in 2022 to 44.13% in 2023. Net profit margin decreased from 25.31% to 25.30%. R&D spending as a percentage of revenue increased from 6.66% to 7.81%, representing a 1.15 percentage point increase. The debt-to-assets ratio improved from 85.63% to 82.37%. Free cash flow for 2023 was $99,584 million after subtracting capital expenditures from operating cash flow.

I should verify these calculations against the source documents to ensure accuracy before finalizing the analysis.

From the 2023 10-K Consolidated Statements of Operations, I can confirm: Total net sales of 383,285 million, cost of sales at 214,137 million, net income of 96,995 million, and R\&D expenses of 29,915 million. The Balance Sheet shows total assets of 352,583 million and total liabilities of 290,437 million. The Cash Flow Statement reports operating cash flow of 110,543 million and capital expenditures of 10,959 million.

For 2022, the 10-K shows: Total net sales of 394,328 million, cost of sales at 223,546 million, net income of 99,803 million, and R\&D expenses of 26,251 million. The Balance Sheet indicates total assets of 352,755 million and total liabilities of 302,083 million. Operating cash flow was 122,151 million with capital expenditures of 10,708 million.

Now I'll calculate the key metrics. Revenue growth came in at -2.80%, reflecting the year-over-year decline. Gross margin improved to 44.13% in 2023 from 43.31% in 2022. Net profit margin remained relatively stable at 25.30% versus 25.31%. R&D intensity increased by 1.15 percentage points to 7.81% of revenue. The debt-to-assets ratio decreased to 82.37% from the prior year. Free cash flow for 2023 totaled $99,584 million after subtracting capital expenditures from operating cash flow.

解析后的结果:

{

"revenue_2023": 383285,

"revenue_2022": 394328,

"cogs_2023": 214137,

"cogs_2022": 223546,

"net_income_2023": 96995,

"net_income_2022": 99803,

"r_and_d_2023": 29915,

"r_and_d_2022": 26251,

"total_assets_2023": 352583,

"total_liabilities_2023": 290437,

"operating_cash_flow_2023": 110543,

"capital_expenditure_2023": 10959,

"revenue_growth": "-2.80%",

"gross_margin_2023": "44.13%",

"gross_margin_2022": "43.31%",

"net_profit_margin_2023": "25.30%",

"net_profit_margin_2022": "25.31%",

"r_and_d_pct_change": "+1.15pp",

"debt_to_assets_2023": "82.37%",

"free_cash_flow_2023": 99584

}

2.3 真实对比

这里通过与与真实财报数据进行比对,评估模型准确性。

示例代码如下

复制代码
# 真实数据(单位百万美元)
real_data = {
    "revenue_2023": 383285,
    "revenue_2022": 394328,
    "cogs_2023": 214137,
    "cogs_2022": 223546,
    "net_income_2023": 96995,
    "net_income_2022": 99803,
    "r_and_d_2023": 29915,
    "r_and_d_2022": 26251,
    "total_assets_2023": 352583,
    "total_liabilities_2023": 290437,
    "operating_cash_flow_2023": 110543,
    "capital_expenditure_2023": 10959
}

# 真实计算值
real_metrics = {
    "revenue_growth": f"{(real_data['revenue_2023'] - real_data['revenue_2022'])/real_data['revenue_2022']*100:.2f}%",
    "gross_margin_2023": f"{(real_data['revenue_2023'] - real_data['cogs_2023'])/real_data['revenue_2023']*100:.2f}%",
    "gross_margin_2022": f"{(real_data['revenue_2022'] - real_data['cogs_2022'])/real_data['revenue_2022']*100:.2f}%",
    "net_profit_margin_2023": f"{real_data['net_income_2023']/real_data['revenue_2023']*100:.2f}%",
    "net_profit_margin_2022": f"{real_data['net_income_2022']/real_data['revenue_2022']*100:.2f}%",
    "r_and_d_pct_change": f"{(real_data['r_and_d_2023']/real_data['revenue_2023'] - real_data['r_and_d_2022']/real_data['revenue_2022'])*100:+.2f}pp",
    "debt_to_assets_2023": f"{real_data['total_liabilities_2023']/real_data['total_assets_2023']*100:.2f}%",
    "free_cash_flow_2023": real_data['operating_cash_flow_2023'] - real_data['capital_expenditure_2023']
}

# 对比模型输出
for key in real_metrics:
    if key in result:
        pred = result[key]
        real = real_metrics[key]
        print(f"{key}: 预测 {pred} vs 真实 {real}")
    else:
        print(f"警告:模型输出缺少字段 {key}")

输出示例如下,输出显示,LLM计算结果与真实指标非常接近。

revenue_growth: 预测 -2.80% vs 真实 -2.80%

gross_margin_2023: 预测 44.13% vs 真实 44.13%

gross_margin_2022: 预测 43.31% vs 真实 43.31%

net_profit_margin_2023: 预测 25.30% vs 真实 25.31%

net_profit_margin_2022: 预测 25.31% vs 真实 25.31%

r_and_d_pct_change: 预测 +1.15pp vs 真实 +1.15pp

debt_to_assets_2023: 预测 82.37% vs 真实 82.37%

free_cash_flow_2023: 预测 99584 vs 真实 99584

苹果公司10-K 2022 & 2023财务数据如下

指标 2023 2022
总营收 $383,285 M $394,328 M
营业成本 $214,137 M $223,546 M
净利润 $96,995 M $99,803 M
研发费用 $29,915 M $26,251 M
总资产 $352,583 M $352,755 M (2022末)
总负债 $290,437 M $302,083 M (2022末)
经营活动现金流 $110,543 M $122,151 M
资本支出 $10,959 M $10,708 M

数据来源链接如下

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm

reference


LLM数值提取-计算场景示例

https://blog.csdn.net/liliang199/article/details/159244753

LLM长上下文和数值类有效输出的关系探索

https://blog.csdn.net/liliang199/article/details/159175752

相关推荐
manduic11 分钟前
告别传统编码器痛点!麦歌恩MT6701,重构位置检测选型新逻辑
人工智能·重构·磁性角度传感器
ai大模型中转api测评14 分钟前
告别文字堆砌:Gemini 交互 API 赋能垂直领域,开发者如何重构用户认知?
人工智能·重构·交互·api
陌殇殇19 分钟前
002 Spring AI Alibaba框架整合百炼大模型平台 — 聊天、文生图、语音、向量模型整合
人工智能·spring·ai
架构师老Y19 分钟前
006、异步编程与并发模型:asyncio与高性能后端
python
南湖北漠20 分钟前
记录生活中的那些小事(佚名)
网络·人工智能·计算机网络·其他·安全·生活
清水白石00822 分钟前
《解锁 Python 潜能:从核心语法到 AI 服务层架构的工业级进阶与实战》
人工智能·python·架构
kcuwu.23 分钟前
Python数据分析三剑客导论:NumPy、Pandas、Matplotlib 从入门到入门
python·数据分析·numpy
大连好光景25 分钟前
学会评估模型的拟合状态和泛化能力
人工智能·机器学习
老兵发新帖28 分钟前
Hermes:openclaw的最佳替代之基于源码部署的飞书配置
人工智能·飞书
weixin_5134499633 分钟前
walk_these_ways项目学习记录第七篇(通过行为多样性 (MoB) 实现地形泛化)--核心环境下
人工智能·python·学习