LLM复杂数值的提取计算场景示例

之前探索了使用LLM从长文本中提取简单数值并进行计算的示例。

https://blog.csdn.net/liliang199/article/details/159244753

这里进一步探索横跨两个文本的复杂数值的提取和计算示例。

所用资料和代码，修改和参考自网络资料。

1 文档获取

1.1 下载数据

这里从SEC EDGAR 获取苹果公司 2022 和 2023 年 10-K 的文本版本。

对应链接如下所示

aapl-20220924

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

aapl-20230930

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm

为简化分析，这里直接打开上述链接，选中所有文本复制，然后粘贴到本地。

分别存储为aapl-20220924.txt和aapl-20230930.txt

然后两文档合并，若两文档合计token在128K以内，则可直接拼接。

复制代码

with open("aapl-20230930.txt", "r") as f:
    text_2023 = f.read()
with open("aapl-20220924.txt", "r") as f:
    text_2022 = f.read()

print(f"2023 长度: {len(text_2023)} 字符")
print(f"2022 长度: {len(text_2022)} 字符")

输出如下所示

2023 长度: 203704 字符

2022 长度: 218592 字符

1.2 tokens量估计

这里使用tiktoken估计两个文档合并后的总token量，示例程序如下所示。

复制代码

import tiktoken

def num_tokens(text):
    enc = tiktoken.get_encoding("cl100k_base")
    return len(enc.encode(text))

tokens_2023 = num_tokens(text_2023)
tokens_2022 = num_tokens(text_2022)
print(f"2023 tokens: {tokens_2023}, 2022 tokens: {tokens_2022}, 合计: {tokens_2022 + tokens_2023}")

if tokens_2022 + tokens_2023 < 120000:
    combined_text = "=== 苹果公司 2022 财年 10-K ===\n" + text_2022 + "\n\n=== 苹果公司 2023 财年 10-K ===\n" + text_2023
else:
    # 超出则需截断或使用 RAG
    combined_text = text_2022[:60000] + text_2023[:60000]  # 简单截断，可能导致信息丢失

输出如下所示，92k tokens，在128k窗口内。

2023 tokens: 45185, 2022 tokens: 47670, 合计: 92855

2 提取计算

这里先说明需要提取的数据和计算指标。

提取数据分别来自两个不同的文档，比如，2022财年总营收、2023财年总营收。

部分计算指标会用到不同文档数据，比如，营收增长率、研发费用占营收比例变化。

2.1 提示词

这里采用提示词方式说明需要提取哪些数据，以及需要计算哪些指标。

提示词需清晰说明任务、给出计算要求，并指示使用函数调用。

这里还加入思维链指令，让模型先推理再填写函数参数。

复制代码

prompt = f"""
你是一位经验丰富的财务分析师。以下是苹果公司 2022 和 2023 财年 10-K 年报的部分文本。

请仔细阅读，提取所需的财务数据，并完成以下计算。所有金额单位统一为 **百万美元**。

**需要提取的原始数据（必须从文本中查找）：**
- revenue_2023：2023 财年总营收
- revenue_2022：2022 财年总营收
- cogs_2023：2023 财年营业成本
- cogs_2022：2022 财年营业成本
- net_income_2023：2023 财年净利润
- net_income_2022：2022 财年净利润
- r_and_d_2023：2023 财年研发费用
- r_and_d_2022：2022 财年研发费用
- total_assets_2023：2023 财年末总资产
- total_liabilities_2023：2023 财年末总负债
- operating_cash_flow_2023：2023 财年经营活动现金流
- capital_expenditure_2023：2023 财年资本支出（通常为"购置固定资产"的现金流出）

**需要计算的指标（请根据上面提取的数据计算，并填入 JSON）：**
- revenue_growth：营收增长率，格式如 "8.5%"
- gross_margin_2023：2023 毛利率，格式如 "40.2%"
- gross_margin_2022：2022 毛利率，格式如 "39.8%"
- net_profit_margin_2023：2023 净利润率，格式如 "25.0%"
- net_profit_margin_2022：2022 净利润率，格式如 "24.5%"
- r_and_d_pct_change：研发费用占营收比例的变化（百分点），如 "+0.5pp"
- debt_to_assets_2023：2023 资产负债率，格式如 "80.1%"
- free_cash_flow_2023：2023 自由现金流（单位百万美元，数字）

**请以 JSON 格式输出，包含以上所有字段**。输出的 JSON 对象必须包含上述所有键，且值为正确的数字或字符串（百分比用字符串表示，数字用数值表示）。

文本内容：
{combined_text}

请一步步推理，然后输出 JSON。
"""

2.2 LLM调用

在准备好提示词后，这里进一步调用 API 获取LLM的输出和推理过程，并解析结果。

复制代码

response = client.chat.completions.create(
    model=model_name,          # 支持 JSON 模式的模型
    messages=[{"role": "user", "content": prompt}],
    temperature=0,
    max_tokens=1500,
    response_format={"type": "json_object"}   # 强制输出 JSON
)

# 提取返回的 JSON 字符串
content = response.choices[0].message.content
reasoning_content = response.choices[0].message.reasoning_content
print("模型返回的原始内容：", content)
print(f"模型思考过程: ", reasoning_content)

# 解析 JSON
try:
    result = json.loads(content)
    print("\n解析后的结果：")
    print(json.dumps(result, indent=2, ensure_ascii=False))
except json.JSONDecodeError as e:
    print("JSON 解析失败：", e)
    # 可尝试用正则提取

输出示例如下，在思考过程(reasoning_content)中，

LLM先提取原始数据RawData，再在梳理计算指标，以及计算这些指标需要的数据。

在此基础上给出计算数据，并且进行了验证。

虽然整个过程没有提到公式，但后续验证代码示例说明所有指标均计算正确。

可能LLM已经内化这些计算公式，不需要显式提及。

模型返回的原始内容： {

"revenue_2023": 383285,

"revenue_2022": 394328,

"cogs_2023": 214137,

"cogs_2022": 223546,

"net_income_2023": 96995,

"net_income_2022": 99803,

"r_and_d_2023": 29915,

"r_and_d_2022": 26251,

"total_assets_2023": 352583,

"total_liabilities_2023": 290437,

"operating_cash_flow_2023": 110543,

"capital_expenditure_2023": 10959,

"revenue_growth": "-2.80%",

"gross_margin_2023": "44.13%",

"gross_margin_2022": "43.31%",

"net_profit_margin_2023": "25.30%",

"net_profit_margin_2022": "25.31%",

"r_and_d_pct_change": "+1.15pp",

"debt_to_assets_2023": "82.37%",

"free_cash_flow_2023": 99584

}

模型思考过程: The user wants me to extract financial data from Apple's 2022 and 2023 10-K reports and calculate specific financial metrics. I need to carefully read through the provided text to find the required data points.

Let me identify what I need to extract:

**Raw Data to Extract:**

revenue_2023: 2023 fiscal year total revenue

revenue_2022: 2022 fiscal year total revenue

cogs_2023: 2023 fiscal year cost of sales

cogs_2022: 2022 fiscal year cost of sales

net_income_2023: 2023 fiscal year net income

net_income_2022: 2022 fiscal year net income

r_and_d_2023: 2023 fiscal year R&D expense

r_and_d_2022: 2022 fiscal year R&D expense

total_assets_2023: 2023 fiscal year end total assets

total_liabilities_2023: 2023 fiscal year end total liabilities

operating_cash_flow_2023: 2023 fiscal year operating cash flow

capital_expenditure_2023: 2023 fiscal year capital expenditure (purchase of property, plant and equipment)

**Calculated Metrics:**

revenue_growth: Revenue growth rate

gross_margin_2023: 2023 gross margin percentage

gross_margin_2022: 2022 gross margin percentage

net_profit_margin_2023: 2023 net profit margin percentage

net_profit_margin_2022: 2022 net profit margin percentage

r_and_d_pct_change: R&D as % of revenue change (percentage points)

debt_to_assets_2023: 2023 debt-to-assets ratio (liabilities/assets)

free_cash_flow_2023: 2023 free cash flow (operating cash flow - capital expenditure)

I'll start by locating the 2023 fiscal year data in the 2023 10-K report. The Consolidated Statements of Operations shows total net sales of $383,285 million for the year ended September 30, 2023, with cost of sales at$ 214,137 million and net income of $96,995 million. Research and development expenses totaled$ 29,915 million. From the Consolidated Balance Sheets, I can see total assets were $352,583 million and total liabilities were$ 290,437 million as of September 30, 2023. The Consolidated Statements of Cash Flows shows operating cash flow of $110,543 million and capital expenditures (payments for acquisition of property, plant and equipment) of$ 10,959 million for 2023.

Now I need to find the corresponding 2022 data from the 2022 10-K report to complete my analysis.

From the 2022 10-K Consolidated Statements of Operations, total net sales were $394,328 million for the year ended September 24, 2022, with cost of sales at$ 223,546 million and net income of $99,803 million. Research and development expenses were$ 26,251 million. The Consolidated Balance Sheets show total assets of $352,755 million and total liabilities of$ 302,083 million as of September 24, 2022. Operating cash flow from the Consolidated Statements of Cash Flows was $122,151 million, and capital expenditures were$ 10,708 million for 2022.

Now I can calculate the key financial metrics. Revenue declined 2.81% year-over-year from $394,328 million to$ 383,285 million. Gross margin improved slightly from 43.31% in 2022 to 44.13% in 2023. Net profit margin decreased from 25.31% to 25.30%. R&D spending as a percentage of revenue increased from 6.66% to 7.81%, representing a 1.15 percentage point increase. The debt-to-assets ratio improved from 85.63% to 82.37%. Free cash flow for 2023 was $99,584 million after subtracting capital expenditures from operating cash flow.

I should verify these calculations against the source documents to ensure accuracy before finalizing the analysis.

From the 2023 10-K Consolidated Statements of Operations, I can confirm: Total net sales of $383,285 million, cost of sales at$ 214,137 million, net income of $96,995 million, and R\&D expenses of$ 29,915 million. The Balance Sheet shows total assets of $352,583 million and total liabilities of$ 290,437 million. The Cash Flow Statement reports operating cash flow of $110,543 million and capital expenditures of$ 10,959 million.

For 2022, the 10-K shows: Total net sales of $394,328 million, cost of sales at$ 223,546 million, net income of $99,803 million, and R\&D expenses of$ 26,251 million. The Balance Sheet indicates total assets of $352,755 million and total liabilities of$ 302,083 million. Operating cash flow was $122,151 million with capital expenditures of$ 10,708 million.

Now I'll calculate the key metrics. Revenue growth came in at -2.80%, reflecting the year-over-year decline. Gross margin improved to 44.13% in 2023 from 43.31% in 2022. Net profit margin remained relatively stable at 25.30% versus 25.31%. R&D intensity increased by 1.15 percentage points to 7.81% of revenue. The debt-to-assets ratio decreased to 82.37% from the prior year. Free cash flow for 2023 totaled $99,584 million after subtracting capital expenditures from operating cash flow.

解析后的结果：

{

"revenue_2023": 383285,

"revenue_2022": 394328,

"cogs_2023": 214137,

"cogs_2022": 223546,

"net_income_2023": 96995,

"net_income_2022": 99803,

"r_and_d_2023": 29915,

"r_and_d_2022": 26251,

"total_assets_2023": 352583,

"total_liabilities_2023": 290437,

"operating_cash_flow_2023": 110543,

"capital_expenditure_2023": 10959,

"revenue_growth": "-2.80%",

"gross_margin_2023": "44.13%",

"gross_margin_2022": "43.31%",

"net_profit_margin_2023": "25.30%",

"net_profit_margin_2022": "25.31%",

"r_and_d_pct_change": "+1.15pp",

"debt_to_assets_2023": "82.37%",

"free_cash_flow_2023": 99584

}

2.3 真实对比

这里通过与与真实财报数据进行比对，评估模型准确性。

示例代码如下

复制代码

# 真实数据（单位百万美元）
real_data = {
    "revenue_2023": 383285,
    "revenue_2022": 394328,
    "cogs_2023": 214137,
    "cogs_2022": 223546,
    "net_income_2023": 96995,
    "net_income_2022": 99803,
    "r_and_d_2023": 29915,
    "r_and_d_2022": 26251,
    "total_assets_2023": 352583,
    "total_liabilities_2023": 290437,
    "operating_cash_flow_2023": 110543,
    "capital_expenditure_2023": 10959
}

# 真实计算值
real_metrics = {
    "revenue_growth": f"{(real_data['revenue_2023'] - real_data['revenue_2022'])/real_data['revenue_2022']*100:.2f}%",
    "gross_margin_2023": f"{(real_data['revenue_2023'] - real_data['cogs_2023'])/real_data['revenue_2023']*100:.2f}%",
    "gross_margin_2022": f"{(real_data['revenue_2022'] - real_data['cogs_2022'])/real_data['revenue_2022']*100:.2f}%",
    "net_profit_margin_2023": f"{real_data['net_income_2023']/real_data['revenue_2023']*100:.2f}%",
    "net_profit_margin_2022": f"{real_data['net_income_2022']/real_data['revenue_2022']*100:.2f}%",
    "r_and_d_pct_change": f"{(real_data['r_and_d_2023']/real_data['revenue_2023'] - real_data['r_and_d_2022']/real_data['revenue_2022'])*100:+.2f}pp",
    "debt_to_assets_2023": f"{real_data['total_liabilities_2023']/real_data['total_assets_2023']*100:.2f}%",
    "free_cash_flow_2023": real_data['operating_cash_flow_2023'] - real_data['capital_expenditure_2023']
}

# 对比模型输出
for key in real_metrics:
    if key in result:
        pred = result[key]
        real = real_metrics[key]
        print(f"{key}: 预测 {pred} vs 真实 {real}")
    else:
        print(f"警告：模型输出缺少字段 {key}")

输出示例如下，输出显示，LLM计算结果与真实指标非常接近。

revenue_growth: 预测 -2.80% vs 真实 -2.80%

gross_margin_2023: 预测 44.13% vs 真实 44.13%

gross_margin_2022: 预测 43.31% vs 真实 43.31%

net_profit_margin_2023: 预测 25.30% vs 真实 25.31%

net_profit_margin_2022: 预测 25.31% vs 真实 25.31%

r_and_d_pct_change: 预测 +1.15pp vs 真实 +1.15pp

debt_to_assets_2023: 预测 82.37% vs 真实 82.37%

free_cash_flow_2023: 预测 99584 vs 真实 99584

苹果公司10-K 2022 & 2023财务数据如下

指标	2023	2022
总营收	$383,285 M	$394,328 M
营业成本	$214,137 M	$223,546 M
净利润	$96,995 M	$99,803 M
研发费用	$29,915 M	$26,251 M
总资产	$352,583 M	$352,755 M (2022末)
总负债	$290,437 M	$302,083 M (2022末)
经营活动现金流	$110,543 M	$122,151 M
资本支出	$10,959 M	$10,708 M

数据来源链接如下

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm

reference

LLM数值提取-计算场景示例

https://blog.csdn.net/liliang199/article/details/159244753

LLM长上下文和数值类有效输出的关系探索

https://blog.csdn.net/liliang199/article/details/159175752