阶段目标 : 扩展到150+采购商,完整NLP模型部署,指标体系完善
核心交付物: 完整舆情指标系统、决策API、BI仪表板、生产级部署
数据源扩展与NLP完整部署
数据源扩展 (20+个源)
任务分解
makefile
新增18个数据源:
第一类: 全球主流财经新闻 (5个源)
├─ Bloomberg API
├─ CNBC RSS Feed
├─ Financial Times API
├─ MarketWatch
└─ Wall Street Journal
第二类: 地缘政治情报 (4个源)
├─ ICG (国际危机组织) 报告爬虫
├─ 国务院新闻办 (官方声明)
├─ 外交部 (官方声明)
└─ 各国驻华大使馆新闻稿
第三类: 商品期货 (3个源)
├─ LME金属交易所 (铜、锌、铝)
├─ NYMEX农产品 (大豆、小麦、玉米)
└─ ICE能源 (天然气、布伦特油)
第四类: 港口与物流 (4个源)
├─ 上海港口官网API
├─ 鹿特丹港口官网API
├─ 新加坡港口官网API
└─ MarineTraffic AIS数据 (船舶实时位置)
第五类: 汇率与金融指标 (2个源)
├─ 央行外汇牌价API
└─ CDS价差数据 (彭博终端)
Bloomberg API集成 (示例)
python
import requests
import json
from kafka import KafkaProducer
from datetime import datetime, timedelta
class BloombergCollector:
def __init__(self, api_key, kafka_broker):
self.api_key = api_key
self.producer = KafkaProducer(
bootstrap_servers=kafka_broker,
value_serializer=lambda x: json.dumps(x).encode('utf-8')
)
def fetch_news_by_ticker(self, ticker, since_hours=1):
"""根据商品代码获取相关新闻"""
since_time = datetime.utcnow() - timedelta(hours=since_hours)
headers = {
'Authorization': f'Bearer {self.api_key}',
'Content-Type': 'application/json'
}
params = {
'query': f'ticker:{ticker}',
'startDate': since_time.strftime('%Y-%m-%dT%H:%M:%S'),
'limit': 100
}
response = requests.get(
'https://api.bloomberg.com/v1/articles',
headers=headers,
params=params
)
if response.status_code == 200:
articles = response.json().get('articles', [])
for article in articles:
event = {
'event_id': article['id'],
'ingest_timestamp': datetime.utcnow().isoformat(),
'risk_date': article['publishedDate'][:10],
'source_type': 'news',
'source_name': 'Bloomberg',
'source_url': article['url'],
'raw_text': article['title'] + '\n' + article.get('summary', ''),
'language': 'en',
'detected_geo_keys': [],
'keywords': [ticker] + article.get('tags', []),
'nlp_processing_flag': False,
'processing_version': 0,
'confidence_raw': 0.9,
'_ticker': ticker # 标记商品
}
self.producer.send('raw_news_events', value=event)
def run_continuous(self):
"""持续监测多个商品"""
import schedule
import time
tickers = ['CRU', 'USCRWTIC', 'GC', 'SI', 'RB'] # 示例代码
def job():
for ticker in tickers:
try:
self.fetch_news_by_ticker(ticker)
except Exception as e:
print(f'Error fetching {ticker}: {e}')
schedule.every(30).minutes.do(job)
while True:
schedule.run_pending()
time.sleep(60)
# 部署
if __name__ == '__main__':
collector = BloombergCollector(
api_key='your_api_key',
kafka_broker='kafka1:9092,kafka2:9092,kafka3:9092'
)
collector.run_continuous()
港口状态API集成
python
import requests
import json
from kafka import KafkaProducer
from datetime import datetime
class PortStatusCollector:
"""收集主要港口的实时状态"""
MAJOR_PORTS = {
'shanghai': {
'name': '上海港',
'api': 'https://api.shanghai-port.com/v1/status',
'geo_key': ['China', 'Shanghai']
},
'rotterdam': {
'name': '鹿特丹港',
'api': 'https://api.portofrotterdam.com/vessel',
'geo_key': ['Netherlands', 'Rotterdam']
},
'singapore': {
'name': '新加坡港',
'api': 'https://api.mpa.gov.sg/vessels',
'geo_key': ['Singapore']
},
'suez': {
'name': '苏伊士运河',
'api': 'https://api.suez-canal.gov.eg/status',
'geo_key': ['Egypt', 'Suez']
},
'panama': {
'name': '巴拿马运河',
'api': 'https://api.panama-canal.com/status',
'geo_key': ['Panama']
}
}
def __init__(self, kafka_broker):
self.producer = KafkaProducer(
bootstrap_servers=kafka_broker,
value_serializer=lambda x: json.dumps(x).encode('utf-8')
)
def fetch_port_status(self, port_name, port_info):
"""获取港口状态"""
try:
response = requests.get(port_info['api'], timeout=10)
if response.status_code == 200:
data = response.json()
# 解析港口状态
status_event = {
'event_id': f"PORT_{port_name}_{datetime.utcnow().timestamp()}",
'ingest_timestamp': datetime.utcnow().isoformat(),
'risk_date': datetime.utcnow().date().isoformat(),
'source_type': 'port_status',
'source_name': port_info['name'],
'source_url': port_info['api'],
'raw_text': json.dumps(data),
'language': 'en',
'detected_geo_keys': port_info['geo_key'],
'keywords': ['port', 'logistics', 'shipping', port_name.lower()],
'nlp_processing_flag': False,
'processing_version': 0,
'confidence_raw': 1.0,
# 港口特定信息
'_port_name': port_name,
'_congestion_level': data.get('congestion_level'), # 拥堵程度
'_vessels_waiting': data.get('vessels_waiting', 0),
'_avg_wait_hours': data.get('avg_wait_hours', 0),
'_operational_status': data.get('status') # Operational/Limited/Closed
}
self.producer.send('port_logistics_status', value=status_event)
except Exception as e:
print(f'Error fetching port status for {port_name}: {e}')
def run_continuous(self):
"""每10分钟检查一次所有港口"""
import schedule
import time
def job():
for port_name, port_info in self.MAJOR_PORTS.items():
self.fetch_port_status(port_name, port_info)
schedule.every(10).minutes.do(job)
while True:
schedule.run_pending()
time.sleep(60)
if __name__ == '__main__':
collector = PortStatusCollector(
kafka_broker='kafka1:9092,kafka2:9092,kafka3:9092'
)
collector.run_continuous()
数据源监控与告警
yaml
配置所有数据源的健康监控:
检查项:
├─ API可用性 (每1小时检查)
├─ 数据新鲜度 (最后更新时间)
├─ 消息吞吐量 (每分钟消息数)
└─ 错误率 (API请求失败率)
告警规则:
Rule 1: 数据源不可用
├─ 条件: API响应 5xx错误 连续3次
├─ 级别: P1
├─ 行动: 通知Tech Lead + 切换备用源
Rule 2: 数据延迟超标
├─ 条件: 最后更新时间 >2小时
├─ 级别: P2
├─ 行动: 通知数据工程师 + 检查API
Rule 3: 吞吐量异常下降
├─ 条件: 当前小时吞吐量 <上周平均值的50%
├─ 级别: P2
├─ 行动: 告警 + 日志分析
Prometheus监控配置:
```yaml
global:
scrape_interval: 1m
scrape_configs:
- job_name: 'data_sources_health'
static_configs:
- targets: ['datasource-monitor:9090']
metrics_path: '/metrics'
makefile
Grafana仪表板:
├─ 所有数据源的实时状态
├─ 吞吐量折线图 (24小时)
├─ 延迟热力图
└─ 错误率趋势
NLP模型微调与完整部署
NLP模型选型与采购
yaml
模型选择:
1. 事件分类模型 (A/B/C/D四维)
├─ 基础: mBERT (多语言) / XLM-RoBERTa
├─ 微调数据: 手工标注 5000+ 条样本
│ ├─ A类 (物流中断): 1250条
│ ├─ B类 (监管限制): 1250条
├─ C类 (政治不稳定): 1250条
└─ D类 (金融冲击): 1250条
成本:
├─ 数据标注: ¥5万 (1000条/¥500)
├─ 模型训练: ¥5万 (GPU租赁)
└─ 模型部署: ¥3万 (优化与版本管理)
2. 命名实体识别模型 (NER)
├─ 基础: XLM-RoBERTa-large (多语言强)
├─ 微调数据: 3000+ 句子标注
│ ├─ 地点 (GPE/LOC): 港口、海峡、国家
│ ├─ 组织 (ORG): 公司、政府机构
│ ├─ 人物 (PER): 政治人物、CEO
│ └─ 商品 (PRODUCT): 石油、铁矿石等
└─ 准确度目标: >90% F1 score
成本:
├─ 标注数据: ¥3万
├─ 模型训练: ¥2万
└─ 部署优化: ¥1万
3. 情感分析模型 (金融领域)
├─ 基础: DistilBERT (轻量级)
├─ 微调数据: 2000+ 金融新闻样本
│ ├─ 标注维度: 正/中性/负
│ ├─ 强度: 强/弱
│ └─ 金融特定: 机会/风险/中立
├─ 数据来源:
│ ├─ 历史金融事件 (标注)
│ ├─ 第三方数据集 (金融情感语料库)
│ └─ 众包标注
└─ 准确度目标: >85%
成本:
├─ 标注: ¥2万
├─ 训练: ¥1.5万
└─ 部署: ¥1万
总NLP成本: ~¥20万 (M4-M5内完成)
模型训练与优化流程
yaml
时间表:
Week 15-16: 数据准备与标注
├─ 准备训练数据 (5000条标注)
├─ 数据清洗与验证
├─ 拆分: 70% train / 15% val / 15% test
└─ 保存为标准格式 (HuggingFace datasets)
Week 17-18: 模型训练与评估
├─ 分类模型训练
│ ├─ 基础模型: mBERT / XLM-RoBERTa
│ ├─ 训练参数:
│ │ ├─ Learning rate: 2e-5
│ │ ├─ Batch size: 32
│ │ ├─ Epochs: 3-5
│ │ ├─ Early stopping: 验证集无改进3个epoch停止
│ │ └─ Optimizer: AdamW
│ ├─ 评估指标: Precision / Recall / F1 (各维度)
│ └─ 目标: 整体准确度 >95%
│
├─ NER模型训练
│ ├─ 训练参数 (类似)
│ ├─ 评估: token-level F1, entity-level F1
│ └─ 目标: >90% F1
│
├─ 情感分析模型训练
│ ├─ 3分类问题 (negative/neutral/positive)
│ ├─ 目标: 准确度 >85%
│ └─ 关注: 金融特定短语的识别
│
├─ 交叉验证与超参调优
└─ 对比多个模型选择最优
Week 19-20: 模型部署与推理优化
├─ 模型量化 (降低推理延迟)
│ ├─ INT8量化 (推理延迟↓50%)
│ ├─ 知识蒸馏 (模型大小↓60%)
│ └─ 混合精度训练
├─ 部署环境准备
│ ├─ GPU服务器配置 (如果有GPU)
│ ├─ 推理框架: TorchServe 或 ONNX Runtime
│ └─ API包装 (FastAPI)
├─ 性能测试
│ ├─ 吞吐量: 目标 >1000条/秒 (分类)
│ ├─ 延迟: P99 <100ms (单条)
│ └─ 资源占用: GPU内存 <4GB
└─ 模型版本管理
├─ Model Registry (MLflow)
├─ 版本追踪与回滚
└─ A/B测试框架
推理服务代码示例:
python
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import pipeline
import uvicorn
app = FastAPI()
# 加载模型
classifier = pipeline(
'zero-shot-classification',
model='xlm-roberta-large-finetuned-mgir',
device=0
)
class TextInput(BaseModel):
text: str
class ClassificationOutput(BaseModel):
dimension: str
confidence: float
scores: dict
@app.post('/classify')
def classify_text(input_data: TextInput) -> ClassificationOutput:
candidate_labels = [
'Physical logistics disruption',
'Regulatory restrictions',
'Political instability',
'Financial crisis'
]
result = classifier(input_data.text, candidate_labels)
dim_map = {
'Physical logistics disruption': 'A',
'Regulatory restrictions': 'B',
'Political instability': 'C',
'Financial crisis': 'D'
}
return ClassificationOutput(
dimension=dim_map.get(result['labels'][0]),
confidence=result['scores'][0],
scores={
'A': result['scores'][0] if result['labels'][0] == 'Physical logistics disruption' else 0,
'B': result['scores'][1] if len(result['labels']) > 1 else 0,
'C': result['scores'][2] if len(result['labels']) > 2 else 0,
'D': result['scores'][3] if len(result['labels']) > 3 else 0
}
)
if __name__ == '__main__':
uvicorn.run(app, host='0.0.0.0', port=8000)
部署到Flink:
python
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import MapFunction
import requests
import json
class NLPClassificationFunction(MapFunction):
def __init__(self, nlp_api_url='http://nlp-service:8000'):
self.nlp_api_url = nlp_api_url
def map(self, element):
try:
text = element.get('raw_text', '')
# 调用NLP服务API
response = requests.post(
f'{self.nlp_api_url}/classify',
json={'text': text},
timeout=5
)
if response.status_code == 200:
nlp_result = response.json()
element['delivery_impact_dim'] = nlp_result['dimension']
element['confidence_score_nlp'] = nlp_result['confidence']
element['impact_severity_score'] = self._calculate_severity(
nlp_result['confidence'],
element.get('sentiment_score_nlp', 0)
)
element['nlp_processing_flag'] = True
element['processing_version'] = 1
return element
except Exception as e:
print(f'Error in NLP classification: {e}')
return element
def _calculate_severity(self, confidence, sentiment):
"""计算严重度 (0-10)"""
base = confidence * 10
if sentiment < -0.5:
base *= 1.2
return min(10.0, base)
Flink任务完整化与Paimon数据验证
完整的Flink ETL任务
python
from pyflink.datastream import StreamExecutionEnvironment, KeyedStream
from pyflink.datastream.functions import (
MapFunction, KeyedProcessFunction, WindowFunction
)
from pyflink.datastream.connectors.kafka import FlinkKafkaConsumer, FlinkKafkaProducer
from pyflink.datastream.connectors.files import FileSink
from pyflink.datastream.formats.json import JsonRowSerializationSchema
from pyflink.datastream.window import TumblingEventTimeWindow
from pyflink.common.typeinfo import Types
from pyflink.common.watermark_strategy import WatermarkStrategy
from datetime import timedelta
import json
from datetime import datetime
class CompleteFlinKETLPipeline:
"""完整的Flink ETL管道"""
def create_environment(self):
"""创建Flink环境"""
env = StreamExecutionEnvironment.get_execution_environment()
# 配置
env.set_parallelism(32)
env.enable_change_logs(True)
env.set_default_savepoint_dir('hdfs://namenode:8020/flink/savepoints')
# Checkpoint配置
from pyflink.datastream.checkpoint_config import CheckpointConfig
checkpoint_config = CheckpointConfig()
checkpoint_config.set_checkpointing_interval(60000) # 60秒
checkpoint_config.set_checkpointing_mode('EXACTLY_ONCE')
checkpoint_config.set_checkpoint_timeout(600000) # 10分钟
checkpoint_config.set_tolerable_checkpoint_failure_number(3)
env.enable_checkpointing_with_config(checkpoint_config)
return env
def create_kafka_source(self, env):
"""创建Kafka数据源"""
kafka_props = {
'bootstrap.servers': 'kafka1:9092,kafka2:9092,kafka3:9092',
'group.id': 'flink_etl_complete',
'auto.offset.reset': 'earliest'
}
# 消费多个Topic
topics = [
'raw_news_events',
'geopolitical_conflict_events',
'sanctions_and_regulations',
'commodity_prices_stream',
'port_logistics_status',
'financial_indicators'
]
kafka_source = FlinkKafkaConsumer(
topics=topics,
deserialization_schema=JsonRowSerializationSchema.Builder()
.type_info(Types.ROW_NAMED(
['event_id', 'ingest_timestamp', 'raw_text', 'source_type', ...],
[Types.STRING(), Types.STRING(), Types.STRING(), Types.STRING(), ...]
)).build(),
properties=kafka_props
)
# 水位线策略 (事件时间)
kafka_source.assign_timestamps_and_watermarks(
WatermarkStrategy.for_bounded_out_of_orderness(timedelta(seconds=10))
.with_timestamp_assigner(lambda x: int(datetime.fromisoformat(
x.get('ingest_timestamp')).timestamp() * 1000))
)
return env.add_source(kafka_source)
def build_pipeline(self, env):
"""构建完整的数据处理管道"""
# 1. Source
raw_stream = self.create_kafka_source(env)
# 2. 数据清洗 (并行度16)
cleaned_stream = raw_stream \
.map(DataCleaningFunction()) \
.set_parallelism(16) \
.name('DataCleaning')
# 3. 去重 (KeyedProcessFunction, 使用RocksDB状态)
deduped_stream = cleaned_stream \
.key_by(lambda x: x.get('source_url')) \
.process(DeduplicationFunction()) \
.set_parallelism(32) \
.name('Deduplication')
# 4. 地理编码与实体识别 (可选的外部API调用)
geo_enriched_stream = deduped_stream \
.map(GeoEnrichmentFunction()) \
.set_parallelism(16) \
.name('GeoEnrichment')
# 5. 时间窗口聚合 (Tumbling Window, 60秒)
windowed_stream = geo_enriched_stream \
.key_by(lambda x: (x.get('buyer_geo_key'), x.get('commodity_type'))) \
.window(TumblingEventTimeWindow.of(timedelta(seconds=60))) \
.apply(DimensionScoreCalculator()) \
.set_parallelism(32) \
.name('DimensionScoreCalculation')
# 6. 综合指标计算
composite_stream = windowed_stream \
.map(CompositeRiskCalculator()) \
.set_parallelism(16) \
.name('CompositeRiskCalculation')
# 7. 输出 - Sink到多个目标
# Sink 1: Paimon (标准化事件表)
paimon_sink = self.create_paimon_sink('standardized_events')
deduped_stream.add_sink(paimon_sink).name('PaimonStandardizedEvents')
# Sink 2: Paimon (维度指标表)
dimension_sink = self.create_paimon_sink('dimension_scores')
windowed_stream.add_sink(dimension_sink).name('PaimonDimensionScores')
# Sink 3: Doris (最终指标表)
doris_sink = DorisStreamLoad(
'doris_cdc',
'mgir',
'mgir_indices_fact',
options={
'fenodes': 'doris_fe:8030',
'username': 'root',
'password': 'doris_password'
}
)
composite_stream.add_sink(doris_sink).name('DorisIndices')
# Sink 4: 告警系统 (Red级风险)
alert_stream = composite_stream \
.filter(lambda x: x.get('risk_alert_level') == 'Red') \
.map(AlertFormattingFunction()) \
.set_parallelism(4) \
.name('AlertFiltering')
alert_kafka_sink = FlinkKafkaProducer(
'alert_notifications',
serialization_schema=JsonRowSerializationSchema.Builder().build(),
producer_config={'bootstrap.servers': 'kafka1:9092,kafka2:9092,kafka3:9092'}
)
alert_stream.add_sink(alert_kafka_sink).name('KafkaAlerts')
# Sink 5: 监控指标输出
metrics_stream = composite_stream \
.map(MetricsFormattingFunction()) \
.set_parallelism(4) \
.name('MetricsFormatting')
# 输出到Prometheus Push Gateway (可选)
metrics_stream.add_sink(PrometheusMetricsSink()).name('PrometheusMetrics')
return env
def create_paimon_sink(self, table_name):
"""创建Paimon Sink"""
# 这里使用Paimon的FlinkDataStreamSink
# 实际使用时需要引入paimon-flink-runtime的相关依赖
return PaimonStreamSink(
catalog='paimon_catalog',
database='default',
table=table_name,
options={
'warehouse': 'hdfs://namenode:8020/paimon',
'bucket': '64',
'file.compression': 'lz4'
}
)
def submit_job(self, env, job_name):
"""提交Flink任务"""
env.execute(job_name)
# 辅助函数类
class DataCleaningFunction(MapFunction):
def map(self, element):
# ... 数据清洗逻辑
return element
class GeoEnrichmentFunction(MapFunction):
def map(self, element):
# 调用外部GeoIP或地理编码服务
# 填充location_latitude, location_longitude, detected_geo_keys
return element
class DimensionScoreCalculator(WindowFunction):
def apply(self, key, window, elements):
# 在60秒窗口内计算LDI, CRI, GRI, FRI
# 返回维度指标
yield {
'buyer_geo_key': key[0],
'commodity_type': key[1],
'window_end': window.end,
'ldi_score': ...,
'cri_score': ...,
'gri_score': ...,
'fri_score': ...
}
class CompositeRiskCalculator(MapFunction):
def map(self, element):
# 综合计算: 40%(A+B)/2 + 60%(C+D)/2
element['composite_risk_index'] = (
0.4 * (element['ldi_score'] + element['cri_score']) / 2 +
0.6 * (element['gri_score'] + element['fri_score']) / 2
) * 10
# 转换为预警等级
if element['composite_risk_index'] < 25:
element['risk_alert_level'] = 'Green'
elif element['composite_risk_index'] < 50:
element['risk_alert_level'] = 'Yellow'
elif element['composite_risk_index'] < 75:
element['risk_alert_level'] = 'Orange'
else:
element['risk_alert_level'] = 'Red'
return element
class AlertFormattingFunction(MapFunction):
def map(self, element):
# 格式化告警消息
return {
'alert_id': element['event_id'],
'buyer_geo_key': element['buyer_geo_key'],
'risk_score': element['composite_risk_index'],
'timestamp': datetime.now().isoformat(),
'message': f"Red Alert: Risk score {element['composite_risk_index']:.1f} for {element['buyer_geo_key']}"
}
class MetricsFormattingFunction(MapFunction):
def map(self, element):
# 输出Prometheus格式的指标
return {
'metric': 'mgir_composite_risk_index',
'tags': {
'buyer': element['buyer_geo_key'],
'commodity': element['commodity_type']
},
'value': element['composite_risk_index'],
'timestamp': int(datetime.now().timestamp())
}
# 主程序
if __name__ == '__main__':
pipeline = CompleteFlinKETLPipeline()
env = pipeline.create_environment()
env = pipeline.build_pipeline(env)
pipeline.submit_job(env, 'mgir_complete_etl_pipeline')
数据验证与质量检查
sql
在Paimon中执行数据质量检查:
```sql
-- 1. 数据新鲜度检查
SELECT
COUNT(*) as total_events,
MAX(ingest_timestamp) as latest_update,
CURRENT_TIMESTAMP - MAX(ingest_timestamp) as age_minutes
FROM paimon_catalog.standardized_events
WHERE risk_date >= CURRENT_DATE - INTERVAL 1 DAY;
-- 2. 去重检查
SELECT
COUNT(*) as total_records,
COUNT(DISTINCT event_id) as unique_events,
COUNT(*) - COUNT(DISTINCT event_id) as duplicate_count
FROM paimon_catalog.standardized_events;
-- 3. 分类覆盖率
SELECT
delivery_impact_dim,
COUNT(*) as count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percentage
FROM paimon_catalog.standardized_events
WHERE risk_date >= CURRENT_DATE - INTERVAL 7 DAY
GROUP BY delivery_impact_dim;
-- 4. 置信度分布
SELECT
ROUND(AVG(confidence_score_nlp), 3) as avg_confidence,
MIN(confidence_score_nlp) as min_confidence,
MAX(confidence_score_nlp) as max_confidence,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY confidence_score_nlp) as p95
FROM paimon_catalog.standardized_events
WHERE risk_date >= CURRENT_DATE - INTERVAL 1 DAY;
预期结果:
├─ 数据延迟 <5分钟
├─ 去重率 >99%
├─ A/B/C/D分布 均匀 (各20-30%)
└─ 平均信心度 >0.7
决策API与BI仪表板开发
决策API设计与实现
API规范设计
bash
RESTful API 设计:
1. /v1/risk/latest_alert
├─ 方法: GET
├─ 参数: geo_key (采购商ID)
├─ 返回:
│ {
│ "buyer_geo_key": "BUYER_SH_001",
│ "composite_risk_index": 75.3,
│ "risk_alert_level": "Red",
│ "last_update": "2025-11-30T10:30:00Z",
│ "top_triggers": [
│ {
│ "event_id": "evt_001",
│ "source": "Reuters",
│ "type": "A",
│ "severity": 8.5
│ }
│ ]
│ }
├─ 缓存: 1分钟
└─ 实现: 从Doris直接查询
2. /v1/risk/events
├─ 方法: GET
├─ 参数:
│ ├─ buyer (采购商ID)
│ ├─ timeframe (24h/7d/30d)
│ ├─ type (A/B/C/D, 可选)
│ └─ limit (默认100)
├─ 返回: 该采购商在时间段内的所有触发事件列表
└─ 实现: Doris查询 + 缓存
3. /v1/risk/forecast
├─ 方法: GET
├─ 参数:
│ ├─ geo_key
│ └─ days (1/7/30)
├─ 返回: 未来N天的风险预测
│ {
│ "forecast": [
│ {"date": "2025-12-01", "predicted_risk": 52.3},
│ {"date": "2025-12-02", "predicted_risk": 48.9}
│ ],
│ "confidence": 0.65
│ }
└─ 实现: 基于历史趋势的简单外推 (ARIMA或机器学习)
4. /v1/hedging/suggestion
├─ 方法: GET
├─ 参数:
│ ├─ geo_key
│ └─ commodity
├─ 返回:
│ {
│ "hedging_needed": true,
│ "hedging_ratio": 0.35, # 建议对冲比例
│ "hedging_instrument": "futures",
│ "futures_contract": "CRU_DEC25",
│ "notional_value": 50000000, # 对冲名义价值
│ "estimated_cost": 125000 # 对冲成本估计
│ }
└─ 实现: 根据综合风险指数推荐
5. /v1/logistics/alternative_routes
├─ 方法: POST
├─ 请求体:
│ {
│ "origin_port": "Shanghai",
│ "dest_port": "Rotterdam",
│ "commodity": "Iron Ore",
│ "current_risk_level": "Orange"
│ }
├─ 返回: 替代路线及成本对比
│ {
│ "routes": [
│ {
│ "route_id": "route_1",
│ "waypoints": ["Shanghai", "Singapore", "Suez", "Rotterdam"],
│ "risk_score": 45,
│ "duration_days": 35,
│ "additional_cost_pct": 5.2,
│ "feasibility": "high"
│ }
│ ]
│ }
└─ 实现: 基于当前风险指数和运输网络的推荐
API实现 (Python FastAPI):
python
from fastapi import FastAPI, Query
from fastapi.responses import JSONResponse
from sqlalchemy import create_engine
import sqlalchemy as sa
from datetime import datetime, timedelta
from typing import Optional
import json
app = FastAPI(title='MGIR Decision API', version='1.0.0')
# Doris连接
doris_engine = create_engine('mysql+pymysql://root:password@doris_fe:9030/mgir')
@app.get('/v1/risk/latest_alert')
async def get_latest_alert(geo_key: str):
"""获取采购商最新风险评分"""
with doris_engine.connect() as conn:
query = sa.text('''
SELECT
buyer_geo_key,
composite_risk_index,
CASE
WHEN composite_risk_index < 25 THEN 'Green'
WHEN composite_risk_index < 50 THEN 'Yellow'
WHEN composite_risk_index < 75 THEN 'Orange'
ELSE 'Red'
END as risk_alert_level,
MAX(ingest_timestamp) as last_update
FROM mgir_indices_fact
WHERE buyer_geo_key = :geo_key
AND risk_date >= CURDATE()
LIMIT 1
''')
result = conn.execute(query, {'geo_key': geo_key}).fetchone()
if result:
return {
'buyer_geo_key': result[0],
'composite_risk_index': float(result[1]),
'risk_alert_level': result[2],
'last_update': result[3].isoformat() if result[3] else None
}
else:
return {'error': 'No data found'}, 404
@app.get('/v1/risk/events')
async def get_risk_events(
buyer: str,
timeframe: str = '24h',
type: Optional[str] = None,
limit: int = 100
):
"""获取采购商的风险事件列表"""
if timeframe == '24h':
since = datetime.utcnow() - timedelta(hours=24)
elif timeframe == '7d':
since = datetime.utcnow() - timedelta(days=7)
elif timeframe == '30d':
since = datetime.utcnow() - timedelta(days=30)
else:
return {'error': 'Invalid timeframe'}, 400
with doris_engine.connect() as conn:
query_str = '''
SELECT
event_id,
ingest_timestamp,
delivery_impact_dim,
impact_severity_score,
sentiment_score_nlp,
source_url,
confidence_score_nlp
FROM mgir_indices_fact
WHERE buyer_geo_key = :buyer
AND ingest_timestamp >= :since
'''
if type:
query_str += f' AND delivery_impact_dim = :type'
query_str += f' ORDER BY ingest_timestamp DESC LIMIT {limit}'
params = {'buyer': buyer, 'since': since}
if type:
params['type'] = type
results = conn.execute(sa.text(query_str), params).fetchall()
events = []
for row in results:
events.append({
'event_id': row[0],
'timestamp': row[1].isoformat(),
'type': row[2],
'severity': float(row[3]),
'sentiment': float(row[4]),
'source': row[5],
'confidence': float(row[6])
})
return {'events': events, 'count': len(events)}
@app.get('/v1/risk/forecast')
async def get_risk_forecast(geo_key: str, days: int = 7):
"""预测未来N天的风险趋势"""
with doris_engine.connect() as conn:
# 获取过去30天的数据用于趋势分析
query = sa.text('''
SELECT
DATE(risk_date) as date,
AVG(composite_risk_index) as avg_risk
FROM mgir_indices_fact
WHERE buyer_geo_key = :geo_key
AND risk_date >= CURDATE() - INTERVAL 30 DAY
GROUP BY DATE(risk_date)
ORDER BY date DESC
''')
results = conn.execute(query, {'geo_key': geo_key}).fetchall()
# 简单的移动平均预测
historical_data = [(row[0], float(row[1])) for row in results]
forecast = []
if len(historical_data) >= 7:
# 计算最近7天的平均作为基准
recent_avg = sum(v for d, v in historical_data[:7]) / 7
# 简单的线性趋势
for i in range(days):
future_date = datetime.utcnow().date() + timedelta(days=i+1)
# 加入随机波动
predicted = recent_avg + (i * 0.5) # 每天趋势增加0.5分
forecast.append({
'date': future_date.isoformat(),
'predicted_risk': min(100, max(0, predicted))
})
return {
'geo_key': geo_key,
'forecast': forecast,
'confidence': 0.65 if len(historical_data) >= 7 else 0.35
}
# 启动API服务
if __name__ == '__main__':
import uvicorn
uvicorn.run(app, host='0.0.0.0', port=8000)
部署:
bash
# 1. 创建systemd服务
sudo cat > /etc/systemd/system/mgir-api.service << EOF
[Unit]
Description=MGIR Decision API
After=network.target
[Service]
Type=simple
User=api_user
WorkingDirectory=/opt/mgir-api
ExecStart=/usr/bin/python3 /opt/mgir-api/main.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
# 2. 启动服务
sudo systemctl start mgir-api
sudo systemctl enable mgir-api
# 3. 配置反向代理 (Nginx)
sudo cat > /etc/nginx/sites-available/mgir-api << EOF
upstream mgir_api {
server localhost:8000;
}
server {
listen 80;
server_name api.mgir.internal;
location / {
proxy_pass http://mgir_api;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_buffering off;
}
}
EOF
sudo ln -s /etc/nginx/sites-available/mgir-api /etc/nginx/sites-enabled/
sudo systemctl reload nginx