目录
基础设施规划
1.1 硬件配置规划
makefile
服务器类型与数量规划:
根据日数据量550GB、需要3副本的需求:
├─ 总存储需求: 550GB × 30天 × 3副本 = 50TB (热数据)
├─ 计算需求: 日均100+并发查询, P95延迟<1秒
└─ 网络需求: 平均15GB/小时数据导入
推荐配置方案:
┌────────────────────────────────────────────────────┐
│ 广告数仓基础设施配置方案 │
├────────────────┬─────────────┬─────────────────────┤
│ 组件 │ 节点数 │ 硬件配置 │
├────────────────┼─────────────┼─────────────────────┤
│ Doris FE │ 3 │ 16核/64GB/500GB SSD │
│ Doris BE │ 6 │ 32核/128GB/4TB SSD │
│ Kafka Broker │ 3 │ 16核/64GB/2TB SSD │
│ Flink Master │ 2(HA) │ 8核/32GB/500GB SSD │
│ Flink Worker │ 4 │ 32核/128GB/1TB SSD │
│ Prometheus │ 1 │ 8核/32GB/500GB │
│ ELK Stack │ 3 │ 8核/32GB/2TB each │
│ 跳板机/监控 │ 1 │ 4核/16GB/200GB │
├────────────────┼─────────────┼─────────────────────┤
│ 合计 │ 23 │ 约¥150-200万 │
└────────────────┴─────────────┴─────────────────────┘
成本明细:
├─ 服务器采购: ¥100万 (计算和存储)
├─ 网络设备: ¥15万 (交换机、光纤等)
├─ 部署与配置: ¥20万 (人工成本)
├─ 监控工具: ¥5万 (Prometheus/Grafana/ELK许可)
└─ 机房空间: ¥10万/年 (电力、冷却、网络)
高可用与容灾配置:
├─ 跨机房部署: 建议部署在2个机房
├─ Doris集群: 3副本策略, 单机房故障自动转移
├─ Kafka集群: 3副本, min.insync.replicas=2
├─ Flink: JobManager HA (Zookeeper或Kubernetes)
└─ 数据备份: 每日备份到异地(S3/OSS)
1.2 网络与存储规划
makefile
网络规划:
数据中心网络架构:
┌──────────────────────────────────────────┐
│ 外网入口 (防火墙) │
├──────────────────────────────────────────┤
│ 核心交换机 (10Gbps) │
├───────────┬──────────────┬───────────────┤
│ 聚合层 │ 聚合层 │ 聚合层 │
│(Doris) │ (Kafka) │ (Flink) │
├───────────┼──────────────┼───────────────┤
│ 接入交换 │ 接入交换 │ 接入交换 │
│(6台BE) │ (3个Broker)│ (4个Worker) │
└───────────┴──────────────┴───────────────┘
带宽规划:
├─ 核心交换到聚合: 40Gbps (冗余)
├─ 聚合到接入: 10Gbps (冗余)
├─ Doris集群内部: 流量预估 20Gbps 峰值
├─ 外网出口: 需要 5Gbps (API调用 + 备份)
└─ 建议预留: 50%冗余
存储规划:
SSD vs HDD权衡:
├─ Doris BE: 全SSD (4TB × 6 = 24TB)
│ └─ 理由: 查询性能关键, 频繁随机I/O
├─ Kafka: SSD (2TB × 3 = 6TB)
│ └─ 理由: 消息队列需要低延迟
├─ Flink/Spark: HDD + SSD混合
│ └─ 理由: 批处理可容忍较高延迟
└─ 备份/冷存: HDD (低成本)
└─ 理由: 备份和长期存储
数据备份策略:
├─ RPO (恢复点目标): 1小时
├─ RTO (恢复时间目标): 30分钟
├─ 备份方式:
│ ├─ 日全量备份 (到HDFS或S3)
│ ├─ 小时级增量备份 (Doris binlog)
│ └─ Kafka日志保留 (7天)
└─ 异地备份: 定期上传到异地机房/公有云
集群部署方案
2.1 Doris集群部署
yaml
# doris-cluster-config.yaml
# Doris Frontend (FE) 配置
frontend_config:
fe_servers:
- name: fe1
host: 10.0.0.1
edit_log_port: 9010
http_port: 8030
query_port: 9030
- name: fe2
host: 10.0.0.2
edit_log_port: 9010
http_port: 8030
query_port: 9030
- name: fe3
host: 10.0.0.3
edit_log_port: 9010
http_port: 8030
query_port: 9030
# FE配置参数
fe_conf:
# 内存配置
JAVA_OPTS: "-Xmx32g -Xms32g"
# 日志
LOG_DIR: "/var/log/doris/fe"
# 高可用
meta_dir: "/data/doris/fe/meta"
# 性能参数
max_parallel_load_tasks: 4
max_load_dop: 16
# Doris Backend (BE) 配置
backend_config:
be_servers:
- name: be1
host: 10.0.1.1
heartbeat_service_port: 9050
brpc_port: 8060
http_port: 8040
- name: be2
host: 10.0.1.2
heartbeat_service_port: 9050
brpc_port: 8060
http_port: 8040
# ... 6台BE总计
# BE配置参数
be_conf:
# 内存配置
JAVA_OPTS: "-Xmx64g -Xms64g"
# 存储
storage_root_path: "/data/doris/be/storage"
# 性能参数
read_size: 8388608 # 8MB 读取块大小
disable_storage_page_cache: "false"
# 查询线程
doris_scanner_thread_pool_queue_size: 10000
doris_scanner_thread_pool_thread_num: 48
# 集群部署配置
deployment:
mode: "HA"
replication_factor: 3
quorum_port: 9999
# 实例规划
instances:
total: 9 # 3 FE + 6 BE
fe_instances: 3
be_instances: 6
# 资源隔离 (可选)
resource_groups:
- name: "ad_tech"
cpu_share: 500
memory_mb: 51200
- name: "other"
cpu_share: 100
memory_mb: 10240
# 监控与告警
monitoring:
prometheus_port: 9090
metrics_collect_interval: 30 # 30秒
alerts:
- name: "FE_down"
condition: "fe_http_requests_total == 0 for 5m"
action: "webhook"
- name: "BE_storage_used_percent"
condition: "> 85%"
action: "webhook + pagerduty"
2.2 Kafka集群部署
properties
# kafka-broker-config.properties
# 基础配置
broker.id=1
broker.rack=rack1
listeners=PLAINTEXT://10.0.2.1:9092,SSL://10.0.2.1:9093
advertised.listeners=PLAINTEXT://10.0.2.1:9092,SSL://10.0.2.1:9093
controller.listeners=CONTROLLER://10.0.2.1:9094
# 性能优化
num.network.threads=16
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
# 日志配置
log.dirs=/data/kafka/logs
num.log.partitions=48 # 多分区利用并行性
default.replication.factor=3
min.insync.replicas=2
# 保留策略
log.retention.hours=168 # 7天
log.retention.bytes=2147483648 # 2GB
log.segment.bytes=1073741824 # 1GB
# 性能指标
log.flush.interval.messages=10000
log.flush.interval.ms=1000
compression.type=snappy
# 集群通信
group.initial.rebalance.delay.ms=3000
heartbeat.interval.ms=3000
session.timeout.ms=30000
# 安全配置 (可选)
security.protocol=SSL
ssl.keystore.location=/etc/kafka/secrets/kafka.broker.keystore.jks
ssl.keystore.password=password
ssl.key.password=password
# Topic 配置
# 展示事件topic
# Topic: raw_impression
# Partitions: 48
# Replication Factor: 3
# Retention: 168 hours
# Compression: snappy
# 点击事件topic
# Topic: raw_click
# Partitions: 24
# Replication Factor: 3
# 转化事件topic
# Topic: raw_conversion
# Partitions: 12
# Replication Factor: 3
2.3 Flink集群部署
yaml
# flink-config.yaml
# 集群配置
jobmanager.rpc.address: 10.0.3.1
jobmanager.rpc.port: 6123
jobmanager.memory.process.size: 4gb
jobmanager.web.port: 8081
# 高可用配置 (Zookeeper HA)
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster.id: flink_ad_tech
# TaskManager 配置
taskmanager.numberOfTaskSlots: 8
taskmanager.memory.process.size: 64gb
taskmanager.memory.framework.heap.size: 128mb
taskmanager.memory.jvm-metaspace.size: 256mb
# 网络 & 批处理
taskmanager.network.memory.fraction: 0.1
taskmanager.network.memory.min: 64mb
taskmanager.network.memory.max: 1gb
# 状态后端配置 (RocksDB)
state.backend: rocksdb
state.backend.incremental: true
state.backend.rocksdb.memory.managed: true
state.backend.rocksdb.memory.managed.fraction: 0.5
# Checkpoint 配置
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.interval: 60000 # 60秒
execution.checkpointing.min-pause: 30000 # 30秒
state.checkpoints.dir: hdfs:///flink/checkpoints
# 常见参数
parallelism: 32
parallelism.default: 32
# 监控指标
metrics.reporter.prometheus.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prometheus.port: 9091
数据管道搭建
3.1 API数据采集脚本
python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
广告平台API数据采集脚本
负责从各个平台的API拉取数据, 并推送到Kafka
"""
import json
import requests
import time
from datetime import datetime, timedelta
from kafka import KafkaProducer
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class AdPlatformCollector:
def __init__(self, kafka_broker='localhost:9092'):
self.kafka_producer = KafkaProducer(
bootstrap_servers=[kafka_broker],
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
acks='all',
retries=3
)
def collect_baidu_ads(self, token, start_date, end_date):
"""百度搜索广告数据采集"""
url = "https://api.baidu.com/json/sms/v3/ReportService/queryReport"
report_types = ['STAT_FIELD_CLKS', 'STAT_FIELD_COST', 'STAT_FIELD_PV']
for report_type in report_types:
payload = {
'clientToken': token,
'reportType': report_type,
'startDate': start_date,
'endDate': end_date,
'pageNum': 1,
'pageSize': 10000
}
try:
response = requests.post(url, json=payload, timeout=30)
if response.status_code == 200:
data = response.json()
for record in data.get('records', []):
# 标准化格式
standardized = self._standardize_baidu(record)
self.kafka_producer.send('raw_click', value=standardized)
logger.info(f"Collected {len(data.get('records', []))} records from Baidu")
else:
logger.error(f"Baidu API error: {response.status_code}")
except Exception as e:
logger.error(f"Baidu collection failed: {str(e)}")
def collect_douyin_ads(self, token):
"""抖音广告数据采集"""
url = "https://ad.douyin.com/open_api/v1.3/report/integrated/get/"
# 获取最近24小时的数据
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
params = {
'advertiser_id': 'your_advertiser_id',
'start_date': start_date,
'end_date': end_date,
'data_level': 'REPORT_LEVEL_CAMPAIGN',
'dimensions': ['campaign_id', 'stat_time_hour'],
'metrics': ['click', 'show', 'cost']
}
headers = {
'Authorization': f'Bearer {token}'
}
try:
response = requests.get(url, params=params, headers=headers, timeout=30)
if response.status_code == 200:
data = response.json()
for record in data.get('data', []):
standardized = self._standardize_douyin(record)
self.kafka_producer.send('raw_impression', value=standardized)
logger.info(f"Collected from Douyin")
else:
logger.error(f"Douyin API error: {response.status_code}")
except Exception as e:
logger.error(f"Douyin collection failed: {str(e)}")
def _standardize_baidu(self, record):
"""将百度数据标准化"""
return {
'platform': 'baidu',
'campaign_id': record.get('campaignId'),
'adgroup_id': record.get('adgroupId'),
'keyword': record.get('keyword'),
'pv': record.get('pv'),
'click': record.get('click'),
'cost': float(record.get('cost', 0)) / 100, # 分转元
'timestamp': datetime.now().isoformat(),
'data_date': record.get('statDate')
}
def _standardize_douyin(self, record):
"""将抖音数据标准化"""
return {
'platform': 'douyin',
'campaign_id': record.get('campaign_id'),
'show': record.get('show'),
'click': record.get('click'),
'cost': float(record.get('cost', 0)),
'timestamp': datetime.now().isoformat(),
'stat_time': record.get('stat_time_hour')
}
def run(self, config):
"""定期运行采集任务"""
while True:
try:
# 每小时执行一次
self.collect_baidu_ads(
config['baidu_token'],
start_date=(datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d'),
end_date=datetime.now().strftime('%Y-%m-%d')
)
self.collect_douyin_ads(config['douyin_token'])
time.sleep(3600) # 1小时
except Exception as e:
logger.error(f"Collector error: {str(e)}")
time.sleep(60) # 错误后等待1分钟重试
if __name__ == '__main__':
config = {
'baidu_token': 'your_baidu_token',
'douyin_token': 'your_douyin_token'
}
collector = AdPlatformCollector()
collector.run(config)
3.2 Doris导入配置
sql
-- 创建Stream Load导入任务 (实时导入from Kafka)
-- 方案1: 使用Doris Routine Load (推荐)
CREATE ROUTINE LOAD kafka_impression_load ON fact_impression
COLUMNS (impr_id, timestamp, advertiser_id, campaign_id, adgroup_id,
creative_id, user_id, device_type, platform_id, region_id, cost, event_time)
PROPERTIES (
"kafka_broker_list" = "10.0.2.1:9092,10.0.2.2:9092,10.0.2.3:9092",
"kafka_topic" = "raw_impression",
"kafka_partitions" = "0,1,2,3,4,5,6,7",
"kafka_offsets" = "OFFSET_BEGINNING",
"kafka_default_offsets" = "OFFSET_END",
"format" = "json",
"jsonpaths" = "[\"$.impr_id\", \"$.timestamp\", \"$.advertiser_id\", ...]",
"json_root" = "",
"strip_outer_array" = "false"
)
LOAD PROPERTIES (
"line_delimiter" = "\n",
"timeout" = "600",
"max_filter_ratio" = "0.1"
);
-- 监视 Routine Load 任务
SHOW ROUTINE LOAD kafka_impression_load\G
-- 方案2: 使用 Kafka connector (批量导入)
-- ./bin/kafka-connect.sh config/kafka-connector-doris.conf
部署配置示例
4.1 部署脚本 (Ansible)
yaml
# deploy-adtech-dw.yaml
---
- name: Deploy Adtech Data Warehouse
hosts: adtech_cluster
gather_facts: yes
vars:
doris_version: "1.2.8"
flink_version: "1.17.0"
kafka_version: "3.5.0"
base_dir: "/opt/adtech"
tasks:
- name: Install dependencies
apt:
name:
- openjdk-11-jdk
- wget
- curl
- git
- build-essential
state: present
- name: Create base directory
file:
path: "{{ base_dir }}"
state: directory
mode: '0755'
# Doris 部署
- name: Deploy Doris FE
block:
- name: Download Doris
get_url:
url: "https://apache-mirror../doris-{{ doris_version }}-bin.tar.gz"
dest: "{{ base_dir }}/doris.tar.gz"
- name: Extract Doris
unarchive:
src: "{{ base_dir }}/doris.tar.gz"
dest: "{{ base_dir }}"
- name: Copy FE config
template:
src: fe.conf.j2
dest: "{{ base_dir }}/doris/fe/conf/fe.conf"
- name: Start FE
shell: "{{ base_dir }}/doris/fe/bin/start_fe.sh"
when: "'fe' in inventory_hostname"
# Kafka 部署
- name: Deploy Kafka
block:
- name: Download Kafka
get_url:
url: "https://archive.apache.org/dist/kafka/{{ kafka_version }}/kafka_2.13-{{ kafka_version }}.tgz"
dest: "{{ base_dir }}/kafka.tgz"
- name: Extract Kafka
unarchive:
src: "{{ base_dir }}/kafka.tgz"
dest: "{{ base_dir }}"
- name: Copy Kafka config
template:
src: server.properties.j2
dest: "{{ base_dir }}/kafka/config/server.properties"
- name: Start Kafka
shell: "{{ base_dir }}/kafka/bin/kafka-server-start.sh -daemon {{ base_dir }}/kafka/config/server.properties"
when: "'kafka' in inventory_hostname"
# Flink 部署
- name: Deploy Flink
block:
- name: Download Flink
get_url:
url: "https://archive.apache.org/dist/flink/flink-{{ flink_version }}/flink-{{ flink_version }}-bin-scala_2.12.tgz"
dest: "{{ base_dir }}/flink.tgz"
- name: Extract Flink
unarchive:
src: "{{ base_dir }}/flink.tgz"
dest: "{{ base_dir }}"
- name: Copy Flink config
template:
src: flink-conf.yaml.j2
dest: "{{ base_dir }}/flink/conf/flink-conf.yaml"
- name: Start Flink
shell: "{{ base_dir }}/flink/bin/start-cluster.sh"
when: "'flink' in inventory_hostname"
- name: Verify deployments
shell: |
echo "=== Doris FE Status ===" && curl -s http://localhost:8030/api/bootstrap | jq .
echo "=== Kafka Broker Status ===" && {{ base_dir }}/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092 | head -5
echo "=== Flink Status ===" && curl -s http://localhost:8081/v1/taskmanagers | jq '.taskmanagers | length'
上线与灰度策略
5.1 上线流程
yaml
├─ Phase 1: 开发环境 (Dev)
│ ├─ 3台服务器 (1FE + 1BE + 1Kafka)
│ ├─ 数据量: 样本数据 (1GB)
│ ├─ 周期: 1-2周
│ └─ 验收: 功能测试、性能基准测试
│
├─ Phase 2: 测试环境 (Test/UAT)
│ ├─ 6台服务器 (复制部分生产配置)
│ ├─ 数据量: 历史样本数据 (100GB)
│ ├─ 周期: 2-3周
│ └─ 验收: 集成测试、压力测试、用户验收
│
├─ Phase 3: 灰度环境 (Canary)
│ │ ├─ 6台服务器 (完整集群, 新部署)
│ ├─ 流量: 5% 的生产流量
│ ├─ 周期: 1-2周
│ ├─ 监控: 关键指标对标生产
│ └─ 验收: 实时数据验证、性能基准对标
│
└─ Phase 4: 全量上线 (Production)
├─ 23台服务器 (完整规划配置)
├─ 流量: 100% 生产流量
├─ 周期: 滚动上线, 3-5天
├─ 监控: 24/7 实时监控, 告警响应<5分钟
└─ 回滚: 完整的回滚方案准备
5.2 监控指标与告警
erlang
关键监控指标:
基础设施监控:
├─ CPU使用率: >80% → 告警, >95% → P0告警
├─ 内存使用率: >85% → 告警, >95% → P0告警
├─ 磁盘使用率: >80% → 告警, >95% → P0告警
├─ 网络延迟: >50ms → 告警, >100ms → P1告警
└─ 网络丢包率: >0.1% → 告警, >0.5% → P0告警
数据管道监控:
├─ Kafka消费延迟: >100条 → 告警, >1000条 → P0告警
├─ Flink job延迟: >5分钟 → 告警, >1小时 → P0告警
├─ Flink Checkpoint失败率: >5% → 告警
├─ Doris导入延迟: >10分钟 → 告警, >1小时 → P0告警
└─ 数据一致性: Kafka vs Doris行数差异>1% → 告警
应用监控:
├─ API响应延迟: P95>1秒 → 告警, P99>5秒 → P1告警
├─ API错误率: >0.1% → 告警, >1% → P0告警
├─ 查询QPS: >1000 QPS → 告警, >5000 QPS → P0告警
└─ 数据新鲜度: >15分钟 → 告警, >1小时 → P0告警
总结
该部署方案实现:
- 生产就绪: 完整的HA/容灾设计
- 可扩展: 支持从5亿到50亿条/日的增长
- 高可靠: 99.9%+ 可用性, RPO 1小时
- 低延迟: 秒级数据导入, 分钟级指标更新
- 可管理: 自动化部署、监控告警齐全