第12期:Hadoop集群监控与运维 - 构建工业级智能运维体系
导言:工业大数据平台的稳定运行是智能制造的生命线。本期从监控体系架构设计出发,深入剖析Ganglia、Prometheus、Grafana等主流监控工具的集成方案,详细讲解JMX、Metrics接口的暴露与采集,构建完整的告警体系和自动化运维流程。
12.1 Hadoop监控体系架构设计
12.1.1 工业大数据平台监控需求分析
工业大数据平台的监控与传统互联网场景有显著差异:
工业大数据平台监控特征分析:
┌─────────────────────────────────────────────────────────────────┐
│ 监控需求矩阵 │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ 维度 │ 传统互联网 │ 工业大数据平台 │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ 数据规模 │ GB~TB级 │ TB~PB级 │
│ 实时性要求 │ 秒~分钟级 │ 毫秒~秒级 │
│ SLA要求 │ 99.9% │ 99.99%+ │
│ 故障容忍度 │ 可接受延迟 │ 零容忍(生产线影响) │
│ 监控指标 │ 基础资源+业务 │ 资源+作业+工艺+设备 │
│ 告警场景 │ 通用场景 │ 工艺参数异常、设备故障预警 │
└─────────────────┴─────────────────┴─────────────────────────────┘
工业特性监控指标:
- PLC采集频率、工艺节拍、OEE设备综合效率
- 传感器数据延迟、数据完整性校验
- 边缘-云端数据一致性、时序数据乱序率
- 关键作业DAG执行耗时、数据管道吞吐量
#mermaid-svg-I9RaKSUz8uVCrsbV{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-I9RaKSUz8uVCrsbV .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-I9RaKSUz8uVCrsbV .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-I9RaKSUz8uVCrsbV .error-icon{fill:#552222;}#mermaid-svg-I9RaKSUz8uVCrsbV .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-I9RaKSUz8uVCrsbV .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-I9RaKSUz8uVCrsbV .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-I9RaKSUz8uVCrsbV .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-I9RaKSUz8uVCrsbV .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-I9RaKSUz8uVCrsbV .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-I9RaKSUz8uVCrsbV .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-I9RaKSUz8uVCrsbV .marker{fill:#333333;stroke:#333333;}#mermaid-svg-I9RaKSUz8uVCrsbV .marker.cross{stroke:#333333;}#mermaid-svg-I9RaKSUz8uVCrsbV svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-I9RaKSUz8uVCrsbV p{margin:0;}#mermaid-svg-I9RaKSUz8uVCrsbV .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-I9RaKSUz8uVCrsbV .cluster-label text{fill:#333;}#mermaid-svg-I9RaKSUz8uVCrsbV .cluster-label span{color:#333;}#mermaid-svg-I9RaKSUz8uVCrsbV .cluster-label span p{background-color:transparent;}#mermaid-svg-I9RaKSUz8uVCrsbV .label text,#mermaid-svg-I9RaKSUz8uVCrsbV span{fill:#333;color:#333;}#mermaid-svg-I9RaKSUz8uVCrsbV .node rect,#mermaid-svg-I9RaKSUz8uVCrsbV .node circle,#mermaid-svg-I9RaKSUz8uVCrsbV .node ellipse,#mermaid-svg-I9RaKSUz8uVCrsbV .node polygon,#mermaid-svg-I9RaKSUz8uVCrsbV .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-I9RaKSUz8uVCrsbV .rough-node .label text,#mermaid-svg-I9RaKSUz8uVCrsbV .node .label text,#mermaid-svg-I9RaKSUz8uVCrsbV .image-shape .label,#mermaid-svg-I9RaKSUz8uVCrsbV .icon-shape .label{text-anchor:middle;}#mermaid-svg-I9RaKSUz8uVCrsbV .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-I9RaKSUz8uVCrsbV .rough-node .label,#mermaid-svg-I9RaKSUz8uVCrsbV .node .label,#mermaid-svg-I9RaKSUz8uVCrsbV .image-shape .label,#mermaid-svg-I9RaKSUz8uVCrsbV .icon-shape .label{text-align:center;}#mermaid-svg-I9RaKSUz8uVCrsbV .node.clickable{cursor:pointer;}#mermaid-svg-I9RaKSUz8uVCrsbV .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-I9RaKSUz8uVCrsbV .arrowheadPath{fill:#333333;}#mermaid-svg-I9RaKSUz8uVCrsbV .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-I9RaKSUz8uVCrsbV .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-I9RaKSUz8uVCrsbV .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-I9RaKSUz8uVCrsbV .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-I9RaKSUz8uVCrsbV .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-I9RaKSUz8uVCrsbV .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-I9RaKSUz8uVCrsbV .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-I9RaKSUz8uVCrsbV .cluster text{fill:#333;}#mermaid-svg-I9RaKSUz8uVCrsbV .cluster span{color:#333;}#mermaid-svg-I9RaKSUz8uVCrsbV div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-I9RaKSUz8uVCrsbV .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-I9RaKSUz8uVCrsbV rect.text{fill:none;stroke-width:0;}#mermaid-svg-I9RaKSUz8uVCrsbV .icon-shape,#mermaid-svg-I9RaKSUz8uVCrsbV .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-I9RaKSUz8uVCrsbV .icon-shape p,#mermaid-svg-I9RaKSUz8uVCrsbV .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-I9RaKSUz8uVCrsbV .icon-shape .label rect,#mermaid-svg-I9RaKSUz8uVCrsbV .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-I9RaKSUz8uVCrsbV .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-I9RaKSUz8uVCrsbV .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-I9RaKSUz8uVCrsbV :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 可视化层
即时告警
存储计算层
Prometheus
TimeScaleDB
Elasticsearch
采集层
JMX Exporter
REST API
Node Exporter
Event Log
数据源层
HDFS集群
YARN集群
HBase集群
Hive/Kafka
Grafana
Kibana
AlertManager
12.1.2 Hadoop原生Metrics体系
Hadoop通过Metrics2框架暴露组件运行状态:
java
// Hadoop Metrics2自定义指标实现
package com.industrial.hadoop.metrics;
import org.apache.hadoop.metrics2.MetricCounter;
import org.apache.hadoop.metrics2.MetricGauge;
import org.apache.hadoop.metrics2.MetricsSource;
import org.apache.hadoop.metrics2.MetricsSystem;
import org.apache.hadoop.metrics2.annotation.Metrics;
@Metrics(context = "industrial-process")
public class IndustrialProcessMetrics implements MetricsSource {
// 工艺参数指标
@Metric(value = "工艺数据采集量", type = MetricType.COUNTER)
private long processDataCollected;
@Metric(value = "数据处理延迟(ms)", type = MetricType.GAUGE)
private Gauge<Integer> processingLatency;
@Metric(value = "数据完整性比率", type = MetricType.GAUGE)
private Gauge<Double> dataCompleteness;
// 设备状态指标
@Metric(value = "设备在线数", type = MetricType.GAUGE)
private Gauge<Integer> deviceOnlineCount;
@Metric(value = "告警次数", type = MetricType.COUNTER)
private long alertCount;
@Override
public void getMetrics(MetricsBuilder builder, boolean all) {
builder.addRecord("IndustrialProcessMetrics")
.addCounter("ProcessDataCollected", processDataCollected)
.addGauge("ProcessingLatencyMs", processingLatency.value())
.addGauge("DataCompleteness", dataCompleteness.value())
.addGauge("DeviceOnlineCount", deviceOnlineCount.value())
.addCounter("AlertCount", alertCount);
}
// 指标注册与初始化
public static void init(MetricsSystem ms, String processId) {
IndustrialProcessMetrics metrics = new IndustrialProcessMetrics(processId);
ms.register("IndustrialProcessMetrics",
"工业工艺过程监控指标",
metrics);
}
}
12.2 Prometheus+Grafana监控体系
12.2.1 Hadoop JMX Exporter配置
yaml
# jmx_exporter配置 - hadoop-common
apiVersion: v1
kind: ConfigMap
metadata:
name: hadoop-jmx-config
namespace: bigdata
data:
hadoop-config.yaml: |
jmxUrl: http://localhost:8004/jmx
lowercaseOutputName: true
lowercaseOutputLabelNames: true
whitelistObjectNames:
- "hadoop.metrics2:name=*"
- "java.lang:type=Memory"
- "java.lang:type=GarbageCollector,*"
rules:
# HDFS NameNode指标
- pattern: 'Hadoop:service=NameNode,name=.*'
name: hadoop_namenode_$1
type: GAUGE
# Block状态指标
- pattern: 'Hadoop:service=NameNode,name=FSNamesystem.*'
name: hadoop_namenode_$1
labels:
component: "namenode"
# YARN ResourceManager指标
- pattern: 'Hadoop:service=ResourceManager,name=.*'
name: hadoop_resourcemanager_$1
type: GAUGE
# JVM内存指标
- pattern: 'java.lang<type=Memory><HeapMemoryUsage>(\w+)'
name: jvm_heap_$1
type: GAUGE
# GC指标
- pattern: 'java.lang<type=GarbageCollector, name=(\w+)><>CollectionCount'
name: jvm_gc_collection_count_total
labels:
gc: "$1"
type: COUNTER
12.2.2 Prometheus Operator CRD配置
yaml
# Prometheus配置 - ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: hadoop-namenode-monitor
namespace: bigdata
labels:
release: prometheus # 必须匹配Prometheus Operator的serviceMonitorNamespace
spec:
jobLabel: hadoop-nn
selector:
matchLabels:
app: hadoop-namenode
namespaceSelector:
matchNames:
- bigdata
endpoints:
# JMX Exporter端点
- port: jmx
path: /metrics
interval: 15s
scrapeTimeout: 10s
# HTTP Exporter端点
- port: http
path: /jmx
interval: 30s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: hadoop-alerts
namespace: bigdata
spec:
groups:
- name: hadoop.rules
interval: 30s
rules:
# HDFS容量告警
- alert: HDFSCapacityHigh
expr: |
(hadoop_namenode_capacity_used /
hadoop_namenode_capacity_total) > 0.85
for: 5m
labels:
severity: warning
component: hdfs
annotations:
summary: "HDFS集群容量使用率超过85%"
description: "HDFS {{ $labels.instance }} 容量使用率 {{ $value | humanizePercentage }}"
# DataNode离线告警
- alert: DataNodeDown
expr: |
(hadoop_datanode_num_live_nodes /
(hadoop_datanode_num_live_nodes + hadoop_datanode_num_dead_nodes)) < 0.9
for: 2m
labels:
severity: critical
component: hdfs
annotations:
summary: "DataNode存活率低于90%"
description: "集群{{ $labels.instance }} DataNode存活率{{ $value | humanizePercentage }}"
# YARN队列资源告警
- alert: YARNQueueResourceHigh
expr: |
(hadoop_resourcemanager_queue_allocated_memory /
hadoop_resourcemanager_queue_memory) > 0.9
for: 5m
labels:
severity: warning
component: yarn
annotations:
summary: "YARN队列{{ $labels.queue }}资源使用率超过90%"
# JVM堆内存告警
- alert: JVMHeapUsageHigh
expr: |
(jvm_heap_used / jvm_heap_max) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "JVM堆内存使用率超过85%"
description: "组件{{ $labels.component }}堆内存使用率{{ $value | humanizePercentage }}"
12.2.3 Grafana工业大数据看板
json
{
"dashboard": {
"title": "工业大数据平台监控看板",
"uid": "industrial-hadoop",
"tags": ["hadoop", "industrial", "manufacturing"],
"panels": [
{
"title": "HDFS存储概览",
"type": "stat",
"gridPos": {"x": 0, "y": 0, "w": 6, "h": 4},
"targets": [
{
"expr": "sum(hadoop_namenode_capacity_total)",
"legendFormat": "总容量"
},
{
"expr": "sum(hadoop_namenode_capacity_used)",
"legendFormat": "已使用"
},
{
"expr": "sum(hadoop_namenode_capacity_remaining)",
"legendFormat": "剩余"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 0.7, "color": "yellow"},
{"value": 0.85, "color": "orange"},
{"value": 0.9, "color": "red"}
]
}
}
}
},
{
"title": "数据处理吞吐量趋势",
"type": "timeseries",
"gridPos": {"x": 6, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "rate(process_data_collected_total[5m])",
"legendFormat": "采集速率"
},
{
"expr": "rate(data_processed_total[5m])",
"legendFormat": "处理速率"
}
],
"options": {
"legend": {"displayMode": "table", "placement": "right"}
}
},
{
"title": "节点健康状态",
"type": "piechart",
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 8},
"targets": [
{
"expr": "count by (status) (up{job=~\"hadoop-.*\"})",
"legendFormat": "{{status}}"
}
]
}
]
}
}
12.3 自动化运维与故障自愈
12.3.1 Ansible自动化部署与运维剧本
yaml
# hadoop_operations.yml - Ansible运维剧本
---
- name: Hadoop集群运维管理
hosts: hadoop_cluster
gather_facts: yes
become: yes
vars:
hadoop_home: /opt/hadoop
log_dir: /var/log/hadoop
conf_dir: /etc/hadoop
tasks:
# 集群健康检查
- name: 检查NameNode状态
shell: |
{{ hadoop_home }}/bin/hdfs haadmin -getServiceState nn1
register: nn1_state
failed_when: nn1_state.stdout not in ['active', 'standby']
- name: 检查DataNode连接数
shell: |
{{ hadoop_home }}/bin/hdfs dfsadmin -report | grep 'Live datanodes'
register: datanode_report
# 容量管理与扩容
- name: 获取HDFS容量使用情况
shell: |
{{ hadoop_home }}/bin/hdfs dfsadmin -report | grep -E 'Configured|Capacity'
register: hdfs_capacity
- name: 检查磁盘空间
shell: |
df -h {{ hadoop_home }} {{ log_dir }}
register: disk_usage
# YARN队列管理
- name: 查看YARN队列状态
shell: |
{{ hadoop_home }}/bin/yarn queue -status default
register: yarn_queue
# 作业管理
- name: 列出运行中的作业
shell: |
{{ hadoop_home }}/bin/yarn application -list -appStates RUNNING
register: running_apps
# 日志收集
- name: 收集Hadoop日志
archive:
path: "{{ log_dir }}/hadoop-*/*.log"
dest: "/tmp/hadoop_logs_{{ ansible_date_time.epoch }}.tar.gz"
format: gz
when: log_collection_enabled | bool
12.3.2 故障自愈策略配置
python
# fault_healing.py - 故障自愈引擎
from typing import Dict, List, Optional
from dataclasses import dataclass
from enum import Enum
class FaultType(Enum):
DATANODE_DOWN = "datanode_down"
NAMENODE_SAFE_MODE = "namenode_safe_mode"
YARN_QUEUE_FULL = "yarn_queue_full"
DISK_FULL = "disk_full"
GC_PAUSE = "gc_pause"
@dataclass
class HealingPolicy:
fault_type: FaultType
condition: str # PromQL表达式
threshold: float
action: str
cooldown_seconds: int = 300
class FaultHealingEngine:
"""故障自愈引擎"""
def __init__(self, prometheus_url: str, alertmanager_url: str):
self.prometheus = PrometheusClient(prometheus_url)
self.alertmanager = AlertManagerClient(alertmanager_url)
self.healing_policies = self._load_policies()
def _load_policies(self) -> List[HealingPolicy]:
return [
# DataNode自动下线与恢复
HealingPolicy(
fault_type=FaultType.DATANODE_DOWN,
condition='up{job="hadoop-datanode"} == 0',
threshold=1,
action="auto_decommission",
cooldown_seconds=600
),
# NameNode自动退出安全模式
HealingPolicy(
fault_type=FaultType.NAMENODE_SAFE_MODE,
condition='hadoop_namenode_safe_mode == 1',
threshold=1,
action="leave_safe_mode",
cooldown_seconds=300
),
# 磁盘空间自动清理
HealingPolicy(
fault_type=FaultType.DISK_FULL,
condition='node_filesystem_avail_bytes < 10GB',
threshold=1,
action="cleanup_temp_files",
cooldown_seconds=3600
),
# YARN队列扩容
HealingPolicy(
fault_type=FaultType.YARN_QUEUE_FULL,
condition='yarn_queue_available_memory / yarn_queue_total_memory < 0.1',
threshold=1,
action="add_node_manager",
cooldown_seconds=1800
),
]
def check_and_heal(self) -> List[Dict]:
"""检查故障并执行自愈"""
healing_results = []
for policy in self.healing_policies:
# 查询当前指标值
result = self.prometheus.query(policy.condition)
if self._should_heal(policy, result):
success = self._execute_healing(policy)
healing_results.append({
'policy': policy.fault_type.value,
'action': policy.action,
'success': success
})
return healing_results
def _execute_healing(self, policy: HealingPolicy) -> bool:
"""执行自愈动作"""
if policy.action == "auto_decommission":
return self._heal_datanode()
elif policy.action == "leave_safe_mode":
return self._heal_namenode_safe_mode()
elif policy.action == "cleanup_temp_files":
return self._cleanup_temp_files()
elif policy.action == "add_node_manager":
return self._scale_nodemanager()
return False
12.4 工业场景监控最佳实践
12.4.1 监控指标体系设计
工业大数据平台监控指标体系(完整版):
┌──────────────────────────────────────────────────────────────────┐
│ 四层监控指标体系 │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 基础设施层 │ │ 平台服务层 │ │ 应用作业层 │ ┌─────────────┐│
│ ├─────────────┤ ├─────────────┤ ├─────────────┤ │ 工艺指标层 ││
│ │ CPU使用率 │ │ NameNode │ │ MapReduce │ │ 数据吞吐量 ││
│ │ 内存占用 │ │ DataNode │ │ Spark作业 │ │ 处理延迟 ││
│ │ 磁盘IO │ │ ResourceMgr │ │ Hive查询 │ │ 数据完整性 ││
│ │ 网络带宽 │ │ NodeManager │ │ Flink流 │ │ 设备状态 ││
│ │ 磁盘容量 │ │ HBase │ │ Kafka消费 │ │ OEE指标 ││
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘│
│ │
│ 指标分类: │
│ - 实时指标:毫秒级采集,Grafana实时展示 │
│ - 历史指标:分钟级聚合,存储于TimeScaleDB │
│ - 业务指标:作业级别聚合,关联MES/ERP系统 │
│ │
└──────────────────────────────────────────────────────────────────┘
12.4.2 SLA监控与报表
sql
-- SLA监控报表查询
SELECT
date_trunc('day', timestamp) as day,
-- 可用性
ROUND(
COUNT(DISTINCT CASE WHEN status = 'up' THEN hour END) * 100.0 /
(24 * COUNT(DISTINCT date)), 2
) as availability_pct,
-- 平均延迟
ROUND(AVG(processing_latency_ms), 2) as avg_latency_ms,
-- P99延迟
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY processing_latency_ms) as p99_latency_ms,
-- 数据完整性
ROUND(
SUM(data_received) / NULLIF(SUM(data_expected), 0) * 100, 2
) as data_completeness_pct,
-- 作业成功率
ROUND(
SUM(job_success) * 100.0 / NULLIF(SUM(job_total), 0), 2
) as job_success_rate
FROM hadoop_sla_metrics
WHERE timestamp >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY date_trunc('day', timestamp)
ORDER BY day DESC;
12.5 知识体系总结
#mermaid-svg-QOyE71moMMUAMwxI{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-QOyE71moMMUAMwxI .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-QOyE71moMMUAMwxI .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-QOyE71moMMUAMwxI .error-icon{fill:#552222;}#mermaid-svg-QOyE71moMMUAMwxI .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-QOyE71moMMUAMwxI .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-QOyE71moMMUAMwxI .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-QOyE71moMMUAMwxI .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-QOyE71moMMUAMwxI .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-QOyE71moMMUAMwxI .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-QOyE71moMMUAMwxI .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-QOyE71moMMUAMwxI .marker{fill:#333333;stroke:#333333;}#mermaid-svg-QOyE71moMMUAMwxI .marker.cross{stroke:#333333;}#mermaid-svg-QOyE71moMMUAMwxI svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-QOyE71moMMUAMwxI p{margin:0;}#mermaid-svg-QOyE71moMMUAMwxI .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-QOyE71moMMUAMwxI .cluster-label text{fill:#333;}#mermaid-svg-QOyE71moMMUAMwxI .cluster-label span{color:#333;}#mermaid-svg-QOyE71moMMUAMwxI .cluster-label span p{background-color:transparent;}#mermaid-svg-QOyE71moMMUAMwxI .label text,#mermaid-svg-QOyE71moMMUAMwxI span{fill:#333;color:#333;}#mermaid-svg-QOyE71moMMUAMwxI .node rect,#mermaid-svg-QOyE71moMMUAMwxI .node circle,#mermaid-svg-QOyE71moMMUAMwxI .node ellipse,#mermaid-svg-QOyE71moMMUAMwxI .node polygon,#mermaid-svg-QOyE71moMMUAMwxI .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-QOyE71moMMUAMwxI .rough-node .label text,#mermaid-svg-QOyE71moMMUAMwxI .node .label text,#mermaid-svg-QOyE71moMMUAMwxI .image-shape .label,#mermaid-svg-QOyE71moMMUAMwxI .icon-shape .label{text-anchor:middle;}#mermaid-svg-QOyE71moMMUAMwxI .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-QOyE71moMMUAMwxI .rough-node .label,#mermaid-svg-QOyE71moMMUAMwxI .node .label,#mermaid-svg-QOyE71moMMUAMwxI .image-shape .label,#mermaid-svg-QOyE71moMMUAMwxI .icon-shape .label{text-align:center;}#mermaid-svg-QOyE71moMMUAMwxI .node.clickable{cursor:pointer;}#mermaid-svg-QOyE71moMMUAMwxI .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-QOyE71moMMUAMwxI .arrowheadPath{fill:#333333;}#mermaid-svg-QOyE71moMMUAMwxI .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-QOyE71moMMUAMwxI .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-QOyE71moMMUAMwxI .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QOyE71moMMUAMwxI .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-QOyE71moMMUAMwxI .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QOyE71moMMUAMwxI .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-QOyE71moMMUAMwxI .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-QOyE71moMMUAMwxI .cluster text{fill:#333;}#mermaid-svg-QOyE71moMMUAMwxI .cluster span{color:#333;}#mermaid-svg-QOyE71moMMUAMwxI div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-QOyE71moMMUAMwxI .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-QOyE71moMMUAMwxI rect.text{fill:none;stroke-width:0;}#mermaid-svg-QOyE71moMMUAMwxI .icon-shape,#mermaid-svg-QOyE71moMMUAMwxI .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QOyE71moMMUAMwxI .icon-shape p,#mermaid-svg-QOyE71moMMUAMwxI .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-QOyE71moMMUAMwxI .icon-shape .label rect,#mermaid-svg-QOyE71moMMUAMwxI .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QOyE71moMMUAMwxI .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-QOyE71moMMUAMwxI .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-QOyE71moMMUAMwxI :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Hadoop监控体系
采集层
存储层
可视化层
自愈层
JMX Exporter
Node Exporter
REST API
Prometheus
TimeScaleDB
Elasticsearch
Grafana看板
AlertManager
报表系统
Ansible剧本
故障自愈引擎
自动化扩缩容
| 监控层次 | 核心组件 | 关键指标 | 采集频率 |
|---|---|---|---|
| 基础设施 | Node Exporter | CPU/内存/磁盘/网络 | 15s |
| HDFS | JMX Exporter | 容量/块数/DN状态 | 30s |
| YARN | REST API | 队列/Container/AM | 30s |
| 作业 | Application Master | 进度/失败/资源 | 10s |
| 工艺 | 自定义Metrics | 吞吐量/延迟/完整性 | 5s |
| 告警 | AlertManager | 触发/抑制/通知 | 实时 |
下期预告
第13期我们将深入探讨《数据湖架构》,讲解Delta Lake、Iceberg、Hudi三大开源数据湖方案在工业场景的选型与应用。敬请期待!
相关资源:
作者:高炉炼铁智能化技术研究者,专注钢铁冶金与人工智能 交叉领域。
👍 如果觉得有帮助,请点赞、收藏、转发!
版权归作者所有,未经许可请勿抄袭,套用,商用(或其它具有利益性行为) 。
🔔 关注专栏,不错过后续精彩内容!