MetricService 启动流程解析(按 ScopeFormats → ScopeFormat → MetricRegistry → MetricQueryService → MetricGroup → Metric 梳理)
这篇只回答一个问题:JobManager 启动时,Metrics 子系统是如何把"命名规则(scope)→ 指标分组(MetricGroup)→ 指标注册(MetricRegistry)→ 对外查询(MetricQueryService)"串成一条闭环的?
本文主线固定按你提出的顺序展开 6 个核心对象(只在必要处穿插启动链路与工具类):
org.apache.flink.runtime.metrics.scope.ScopeFormats:从配置里读出各组件的 scope 模板(JM/TM/Job/Task/Operator...)。org.apache.flink.runtime.metrics.scope.ScopeFormat:scope 模板的"编译结果"(拆分为 template,并预计算变量替换位置)。org.apache.flink.runtime.metrics.MetricRegistryImpl:指标注册/反注册的收口点,并负责启动 QueryService。org.apache.flink.runtime.metrics.dump.MetricQueryService:Metrics 的查询服务(RpcEndpoint),把"已注册指标对象"序列化成可拉取的 dump。org.apache.flink.metrics.MetricGroup(runtime 实现为org.apache.flink.runtime.metrics.groups.AbstractMetricGroup):指标分组与 scope 层级组织;负责子组创建、同名复用与冲突告警。org.apache.flink.metrics.Metric:所有指标类型的最小抽象(Counter/Gauge/Histogram/Meter 的公共父接口)。
一、主题与核心组件职责(总览)
这篇文档只讲 metrics 这条链路:scope 如何决定命名、MetricGroup 如何组织层级、MetricRegistryImpl 如何收口注册、MetricQueryService 如何把"已注册指标"转成可查询的快照。
org.apache.flink.runtime.entrypoint.ClusterEntrypoint#initializeServices:启动阶段把 metrics 相关对象装配起来(registry、metrics RPC、QueryService、进程级预置指标)。org.apache.flink.runtime.metrics.scope.ScopeFormats:从配置读取各层级 scope 模板,并建立父子 scope 的继承关系("*."前缀)。org.apache.flink.runtime.metrics.scope.ScopeFormat:把 scope 模板"编译"为template[] + templatePos/valuePos,让变量绑定变成 O(k) 的替换操作。org.apache.flink.metrics.MetricGroup(runtime 为org.apache.flink.runtime.metrics.groups.AbstractMetricGroup):提供层级命名空间与子组创建;同名 group 复用、冲突告警都在这里。org.apache.flink.metrics.Metric:所有指标类型的最小抽象(Counter/Gauge/Histogram/Meter 的公共父接口)。org.apache.flink.runtime.metrics.MetricRegistryImpl:指标注册/反注册的总入口;负责通知 reporters,并把指标同步进 QueryService。org.apache.flink.runtime.metrics.dump.MetricQueryService:运行在 metrics 专用 RPC 上的RpcEndpoint,维护"已注册指标对象 → (scopeInfo, metricName)"的索引(按 Counter/Gauge/Histogram/Meter 分 4 类 Map),并在queryMetrics(...)时读取指标对象的当前值、序列化为 dump(同时受 messageSizeLimit 截断保护)。org.apache.flink.runtime.metrics.util.MetricUtils:启动 metrics RPC 服务、注册 JVM/System 等预置指标集的工具类。
二、启动链路先过一遍:ClusterEntrypoint.initializeServices 里 metrics 相关装配顺序
这条链路发生在 ClusterEntrypoint.initializeServices:先创建 metricRegistry,再创建 metricQueryServiceRpcService,随后在该 RPC 上启动 QueryService,最后注册进程级预置指标(JVM/System)。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java>
FQCN:org.apache.flink.runtime.entrypoint.ClusterEntrypoint#initializeServices
java
metricRegistry = createMetricRegistry(configuration, pluginManager, rpcSystem);
final RpcService metricQueryServiceRpcService =
MetricUtils.startRemoteMetricsRpcService(
configuration,
commonRpcService.getAddress(),
configuration.get(JobManagerOptions.BIND_HOST),
rpcSystem);
metricRegistry.startQueryService(metricQueryServiceRpcService, null);
final String hostname = RpcUtils.getHostname(commonRpcService);
processMetricGroup =
MetricUtils.instantiateProcessMetricGroup(
metricRegistry,
hostname,
ConfigurationUtils.getSystemResourceMetricsProbingInterval(configuration));
createMetricRegistry → metricRegistry(MetricRegistryImpl)
MetricUtils.startRemoteMetricsRpcService → metricQueryServiceRpcService(RpcService)
metricRegistry.startQueryService → MetricQueryService.start()
MetricUtils.instantiateProcessMetricGroup → 注册 JVM/System 指标
MetricQueryService metricQueryServiceRpcService(RpcService) RpcServiceBuilder RpcSystem MetricUtils MetricRegistryImpl ClusterEntrypoint MetricQueryService metricQueryServiceRpcService(RpcService) RpcServiceBuilder RpcSystem MetricUtils MetricRegistryImpl ClusterEntrypoint createMetricRegistry(...) new MetricRegistryImpl(...) startRemoteMetricsRpcService(...) remoteServiceBuilder(configuration, externalAddress, portRange) RpcServiceBuilder withComponentName("flink-metrics")\nwithExecutorConfiguration(1 thread) createAndStart() RpcService startQueryService(R, resourceID=null) createMetricQueryService(R, resourceID, maxFrameSize) start() instantiateProcessMetricGroup(MR, hostname, interval)
三、核心类关系(类图)
extends
implements
implements
holds(queryService)
holds(registry)
creates
register/unregister
extends
extends
extends
extends
extends
extends
extends
extends
extends
extends
extends
extends
extends
extends
extends
extends
<<RpcEndpoint>>
MetricQueryService
RpcEndpoint
<<interface>>
MetricQueryServiceGateway
MetricRegistryImpl
AbstractMetricGroup
<<interface>>
MetricGroup
<<interface>>
Metric
<<interface>>
Gauge
<<interface>>
Counter
<<interface>>
Histogram
<<interface>>
Meter
ScopeFormats
ScopeFormat
JobManagerScopeFormat
JobManagerJobScopeFormat
JobManagerOperatorScopeFormat
TaskManagerScopeFormat
TaskManagerJobScopeFormat
TaskScopeFormat
OperatorScopeFormat
GenericMetricGroup
GenericKeyMetricGroup
GenericValueMetricGroup
AbstractImitatingJobManagerMetricGroup
ProcessMetricGroup
四、ScopeFormats:scope 模板读取(命名规则的入口)
ScopeFormats 是一个"容器类",它做的事很简单:从 MetricOptions.SCOPE_NAMING_* 读取字符串,实例化每个组件的具体 ScopeFormat 子类,并把父子关系连起来(用于 "*." 继承)。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/scope/ScopeFormats.java>
FQCN:org.apache.flink.runtime.metrics.scope.ScopeFormats#fromConfig
java
public static ScopeFormats fromConfig(Configuration config) {
String jmFormat = config.get(MetricOptions.SCOPE_NAMING_JM);
String jmJobFormat = config.get(MetricOptions.SCOPE_NAMING_JM_JOB);
String tmFormat = config.get(MetricOptions.SCOPE_NAMING_TM);
String tmJobFormat = config.get(MetricOptions.SCOPE_NAMING_TM_JOB);
String taskFormat = config.get(MetricOptions.SCOPE_NAMING_TASK);
String operatorFormat = config.get(MetricOptions.SCOPE_NAMING_OPERATOR);
String jmOperatorFormat = config.get(MetricOptions.SCOPE_NAMING_JM_OPERATOR);
return new ScopeFormats(
jmFormat,
jmJobFormat,
tmFormat,
tmJobFormat,
taskFormat,
operatorFormat,
jmOperatorFormat);
}
- 输入:一堆字符串模板(例如
"<host>"、"*.<job_name>"、"*.<task_name>.<subtask_index>")。 - 输出:
ScopeFormats,内部持有JobManagerScopeFormat / TaskManagerScopeFormat / TaskScopeFormat / ...,并把 parent 传进去以支持继承。
五、ScopeFormat:模板编译与变量绑定(把"字符串规则"变成"可高效替换的结构")
ScopeFormat 的构造函数做两件事:
- 生成
template[]:把模板按"."切分;如果模板以"*"开头,则把parent.template前缀拼进去。 - 预计算
templatePos[]与valuePos[]:后续绑定变量时,不需要每次扫描整条 template,直接按两个 int 数组替换即可。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/scope/ScopeFormat.java>
FQCN:org.apache.flink.runtime.metrics.scope.ScopeFormat#ScopeFormat
java
protected ScopeFormat(String format, ScopeFormat parent, String[] variables) {
checkNotNull(format, "format is null");
final String[] rawComponents = format.split("\\" + SCOPE_SEPARATOR);
final boolean parentAsPrefix =
rawComponents.length > 0 && rawComponents[0].equals(SCOPE_INHERIT_PARENT);
if (parentAsPrefix) {
if (parent == null) {
throw new IllegalArgumentException(
"Component scope format requires parent prefix (starts with '"
+ SCOPE_INHERIT_PARENT
+ "'), but this component has no parent (is root component).");
}
this.format = format.length() > 2 ? format.substring(2) : "<empty>";
String[] parentTemplate = parent.template;
int parentLen = parentTemplate.length;
this.template = new String[parentLen + rawComponents.length - 1];
System.arraycopy(parentTemplate, 0, this.template, 0, parentLen);
System.arraycopy(rawComponents, 1, this.template, parentLen, rawComponents.length - 1);
} else {
this.format = format.isEmpty() ? "<empty>" : format;
this.template = rawComponents;
}
HashMap<String, Integer> varToValuePos = arrayToMap(variables);
List<Integer> templatePos = new ArrayList<>();
List<Integer> valuePos = new ArrayList<>();
for (int i = 0; i < template.length; i++) {
final String component = template[i];
if (component != null
&& component.length() >= 3
&& component.charAt(0) == '<'
&& component.charAt(component.length() - 1) == '>') {
Integer replacementPos = varToValuePos.get(component);
if (replacementPos != null) {
templatePos.add(i);
valuePos.add(replacementPos);
}
}
}
this.templatePos = integerListToArray(templatePos);
this.valuePos = integerListToArray(valuePos);
}
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/scope/ScopeFormat.java>
FQCN:org.apache.flink.runtime.metrics.scope.ScopeFormat#bindVariables
java
protected final String[] bindVariables(String[] template, String[] values) {
final int len = templatePos.length;
for (int i = 0; i < len; i++) {
template[templatePos[i]] = values[valuePos[i]];
}
return template;
}
bindVariables是 ScopeFormat 预计算的直接消费者:templatePos[i]指定替换位置,valuePos[i]指向values[]的来源位置。
六、MetricRegistryImpl:指标注册收口点 + QueryService 启动入口
MetricRegistryImpl 是 runtime 的"指标总入口":组件创建 MetricGroup 并注册指标最终都汇聚到这里;同时它也负责把 QueryService 启动起来以便外部查询。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java>
FQCN:org.apache.flink.runtime.entrypoint.ClusterEntrypoint#createMetricRegistry
java
protected MetricRegistryImpl createMetricRegistry(
Configuration configuration,
PluginManager pluginManager,
RpcSystemUtils rpcSystemUtils) {
return new MetricRegistryImpl(
MetricRegistryConfiguration.fromConfiguration(
configuration, rpcSystemUtils.getMaximumMessageSizeInBytes(configuration)),
ReporterSetup.fromConfiguration(configuration, pluginManager),
TraceReporterSetup.fromConfiguration(configuration, pluginManager));
}
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java>
FQCN:org.apache.flink.runtime.metrics.MetricRegistryImpl#startQueryService
java
public void startQueryService(RpcService rpcService, ResourceID resourceID) {
synchronized (lock) {
Preconditions.checkState(!isShutdown(), "The metric registry has already been shut down.");
try {
metricQueryServiceRpcService = rpcService;
queryService =
MetricQueryService.createMetricQueryService(
rpcService, resourceID, maximumFramesize);
queryService.start();
} catch (Exception e) {
LOG.warn(
"Could not start MetricDumpActor. No metrics will be submitted to the WebInterface.",
e);
}
}
}
metricRegistry.startQueryService(...)是"registry → query"闭环的开关:它把 QueryService endpoint 注册到metricQueryServiceRpcService上。- JM 侧一般传
resourceID == null,endpointId 用固定名;TM 侧则用resourceID拼出唯一 endpointId。
七、MetricQueryService:对外查询服务(RpcEndpoint)+ metrics 专用 RpcService
MetricQueryService 本质是一个运行在 metrics 专用 RpcService 上的 RpcEndpoint,并实现了 MetricQueryServiceGateway 接口,因此查询端可以通过 RPC 代理调用 queryMetrics(...) 拉取"指标快照"。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/dump/MetricQueryService.java>
FQCN:org.apache.flink.runtime.metrics.dump.MetricQueryService
java
public class MetricQueryService extends RpcEndpoint implements MetricQueryServiceGateway {
metrics 专用 RpcService 则是承载这个 endpoint 的底座:它不是一个"类",而是一个 被启动后的 org.apache.flink.runtime.rpc.RpcService 实例 ,专门用来跑 QueryService(而不是复用 commonRpcService)。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java>
FQCN:org.apache.flink.runtime.metrics.util.MetricUtils#startRemoteMetricsRpcService
java
public static RpcService startRemoteMetricsRpcService(
Configuration configuration,
String externalAddress,
@Nullable String bindAddress,
RpcSystem rpcSystem)
throws Exception {
final String portRange = configuration.get(MetricOptions.QUERY_SERVICE_PORT);
final RpcSystem.RpcServiceBuilder rpcServiceBuilder =
rpcSystem.remoteServiceBuilder(configuration, externalAddress, portRange);
if (bindAddress != null) {
rpcServiceBuilder.withBindAddress(bindAddress);
}
return startMetricRpcService(configuration, rpcServiceBuilder);
}
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java>
FQCN:org.apache.flink.runtime.metrics.util.MetricUtils#startMetricRpcService
java
private static RpcService startMetricRpcService(
Configuration configuration, RpcSystem.RpcServiceBuilder rpcServiceBuilder)
throws Exception {
final int threadPriority = configuration.get(MetricOptions.QUERY_SERVICE_THREAD_PRIORITY);
return rpcServiceBuilder
.withComponentName(METRICS_ACTOR_SYSTEM_NAME)
.withExecutorConfiguration(
new RpcSystem.FixedThreadPoolExecutorConfiguration(1, 1, threadPriority))
.createAndStart();
}
externalAddress用于"对外宣告/生成可连接地址";bindAddress决定"本机实际监听在哪个地址上"。- componentName 固定为
flink-metrics,线程池固定 1 线程:QueryService 的变更与序列化在单线程语义下完成。
八、MetricGroup:指标分组(scope 层级)以及 addGroup 的"去重/冲突处理"
MetricGroup 的核心是"层级命名空间":你 addGroup("TaskManager").addGroup("Status") 形成 scope 层级,指标挂在某个叶子 group 上。
在 runtime 实现里,group 的创建走 AbstractMetricGroup#addGroup,里面包含两个经常被忽略但很关键的语义:名字冲突告警 与 同名 group 复用(去重)。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java>
FQCN:org.apache.flink.runtime.metrics.groups.AbstractMetricGroup#addGroup
java
private AbstractMetricGroup<?> addGroup(String name, ChildType childType) {
synchronized (this) {
if (!closed) {
if (metrics.containsKey(name)) {
LOG.warn(
"Name collision: Adding a metric subgroup with the same name as an existing metric: '"
+ name
+ "'. Metric might not get properly reported. "
+ Arrays.toString(scopeComponents));
}
AbstractMetricGroup newGroup = createChildGroup(name, childType);
AbstractMetricGroup prior = groups.put(name, newGroup);
if (prior == null || prior.isClosed()) {
return newGroup;
} else {
groups.put(name, prior);
return prior;
}
} else {
GenericMetricGroup closedGroup = new GenericMetricGroup(registry, this, name);
closedGroup.close();
return closedGroup;
}
}
}
- 同名 metric vs group 冲突:只 warn 不 fail(metrics 不应该因为误用把业务打挂)。
- 同名 group:返回"已存在且未关闭"的 prior,保证同名 group 在并发下不会被重复创建。
- 父 group 已关闭:返回一个立刻关闭的 group,避免后续继续注册造成副作用。
8.1 runtime 侧的 MetricGroup 实现类:Generic* 与 ProcessMetricGroup
在 UML 里我们把 AbstractMetricGroup 的常见实现也画出来了,实际落到代码层面,日常最常见的就是这 4 类:
org.apache.flink.runtime.metrics.groups.GenericMetricGroup:最通用的"命名子组",addGroup("A").addGroup("B")的默认实现基本都会落到它。org.apache.flink.runtime.metrics.groups.GenericKeyMetricGroup/GenericValueMetricGroup:用于MetricGroup#addGroup(String key, String value)这种"key-value 子组对",value group 会额外注入一个"<key>"="value"的变量,方便后续 scope 变量体系与 reporter 侧做标签化映射。org.apache.flink.runtime.metrics.groups.ProcessMetricGroup:进程级(JVM/System)预置指标挂载的 group,当前实现为了兼容历史命名,scope 与 groupName 会模拟 JobManager 的输出。
8.2 GenericMetricGroup:普通层级子组(最常见)
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/GenericMetricGroup.java>
FQCN:org.apache.flink.runtime.metrics.groups.GenericMetricGroup
java
public class GenericMetricGroup extends AbstractMetricGroup<AbstractMetricGroup<?>> {
private String name;
public GenericMetricGroup(MetricRegistry registry, AbstractMetricGroup parent, String name) {
super(registry, makeScopeComponents(parent, name), parent);
this.name = name;
}
}
- 用法:当你只是想在当前 group 下再分层,例如
group.addGroup("Status").addGroup("JVM"),就会不断创建/复用GenericMetricGroup。 - 它的 scope 组成规则很直观:
parent.scopeComponents + name(因此"层级路径"就是 scope 的自然前缀)。
8.3 GenericKeyMetricGroup / GenericValueMetricGroup:key-value 子组对(用于变量注入)
MetricGroup 接口定义了 key-value 子组对的语义:返回的是 value group,并且它的 getAllVariables() 会比 addGroup(key).addGroup(value) 多出一对变量。
路径:<flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/MetricGroup.java>
FQCN:org.apache.flink.metrics.MetricGroup#addGroup(String,String)
java
MetricGroup addGroup(String key, String value);
runtime 实现里,AbstractMetricGroup#addGroup(key,value) 会显式创建 KEY/VALUE 两类子组:
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java>
FQCN:org.apache.flink.runtime.metrics.groups.AbstractMetricGroup#addGroup(String,String)
java
@Override
public MetricGroup addGroup(String key, String value) {
return addGroup(key, ChildType.KEY).addGroup(value, ChildType.VALUE);
}
ChildType.KEY 会创建 GenericKeyMetricGroup,ChildType.VALUE 会在 key group 下创建 GenericValueMetricGroup(注意这里是"二段式"创建):
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java>
FQCN:org.apache.flink.runtime.metrics.groups.AbstractMetricGroup#createChildGroup
java
protected GenericMetricGroup createChildGroup(String name, ChildType childType) {
switch (childType) {
case KEY:
return new GenericKeyMetricGroup(registry, this, name);
default:
return new GenericMetricGroup(registry, this, name);
}
}
接着,GenericKeyMetricGroup 在创建 VALUE 子组时会返回 GenericValueMetricGroup:
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/GenericKeyMetricGroup.java>
FQCN:org.apache.flink.runtime.metrics.groups.GenericKeyMetricGroup#createChildGroup
java
@Override
protected GenericMetricGroup createChildGroup(String name, ChildType childType) {
switch (childType) {
case VALUE:
return new GenericValueMetricGroup(registry, this, name);
default:
return new GenericMetricGroup(registry, this, name);
}
}
最终的变量注入发生在 GenericValueMetricGroup#putVariables:把 key group 的名字转成 "<key>" 变量名,并注入 value。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/GenericValueMetricGroup.java>
FQCN:org.apache.flink.runtime.metrics.groups.GenericValueMetricGroup#putVariables
java
@Override
protected void putVariables(Map<String, String> variables) {
variables.put(ScopeFormat.asVariable(this.key), value);
}
- 用法:当你希望 metrics 具备"标签化"的结构化变量(例如某些 reporter/格式会把
getAllVariables()映射为 labels),就用addGroup(key, value);否则普通addGroup(name)即可。
8.4 ProcessMetricGroup:进程级指标挂载点(为了兼容历史 scope)
Process 指标的创建入口在 ProcessMetricGroup.create(...):
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/ProcessMetricGroup.java>
FQCN:org.apache.flink.runtime.metrics.groups.ProcessMetricGroup#create
java
public static ProcessMetricGroup create(MetricRegistry metricRegistry, String hostname) {
return new ProcessMetricGroup(metricRegistry, hostname);
}
它继承自 AbstractImitatingJobManagerMetricGroup,后者会把 scope 与 groupName "伪装成 jobmanager",保证历史上基于 jobmanager scope 的 dashboard/告警规则不被破坏:
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractImitatingJobManagerMetricGroup.java>
FQCN:org.apache.flink.runtime.metrics.groups.AbstractImitatingJobManagerMetricGroup#getGroupName / #getScope
java
private static String[] getScope(MetricRegistry registry, String hostname) {
return registry.getScopeFormats().getJobManagerFormat().formatScope(hostname);
}
@Override
protected final String getGroupName(CharacterFilter filter) {
return "jobmanager";
}
- 用法:你在
ClusterEntrypoint.initializeServices里看到的MetricUtils.instantiateProcessMetricGroup(...),就是把 JVM/System 这类"进程级指标"挂到这个 group 上,然后再向下 addGroup 并注册各种 Gauge/Counter。
九、Metric:所有指标类型的最小抽象
Flink 把所有指标类型(Counter/Gauge/Histogram/Meter...)统一抽象成一个最小接口 Metric,因此 QueryService 可以用"具体的 metric 实例对象"作为 Map 的 key 来做索引与删除。
路径:<flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/Metric.java>
FQCN:org.apache.flink.metrics.Metric
java
public interface Metric {
default MetricType getMetricType() {
throw new UnsupportedOperationException("Custom metric types are not supported.");
}
}
9.1 Metric 的 4 个标准子接口:Gauge / Counter / Histogram / Meter
Flink 并不鼓励用户自定义"新的 metric 类型",因此它把可上报/可序列化的核心类型固定成 4 个子接口,并在接口层面用 getMetricType() 把类型钉死。
路径:<flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/Gauge.java>
FQCN:org.apache.flink.metrics.Gauge
java
public interface Gauge<T> extends Metric {
T getValue();
@Override
default MetricType getMetricType() {
return MetricType.GAUGE;
}
}
路径:<flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/Counter.java>
FQCN:org.apache.flink.metrics.Counter
java
public interface Counter extends Metric {
void inc();
void inc(long n);
void dec();
void dec(long n);
long getCount();
@Override
default MetricType getMetricType() {
return MetricType.COUNTER;
}
}
路径:<flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/Histogram.java>
FQCN:org.apache.flink.metrics.Histogram
java
public interface Histogram extends Metric {
void update(long value);
long getCount();
HistogramStatistics getStatistics();
@Override
default MetricType getMetricType() {
return MetricType.HISTOGRAM;
}
}
路径:<flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/Meter.java>
FQCN:org.apache.flink.metrics.Meter
java
public interface Meter extends Metric {
void markEvent();
void markEvent(long n);
double getRate();
long getCount();
@Override
default MetricType getMetricType() {
return MetricType.METER;
}
}
- Gauge:读的时候算值(pull 模型),典型用法是把 lambda/mxbean getter 当成 Gauge。
- Counter:累加/累减的计数器(push 模型),典型用于累计次数、累计字节数。
- Histogram:记录样本并导出分位数/均值等统计,典型用于延迟分布。
- Meter:吞吐(events/s)统计,典型用于速率类指标。
9.2 MetricGroup 如何"注册"这些 Metric:AbstractMetricGroup.addMetric → MetricRegistryImpl.register
runtime 的 MetricGroup 实现是 AbstractMetricGroup,它把 counter/gauge/histogram/meter 统一收敛成 addMetric(name, metric),并由 MetricRegistryImpl.register(...) 做真正注册与广播(reporters + queryService)。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java>
FQCN:org.apache.flink.runtime.metrics.groups.AbstractMetricGroup#counter / #gauge / #histogram / #meter
java
@Override
public <C extends Counter> C counter(String name, C counter) {
addMetric(name, counter);
return counter;
}
@Override
public <T, G extends Gauge<T>> G gauge(String name, G gauge) {
addMetric(name, gauge);
return gauge;
}
@Override
public <H extends Histogram> H histogram(String name, H histogram) {
addMetric(name, histogram);
return histogram;
}
@Override
public <M extends Meter> M meter(String name, M meter) {
addMetric(name, meter);
return meter;
}
- 用法(Counter):
Counter c = group.counter("numRestarts"); c.inc(); - 用法(Gauge):
group.gauge("ClassesLoaded", mxBean::getTotalLoadedClassCount); - 用法(Histogram):
histogram.update(latencyMillis); - 用法(Meter):
meter.markEvent();或meter.markEvent(n);
9.3 MetricQueryService 如何"按类型存储"指标:4 个 Map + instanceof 分发
MetricQueryService 会把当前已注册的指标按 4 类分别放入 Map,key 就是 metric 实例对象(因此 unregister/removeMetric 必须传同一个实例)。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/dump/MetricQueryService.java>
FQCN:org.apache.flink.runtime.metrics.dump.MetricQueryService#addMetric
java
public void addMetric(String metricName, Metric metric, AbstractMetricGroup group) {
runAsync(
() -> {
QueryScopeInfo info = group.getQueryServiceMetricInfo(FILTER);
if (metric instanceof Counter) {
counters.put((Counter) metric, new Tuple2<>(info, FILTER.filterCharacters(metricName)));
} else if (metric instanceof Gauge) {
gauges.put((Gauge<?>) metric, new Tuple2<>(info, FILTER.filterCharacters(metricName)));
} else if (metric instanceof Histogram) {
histograms.put((Histogram) metric, new Tuple2<>(info, FILTER.filterCharacters(metricName)));
} else if (metric instanceof Meter) {
meters.put((Meter) metric, new Tuple2<>(info, FILTER.filterCharacters(metricName)));
}
});
}
- 这里用
instanceof而不是Metric#getMetricType():意味着"被 QueryService 支持"的标准类型就是这 4 个接口。 - QueryScopeInfo 来自
MetricGroup:这也是为什么MetricGroup的实现会维护 scope components 与 variables(它们最终会被 QueryService dump/序列化为可查询的视图)。
十、注册闭环:MetricGroup → MetricRegistryImpl.register → MetricQueryService.add/remove
10.1 "add 本质调用 registry.register"这条链路
- runtime 实现的
MetricGroup(AbstractMetricGroup)内部持有MetricRegistry引用(实际通常就是MetricRegistryImpl),创建 metric 后会调用MetricRegistryImpl.register(...)完成真正注册。 MetricRegistryImpl.register(...)不只通知 QueryService,还会通知 reporters,并对View做额外处理。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java>
FQCN:org.apache.flink.runtime.metrics.MetricRegistryImpl#register / #unregister
java
public void register(Metric metric, String metricName, AbstractMetricGroup group) {
synchronized (lock) {
if (reporters != null) {
forAllReporters(MetricReporter::notifyOfAddedMetric, metric, metricName, group);
}
if (queryService != null) {
queryService.addMetric(metricName, metric, group);
}
if (metric instanceof View) {
// ... viewUpdater.notifyOfAddedView((View) metric);
}
}
}
public void unregister(Metric metric, String metricName, AbstractMetricGroup group) {
synchronized (lock) {
if (reporters != null) {
forAllReporters(MetricReporter::notifyOfRemovedMetric, metric, metricName, group);
}
if (queryService != null) {
queryService.removeMetric(metric);
}
if (metric instanceof View) {
// ... viewUpdater.notifyOfRemovedView((View) metric);
}
}
}
10.2 "key 是 Metric 实例,会不会越变越多?"
MetricQueryService.addMetric(...)会把已注册的指标放进 4 个 Map(counters/gauges/histograms/meters),key 是 metric 实例对象。- 这不会"无界增长"的前提是:指标生命周期结束时会走
MetricRegistryImpl.unregister(...) → MetricQueryService.removeMetric(...)把同一个实例从 Map 移除。 - 正常用法下,一个 metric 实例通常只会注册 1 次;后续"数值变化"是 metric 对象内部状态变化(例如
Counter.inc(),Gauge#getValue()每次返回最新值),QueryService 只是持有引用并在queryMetrics(...)时读取当前值做快照。 - "对同一个实例重复 addMetric"更多是说明 Map 的行为:如果因为误用/异常路径导致同一实例被重复注册,最终也是
Map.put(key, value)覆盖旧的元信息,不会把同一实例累加成多份。 - 真正导致 "越变越多" 的情况是:不断创建新的 metric 实例并注册,但没有对应的 unregister(例如组件/MetricGroup 生命周期没有正确 close,导致反注册链路没走到)。
十一、附:MetricUtils 的两个常见切入点(启动 metrics RPC、注册预置 JVM/System 指标)
MetricUtils 在这条链路里经常以两种形式出现:
- 启动 metrics 专用 RpcService:
MetricUtils.startRemoteMetricsRpcService(...)(已在上文 MetricQueryService 小节展开)。 - 注册预置 JVM/System 指标:
MetricUtils.instantiateProcessMetricGroup(...)会在进程级 group 下挂载 JVM/System 的常见指标集。
下面用 ClassLoader 指标作为一个具体例子:
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java>
FQCN:org.apache.flink.runtime.metrics.util.MetricUtils#instantiateClassLoaderMetrics
java
private static void instantiateClassLoaderMetrics(MetricGroup metrics) {
final ClassLoadingMXBean mxBean = ManagementFactory.getClassLoadingMXBean();
metrics.<Long, Gauge<Long>>gauge("ClassesLoaded", mxBean::getTotalLoadedClassCount);
metrics.<Long, Gauge<Long>>gauge("ClassesUnloaded", mxBean::getUnloadedClassCount);
}
ManagementFactory.getClassLoadingMXBean()拿到 JVM 内置 MXBean(进程内调用,不依赖是否开启远程 JMX)。ClassesLoaded/ClassesUnloaded都是"JVM 启动以来的累计值",Gauge 每次被拉取时都会调用 MXBean getter 获取当前值。
11.1 JobManager 启动期:指标服务初始化 → 预置指标注册(到"加上指标"为止)
这一段是 JobManager(严格说是 ClusterEntrypoint)在启动期把 metrics "底座与初始指标"一次性装起来的链路,结束点就是:ProcessMetricGroup 下的 Status/JVM/... 这些预置指标(Gauge 等)已经完成注册。
11.1.1 ClusterEntrypoint.initializeServices:创建 registry → 拉起 query service → 构建 process 根 group
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java>
FQCN:org.apache.flink.runtime.entrypoint.ClusterEntrypoint#initializeServices
java
metricRegistry = createMetricRegistry(configuration, pluginManager, rpcSystem);
final RpcService metricQueryServiceRpcService =
MetricUtils.startRemoteMetricsRpcService(
configuration,
commonRpcService.getAddress(),
configuration.get(JobManagerOptions.BIND_HOST),
rpcSystem);
metricRegistry.startQueryService(metricQueryServiceRpcService, null);
final String hostname = RpcUtils.getHostname(commonRpcService);
processMetricGroup =
MetricUtils.instantiateProcessMetricGroup(
metricRegistry, hostname,
ConfigurationUtils.getSystemResourceMetricsProbingInterval(configuration));
createMetricRegistry(...):构建指标注册中心(reporter/view/queryService 的统一入口)。startRemoteMetricsRpcService(...):单独起一个 metrics 专用 RpcService(用于MetricQueryService,不复用 commonRpcService)。metricRegistry.startQueryService(...):在 metrics RpcService 上创建并启动MetricQueryService(Web/REST 查询指标的服务端)。instantiateProcessMetricGroup(...):创建进程级根MetricGroup,并把 JVM/System 等预置指标挂上去(即"加上指标"发生在这里)。
11.1.2 createMetricRegistry:MetricRegistryImpl 的构造参数从配置与插件体系汇总
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java>
FQCN:org.apache.flink.runtime.entrypoint.ClusterEntrypoint#createMetricRegistry
java
protected MetricRegistryImpl createMetricRegistry(
Configuration configuration,
PluginManager pluginManager,
RpcSystemUtils rpcSystemUtils) {
return new MetricRegistryImpl(
MetricRegistryConfiguration.fromConfiguration(
configuration, rpcSystemUtils.getMaximumMessageSizeInBytes(configuration)),
ReporterSetup.fromConfiguration(configuration, pluginManager),
TraceReporterSetup.fromConfiguration(configuration, pluginManager));
}
MetricRegistryConfiguration.fromConfiguration(...):收敛 metrics 相关参数(包含 queryService 最大消息大小等)。ReporterSetup/TraceReporterSetup:从配置 + 插件机制解析 reporter(指标上报)与 trace reporter(链路上报)。
11.1.3 MetricRegistryImpl:默认使用两个单线程定时线程池(reporter / viewUpdater)
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java>
FQCN:org.apache.flink.runtime.metrics.MetricRegistryImpl#MetricRegistryImpl
java
public MetricRegistryImpl(
MetricRegistryConfiguration config,
Collection<ReporterSetup> reporterConfigurations,
Collection<TraceReporterSetup> traceReporterConfigurations) {
this(
config,
reporterConfigurations,
traceReporterConfigurations,
Executors.newSingleThreadScheduledExecutor(
new ExecutorThreadFactory("Flink-Metric-Reporter")),
Executors.newSingleThreadScheduledExecutor(
new ExecutorThreadFactory("Flink-Metric-View-Updater")));
}
- reporter 与 viewUpdater 的调度线程与业务线程解耦;并且是单线程,避免并发更新导致 reporter/view 的行为复杂化。
11.1.4 启动 metrics 专用 RpcService:startRemoteMetricsRpcService
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java>
FQCN:org.apache.flink.runtime.metrics.util.MetricUtils#startRemoteMetricsRpcService
java
public static RpcService startRemoteMetricsRpcService(
Configuration configuration,
String externalAddress,
@Nullable String bindAddress,
RpcSystem rpcSystem)
throws Exception {
final String portRange = configuration.get(MetricOptions.QUERY_SERVICE_PORT);
final RpcSystem.RpcServiceBuilder rpcServiceBuilder =
rpcSystem.remoteServiceBuilder(configuration, externalAddress, portRange);
if (bindAddress != null) {
rpcServiceBuilder.withBindAddress(bindAddress);
}
return startMetricRpcService(configuration, rpcServiceBuilder);
}
externalAddress用于对外宣告(拼接可连接地址);bindAddress决定本机实际监听地址。QUERY_SERVICE_PORT决定 query service 的端口/端口范围。
11.1.5 在 metrics RpcService 上启动 MetricQueryService:startQueryService
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java>
FQCN:org.apache.flink.runtime.metrics.MetricRegistryImpl#startQueryService
java
public void startQueryService(RpcService rpcService, ResourceID resourceID) {
synchronized (lock) {
Preconditions.checkState(!isShutdown(), "The metric registry has already been shut down.");
try {
metricQueryServiceRpcService = rpcService;
queryService =
MetricQueryService.createMetricQueryService(
rpcService, resourceID, maximumFramesize);
queryService.start();
} catch (Exception e) {
LOG.warn(
"Could not start MetricDumpActor. No metrics will be submitted to the WebInterface.",
e);
}
}
}
- 这里把
MetricQueryService(RpcEndpoint)启动起来,后续 Web/REST 侧会通过 gateway 调queryMetrics(...)拉快照。 - 注意它依赖的 RpcService 是上一小节专门启动的 metrics RpcService(线程模型与负载隔离)。
11.1.6 "加上指标"的起点:instantiateProcessMetricGroup → Status/JVM → 注册 Gauge
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java>
FQCN:org.apache.flink.runtime.metrics.util.MetricUtils#instantiateProcessMetricGroup
java
public static ProcessMetricGroup instantiateProcessMetricGroup(
final MetricRegistry metricRegistry,
final String hostname,
final Optional<Time> systemResourceProbeInterval) {
final ProcessMetricGroup processMetricGroup =
ProcessMetricGroup.create(metricRegistry, hostname);
createAndInitializeStatusMetricGroup(processMetricGroup);
systemResourceProbeInterval.ifPresent(
interval -> instantiateSystemMetrics(processMetricGroup, interval));
return processMetricGroup;
}
- 先创建进程级根
ProcessMetricGroup,然后立刻挂载Status子树(JVM 指标等)。 - 如果配置了系统资源探测间隔,会额外注册 OS 维度指标(可选)。
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java>
FQCN:org.apache.flink.runtime.metrics.util.MetricUtils#createAndInitializeStatusMetricGroup
java
private static MetricGroup createAndInitializeStatusMetricGroup(
AbstractMetricGroup<?> parentMetricGroup) {
MetricGroup statusGroup = parentMetricGroup.addGroup(METRIC_GROUP_STATUS_NAME);
instantiateStatusMetrics(statusGroup);
return statusGroup;
}
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java>
FQCN:org.apache.flink.runtime.metrics.util.MetricUtils#instantiateStatusMetrics
java
public static void instantiateStatusMetrics(MetricGroup metricGroup) {
MetricGroup jvm = metricGroup.addGroup("JVM");
instantiateClassLoaderMetrics(jvm.addGroup("ClassLoader"));
instantiateGarbageCollectorMetrics(
jvm.addGroup("GarbageCollector"), ManagementFactory.getGarbageCollectorMXBeans());
instantiateMemoryMetrics(jvm.addGroup(METRIC_GROUP_MEMORY));
instantiateThreadMetrics(jvm.addGroup("Threads"));
instantiateCPUMetrics(jvm.addGroup("CPU"));
}
路径:<flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java>
FQCN:org.apache.flink.runtime.metrics.util.MetricUtils#instantiateClassLoaderMetrics
java
private static void instantiateClassLoaderMetrics(MetricGroup metrics) {
final ClassLoadingMXBean mxBean = ManagementFactory.getClassLoadingMXBean();
metrics.<Long, Gauge<Long>>gauge("ClassesLoaded", mxBean::getTotalLoadedClassCount);
metrics.<Long, Gauge<Long>>gauge("ClassesUnloaded", mxBean::getUnloadedClassCount);
}
addGroup("Status") → addGroup("JVM") → addGroup("ClassLoader"):这就是预置指标的 scope 层级组织方式。metrics.gauge(...)会走AbstractMetricGroup#addMetric → MetricRegistryImpl.register:到这里为止,"指标已经加上并完成注册",reporters 与 queryService 后续都能看到它们。
十二、闭环总结(按 ScopeFormats → ScopeFormat → MetricRegistry → MetricQueryService → MetricGroup → Metric 回收)
ScopeFormats从配置读取不同组件的 scope 模板,决定"每一层级的命名规则来源于哪条配置"。ScopeFormat把字符串模板编译成templatePos/valuePos,把运行时的变量绑定变成 O(k) 的替换操作。MetricRegistryImpl是注册收口点:MetricGroup的指标创建最终都落到register/unregister,并在启动期通过startQueryService(...)拉起 QueryService。MetricQueryService运行在 metrics 专用 RpcService(flink-metrics)上,把"Metric 实例引用"维护成可序列化的视图,对外提供queryMetrics(...)。MetricGroup负责 scope 层级组织与子组创建,并在addGroup中处理同名复用与命名冲突告警。Metric作为最小抽象,让注册/查询链路不依赖具体指标类型;数值更新发生在 Metric 对象内部,QueryService 在查询时读取快照。
十一、额外补充:JobManager 启动流程的后半段
到这里,JobManager 启动所依赖的几块"地基服务"(HA、BlobServer、RPC 通信、MetricService)已经分别拆开讲清楚了。
接下来就进入 JobManager 启动链路的后半段:启动 3 个核心组件,并在它们讲完后把 JobManager 启动流程整体串起来:
org.apache.flink.runtime.dispatcher.Dispatcher:作业提交入口与 JobMaster 生命周期编排。org.apache.flink.runtime.resourcemanager.ResourceManager:TaskManager 注册/心跳/slot 管理与资源调度。org.apache.flink.runtime.webmonitor.WebMonitorEndpoint(或对应的 Web/REST endpoint 实现):对外提供 Web UI 与 REST API 查询/运维入口。
这 3 个组件的启动与联动讲清楚后,再回到 org.apache.flink.runtime.entrypoint.ClusterEntrypoint#runCluster / #initializeServices 的主链路,把 "创建地基服务 → 拉起核心组件 → 注册/等待 shutdown" 串成 JobManager 的完整启动流程。