生产环境遇到下面报错
2025-04-23 17:44:15,780 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router1:8888->hh-fed-sub25:nn2:nn2:8020-EXPIRED
2025-04-23 17:44:15,781 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router1:8888->hh-fed-sub25:nn1:nn1:8020-EXPIRED
2025-04-23 17:44:15,781 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router2:8888->hh-fed-sub25:nn1:nn1:8020-EXPIRED
2025-04-23 17:44:15,781 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router2:8888->hh-fed-sub25:nn2:nn2:8020-EXPIRED
报错原因是,之前子集群配置了3个router,2个nn,然后会向StateStore中存储6个MembershipState。
后来,将子集群的router停了两个,只运行一个router,这样的后果就是会在运行的router日志发现上面报错。
因为router会周期性下载MembershipState,每次都会去检查是否过期,而我们停了2个Router,这俩Router之前和NameNode形成Membership并上报到了StateStore,并且我们关闭了删除过期记录的参数dfs.federation.router.store.membership.expiration.deletion,所以,会在运行的Router中打印上面报错。
修复做法,选择下面之一都可以:
- 开启删除过期参数
- dfs.federation.router.store.membership.expiration默认未5min,若设置dfs.federation.router.store.membership.expiration.deletion=2min,则表示membership过期了(超过5min没汇报),在等2min就删除它。
- 启动已停止的router
参考源码
org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
java
public void overrideExpiredRecords(QueryResult<R> query) throws IOException {
List<R> commitRecords = new ArrayList<>();
List<R> deleteRecords = new ArrayList<>();
List<R> newRecords = query.getRecords();
long currentDriverTime = query.getTimestamp();
if (newRecords == null || currentDriverTime <= 0) {
LOG.error("Cannot check overrides for record");
return;
}
for (R record : newRecords) {
if (record.shouldBeDeleted(currentDriverTime)) {
String recordName = StateStoreUtils.getRecordName(record.getClass());
if (getDriver().remove(record)) {
deleteRecords.add(record);
LOG.info("Deleted State Store record {}: {}", recordName, record);
} else {
LOG.warn("Couldn't delete State Store record {}: {}", recordName,
record);
}
} else if (record.checkExpired(currentDriverTime)) {
String recordName = StateStoreUtils.getRecordName(record.getClass());
LOG.info("Override State Store record {}: {}", recordName, record);
commitRecords.add(record);
}
}
if (commitRecords.size() > 0) {
getDriver().putAll(commitRecords, true, false);
}
if (deleteRecords.size() > 0) {
newRecords.removeAll(deleteRecords);
}
}
org.apache.hadoop.hdfs.server.federation.store.records.BaseRecord#checkExpired
java
@Override
public boolean checkExpired(long currentTime) {
if (super.checkExpired(currentTime)) {
this.setState(EXPIRED);
// Commit it
return true;
}
return false;
}
public boolean checkExpired(long currentTime) {
long expiration = getExpirationMs();
long modifiedTime = getDateModified();
if (modifiedTime > 0 && expiration > 0) {
return (modifiedTime + expiration) < currentTime;
}
return false;
}
org.apache.hadoop.hdfs.server.federation.store.records.BaseRecord#shouldBeDeleted
java
public boolean shouldBeDeleted(long currentTime) {
long deletionTime = getDeletionMs();
if (isExpired() && deletionTime > 0) {
long elapsedTime = currentTime - (getDateModified() + getExpirationMs());
return elapsedTime > deletionTime;
} else {
return false;
}
}