Hadoop之NameNode启动源码解析

Hadoop NameNode 启动流程分析

添加依赖项

为了开始 NameNode 的启动过程分析，首先需要在 pom.xml 文件中添加以下依赖项：

复制代码

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>3.1.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs-client</artifactId>
        <version>3.1.3</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

NameNode 类介绍

NameNode 类是 Hadoop 分布式文件系统 (HDFS) 的核心组件之一，负责管理文件系统的命名空间和元数据。每个 HDFS 部署中通常只有一个活动的 NameNode（除了在使用备份/故障转移 NameNode 或联邦 NameNode 的情况下）。NameNode 控制两个关键的数据结构：

文件名到块序列映射（命名空间）。
块到机器列表映射（节点表）。

NameNode 实现了 ClientProtocol 接口，允许客户端请求文件系统服务；同时它还实现了 DatanodeProtocol 接口，用于与存储实际数据块的 DataNode 进行通信。

主方法解析

NameNode 类的 main 方法初始化并启动 NameNode 服务：

复制代码

public static void main(String argv[]) throws Exception {
    if (DFSUtil.parseHelpArgument(argv, NameNode.USAGE, System.out, true)) {
        System.exit(0);
    }

    try {
        StringUtils.startupShutdownMessage(NameNode.class, argv, LOG);
        NameNode namenode = createNameNode(argv, null);
        if (namenode != null) {
            namenode.join();
        }
    } catch (Throwable e) {
        LOG.error("Failed to start namenode.", e);
        terminate(1, e);
    }
}

创建 NameNode 对象

createNameNode 方法创建并配置 NameNode 对象：

复制代码

public static NameNode createNameNode(String argv[], Configuration conf)
    throws IOException {
    StartupOption startOpt = parseArguments(argv);
    if (startOpt == null) {
        printUsage(System.err);
        return null;
    }
    setStartupOption(conf, startOpt);

    boolean aborted = false;
    switch (startOpt) {
        case FORMAT:
            aborted = format(conf, startOpt.getForceFormat(),
                startOpt.getInteractiveFormat());
            terminate(aborted ? 1 : 0);
            return null;
        case GENCLUSTERID:
            //...
        default:
            DefaultMetricsSystem.initialize("NameNode");
            return new NameNode(conf);
    }
}

初始化 NameNode

NameNode 的构造函数执行基本的初始化任务：

复制代码

public NameNode(Configuration conf) throws IOException {
    this(conf, NamenodeRole.NAMENODE);
}

protected NameNode(Configuration conf, NamenodeRole role)
    throws IOException {
    //...
    try {
        initializeGenericKeys(conf, nsId, namenodeId);
        initialize(getConf());
        //...
    } catch (IOException e) {
        this.stopAtException(e);
        throw e;
    } catch (HadoopIllegalArgumentException e) {
        this.stopAtException(e);
        throw e;
    }
    this.started.set(true);
}

NameNode 初始化细节

initialize 方法完成 NameNode 的初始化，包括启动 HTTP 服务器、加载镜像文件和编辑日志等：

复制代码

protected void initialize(Configuration conf) throws IOException {
    //...
    if (NamenodeRole.NAMENODE == role) {
        startHttpServer(conf);
    }
    loadNamesystem(conf);
    startAliasMapServerIfNecessary(conf);
    //...
}

启动 HTTP 服务器

HTTP 服务器通过 startHttpServer 方法启动，监听默认端口 9870：

复制代码

private void startHttpServer(final Configuration conf) throws IOException {
    httpServer = new NameNodeHttpServer(conf, this, getHttpServerBindAddress(conf));
    httpServer.start();
    httpServer.setStartupProgress(startupProgress);
}

加载镜像文件和编辑日志

镜像文件和编辑日志通过 loadNamesystem 方法加载到内存中：

复制代码

protected void loadNamesystem(Configuration conf) throws IOException {
    this.namesystem = FSNamesystem.loadFromDisk(conf);
}

初始化 NameNode 的 RPC 服务端

createRpcServer 方法创建 NameNode 的 RPC 服务端：

复制代码

protected NameNodeRpcServer createRpcServer(Configuration conf)
    throws IOException {
    return new NameNodeRpcServer(conf, this);
}

NameNode 启动资源检查流程

启动通用服务

startCommonServices 方法用于启动 NameNode 的通用服务，其中包括资源检查等功能：

复制代码

private void startCommonServices(Configuration conf) throws IOException {
    namesystem.startCommonServices(conf, haContext);

    registerNNSMXBean();
    if (NamenodeRole.NAMENODE != role) {
        startHttpServer(conf);
        httpServer.setNameNodeAddress(getNameNodeAddress());
        httpServer.setFSImage(getFSImage());
    }
    rpcServer.start();
    try {
        plugins = conf.getInstances(DFS_NAMENODE_PLUGINS_KEY,
            ServicePlugin.class);
    } catch (RuntimeException e) {
        String pluginsValue = conf.get(DFS_NAMENODE_PLUGINS_KEY);
        LOG.error("Unable to load NameNode plugins. Specified list of plugins: " +
            pluginsValue, e);
        throw e;
    }
    //...
}

启动通用服务详情

startCommonServices 方法在 FSNamesystem 类中定义，负责注册 MBean、检查资源、激活块管理器等操作：

复制代码

void startCommonServices(Configuration conf, HAContext haContext) throws IOException {
    this.registerMBean(); // Register the MBean for the FSNamesystemState
    writeLock();
    this.haContext = haContext;
    try {
        nnResourceChecker = new NameNodeResourceChecker(conf);
        checkAvailableResources();

        assert !blockManager.isPopulatingReplQueues();
        StartupProgress prog = NameNode.getStartupProgress();
        prog.beginPhase(Phase.SAFEMODE);
        long completeBlocksTotal = getCompleteBlocksTotal();

        prog.setTotal(Phase.SAFEMODE, STEP_AWAITING_REPORTED_BLOCKS,
            completeBlocksTotal);

        blockManager.activate(conf, completeBlocksTotal);
    } finally {
        writeUnlock("startCommonServices");
    }

    registerMXBean();
    DefaultMetricsSystem.instance().register(this);
    if (inodeAttributeProvider != null) {
        inodeAttributeProvider.start();
        dir.setINodeAttributeProvider(inodeAttributeProvider);
    }
    snapshotManager.registerMXBean();
    InetSocketAddress serviceAddress = NameNode.getServiceAddress(conf, true);
    this.nameNodeHostName = (serviceAddress != null) ?
        serviceAddress.getHostName() : "";
}

初始化资源检查器

NameNodeResourceChecker 类用于初始化资源检查器，该检查器会检查所有指定路径的磁盘空间是否满足要求：

复制代码

public NameNodeResourceChecker(Configuration conf) throws IOException {
    this.conf = conf;
    volumes = new HashMap<String, CheckedVolume>();

    duReserved = conf.getLong(DFSConfigKeys.DFS_NAMENODE_DU_RESERVED_KEY,
        DFSConfigKeys.DFS_NAMENODE_DU_RESERVED_DEFAULT);

    Collection<URI> extraCheckedVolumes = Util.stringCollectionAsURIs(conf
        .getTrimmedStringCollection(DFSConfigKeys.DFS_NAMENODE_CHECKED_VOLUMES_KEY));

    Collection<URI> localEditDirs = Collections2.filter(
        FSNamesystem.getNamespaceEditsDirs(conf),
        new Predicate<URI>() {
            @Override
            public boolean apply(URI input) {
                if (input.getScheme().equals(NNStorage.LOCAL_URI_SCHEME)) {
                    return true;
                }
                return false;
            }
        });

    for (URI editsDirToCheck : localEditDirs) {
        addDirToCheck(editsDirToCheck,
            FSNamesystem.getRequiredNamespaceEditsDirs(conf).contains(
                editsDirToCheck));
    }

    for (URI extraDirToCheck : extraCheckedVolumes) {
        addDirToCheck(extraDirToCheck, true);
    }

    minimumRedundantVolumes = conf.getInt(
        DFSConfigKeys.DFS_NAMENODE_CHECKED_VOLUMES_MINIMUM_KEY,
        DFSConfigKeys.DFS_NAMENODE_CHECKED_VOLUMES_MINIMUM_DEFAULT);
}

检查可用资源

checkAvailableResources 方法检查 NameNode 是否有足够的磁盘空间来存储元数据：

复制代码

void checkAvailableResources() {
    long resourceCheckTime = monotonicNow();
    Preconditions.checkState(nnResourceChecker != null,
        "nnResourceChecker not initialized");

    hasResourcesAvailable = nnResourceChecker.hasAvailableDiskSpace();

    resourceCheckTime = monotonicNow() - resourceCheckTime;
    NameNode.getNameNodeMetrics().addResourceCheckTime(resourceCheckTime);
}

资源检查逻辑

hasAvailableDiskSpace 方法检查磁盘空间是否足够：

复制代码

public boolean hasAvailableDiskSpace() {
    return NameNodeResourcePolicy.areResourcesAvailable(volumes.values(),
        minimumRedundantVolumes);
}

资源政策检查

areResourcesAvailable 方法根据资源策略判断资源是否足够：

复制代码

static boolean areResourcesAvailable(
    Collection<? extends CheckableNameNodeResource> resources,
    int minimumRedundantResources) {

    if (resources.isEmpty()) {
        return true;
    }

    int requiredResourceCount = 0;
    int redundantResourceCount = 0;
    int disabledRedundantResourceCount = 0;

    for (CheckableNameNodeResource resource : resources) {
        if (!resource.isRequired()) {
            redundantResourceCount++;
            if (!resource.isResourceAvailable()) {
                disabledRedundantResourceCount++;
            }
        } else {
            requiredResourceCount++;
            if (!resource.isResourceAvailable()) {
                return false;
            }
        }
    }

    if (redundantResourceCount == 0) {
        return requiredResourceCount > 0;
    } else {
        return redundantResourceCount - disabledRedundantResourceCount >=
            minimumRedundantResources;
    }
}

资源检查接口

CheckableNameNodeResource 接口定义了资源检查的基本方法：

复制代码

interface CheckableNameNodeResource {

    public boolean isResourceAvailable();

    public boolean isRequired();
}

实现资源检查

CheckedVolume 类实现了 CheckableNameNodeResource 接口，检查单个目录的可用空间：

复制代码

public boolean isResourceAvailable() {

    long availableSpace = df.getAvailable();

    if (LOG.isDebugEnabled()) {
        LOG.debug("Space available on volume '" + volume + "' is "
            + availableSpace);
    }

    if (availableSpace < duReserved) {
        LOG.warn("Space available on volume '" + volume + "' is "
            + availableSpace +
            ", which is below the configured reserved amount " + duReserved);
        return false;
    } else {
        return true;
    }
}

NameNode 心跳超时判断及安全模式处理

心跳超时判断

DatanodeManager 类中的 activate 方法负责启动心跳管理器，心跳管理器通过一个单独的线程周期性地检查 DataNode 的心跳状态。

复制代码

void activate(final Configuration conf) {
  datanodeAdminManager.activate(conf);
  heartbeatManager.activate();
}

心跳管理器通过 heartbeatThread 进行心跳检查，每 5 秒钟执行一次心跳检查。

复制代码

void activate() {
  heartbeatThread.start();
}

public void run() {
  while(namesystem.isRunning()) {
    restartHeartbeatStopWatch();
    try {
      final long now = Time.monotonicNow();
      if (lastHeartbeatCheck + heartbeatRecheckInterval < now) {
        heartbeatCheck();
        lastHeartbeatCheck = now;
      }
      if (blockManager.shouldUpdateBlockKey(now - lastBlockKeyUpdate)) {
        synchronized(HeartbeatManager.this) {
          for(DatanodeDescriptor d : datanodes) {
            d.setNeedKeyUpdate(true);
          }
        }
        lastBlockKeyUpdate = now;
      }
    } catch (Exception e) {
      LOG.error("Exception while checking heartbeat", e);
    }
    try {
      Thread.sleep(5000);  // 5 seconds
    } catch (InterruptedException ignored) {
    }
    if (shouldAbortHeartbeatCheck(-5000)) {
      LOG.warn("Skipping next heartbeat scan due to excessive pause");
      lastHeartbeatCheck = Time.monotonicNow();
    }
  }
}

心跳检查方法 heartbeatCheck 用于检测 DataNode 是否处于活跃状态。

复制代码

void heartbeatCheck() {
  final DatanodeManager dm = blockManager.getDatanodeManager();

  boolean allAlive = false;
  while (!allAlive) {
    DatanodeDescriptor dead = null;
    DatanodeStorageInfo failedStorage = null;
    int numOfStaleNodes = 0;
    int numOfStaleStorages = 0;
    synchronized(this) {
      for (DatanodeDescriptor d : datanodes) {
        if (dead == null && dm.isDatanodeDead(d)) {
          stats.incrExpiredHeartbeats();
          dead = d;
        }
        if (d.isStale(dm.getStaleInterval())) {
          numOfStaleNodes++;
        }
        DatanodeStorageInfo[] storageInfos = d.getStorageInfos();
        for(DatanodeStorageInfo storageInfo : storageInfos) {
          if (storageInfo.areBlockContentsStale()) {
            numOfStaleStorages++;
          }
          if (failedStorage == null &&
              storageInfo.areBlocksOnFailedStorage() &&
              d != dead) {
            failedStorage = storageInfo;
          }
        }
      }
      dm.setNumStaleNodes(numOfStaleNodes);
      dm.setNumStaleStorages(numOfStaleStorages);
    }
    //...
  }
}

心跳超时判断方法 isDatanodeDead 根据心跳过期时间来判断 DataNode 是否已死。

复制代码

boolean isDatanodeDead(DatanodeDescriptor node) {
  return (node.getLastUpdateMonotonic() <
          (monotonicNow() - heartbeatExpireInterval));
}

心跳过期间隔 heartbeatExpireInterval 是基于心跳重检间隔和心跳间隔计算得出的。

复制代码

private long heartbeatExpireInterval;
// 10分钟 + 30秒
this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval + 10 * 1000 * heartbeatIntervalSeconds;

private volatile int heartbeatRecheckInterval;
heartbeatRecheckInterval = conf.getInt(
        DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_KEY, 
        DFSConfigKeys.DFS_NAMENODE_HEARTBEAT_RECHECK_INTERVAL_DEFAULT); // 5 minutes

private volatile long heartbeatIntervalSeconds;
heartbeatIntervalSeconds = conf.getTimeDuration(
        DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY,
        DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_DEFAULT, TimeUnit.SECONDS);
public static final long    DFS_HEARTBEAT_INTERVAL_DEFAULT = 3;

安全模式

在启动过程中，FSNamesystem 的 startCommonServices 方法会开始安全模式。

复制代码

void startCommonServices(Configuration conf, HAContext haContext) throws IOException {
  this.registerMBean(); // Register the MBean for the FSNamesystemState
  writeLock();
  this.haContext = haContext;
  try {
    nnResourceChecker = new NameNodeResourceChecker(conf);
    checkAvailableResources();

    assert !blockManager.isPopulatingReplQueues();
    StartupProgress prog = NameNode.getStartupProgress();

    // 开始进入安全模式
    prog.beginPhase(Phase.SAFEMODE);

    // 获取所有可以正常使用的block
    long completeBlocksTotal = getCompleteBlocksTotal();

    prog.setTotal(Phase.SAFEMODE, STEP_AWAITING_REPORTED_BLOCKS,
                  completeBlocksTotal);

    // 启动块服务
    blockManager.activate(conf, completeBlocksTotal);
  } finally {
    writeUnlock("startCommonServices");
  }
  //...
}

getCompleteBlocksTotal 方法用于获取所有可使用的完整块总数。

复制代码

public long getCompleteBlocksTotal() {
  long numUCBlocks = 0;
  readLock();
  try {
    numUCBlocks = leaseManager.getNumUnderConstructionBlocks();
    return getBlocksTotal() - numUCBlocks;
  } finally {
    readUnlock("getCompleteBlocksTotal");
  }
}

activate 方法在 BlockManager 中被调用，用于启动相关的后台任务，并设置安全模式的状态。

复制代码

public void activate(Configuration conf, long blockTotal) {
  pendingReconstruction.start();
  datanodeManager.activate(conf);

  this.redundancyThread.setName("RedundancyMonitor");
  this.redundancyThread.start();

  storageInfoDefragmenterThread.setName("StorageInfoMonitor");
  storageInfoDefragmenterThread.start();
  this.blockReportThread.start();

  mxBeanName = MBeans.register("NameNode", "BlockStats", this);

  bmSafeMode.activate(blockTotal);
}

安全模式的激活逻辑在 BMSafeMode 类中的 activate 方法中定义。

复制代码

void activate(long total) {
  assert namesystem.hasWriteLock();
  assert status == BMSafeModeStatus.OFF;

  startTime = monotonicNow();

  // 计算是否满足块个数的阈值
  setBlockTotal(total);

  // 判断DataNode节点和块信息是否达到退出安全模式标准
  if (areThresholdsMet()) {
    boolean exitResult = leaveSafeMode(false);
    Preconditions.checkState(exitResult, "Failed to leave safe mode.");
  } else {
    status = BMSafeModeStatus.PENDING_THRESHOLD;

    initializeReplQueuesIfNecessary();

    reportStatus("STATE* Safe mode ON.", true);
    lastStatusReport = monotonicNow();
  }
}

setBlockTotal 方法用于设置块总数和阈值。

复制代码

void setBlockTotal(long total) {
  assert namesystem.hasWriteLock();
  synchronized (this) {
    this.blockTotal = total;
    this.blockThreshold = (long) (total * threshold);
  }
  this.blockReplQueueThreshold = (long) (total * replQueueThreshold);
}

阈值 threshold 是根据配置文件中的百分比计算得出的。

复制代码

this.threshold = conf.getFloat(DFS_NAMENODE_SAFEMODE_THRESHOLD_PCT_KEY,
        DFS_NAMENODE_SAFEMODE_THRESHOLD_PCT_DEFAULT);

public static final float   DFS_NAMENODE_SAFEMODE_THRESHOLD_PCT_DEFAULT = 0.999f;

areThresholdsMet 方法检查是否达到了安全模式的退出条件。

复制代码

private boolean areThresholdsMet() {
  assert namesystem.hasWriteLock();
  int datanodeNum = 0;

  if (datanodeThreshold > 0) {
    datanodeNum = blockManager.getDatanodeManager().getNumLiveDataNodes();
  }
  synchronized (this) {
    return blockSafe >= blockThreshold && datanodeNum >= datanodeThreshold;
  }
}