PyTorch 源码学习:GPU 内存管理之它山之石——TensorFlow BFC 算法

TensorFlow 和 PyTorch 都是常用的深度学习框架,各自有一套独特但又相似的 GPU 内存管理机制(BFC 算法)。它山之石可以攻玉。了解 TensorFlow 的 BFC 算法有助于学习 PyTorch 管理 GPU 内存的精妙之处。本文重点关注 TensorFlow BFC 算法的核心思想,研究的 TensorFlow 版本为1f1379c5f721579c58b4ee560daac7df90f8519a。更多内容请参考:

文章目录

  • [1 关于 TensorFlow BFC 算法](#1 关于 TensorFlow BFC 算法)
  • [2 关于 XLA](#2 关于 XLA)
    • [2.1 初见端倪](#2.1 初见端倪)
    • [2.2 初识 XLA](#2.2 初识 XLA)
  • [3 主要数据结构](#3 主要数据结构)
    • [3.1 Chunk](#3.1 Chunk)
    • [3.2 Bin](#3.2 Bin)
    • [3.3 辅助类](#3.3 辅助类)
  • [4 BFC 算法核心](#4 BFC 算法核心)
    • [4.1 分配内存](#4.1 分配内存)
      • [4.1.1 调整大小](#4.1.1 调整大小)
      • [4.1.2 查找Bin](#4.1.2 查找Bin)
      • [4.1.3 查找Chunk](#4.1.3 查找Chunk)
      • [4.1.4 扩展内存](#4.1.4 扩展内存)
    • [4.2 回收内存](#4.2 回收内存)
      • [4.2.1 获取ChunkHandle](#4.2.1 获取ChunkHandle)
      • [4.2.2 更新状态](#4.2.2 更新状态)
      • [4.2.3 合并Chunk](#4.2.3 合并Chunk)
  • [5 总结](#5 总结)
  • 参考资料
    • [关于 XLA](#关于 XLA)
    • [关于 TensorFlow BFC](#关于 TensorFlow BFC)

1 关于 TensorFlow BFC 算法

以下内容来自:TensorFlow中的显存管理器------BFC Allocator - DeepLearningStack

背景 :使用 GPU 训练时,一次训练任务无论是模型参数还是中间结果都需要占用大量显存。为了避免每次训练重新开辟显存带来计算之外的开销,一般框架的做法是在真正的训练任务开始前,将每个节点的输入和输出,以及模型参数的 shape 计算出来并全局开辟一次,例如 Caffe 就是这种做法。随着深度学习模型的发展和迭代,不仅模型训练的数据 shape 可能发生变化,就连模型本身在训练过程中也可能发生变化,那么按照固定 shape 一次开辟显存的做法就不能满足需求了。为此,TensorFlow 重新设计了较为灵活的显存管理机制 ,它使用了名为 BFC 的分配算法,并通过 BFC Allocator 为每个 Tensor 分配满足需求的显存。

问题 :显存分配与回收的性能需求 。Tensor 在每次创建时会得到存储区域,而每一轮训练都要重新创建新的 Tensor,那么这里面临的一个问题:如此频繁的分配和回收存储区,如何才能做的高效 ?试想对于 GPU 来说,如果Allocate函数直接封装 CUDA 中昂贵的cudaMalloc函数,当 Tensor 被释放时直接调用cudaFree函数,那么训练速度将会因为这些 overhead 大打折扣

思路存储池 。如果你对操作系统这门课比较熟悉,那么应该很容易想到解决办法:将显存按照不同的大小一次性开辟出来,并组成存储池,每次调用Allocate函数时从存储池中获取,Tensor 回收时将显存重新挂到存储池中。这样做确实可以满足性能需求,但是需要为此设计一个相对复杂的存储管理器。BFC Allocator 就是 TensorFlow 中管理 GPU 显存的存储管理器。

2 关于 XLA

2.1 初见端倪

为什么要先说 XLA 呢?

笔者在 TensorFlow 仓库的 main 分支寻找与 BFC (B est-F it with C oalescing) 算法有关的源码时,首先是定位到了 tensorflow/core/common_runtime/gpu,该目录下有 gpu_bfc_allocator.hgpu_bfc_allocator.cc 两个文件,但并没有找到具体的与 BFC 算法相关的代码实现。

gpu_bfc_allocator.h核心代码

cpp 复制代码
#include <memory>
#include <optional>
#include <string>

#include "xla/tsl/framework/allocator.h"
#include "xla/tsl/framework/bfc_allocator.h"
#include "xla/tsl/platform/macros.h"

namespace tensorflow {

// A GPU memory allocator that implements a 'best-fit with coalescing'
// algorithm.
class GPUBFCAllocator : public tsl::BFCAllocator {
 public:
  // See BFCAllocator::Options.
  struct Options {
    // Overridden by TF_FORCE_GPU_ALLOW_GROWTH if that envvar is set.
    bool allow_growth = false;

    // If nullopt, defaults to TF_ENABLE_GPU_GARBAGE_COLLECTION, or true if that
    // envvar is not present.
    //
    // Note:
    //
    //  - BFCAllocator defaults garbage_collection to false, not true.
    //  - this is not the same override behavior as TF_FORCE_GPU_ALLOW_GROWTH.
    std::optional<bool> garbage_collection;

    double fragmentation_fraction = 0;
    bool allow_retry_on_failure = true;
  };

  GPUBFCAllocator(std::unique_ptr<tsl::SubAllocator> sub_allocator,
                  size_t total_memory, const std::string& name,
                  const Options& opts);

  ~GPUBFCAllocator() override {}

  GPUBFCAllocator(const GPUBFCAllocator&) = delete;
  void operator=(const GPUBFCAllocator&) = delete;
};

}  // namespace tensorflow

但从 gpu_bfc_allocator.h 的代码中,笔者发现GPUBFCAllocator类继承自tsl::BFCAllocator,而tsl命名空间就来自第三方库 third_party/xla

而 BFC 算法的具体实现就在 xla/tsl/framework 目录下的 bfc_allocator.hbfc_allocator.cc 两个文件中。

2.2 初识 XLA

TensorFlow 通过 XLA 编译器优化 GPU 代码执行。
下图来自:openxla/xla: A machine learning compiler for GPUs, CPUs, and ML accelerators

下图来自:技术栈架构 - AI技术栈解析及应用- 作者:张真瑜 | 山东大学智能创新研究院

OpenXLA 作为一个灵活的深度学习编译器框架,与 PyTorch 和 TensorFlow 深度集成,通过自定义算子、JIT 编译和 GPU 内核融合等技术,大幅提升了这些深度学习框架在 GPU 上的执行效率。同时,OpenXLA 还利用 CUDA Runtime API 和 CUDA Driver API,实现了对 GPU 硬件的精细控制和优化,包括内存管理、设备操作和内核启动等。

3 主要数据结构

3.1 Chunk

以下内容部分来自:TensorFlow 源码分析之内存管理BFC算法------DeepReve

整个内存空间由一个按基址升序排列 的Chunk双向链表 来表示,它们的直接前趋和后继必须在地址连续的内存空间 。Chunk结构体里含有++实际大小、请求大小、是否被占用、基址、直接前趋、直接后继、Bin索引++等信息。

以下内容部分来自:Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理

Chunk 是 TensorFlow 的最小内存单位 ,由数倍 256 bytes (kMinAllocationSize) 的连续内存块 组成。TensorFlow 的内存管理是基于 Chunk 的管理

以下内容部分来自:TensorFlow中的显存管理器------BFC Allocator - DeepLearningStack

Chunk 是 BFC 最核心的数据结构之一,在 TensorFlow 源码中是以struct来描述的。具体来说,一个 Chunk 代表一段连续的存储空间 ,BFC 要求各个 Chunk 要按照基地址升序 排列并组织成双向链表 ,下图展示了 Chunk 的结构以及 Chunk 之间的连接关系。初始时,每个 Chunk 都有自己的size,并且这些size都是以 256 字节为模。应当注意,每个 Chunk 或者完全被标记为使用,或者完全标记为空闲,不存在该 Chunk 内只有部分空间被使用的情况。

Chunk代码实现

cpp 复制代码
  // A Chunk points to a piece of memory that's either entirely free or entirely
  // in use by one user memory allocation.
  //
  // An AllocationRegion's memory is split up into one or more disjoint Chunks,
  // which together cover the whole region without gaps.  Chunks participate in
  // a doubly-linked list, and the prev/next pointers point to the physically
  // adjacent chunks.
  //
  // Since a chunk cannot be partially in use, we may need to split a free chunk
  // in order to service a user allocation.  We always merge adjacent free
  // chunks.
  //
  // Chunks contain information about whether they are in use or whether they
  // are free, and contain a pointer to the bin they are in.
  struct Chunk {
    size_t size = 0;  // Full size of buffer.

    // We sometimes give chunks that are larger than needed to reduce
    // fragmentation.  requested_size keeps track of what the client
    // actually wanted so we can understand whether our splitting
    // strategy is efficient.
    size_t requested_size = 0;

    // allocation_id is set to -1 when the chunk is not in use. It is assigned a
    // value greater than zero before the chunk is returned from
    // AllocateRaw, and this value is unique among values assigned by
    // the parent allocator.
    int64_t allocation_id = -1;
    void* ptr = nullptr;  // pointer to granted subbuffer.

    // If not kInvalidChunkHandle, the memory referred to by 'prev' is directly
    // preceding the memory used by this chunk.  E.g., It should start
    // at 'ptr - prev->size'
    ChunkHandle prev = kInvalidChunkHandle;

    // If not kInvalidChunkHandle, the memory referred to by 'next' is directly
    // following the memory used by this chunk.  E.g., It should be at
    // 'ptr + size'
    ChunkHandle next = kInvalidChunkHandle;

    // What bin are we in?
    BinNum bin_num = kInvalidBinNum;

    // Optional count when this chunk was most recently made free.
    uint64 freed_at_count = 0;

    bool in_use() const { return allocation_id != -1; }

#ifdef TENSORFLOW_MEM_DEBUG
    // optional debugging info
    const char* op_name = nullptr;
    uint64 step_id = 0;
    int64 action_count = 0;
#endif

    string DebugString(BFCAllocator* a,
                       bool recurse) ABSL_NO_THREAD_SAFETY_ANALYSIS {
      string dbg;
      absl::StrAppend(
          &dbg, "  Size: ", strings::HumanReadableNumBytes(size),
          " | Requested Size: ", strings::HumanReadableNumBytes(requested_size),
          " | in_use: ", in_use(), " | bin_num: ", bin_num);
      if (recurse && prev != BFCAllocator::kInvalidChunkHandle) {
        Chunk* p = a->ChunkFromHandle(prev);
        absl::StrAppend(&dbg, ", prev: ", p->DebugString(a, false));
      }
      if (recurse && next != BFCAllocator::kInvalidChunkHandle) {
        Chunk* n = a->ChunkFromHandle(next);
        absl::StrAppend(&dbg, ", next: ", n->DebugString(a, false));
      }
#ifdef TENSORFLOW_MEM_DEBUG
      absl::StrAppend(&dbg, ", for: ", op_name ? op_name : "UNKNOWN",
                      ", stepid: ", step_id, ", last_action: ", action_count);
#endif
      return dbg;
    }
  };

3.2 Bin

以下内容部分来自:TensorFlow内存管理bfc算法

BFC 算法采取的是被动分块的策略。最开始整个内存是一个 Chunk,随着用户申请空间的次数增加,最开始的大 Chunk 会被不断的 Split 开来,从而产生越来越多的小 Chunk。当 Chunk 数量很大时,为了寻找一个合适的内存块而遍历双链表无疑是一笔巨大的开销。

为了实现对空闲块的高效管理,BFC 算法设计了 Bin 这个抽象数据结构。

  • 每个 Bin 都有一个size属性,一个 Bin 是一个拥有 Chunk size >= Bin size的空闲 Chunk 的集合 。集合中的 Chunk 按照 Chunk size 的升序组织成单链表。
  • BFC 算法维护了一个 Bin 的集合:Bins。它由多个 Bin 以及从属于每个 Bin 的 Chunk 组成。内存中所有的空闲 Chunk 都由 Bins 管理。

图中每一列表示一个 Bin,列首方格中的数字表示 Bin 的size。Bin size 的大小都是 256 的 2^n 的倍。每个 Bin 下面挂载了一系列的空闲 Chunk ,每个 Chunk 的 Chunk size 都大于等于所属的 Bin 的 Bin size,按照 Chunk size 的升序 挂载成单链表

以下内容部分来自:Tensorflow 源码分析- 从GPU OOM开始说Tensorflow的BFC内存管理

当申请新的内存的时候,如何更快高效的查找匹配的空闲 Chunk,这是非常重要的。TensorFlow 基于 Chunk 构建了一个全局的 Bin,每个 Bin 里管理的是内存大小在一定范围的 Chunk(内存大小范围 (2^bin_num)*256 到 (2^(bin_num+1))*256-1的,bin_num 代表的是 Bin 的序列号)。每个 Bin 里会保存着一个空闲的 Chunk 的 Set。

以下内容部分来自:TensorFlow中的显存管理器------BFC Allocator - DeepLearningStack

如果我们想查询某一块符合条件的空闲 Chunk 并取出,那么只能对双向链表做遍历,显然这个效率不是很高。为了加速查询某块 Chunk 的速度 ,可以++在创建 Chunk 链表时按一定顺序排列,并将整个有序链表在逻辑上切分成多个段,为每个段记录所包含的 Chunk 的范围++ ,这种结构就是 Bin,它相当于一种索引 。因此,Bin 结构是为了方便 Chunk 的查询而出现的。

在 BFC Allocator 中,每个段中 Chunk 的顺序是按照size和基地址升序排序 的,每个 Bin 都设有自己的bin_size,该bin_size表示该段包含的最小 Chunk 的size。这样一来,用户端就可以根据所需要申请的 Memory 大小直接找到对应的 Bin,然后在该 Bin 中遍历寻找适合的 Chunk。为了能够根据bin_size直接定位到 Bin,规定bin_sizebin_num的大小关系为:bin_size=256 * 2^bin_num。用户在申请 Memory 时,会将实际大小映射到最适合的bin_size上,然后再根据bin_sizebin_num的关系找到对应的 Bin,进而在该段中遍历搜索。

Bin 中 Chunk 的是通过 Set 组织的,为了能在 Set 中体现双向链表的逻辑,只需要让 Chunk 在 Set 中按照规则升序排列,并修正前驱后继指针即可。

以下内容部分来自:极简入门TensorFlow 内存管理

BFCAllocator 下的两个比较重要的数据结构,Chunk 和 Bin,两者之间的关系如下图,看起来像一个个糖葫芦 ,第一个 bin size 为 256<<1, 第二个为 256<<2, 一次类推,TF 内有 21 个 bin,最后 bin size 为 256 << 21 为 512MB,每一个 bin 下面会接下若干个大于 bin size 的 Chunk ,整个内存空间由以下的结构来组织,当分配内存大小指定时,系统会遍历 bin,找到能够第一次满足 Chunk > bin_size,每一个 bin 下的 Chunk 是有序的(Bin 下的 ChunkComparator)

Bin代码实现

cpp 复制代码
  // A Bin is a collection of similar-sized free chunks.
  // Allocated chunks are never in a Bin.
  struct Bin {
    // All chunks in this bin have >= bin_size memory.
    size_t bin_size = 0;

    class ChunkComparator {
     public:
      explicit ChunkComparator(BFCAllocator* allocator)
          : allocator_(allocator) {}
      // Sort first by size and then use pointer address as a tie breaker.
      bool operator()(const ChunkHandle ha, const ChunkHandle hb) const
          ABSL_NO_THREAD_SAFETY_ANALYSIS {
        const Chunk* a = allocator_->ChunkFromHandle(ha);
        const Chunk* b = allocator_->ChunkFromHandle(hb);
        if (a->size != b->size) {
          return a->size < b->size;
        }
        return a->ptr < b->ptr;
      }

     private:
      BFCAllocator* allocator_;  // The parent allocator
    };

    typedef std::set<ChunkHandle, ChunkComparator> FreeChunkSet;
    // List of free chunks within the bin, sorted by chunk size.
    // Chunk * not owned.
    FreeChunkSet free_chunks;
    Bin(BFCAllocator* allocator, size_t bs)
        : bin_size(bs), free_chunks(ChunkComparator(allocator)) {}
  };

3.3 辅助类

以下内容部分来自:Nvidia GPU显存池-BFC

AllocationRegion 与 RegionManager 两个辅助类的主要功能是记录每次分配给用户的显存指针和对应 Chunk 的映射关系,方便后续显存回收

在本文,我们把重点放在 TensorFlow BFC 算法的核心思想,不展开讨论辅助类 AllocationRegion 和 RegionManager,感兴趣可以学习:

4 BFC 算法核心

4.1 分配内存

以下内容部分来自:TensorFlow中的显存管理器------BFC Allocator - DeepLearningStack

这是 BFCAllocator 的为用户分配 Chunk 的总体流程,该过程涉及到了几个比较重要的子过程。它们分别是

  • 遍历搜索寻找最佳 Chunk 指针的FindChunkPtr过程,
  • 当 Chunk 链表中不存在合适的 Chunk 以至于不得不向物理设备申请新存储空间的Extend过程,
  • 以及分配 Chunk 时为缓解碎片问题而出现的SplitChunk过程。

总流程AllocateRawInternal代码实现

cpp 复制代码
void* BFCAllocator::AllocateRawInternal(size_t unused_alignment,
                                        size_t num_bytes,
                                        bool dump_log_on_failure,
                                        uint64 freed_before) {
  if (num_bytes == 0) {
    VLOG(2) << "tried to allocate 0 bytes";
    return nullptr;
  }
  
  // 1) 调整大小
  // First, always allocate memory of at least kMinAllocationSize
  // bytes, and always allocate multiples of kMinAllocationSize bytes
  // so all memory addresses are nicely byte aligned.
  size_t rounded_bytes = RoundedBytes(num_bytes);
 
  // 2) 查找 Bin
  // The BFC allocator tries to find the best fit first.
  BinNum bin_num = BinNumForSize(rounded_bytes);

  absl::MutexLock l(&mutex_);
  if (!timestamped_chunks_.empty()) {
    // Merge timestamped chunks whose counts have become safe for general use.
    MergeTimestampedChunks(0);
  }
  
  // 3) 查找 Chunk
  void* ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
  if (ptr != nullptr) {
    AddTraceMe("MemoryAllocation", ptr);
    return ptr;
  }
  
  // 4) 如果查找失败,那么扩展内存,然后再查找合适的空闲Chunk
  // Try to extend
  if (Extend(unused_alignment, rounded_bytes)) {
    ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);  // 4.8 再次查找 Chunk
    if (ptr != nullptr) {
      AddTraceMe("MemoryAllocation", ptr);
      return ptr;
    }
  }
  
  // 5) 第一次尝试合并,并再次查找 Chunk
  if ((freed_before == 0) && (!timestamped_chunks_.empty())) {
    // We're unable to satisfy an allocation request without a specific
    // timestamp requirement.  Rather than fail, try merging any held-out
    // timestamped chunks more aggressively until a free chunk of the necessary
    // size is formed.
    if (MergeTimestampedChunks(rounded_bytes)) {
      ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
      if (ptr != nullptr) {
        AddTraceMe("MemoryAllocation", ptr);
        return ptr;
      }
    }
  }

  // 6) 第二次尝试合并,并再次查找 Chunk
  // Reaching this point means that no chunks can satisfy the request. Also,
  // the unallocated bytes cannot satisfy the request. Before giving up, let's
  // try deallocating free regions so that suballocator can combine them with
  // the unallocated bytes and form a larger region.
  if (DeallocateFreeRegions(rounded_bytes) &&
      Extend(unused_alignment, rounded_bytes)) {
    ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
    if (ptr != nullptr) {
      AddTraceMe("MemoryAllocation", ptr);
      return ptr;
    }
  }
  
  // 7) 可用内存不足导致分配失败,报告 OOM
  // We searched all bins for an existing free chunk to use and
  // couldn't find one.  This means we must have run out of memory,
  // Dump the memory log for analysis.
  MaybeWriteMemoryMap();
  if (dump_log_on_failure) {
    LOG(WARNING)
        << "Allocator (" << Name() << ") ran out of memory trying "
        << "to allocate " << strings::HumanReadableNumBytes(num_bytes)
        << " (rounded to " << rounded_bytes << ")" << "requested by op "
        << tsl::profiler::ScopedMemoryDebugAnnotation::CurrentAnnotation()
               .pending_op_name
        << "\nIf the cause is memory fragmentation maybe the environment "
        << "variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will "
        << "improve the situation. \nCurrent allocation summary follows."
        << "\nCurrent allocation summary follows.";
    DumpMemoryLog(rounded_bytes);
    LOG(WARNING) << RenderOccupancy();
  }
  return nullptr;
}

4.1.1 调整大小

RoundedBytes代码实现

cpp 复制代码
size_t BFCAllocator::RoundedBytes(size_t bytes) {
  size_t rounded_bytes =
      (kMinAllocationSize *
       ((bytes + kMinAllocationSize - 1) / kMinAllocationSize));
  DCHECK_EQ(size_t{0}, rounded_bytes % kMinAllocationSize);
  return rounded_bytes;
}

将每次分配的内存大小调整为kMinAllocationSize的N倍,这样所有内存地址都是很好地按字节对齐了。

4.1.2 查找Bin

BinNumForSize代码实现

cpp 复制代码
  BinNum BinNumForSize(size_t bytes) {
    uint64 v = std::max<size_t>(bytes, 256) >> kMinAllocationBits;
    int b = std::min(kNumBins - 1, tsl::Log2Floor64(v));
    return b;
  }

Bin size 是 256 的 2^N 倍。std::max<size_t>(bytes, 256) >> kMinAllocationBits先将分配内存的字节数右移 8 位,然后把结果用在std::min(kNumBins - 1, tsl::Log2Floor64(v)),计算出的二进制有效位数即为 Bin 在 Bins 中的索引

4.1.3 查找Chunk

FindChunkPtr代码实现

cpp 复制代码
void* BFCAllocator::FindChunkPtr(BinNum bin_num, size_t rounded_bytes,
                                 size_t num_bytes, uint64 freed_before) {
  // First identify the first bin that could satisfy rounded_bytes.
  for (; bin_num < kNumBins; bin_num++) {
    // Start searching from the first bin for the smallest chunk that fits
    // rounded_bytes.
    Bin* b = BinFromIndex(bin_num);
    for (auto citer = b->free_chunks.begin(); citer != b->free_chunks.end();
         ++citer) {
         
      // 1) 从之前得到的 Bin 索引开始,查找合适的空闲 Chunk
      const BFCAllocator::ChunkHandle h = (*citer);
      BFCAllocator::Chunk* chunk = ChunkFromHandle(h);
      
      DCHECK(!chunk->in_use());
      if (freed_before > 0 && freed_before < chunk->freed_at_count) {
        continue;
      }
      if (chunk->size >= rounded_bytes) {
        
        // 2) 将找到的 Chunk 从 Bin 中移除
        // We found an existing chunk that fits us that wasn't in use, so remove
        // it from the free bin structure prior to using.
        RemoveFreeChunkIterFromBin(&b->free_chunks, citer);

		// 3) 拆分 Chunk:当以下两个条件之一满足时,SplitChunk过程将被触发。
		// 		1. 当 Chunk 的 size 是用户请求的 round size 两倍及以上时(用户请求的 size 会根据最小分配单元做 round 近似)
		// 		2. 当 Chunk 的 size 减去用户请求的 round size 后依然大于等于最大碎片限定时(128MB)
        // If we can break the size of the chunk into two reasonably large
        // pieces, do don't waste more than max_internal_fragmentation_bytes on
        // padding. If this threshold is not set by the user, then use 128MB as
        // the default.
        const int64_t max_internal_fragmentation_bytes =
            (opts_.fragmentation_fraction > 0.0)
                ? opts_.fragmentation_fraction * memory_limit_
                : 128 << 20;
		
        if (chunk->size >= rounded_bytes * 2 ||
            static_cast<int64_t>(chunk->size) - rounded_bytes >=
                max_internal_fragmentation_bytes) {
          SplitChunk(h, rounded_bytes);
          chunk = ChunkFromHandle(h);  // Update chunk pointer in case it moved
        }
		
		// 4) 修改 Chunk 的请求大小、分配 ID(标记 Chunk 被占用)
        // The requested size of the returned chunk is what the user
        // has allocated.
        chunk->requested_size = num_bytes;
        // Assign a unique id and increment the id counter, marking the
        // chunk as being in use.
        chunk->allocation_id = next_allocation_id_++;
		
		// 5) 更新统计
        // Update stats.
        ++stats_.num_allocs;
        stats_.bytes_in_use += chunk->size;
        if (stats_.bytes_in_use > stats_.peak_bytes_in_use) {
          VLOG(2) << "New Peak memory usage of " << stats_.bytes_in_use
                  << " bytes for " << Name();
        }
        stats_.peak_bytes_in_use =
            std::max(stats_.peak_bytes_in_use, stats_.bytes_in_use);
        stats_.largest_alloc_size =
            std::max<std::size_t>(stats_.largest_alloc_size, chunk->size);

#ifdef TENSORFLOW_MEM_DEBUG
        if (ShouldRecordOpName()) {
          const auto& annotation =
              profiler::ScopedMemoryDebugAnnotation::CurrentAnnotation();
          if (annotation.pending_op_name != nullptr) {
            chunk->op_name = annotation.pending_op_name;
          } else {
            LOG(INFO) << "missing pending_op_name for " << Name()
                      << " reading addr "
                      << static_cast<const void*>(&annotation.pending_op_name)
                      << "\n"
                      << CurrentStackTrace();
            chunk->op_name = nullptr;
          }
          chunk->action_count = ++action_counter_;
          chunk->step_id = annotation.pending_step_id;
          int slot = chunk->action_count % MEM_DEBUG_SIZE_HISTORY_SIZE;
          size_history_[slot] = stats_.bytes_in_use;
        }
#endif

        VLOG(4) << "Returning: " << chunk->ptr;
        if (VLOG_IS_ON(4)) {
          LOG(INFO) << "A: " << RenderOccupancy();
        }
        
        // 6) 成功时返回找到的 Chunk 指针
        return chunk->ptr;
      }
    }
  }
  
  // 7) 失败时返回空指针
  return nullptr;
}

SplitChunk代码实现

cpp 复制代码
void BFCAllocator::SplitChunk(BFCAllocator::ChunkHandle h, size_t num_bytes) {
  // Allocate the new chunk before we do any ChunkFromHandle
  ChunkHandle h_new_chunk = AllocateChunk();

  Chunk* c = ChunkFromHandle(h);
  CHECK(!c->in_use() && (c->bin_num == kInvalidBinNum));

  // Create a new chunk starting num_bytes after c
  BFCAllocator::Chunk* new_chunk = ChunkFromHandle(h_new_chunk);
  new_chunk->ptr = static_cast<void*>(static_cast<char*>(c->ptr) + num_bytes);
  region_manager_.set_handle(new_chunk->ptr, h_new_chunk);

  // Set the new sizes of the chunks.
  new_chunk->size = c->size - num_bytes;
  c->size = num_bytes;

  // The new chunk is not in use.
  new_chunk->allocation_id = -1;

  // It inherits the freed time.
  new_chunk->freed_at_count = c->freed_at_count;

  // 1) 调整 Chunk 的前驱后继指针
  // Maintain the pointers.
  // c <-> c_neighbor becomes
  // c <-> new_chunk <-> c_neighbor
  BFCAllocator::ChunkHandle h_neighbor = c->next;
  new_chunk->prev = h;
  new_chunk->next = h_neighbor;
  c->next = h_new_chunk;
  if (h_neighbor != kInvalidChunkHandle) {
    Chunk* c_neighbor = ChunkFromHandle(h_neighbor);
    c_neighbor->prev = h_new_chunk;
  }

  // 2) 根据 Free Chunk 的大小,将它插入到对应的 Bin 中
  // Add the newly free chunk to the free bin.
  InsertFreeChunkIntoBin(h_new_chunk);
}

4.1.4 扩展内存

Extend代码实现

cpp 复制代码
bool BFCAllocator::Extend(size_t alignment, size_t rounded_bytes) {
  size_t available_bytes = memory_limit_ - *stats_.pool_bytes;
  // Rounds available_bytes down to the nearest multiple of kMinAllocationSize.
  available_bytes = (available_bytes / kMinAllocationSize) * kMinAllocationSize;
  
  // 1) 已占用的加上申请的内存大小,超过最大内存限制时,返回失败。
  // Do we have enough space to handle the client's request?
  // If not, fail immediately.
  if (rounded_bytes > available_bytes) {
    return false;
  }

  // 2) 循环将当前区域可分配的内存扩充 1 倍,直到大于等于申请的内存大小。
  // If curr_region_allocation_bytes_ is not enough to satisfy the
  // allocation, keep multiplying by a power of two until that is
  // sufficient.
  bool increased_allocation = false;
  while (rounded_bytes > curr_region_allocation_bytes_) {
    curr_region_allocation_bytes_ *= 2;
    increased_allocation = true;
  }
  
  // 3) 从内存池里分配内存
  // Try allocating.
  size_t bytes = std::min(curr_region_allocation_bytes_, available_bytes);
  size_t bytes_received;
  void* mem_addr = sub_allocator_->Alloc(alignment, bytes, &bytes_received);
  
  // 4) 如果失败,尝试分配申请内存大小的 90%。一直重复,直到分配成功或可用内存不足。
  if (mem_addr == nullptr) {
    static constexpr float kBackpedalFactor = 0.9;

    // Try allocating less memory.
    while (mem_addr == nullptr) {
      bytes = RoundedBytes(bytes * kBackpedalFactor);
      if (bytes < rounded_bytes) {
        return false;
      }
      mem_addr = sub_allocator_->Alloc(alignment, bytes, &bytes_received);
    }
  }

  if (!increased_allocation) {
    // Increase the region size of the next required allocation.
    curr_region_allocation_bytes_ *= 2;
  }

  VLOG(1) << "Extending allocation by "
          << strings::HumanReadableNumBytes(bytes_received) << " bytes for "
          << Name() << ".";

  *stats_.pool_bytes += bytes_received;
  *stats_.peak_pool_bytes =
      std::max(*stats_.pool_bytes, *stats_.peak_pool_bytes);
  VLOG(1) << "Total allocated bytes: "
          << strings::HumanReadableNumBytes(*stats_.pool_bytes);

  VLOG(1) << "Allocated memory at " << mem_addr << " to "
          << static_cast<void*>(static_cast<char*>(mem_addr) + bytes_received);
  
  // 5) 给分配的内存添加 AllocationRegion
  AllocationRegion* maybe_extended_region = nullptr;
  if (coalesce_regions_) {
    maybe_extended_region =
        region_manager_.AddOrExtendAllocationRegion(mem_addr, bytes_received);
  } else {
    region_manager_.AddAllocationRegion(mem_addr, bytes_received);
  }
  
  // 6) 创建一个空闲 Chunk
  // Create one large chunk for the whole memory space that will
  // be chunked later.
  ChunkHandle h = AllocateChunk();
  BFCAllocator::Chunk* c = ChunkFromHandle(h);
  c->ptr = mem_addr;
  c->size = bytes_received;
  c->allocation_id = -1;
  c->prev = kInvalidChunkHandle;
  c->next = kInvalidChunkHandle;
  c->freed_at_count = 0;

  region_manager_.set_handle(c->ptr, h);

  // If the region was extended, then there exists a previous chunk that should
  // be linked to the new chunk.
  if (maybe_extended_region != nullptr) {
    ChunkHandle prev =
        maybe_extended_region->get_handle(maybe_extended_region->ptr());
    BFCAllocator::Chunk* prev_chunk = ChunkFromHandle(prev);
    // Find the last recorded chunk in the extended region.
    while (prev_chunk->next != kInvalidChunkHandle) {
      prev = prev_chunk->next;
      prev_chunk = ChunkFromHandle(prev);
    }
    c->prev = prev;
    prev_chunk->next = h;
  }
  
  // 7) 将空闲 Chunk 插入 Bin 中
  // Maybe merge adjacent chunks and insert the chunk into the right bin.
  InsertFreeChunkIntoBin(TryToCoalesce(h, /*ignore_freed_at=*/false));

  return true;
}

4.2 回收内存

以下内容部分来自:TensorFlow中的显存管理器------BFC Allocator - DeepLearningStack

因为在回收时只知道存储空间首地址指针 ,并不知道其对应的 Chunk,所以需要先借助region_manager等辅助工具获取其所对应的 Chunk 指针,然后考虑其前驱后继节点是否可以合并。下面展示了整体流程。

总流程DeallocateRawInternal代码实现

cpp 复制代码
void BFCAllocator::DeallocateRawInternal(void* ptr) {
  if (ptr == nullptr) {
    VLOG(2) << "tried to deallocate nullptr";
    return;
  }
  absl::MutexLock l(&mutex_);
  
  // 1) 从 RegionManager 的指针到 ChunkHandle 的映射关系中得到 ChunkHandle
  // Find the chunk from the ptr.
  BFCAllocator::ChunkHandle h = region_manager_.get_handle(ptr);
  CHECK(h != kInvalidChunkHandle);
  // Record chunk information before it's freed.
  Chunk* chunk = ChunkFromHandle(h);
  void* chunk_ptr = chunk->ptr;
  int64_t req_bytes = chunk->requested_size;
  int64_t alloc_bytes = chunk->size;
  
  // 2) 将 Chunk 标记为空闲,然后把总占用的内存量减去 Chunk 的大小
  MarkFree(h);

  // Consider coalescing it.
  if (timing_counter_) {
    InsertFreeChunkIntoBin(h);
    timestamped_chunks_.push_back(h);
  } else {
    // 3) 合并 Chunk
    // 4) 将合并后的空闲 Chunk 插入 Bin
    InsertFreeChunkIntoBin(TryToCoalesce(h, false));
  }

  // TraceMe needs to be added after MarkFree and InsertFreeChunkIntoBin for
  // correct aggregation stats (bytes_in_use, fragmentation).
  AddTraceMe("MemoryDeallocation", chunk_ptr, req_bytes, alloc_bytes);

  if (VLOG_IS_ON(4)) {
    LOG(INFO) << "F: " << RenderOccupancy();
  }
}

4.2.1 获取ChunkHandle

4.2.2 更新状态

MarkFree代码实现

cpp 复制代码
void BFCAllocator::MarkFree(BFCAllocator::ChunkHandle h) {
  Chunk* c = ChunkFromHandle(h);
  CHECK(c->in_use() && (c->bin_num == kInvalidBinNum));

  // 1) 将 Chunk 标记为空闲
  // Mark the chunk as no longer in use.
  c->allocation_id = -1;

  // Optionally record the free time.
  if (timing_counter_) {
    c->freed_at_count = timing_counter_->next();
  }
  
  // 2) 更新状态,把总占用的内存量减去 Chunk 的大小
  // Updates the stats.
  stats_.bytes_in_use -= c->size;

#ifdef TENSORFLOW_MEM_DEBUG
  if (ShouldRecordOpName()) {
    c->action_count = ++action_counter_;
    int slot = c->action_count % MEM_DEBUG_SIZE_HISTORY_SIZE;
    size_history_[slot] = stats_.bytes_in_use;
  }
#endif
}

4.2.3 合并Chunk

TryToCoalesce代码实现

cpp 复制代码
BFCAllocator::ChunkHandle BFCAllocator::TryToCoalesce(ChunkHandle h,
                                                      bool ignore_freed_at) {
  Chunk* c = ChunkFromHandle(h);
  if ((!ignore_freed_at) && c->freed_at_count > 0) return h;
  ChunkHandle coalesced_chunk = h;
  
  // 1) 合并直接后继
  // If the next chunk is free, merge it into c and delete it.
  if (c->next != kInvalidChunkHandle && !ChunkFromHandle(c->next)->in_use()) {
    Chunk* n = ChunkFromHandle(c->next);
    if ((n->freed_at_count == 0) || ignore_freed_at) {
      VLOG(4) << "Merging c->next " << n->ptr << " with c " << c->ptr;
      // 1.1) 将直接后继从 Bin 中移除
      RemoveFreeChunkFromBin(c->next);
      // 1.2) 合并
      Merge(h, c->next);
    }
  }
  
  // 2) 合并直接前趋
  // If the previous chunk is free, merge c into it and delete c.
  if (c->prev != kInvalidChunkHandle && !ChunkFromHandle(c->prev)->in_use()) {
    Chunk* n = ChunkFromHandle(c->prev);
    if ((n->freed_at_count == 0) || ignore_freed_at) {
      VLOG(4) << "Merging c " << c->ptr << " into c->prev " << n->ptr;
      coalesced_chunk = c->prev;
      // 2.1) 将直接前趋从 Bin 中移除
      RemoveFreeChunkFromBin(c->prev);
      // 2.2) 合并
      Merge(c->prev, h);
    }
  }

  return coalesced_chunk;
}

Merge代码实现

cpp 复制代码
// Merges h1 and h2 when Chunk(h1)->next is h2 and Chunk(h2)->prev is c1.
// We merge Chunk(h2) into Chunk(h1).
void BFCAllocator::Merge(BFCAllocator::ChunkHandle h1,
                         BFCAllocator::ChunkHandle h2) {
  Chunk* c1 = ChunkFromHandle(h1);
  Chunk* c2 = ChunkFromHandle(h2);
  // We can only merge chunks that are not in use.
  CHECK(!c1->in_use() && !c2->in_use());

  // c1's prev doesn't change, still points to the same ptr, and is
  // still not in use.

  // 1) 修改 Chunk 的指向关系
  // Fix up neighbor pointers
  //
  // c1 <-> c2 <-> c3 should become
  // c1 <-> c3

  BFCAllocator::ChunkHandle h3 = c2->next;
  c1->next = h3;
  CHECK(c2->prev == h1);
  if (h3 != kInvalidChunkHandle) {
    BFCAllocator::Chunk* c3 = ChunkFromHandle(h3);
    c3->prev = h1;
  }

  // 2) 更新 Chunk 大小
  // Set the new size
  c1->size += c2->size;

  // Pick latest free time.
  c1->freed_at_count = std::max(c1->freed_at_count, c2->freed_at_count);
  
  // 3) 删除被合并的 Chunk
  DeleteChunk(h2);
}

5 总结

以下内容来自:TensorFlow中的显存管理器------BFC Allocator - DeepLearningStack

本文总结了 TensorFlow 中存储管理器------BFC Allocator。它的设计思路来自于经典来的 dlmalloc 分配算法 ,是 Best fit coalecing 的简单实现版本。BFC Allocator 是为了应对 TensorFlow 中频繁分配释放存储空间需求的场景 而出现的解决方案,通过事先将存储空间从物理设备上开辟好,并将这些空闲存储空间封装成 Chunk,组织成有序双向链表,然后利用 Bin 这一种索引结构为 Chunk 的查询做加速,最终完成了高效的分配算法。在实际分配时,可能会遇到 Chunk 链表中不存在符合要求的空闲 Chunk 情况,这时候就可能需要向物理设备中再次开辟新的存储空间,这个过程被视为对 Chunk 链表的扩展,对应的过程是Extend。因为是按 Chunk 进行分配,势必可能造成存储碎片,为了解决碎片问题,BFC Allocator 设计了SplitChunkMerge函数。

参考资料

关于 XLA

关于 TensorFlow BFC

相关推荐
Y_3_71 小时前
30 分钟从零开始入门 CSS
开发语言·前端·css·人工智能·python·tensorflow
IT、木易1 小时前
大白话css第二章深入学习
前端·css·学习
fanged2 小时前
CMake小结2(PICO为例)
学习·嵌入式
笨鸟先飞的橘猫4 小时前
lua学习(一)
开发语言·学习·lua
白云千载尽4 小时前
大语言加持的闭环端到端自动驾驶模型 学习笔记纯干货
论文阅读·笔记·学习·自动驾驶·carla·vlm
wrjwww6 小时前
【Git学习笔记】Git常用命令
笔记·git·学习
键盘敲没电7 小时前
【iOS】小蓝书学习(四)
学习·ios·cocoa
我要昵称干什么8 小时前
STM32学习——RTC实时时钟(BKP与RTC外设)
c语言·stm32·单片机·嵌入式硬件·学习·实时音视频