pprof实现 - 技术栈

cloud.tencent.com/developer/a...

剖析

当我们想仔细观察我们程序的运行速度的时候，最好的方法是性能剖析。剖析技术是基于程序执行期间一些自动抽样，然后在收尾时进行推断；最后产生的统计结果就称为剖析数据。

CPU剖析数据标识了最耗CPU时间的函数。在每个CPU上运行的线程在每隔几毫秒都会遇到操作系统的中断事件，每次中断时都会记录一个剖析数据然后恢复正常的运行。

堆剖析则标识了最耗内存的语句。剖析库会记录调用内部内存分配的操作，平均每512KB的内存申请会触发一个剖析数据。

阻塞剖析则记录阻塞goroutine最久的操作，例如系统调用、管道发送和接收，还有获取锁等。每当goroutine被这些操作阻塞时，剖析库都会记录相应的事件。

只需要开启下面其中一个标志参数就可以生成各种分析文件。当同时使用多个标志参数时需要当心，因为一项分析操作可能会影响其他项的分析结果。

go test -cpuprofile cpu.prof

go test -memprofile mem.prof

blockprofile

mutexprofile

一旦我们已经收集到了用于分析的采样数据，我们就可以使用pprof来分析这些数据。这是Go工具箱自带的一个工具，但并不是一个日常工具，它对应go tool pprof命令。该命令有许多特性和选项，但是最基本的是两个参数：生成这个概要文件的可执行程序和对应的剖析数据。

scss 复制代码

func TestBasic(t *testing.T) {
   for i := 0; i < 200; i++ { // running it many times
      if Fib2(30) != 832040 {
         t.Error("Incorrect!")
      }
   }
}

到test文件所在文件夹下执行

go test -run TestBasic -cpuprofile cpu.prof

go tool pprof cpu.prof

输入 top5 -cum

The topN function shown top N entries and the -cum flag shows the cumulative time taken.

访问 http://host:port/debug/pprof/goroutine?debug=1, PProf will return a list of all Goroutines with stack traces.

arduino 复制代码

// A Profile is a collection of stack traces showing the call sequences
// that led to instances of a particular event, such as allocation.
// Packages can create and maintain their own profiles; the most common
// use is for tracking resources that must be explicitly closed, such as files
// or network connections.
//
// A Profile's methods can be called from multiple goroutines simultaneously.
//
// Each Profile has a unique name. A few profiles are predefined:
//
// goroutine    - stack traces of all current goroutines
// heap         - a sampling of memory allocations of live objects
// allocs       - a sampling of all past memory allocations
// threadcreate - stack traces that led to the creation of new OS threads
// block        - stack traces that led to blocking on synchronization primitives
// mutex        - stack traces of holders of contended mutexes

arduino 复制代码

// The heap profile tracks both the allocation sites for all live objects in
// the application memory and for all objects allocated since the program start.
// Pprof's -inuse_space, -inuse_objects, -alloc_space, and -alloc_objects
// flags select which to display, defaulting to -inuse_space (live objects,
// scaled by size).

arduino 复制代码

// The allocs profile is the same as the heap profile but changes the default
// pprof display to -alloc_space, the total number of bytes allocated since
// the program began (including garbage-collected bytes).

go 复制代码

// writeHeap writes the current runtime heap profile to w.
func writeHeap(w io.Writer, debug int) error {
   return writeHeapInternal(w, debug, "")
}

// writeAlloc writes the current runtime heap profile to w
// with the total allocation space as the default sample type.
func writeAlloc(w io.Writer, debug int) error {
   return writeHeapInternal(w, debug, "alloc_space")
}

csharp 复制代码

var (
   mbuckets  *bucket // memory profile buckets
   bbuckets  *bucket // blocking profile buckets
   xbuckets  *bucket // mutex profile buckets
   ...
   )

go 复制代码

// A bucket holds per-call-stack profiling information.
// The representation is a bit sleazy, inherited from C.
// This struct defines the bucket header. It is followed in
// memory by the stack words and then the actual record
// data, either a memRecord or a blockRecord.
//
// Per-call-stack profiling information.
// Lookup by hashing call stack into a linked-list hash table.
//
// No heap pointers.
//
//go:notinheap
type bucket struct {
   next    *bucket
   allnext *bucket
   typ     bucketType // memBucket or blockBucket (includes mutexProfile)
   hash    uintptr
   size    uintptr
   nstk    uintptr
}

arduino 复制代码

type memRecord struct {
   // The following complex 3-stage scheme of stats accumulation
   // is required to obtain a consistent picture of mallocs and frees
   // for some point in time.
   // The problem is that mallocs come in real time, while frees
   // come only after a GC during concurrent sweeping. So if we would
   // naively count them, we would get a skew toward mallocs.
   //
   // Hence, we delay information to get consistent snapshots as
   // of mark termination. Allocations count toward the next mark
   // termination's snapshot, while sweep frees count toward the
   // previous mark termination's snapshot:
   //
   //              MT          MT          MT          MT
   //             .·|         .·|         .·|         .·|
   //          .·˙  |      .·˙  |      .·˙  |      .·˙  |
   //       .·˙     |   .·˙     |   .·˙     |   .·˙     |
   //    .·˙        |.·˙        |.·˙        |.·˙        |
   //
   //       alloc → ▲ ← free
   //               ┠┅┅┅┅┅┅┅┅┅┅┅P
   //       C+2     →    C+1    →  C
   //
   //                   alloc → ▲ ← free
   //                           ┠┅┅┅┅┅┅┅┅┅┅┅P
   //                   C+2     →    C+1    →  C
   //
   // Since we can't publish a consistent snapshot until all of
   // the sweep frees are accounted for, we wait until the next
   // mark termination ("MT" above) to publish the previous mark
   // termination's snapshot ("P" above). To do this, allocation
   // and free events are accounted to *future* heap profile
   // cycles ("C+n" above) and we only publish a cycle once all
   // of the events from that cycle must be done. Specifically:
   //
   // Mallocs are accounted to cycle C+2.
   // Explicit frees are accounted to cycle C+2.
   // GC frees (done during sweeping) are accounted to cycle C+1.
   //
   // After mark termination, we increment the global heap
   // profile cycle counter and accumulate the stats from cycle C
   // into the active profile.

   // active is the currently published profile. A profiling
   // cycle can be accumulated into active once its complete.
   active memRecordCycle

   // future records the profile events we're counting for cycles
   // that have not yet been published. This is ring buffer
   // indexed by the global heap profile cycle C and stores
   // cycles C, C+1, and C+2. Unlike active, these counts are
   // only for a single cycle; they are not cumulative across
   // cycles.
   //
   // We store cycle C here because there's a window between when
   // C becomes the active cycle and when we've flushed it to
   // active.
   future [3]memRecordCycle
}

一些重要信息：

arduino 复制代码

// Mallocs are accounted to cycle C+2.
// Explicit frees are accounted to cycle C+2.
// GC frees (done during sweeping) are accounted to cycle C+1.

...
// After mark termination, we increment the global heap
// profile cycle counter and accumulate the stats from cycle C
// into the active profile.

go 复制代码

// memRecordCycle
type memRecordCycle struct {
   allocs, frees           uintptr
   alloc_bytes, free_bytes uintptr
}

go 复制代码

// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
   count  float64
   cycles int64
}

writeHeapInternal

debug!=0的时候，调用runtime.ReadMemStats(memStats)//短暂的stw去读取memstats

runtime 在内存分配的时候会根据一定策略进行采样, 记录到 mbuckets 中让用户得以进行分析, 而采样算法有个重要的依赖 MemProfileRate

arduino 复制代码

// MemProfileRate controls the fraction of memory allocations 
// that are recorded and reported in the memory profile. 
// The profiler aims to sample an average of 
// one allocation per MemProfileRate bytes allocated.

默认大小是 512 KB, 可以由用户自行配置.

goroutine

go 复制代码

// writeGoroutine writes the current runtime GoroutineProfile to w.
func writeGoroutine(w io.Writer, debug int) error {
   if debug >= 2 {
      return writeGoroutineStacks(w)
   }
   return writeRuntimeProfile(w, debug, "goroutine", runtime_goroutineProfileWithLabels)
}

也调用了 stopTheWorld("profile")

threadcreate

pprof/threadcreate 具体实现和 pprof/goroutine 类似, 无非前者遍历的对象是全局 allm, 而后者为 allgs, 区别在于 pprof/threadcreate => ThreadCreateProfile 时不会进行进行 STW

mutex

mutex 默认是关闭采样的, 通过 runtime.SetMutexProfileFraction(int) 来进行 rate 的配置进行开启或关闭

和上文分析过的 mbuckets 类似, 这边用以记录采样数据的是 xbuckets, bucket 记录了锁持有的堆栈, 次数(采样)等信息以供用户查看

go 复制代码

// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
   count  float64
   cycles int64
}

mutexevent 在 semrelease1 中调用

block

在chan send/recv, select, semacquire1、notifyListWait 中调用

cpu

scss 复制代码

func setProcessCPUProfiler(hz int32) {
   setProcessCPUProfilerTimer(hz)
}

每隔一段时间(rate)在向当前 g 所在线程发送一个 SIGPROF 信号量

由于 GMP 的模型设计, 在绝大多数情况下通过这种 timer + sig + current thread 以及当前支持的抢占式调度, 这种记录方式是能够很好进行整个 runtime cpu profile 采样分析的, 但也不能排除一些极端情况是无法被覆盖的, 毕竟也只是基于当前 M 而已.

安全性:

生产环境可用 pprof, 注意接口不能直接暴露, 毕竟存在诸如 STW 等操作, 存在潜在风险点