pprof实现

cloud.tencent.com/developer/a...

剖析

当我们想仔细观察我们程序的运行速度的时候,最好的方法是性能剖析。剖析技术是基于程序执行期间一些自动抽样,然后在收尾时进行推断;最后产生的统计结果就称为剖析数据。

CPU剖析数据标识了最耗CPU时间的函数。在每个CPU上运行的线程在每隔几毫秒都会遇到操作系统的中断事件,每次中断时都会记录一个剖析数据然后恢复正常的运行。

堆剖析则标识了最耗内存的语句。剖析库会记录调用内部内存分配的操作,平均每512KB的内存申请会触发一个剖析数据。

阻塞剖析则记录阻塞goroutine最久的操作,例如系统调用、管道发送和接收,还有获取锁等。每当goroutine被这些操作阻塞时,剖析库都会记录相应的事件。

只需要开启下面其中一个标志参数就可以生成各种分析文件。当同时使用多个标志参数时需要当心,因为一项分析操作可能会影响其他项的分析结果。

go test -cpuprofile cpu.prof

go test -memprofile mem.prof

blockprofile

mutexprofile

一旦我们已经收集到了用于分析的采样数据,我们就可以使用pprof来分析这些数据。这是Go工具箱自带的一个工具,但并不是一个日常工具,它对应go tool pprof命令。该命令有许多特性和选项,但是最基本的是两个参数:生成这个概要文件的可执行程序和对应的剖析数据。

scss 复制代码
func TestBasic(t *testing.T) {
   for i := 0; i < 200; i++ { // running it many times
      if Fib2(30) != 832040 {
         t.Error("Incorrect!")
      }
   }
}

到test文件所在文件夹下执行

go test -run TestBasic -cpuprofile cpu.prof

go tool pprof cpu.prof

输入 top5 -cum

The topN function shown top N entries and the -cum flag shows the cumulative time taken.

访问 http://host:port/debug/pprof/goroutine?debug=1, PProf will return a list of all Goroutines with stack traces.

arduino 复制代码
// A Profile is a collection of stack traces showing the call sequences
// that led to instances of a particular event, such as allocation.
// Packages can create and maintain their own profiles; the most common
// use is for tracking resources that must be explicitly closed, such as files
// or network connections.
//
// A Profile's methods can be called from multiple goroutines simultaneously.
//
// Each Profile has a unique name. A few profiles are predefined:
//
// goroutine    - stack traces of all current goroutines
// heap         - a sampling of memory allocations of live objects
// allocs       - a sampling of all past memory allocations
// threadcreate - stack traces that led to the creation of new OS threads
// block        - stack traces that led to blocking on synchronization primitives
// mutex        - stack traces of holders of contended mutexes
arduino 复制代码
// The heap profile tracks both the allocation sites for all live objects in
// the application memory and for all objects allocated since the program start.
// Pprof's -inuse_space, -inuse_objects, -alloc_space, and -alloc_objects
// flags select which to display, defaulting to -inuse_space (live objects,
// scaled by size).
arduino 复制代码
// The allocs profile is the same as the heap profile but changes the default
// pprof display to -alloc_space, the total number of bytes allocated since
// the program began (including garbage-collected bytes).
go 复制代码
// writeHeap writes the current runtime heap profile to w.
func writeHeap(w io.Writer, debug int) error {
   return writeHeapInternal(w, debug, "")
}

// writeAlloc writes the current runtime heap profile to w
// with the total allocation space as the default sample type.
func writeAlloc(w io.Writer, debug int) error {
   return writeHeapInternal(w, debug, "alloc_space")
}
csharp 复制代码
var (
   mbuckets  *bucket // memory profile buckets
   bbuckets  *bucket // blocking profile buckets
   xbuckets  *bucket // mutex profile buckets
   ...
   )
go 复制代码
// A bucket holds per-call-stack profiling information.
// The representation is a bit sleazy, inherited from C.
// This struct defines the bucket header. It is followed in
// memory by the stack words and then the actual record
// data, either a memRecord or a blockRecord.
//
// Per-call-stack profiling information.
// Lookup by hashing call stack into a linked-list hash table.
//
// No heap pointers.
//
//go:notinheap
type bucket struct {
   next    *bucket
   allnext *bucket
   typ     bucketType // memBucket or blockBucket (includes mutexProfile)
   hash    uintptr
   size    uintptr
   nstk    uintptr
}
arduino 复制代码
type memRecord struct {
   // The following complex 3-stage scheme of stats accumulation
   // is required to obtain a consistent picture of mallocs and frees
   // for some point in time.
   // The problem is that mallocs come in real time, while frees
   // come only after a GC during concurrent sweeping. So if we would
   // naively count them, we would get a skew toward mallocs.
   //
   // Hence, we delay information to get consistent snapshots as
   // of mark termination. Allocations count toward the next mark
   // termination's snapshot, while sweep frees count toward the
   // previous mark termination's snapshot:
   //
   //              MT          MT          MT          MT
   //             .·|         .·|         .·|         .·|
   //          .·˙  |      .·˙  |      .·˙  |      .·˙  |
   //       .·˙     |   .·˙     |   .·˙     |   .·˙     |
   //    .·˙        |.·˙        |.·˙        |.·˙        |
   //
   //       alloc → ▲ ← free
   //               ┠┅┅┅┅┅┅┅┅┅┅┅P
   //       C+2     →    C+1    →  C
   //
   //                   alloc → ▲ ← free
   //                           ┠┅┅┅┅┅┅┅┅┅┅┅P
   //                   C+2     →    C+1    →  C
   //
   // Since we can't publish a consistent snapshot until all of
   // the sweep frees are accounted for, we wait until the next
   // mark termination ("MT" above) to publish the previous mark
   // termination's snapshot ("P" above). To do this, allocation
   // and free events are accounted to *future* heap profile
   // cycles ("C+n" above) and we only publish a cycle once all
   // of the events from that cycle must be done. Specifically:
   //
   // Mallocs are accounted to cycle C+2.
   // Explicit frees are accounted to cycle C+2.
   // GC frees (done during sweeping) are accounted to cycle C+1.
   //
   // After mark termination, we increment the global heap
   // profile cycle counter and accumulate the stats from cycle C
   // into the active profile.

   // active is the currently published profile. A profiling
   // cycle can be accumulated into active once its complete.
   active memRecordCycle

   // future records the profile events we're counting for cycles
   // that have not yet been published. This is ring buffer
   // indexed by the global heap profile cycle C and stores
   // cycles C, C+1, and C+2. Unlike active, these counts are
   // only for a single cycle; they are not cumulative across
   // cycles.
   //
   // We store cycle C here because there's a window between when
   // C becomes the active cycle and when we've flushed it to
   // active.
   future [3]memRecordCycle
}

一些重要信息:

arduino 复制代码
// Mallocs are accounted to cycle C+2.
// Explicit frees are accounted to cycle C+2.
// GC frees (done during sweeping) are accounted to cycle C+1.

...
// After mark termination, we increment the global heap
// profile cycle counter and accumulate the stats from cycle C
// into the active profile.
go 复制代码
// memRecordCycle
type memRecordCycle struct {
   allocs, frees           uintptr
   alloc_bytes, free_bytes uintptr
}
go 复制代码
// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
   count  float64
   cycles int64
}

writeHeapInternal

debug!=0的时候,调用runtime.ReadMemStats(memStats)//短暂的stw去读取memstats

runtime 在内存分配的时候会根据一定策略进行采样, 记录到 mbuckets 中让用户得以进行分析, 而采样算法有个重要的依赖 MemProfileRate

arduino 复制代码
// MemProfileRate controls the fraction of memory allocations 
// that are recorded and reported in the memory profile. 
// The profiler aims to sample an average of 
// one allocation per MemProfileRate bytes allocated. 

默认大小是 512 KB, 可以由用户自行配置.

goroutine

go 复制代码
// writeGoroutine writes the current runtime GoroutineProfile to w.
func writeGoroutine(w io.Writer, debug int) error {
   if debug >= 2 {
      return writeGoroutineStacks(w)
   }
   return writeRuntimeProfile(w, debug, "goroutine", runtime_goroutineProfileWithLabels)
}

也调用了 stopTheWorld("profile")

threadcreate

pprof/threadcreate 具体实现和 pprof/goroutine 类似, 无非前者遍历的对象是全局 allm, 而后者为 allgs, 区别在于 pprof/threadcreate => ThreadCreateProfile 时不会进行进行 STW

mutex

mutex 默认是关闭采样的, 通过 runtime.SetMutexProfileFraction(int) 来进行 rate 的配置进行开启或关闭

和上文分析过的 mbuckets 类似, 这边用以记录采样数据的是 xbuckets, bucket 记录了锁持有的堆栈, 次数(采样)等信息以供用户查看

go 复制代码
// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
   count  float64
   cycles int64
}

mutexevent 在 semrelease1 中调用

block

在chan send/recv, select, semacquire1、notifyListWait 中调用

cpu

scss 复制代码
func setProcessCPUProfiler(hz int32) {
   setProcessCPUProfilerTimer(hz)
}

每隔一段时间(rate)在向当前 g 所在线程发送一个 SIGPROF 信号量

由于 GMP 的模型设计, 在绝大多数情况下通过这种 timer + sig + current thread 以及当前支持的抢占式调度, 这种记录方式是能够很好进行整个 runtime cpu profile 采样分析的, 但也不能排除一些极端情况是无法被覆盖的, 毕竟也只是基于当前 M 而已.

安全性:

生产环境可用 pprof, 注意接口不能直接暴露, 毕竟存在诸如 STW 等操作, 存在潜在风险点

相关推荐
码事漫谈几秒前
后端开发如何将创新转化为专利?案例、流程与实操指南
后端
小坏讲微服务1 小时前
SpringCloud零基础学全栈,实战企业级项目完整使用
后端·spring·spring cloud
humors2211 小时前
服务端开发案例(不定期更新)
java·数据库·后端·mysql·mybatis·excel
Easonmax4 小时前
用 Rust 打造可复现的 ASCII 艺术渲染器:从像素到字符的完整工程实践
开发语言·后端·rust
百锦再4 小时前
选择Rust的理由:从内存管理到抛弃抽象
android·java·开发语言·后端·python·rust·go
小羊失眠啦.4 小时前
深入解析Rust的所有权系统:告别空指针和数据竞争
开发语言·后端·rust
q***71854 小时前
Spring Boot 集成 MyBatis 全面讲解
spring boot·后端·mybatis
大象席地抽烟4 小时前
使用 Ollama 本地模型与 Spring AI Alibaba
后端
程序员小假4 小时前
SQL 语句左连接右连接内连接如何使用,区别是什么?
java·后端
小坏讲微服务5 小时前
Spring Cloud Alibaba Gateway 集成 Redis 限流的完整配置
数据库·redis·分布式·后端·spring cloud·架构·gateway