cloud.tencent.com/developer/a...
剖析
当我们想仔细观察我们程序的运行速度的时候,最好的方法是性能剖析。剖析技术是基于程序执行期间一些自动抽样,然后在收尾时进行推断;最后产生的统计结果就称为剖析数据。
CPU剖析数据标识了最耗CPU时间的函数。在每个CPU上运行的线程在每隔几毫秒都会遇到操作系统的中断事件,每次中断时都会记录一个剖析数据然后恢复正常的运行。
堆剖析则标识了最耗内存的语句。剖析库会记录调用内部内存分配的操作,平均每512KB的内存申请会触发一个剖析数据。
阻塞剖析则记录阻塞goroutine最久的操作,例如系统调用、管道发送和接收,还有获取锁等。每当goroutine被这些操作阻塞时,剖析库都会记录相应的事件。
只需要开启下面其中一个标志参数就可以生成各种分析文件。当同时使用多个标志参数时需要当心,因为一项分析操作可能会影响其他项的分析结果。
go test -cpuprofile cpu.prof
go test -memprofile mem.prof
blockprofile
mutexprofile
一旦我们已经收集到了用于分析的采样数据,我们就可以使用pprof来分析这些数据。这是Go工具箱自带的一个工具,但并不是一个日常工具,它对应go tool pprof
命令。该命令有许多特性和选项,但是最基本的是两个参数:生成这个概要文件的可执行程序和对应的剖析数据。
scss
func TestBasic(t *testing.T) {
for i := 0; i < 200; i++ { // running it many times
if Fib2(30) != 832040 {
t.Error("Incorrect!")
}
}
}
到test文件所在文件夹下执行
go test -run TestBasic -cpuprofile cpu.prof
go tool pprof cpu.prof
输入 top5 -cum
The topN function shown top N entries and the -cum flag shows the cumulative time taken.
访问 http://host:port/debug/pprof/goroutine?debug=1
, PProf will return a list of all Goroutines with stack traces.
arduino
// A Profile is a collection of stack traces showing the call sequences
// that led to instances of a particular event, such as allocation.
// Packages can create and maintain their own profiles; the most common
// use is for tracking resources that must be explicitly closed, such as files
// or network connections.
//
// A Profile's methods can be called from multiple goroutines simultaneously.
//
// Each Profile has a unique name. A few profiles are predefined:
//
// goroutine - stack traces of all current goroutines
// heap - a sampling of memory allocations of live objects
// allocs - a sampling of all past memory allocations
// threadcreate - stack traces that led to the creation of new OS threads
// block - stack traces that led to blocking on synchronization primitives
// mutex - stack traces of holders of contended mutexes
arduino
// The heap profile tracks both the allocation sites for all live objects in
// the application memory and for all objects allocated since the program start.
// Pprof's -inuse_space, -inuse_objects, -alloc_space, and -alloc_objects
// flags select which to display, defaulting to -inuse_space (live objects,
// scaled by size).
arduino
// The allocs profile is the same as the heap profile but changes the default
// pprof display to -alloc_space, the total number of bytes allocated since
// the program began (including garbage-collected bytes).
go
// writeHeap writes the current runtime heap profile to w.
func writeHeap(w io.Writer, debug int) error {
return writeHeapInternal(w, debug, "")
}
// writeAlloc writes the current runtime heap profile to w
// with the total allocation space as the default sample type.
func writeAlloc(w io.Writer, debug int) error {
return writeHeapInternal(w, debug, "alloc_space")
}
csharp
var (
mbuckets *bucket // memory profile buckets
bbuckets *bucket // blocking profile buckets
xbuckets *bucket // mutex profile buckets
...
)
go
// A bucket holds per-call-stack profiling information.
// The representation is a bit sleazy, inherited from C.
// This struct defines the bucket header. It is followed in
// memory by the stack words and then the actual record
// data, either a memRecord or a blockRecord.
//
// Per-call-stack profiling information.
// Lookup by hashing call stack into a linked-list hash table.
//
// No heap pointers.
//
//go:notinheap
type bucket struct {
next *bucket
allnext *bucket
typ bucketType // memBucket or blockBucket (includes mutexProfile)
hash uintptr
size uintptr
nstk uintptr
}
arduino
type memRecord struct {
// The following complex 3-stage scheme of stats accumulation
// is required to obtain a consistent picture of mallocs and frees
// for some point in time.
// The problem is that mallocs come in real time, while frees
// come only after a GC during concurrent sweeping. So if we would
// naively count them, we would get a skew toward mallocs.
//
// Hence, we delay information to get consistent snapshots as
// of mark termination. Allocations count toward the next mark
// termination's snapshot, while sweep frees count toward the
// previous mark termination's snapshot:
//
// MT MT MT MT
// .·| .·| .·| .·|
// .·˙ | .·˙ | .·˙ | .·˙ |
// .·˙ | .·˙ | .·˙ | .·˙ |
// .·˙ |.·˙ |.·˙ |.·˙ |
//
// alloc → ▲ ← free
// ┠┅┅┅┅┅┅┅┅┅┅┅P
// C+2 → C+1 → C
//
// alloc → ▲ ← free
// ┠┅┅┅┅┅┅┅┅┅┅┅P
// C+2 → C+1 → C
//
// Since we can't publish a consistent snapshot until all of
// the sweep frees are accounted for, we wait until the next
// mark termination ("MT" above) to publish the previous mark
// termination's snapshot ("P" above). To do this, allocation
// and free events are accounted to *future* heap profile
// cycles ("C+n" above) and we only publish a cycle once all
// of the events from that cycle must be done. Specifically:
//
// Mallocs are accounted to cycle C+2.
// Explicit frees are accounted to cycle C+2.
// GC frees (done during sweeping) are accounted to cycle C+1.
//
// After mark termination, we increment the global heap
// profile cycle counter and accumulate the stats from cycle C
// into the active profile.
// active is the currently published profile. A profiling
// cycle can be accumulated into active once its complete.
active memRecordCycle
// future records the profile events we're counting for cycles
// that have not yet been published. This is ring buffer
// indexed by the global heap profile cycle C and stores
// cycles C, C+1, and C+2. Unlike active, these counts are
// only for a single cycle; they are not cumulative across
// cycles.
//
// We store cycle C here because there's a window between when
// C becomes the active cycle and when we've flushed it to
// active.
future [3]memRecordCycle
}
一些重要信息:
arduino
// Mallocs are accounted to cycle C+2.
// Explicit frees are accounted to cycle C+2.
// GC frees (done during sweeping) are accounted to cycle C+1.
...
// After mark termination, we increment the global heap
// profile cycle counter and accumulate the stats from cycle C
// into the active profile.
go
// memRecordCycle
type memRecordCycle struct {
allocs, frees uintptr
alloc_bytes, free_bytes uintptr
}
go
// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
count float64
cycles int64
}
writeHeapInternal
debug!=0的时候,调用runtime.ReadMemStats(memStats)//短暂的stw去读取memstats
runtime
在内存分配的时候会根据一定策略进行采样, 记录到 mbuckets
中让用户得以进行分析, 而采样算法有个重要的依赖 MemProfileRate
arduino
// MemProfileRate controls the fraction of memory allocations
// that are recorded and reported in the memory profile.
// The profiler aims to sample an average of
// one allocation per MemProfileRate bytes allocated.
默认大小是 512 KB, 可以由用户自行配置.
goroutine
go
// writeGoroutine writes the current runtime GoroutineProfile to w.
func writeGoroutine(w io.Writer, debug int) error {
if debug >= 2 {
return writeGoroutineStacks(w)
}
return writeRuntimeProfile(w, debug, "goroutine", runtime_goroutineProfileWithLabels)
}
也调用了 stopTheWorld("profile")
threadcreate
pprof/threadcreate
具体实现和 pprof/goroutine
类似, 无非前者遍历的对象是全局 allm
, 而后者为 allgs
, 区别在于 pprof/threadcreate => ThreadCreateProfile
时不会进行进行 STW
mutex
mutex 默认是关闭采样的, 通过 runtime.SetMutexProfileFraction(int)
来进行 rate
的配置进行开启或关闭
和上文分析过的 mbuckets
类似, 这边用以记录采样数据的是 xbuckets
, bucket
记录了锁持有的堆栈, 次数(采样)等信息以供用户查看
go
// A blockRecord is the bucket data for a bucket of type blockProfile,
// which is used in blocking and mutex profiles.
type blockRecord struct {
count float64
cycles int64
}
mutexevent 在 semrelease1 中调用
block
在chan send/recv, select, semacquire1、notifyListWait 中调用
cpu
scss
func setProcessCPUProfiler(hz int32) {
setProcessCPUProfilerTimer(hz)
}
每隔一段时间(rate)在向当前 g 所在线程发送一个 SIGPROF 信号量
由于 GMP
的模型设计, 在绝大多数情况下通过这种 timer
+ sig
+ current thread
以及当前支持的抢占式调度, 这种记录方式是能够很好进行整个 runtime cpu profile
采样分析的, 但也不能排除一些极端情况是无法被覆盖的, 毕竟也只是基于当前 M 而已.
安全性:
生产环境可用 pprof, 注意接口不能直接暴露, 毕竟存在诸如 STW 等操作, 存在潜在风险点