-
Notifications
You must be signed in to change notification settings - Fork 995
Cumulative Run Time Silos in Go `pprof` CPU Profiles
The Go pprof CPU profiling tool lists both flat and cum (cumulative) run-times and percentages for routines. The flat numbers refer to time spent in the routine itself. The cum numbers refer to time spent in the routine and all of its callees. This article discusses how to interpret the cumulative run time information in the context of different phases of the execution of a Go program.
When the Go pprof profiling interrupt occurs, the profiler walks the currently active stack and records all of the routines that appear in the call chain. This allows the profiler to correctly charge cumulative run-time to each caller in the chain. At the bottom of the stack is a stack frame that records the first caller in the chain. These bottom-most frames do not record calls from user code. Instead, only a few selected system routines will ever appear at the bottom of the call stack, and together these few routines will record 100% of the cumulative run time of the application.
Thus the total run-time is partitioned based on which routine is found at the bottom of the stack. In the following the term silo is used to refer to these partitions, since as we discusss below, the different silos record more-or-less independent parts of the execution of the application.
Note: Recent results from profiling test applications suggest that the information contained in this article may not be fully complete or correct, however we have found the approach described here very useful in interpreting Hyperledger fabric profiles. We will update the article as we learn more.
These are the routines that may appear at the bottom of the call stack, and which parts of the execution they register. Note that this list was derived from empirical observation of profiles and reading the Go source code, and may not be exhaustive.
-
runtime.goexitGo goroutines start with a stack context that makes it appear that the routine was called fromruntime.goexit. This way, when the goroutine "returns to its caller" the goroutine is cleaned up and terminated. Therefore the cumulative run-time forruntime.goexitrecords all of the CPU time spent in goroutines, including themainprogram which is also executed as a goroutine. -
runtime.mcallThis routine silos code that runs on a system stack associated with an operating system thread, outside the context of any goroutine. In our profiles almost all of theruntime.mcallsilo is associatecd withruntime.schedule, the goroutine scheduler. -
runtime.morestackThe sizes of goroutine stacks are dynamic, and Go implements stack-capacity checking on most function calls. If a goroutine needs more stack space than its current allocation,runtime.morestackis the routine that arranges this. When this allocation is occurring the stack is modified such thatruntime.morestackeffectively occupies the bottom frame. This is useful in that the cumulative time forruntime.morestackrecords the total overhead for dynamic stack allocation. This is unfortunate though in that we don't have a record of which routines may be causing this to happen. -
runtime.mstartThis routine is used to start goroutines. The routine also appears to be used as a bottom-marker in the stack during certain stack manipulation operations. We typically see only a small amount of system overhead charged toruntime.mstart. [Needs further study.] -
runtime._SystemThis is a dummy routine that exists only for profile accounting. Time is charged toruntime._Systemwhen a stack traceback "is impossible or has failed" for some reason. We typically see a small amount of system overhead accounted here. -
runtime._GCSimilar toruntime._System, this dummy routine is charged with time when a traceback is not possible but the system appears to be in the midst of garbage collection.
Knowledge of the way that CPU run-time is partitioned can help focus attention on different parts of the profile. For example, go tool pprof includes the -focus option which can be used to restrict the profile output to only that portion of the run-time attributable to individual silos. As documented, -focus "restricts to paths going through the node...". So for example, since runtime.goexit is always a terminal node, focussing on runtime.goexit reports on the goroutine silo.
The first example creates a focused list on the terminal, and the second example reports the same information as a call-graph PDF.
go tool pprof -top -focus=runtime.goexit pprof.XXX.samples.cpu.NNN.pb.gz
go tool pprof -pdf -focus=runtime.goexit pprof.XXX.samples.cpu.NNN.pb.gz > prof.pdf
Similarly, if one wanted to focus on all of the system overhead not asscoiated with goroutines one could -ignore runtime.goexit
go tool pprof -top -ignore=runtime.goexit pprof.XXX.samples.cpu.NNN.pb.gz
The runtime.morestack partition is self contained. High cumulative percentages recorded under runtime.morestack suggest that the default goroutine stack size may be too small. Modifying the default stack size currently requires modifying the source code and rebuilding the Go runtime package. As luck would have it, instructions for doing this happen to have been documented.