From 842261dcca2574d70fcb582fa62f753452188017 Mon Sep 17 00:00:00 2001 From: Changkun Ou Date: Wed, 27 Jan 2021 09:31:31 +0100 Subject: [PATCH] content: proofread --- content/posts/zero-alloc-call-sched.md | 305 +++++++++++++------------ 1 file changed, 156 insertions(+), 149 deletions(-) diff --git a/content/posts/zero-alloc-call-sched.md b/content/posts/zero-alloc-call-sched.md index cd925b2..7f7db3a 100644 --- a/content/posts/zero-alloc-call-sched.md +++ b/content/posts/zero-alloc-call-sched.md @@ -16,56 +16,58 @@ title: Scheduling Function Calls with Zero Allocation Author(s): [Changkun Ou](https://changkun.de) GUI programming in Go is a little bit tricky. The infamous issue -regarding interacting with the legacy GUI frameworks is that most of -the graphics related APIs must be called from the main thread. -This basically violates the concurrent nature of Go: A goroutine may be +regarding interacting with legacy, GUI frameworks is that +most graphics related APIs must be called from the main thread. +The issue violates the concurrent nature of Go: A goroutine maybe arbitrarily and randomly scheduled or rescheduled on different running -threads, i.e., the same pice of code will be called from different +threads, i.e., the same piece of code will be called from different threads over time, even without evolving the `go` keyword. ## Background -In multi-threaded programming, operating systems provide space, +In multithreaded programming, operating systems provide space, the so-called Thread Local Storage (TLS) for each thread of a process -to store their private and local content. In the era where multithreaded -programming and scheduling algorithms are not rich enough, -the TLS feature was very useful to avoid data race since this storage is +to store their private and local content. In the era where +multithreaded programming and scheduling algorithms are not rich enough, +the TLS feature was handy to avoid data race since this storage is purely local and guaranteed by the operating system. -For example, a graphics rendering backend such as OpenGL Context was -designed to store the rendering context of each thread on TLS; -In macOS, the famous GUI framework Cocoa also requires rendering user -interfaces on a specific thread, that is the so-called *main thread*. +For example, a graphics rendering backend such as OpenGL Context +was designed to store each thread's rendering context on TLS; +In macOS, the famous GUI framework Cocoa also requires rendering +user interfaces on a specific thread, +which is the so-called *main thread*. ## The Main Thread -In Go, as we all know that a goroutine will be scheduled on different +In Go, as we know that a goroutine will be scheduled to different threads due to its internal work-stealing scheduler [^work-steal] [^go11sched]. -With work-tealing scheduler, goroutines are not promised to run on a specific -thread forever. Instead, whenever a goroutine goes to sleep, or endering a -system call, or the Go's runtime proactively interrupts the execution of -that goroutine, it is likely to be rescheduled to a different thread. -Therefore, if a rendering (OpenGL) context is stored on the old thread, -after switching to a new thread will cause the lose of that old context, too. -**Because such an interruption can happen at anytime and anywhere, +With a work-stealing scheduler, goroutines are not promised to run +on a specific thread forever. Instead, whenever a goroutine goes to +sleep or entering a system call, or the Go runtime proactively +interrupts the execution of that goroutine, it is likely to +be rescheduled to a different thread. Therefore, if an (OpenGL) +rendering context is stored on the old thread, switching to a new thread +will cause the old context's loss. +**Because such an interruption can happen at any time and anywhere, it is impossible to check if the goroutine remains on the same thread when the execution resumes.** -The original intntion of designing such a scheduler is to eliminate -the concept of system thread and multiplexing it. In this way, users won't -suffer from the paying the cost of threads switch/sleep whereas threads -always in its full power mode that constantly running tasks either from -user or the runtime. +The original intention of designing such a scheduler is to eliminate +the concept of system thread and multiplex it. In this way, users will +not suffer from paying the cost of threads switch/sleep, whereas threads +always in their full power mode that is continuously running tasks either +from the user or the runtime. ### Method `runtime.LockOSThread` and Package `mainthread` If GUI applications must interact with the OS on the main thread, -how can we achieve the goal where we want run a specific thread perminantly? +how can we achieve the goal of running a specific thread permanently? Luckily, there is a method called `LockOSThread` offered from the -`runtime` package, provides the exact feature we want: +`runtime` package, provides the same feature we want: ```go // LockOSThread wires the calling goroutine to its current operating system thread. @@ -85,14 +87,14 @@ Luckily, there is a method called `LockOSThread` offered from the func LockOSThread() ``` -As the document of `LockOSThread` says: All `init` functions are run on +As the `LockOSThread` document says: All `init` functions run on the startup thread. Calling `LockOSThread` from an `init` function will -cause the main function to be invoked on that thread. +cause the primary function to be invoked on that thread. -If you think about that carefully, you will immediately realize this gives -us the opportunity to identify, at least, the main thread. -When we would like to wrapping thread scheduling as a package `mainthread`, -we can do something like the following: +If we think about that carefully, we will immediately realize this +allows us to identify, at least, the main thread. When we would like to +wrapping thread scheduling as a package `mainthread`, we can do +something like the following:: ```go {linenos=inline,hl_lines=[13,16],linenostart=1} package mainthread // import "x/mainthread" @@ -135,15 +137,14 @@ func gn() { } ``` -Once we solved the problem of API design, the next question is: -How can we implement the `Init` and `Call`? +Once we solved API design, the next question is: How can we implement +the `Init` and `Call`? -Well, it is not that difficult. Recall that we use `Init` method -to obtain the full control of the main thread, then we should never and -ever to give up such a power. Thus, creating another goroutine to run -what we initially wants to run, and use a channel to receive -the calls that we would like to schedule on the main thread -becomes our only option: +Well, it is not that difficult. Recall that we use `Init` method to +obtain the main thread's full control, then we should never give up +such power. Thus, creating another goroutine to run what we initially +want to run and use a channel to receive the calls that we would like +to schedule on the main thread becomes our only option: ```go // funcQ is a global channel that responsible for receiving function @@ -186,9 +187,10 @@ func Call(f func()) { } ``` -> Note that we use empty struct as our channel signal, if you are not -> familiar with empty struct and channels, you might want read two great -> post from Dave Cheney [^empty-struct] [^curious-channels]. +> Note that we use empty struct as our channel signal; +> if you are not familiar with empty struct and channels, +> you might want to read two great posts from +> Dave Cheney [^empty-struct] [^curious-channels]. To use such a package, one can use `mainthread.Call` to schedule a call to be executed on the main thread: @@ -241,8 +243,8 @@ func Terminate() { } ``` -and make sure critical calls like `glfw.WaitEventsTimeout` inside -the rendering loop always be executed from the main thread: +Furthermore, make sure critical calls like `glfw.WaitEventsTimeout` +inside the rendering loop always be executed from the main thread: ```go package app // import "x/app" @@ -285,9 +287,9 @@ func (w *Win) Run() { } ``` -As a user of `app` package, can get rid of the understanding -and thought overhead about when and how do we call a function -on the main thread: +As a user of the `app` package, we can get rid of the understanding +overhead regarding when and how should we call a function +on the main thread:: ```go package main @@ -317,21 +319,23 @@ func fn() { ![](../assets/zero-alloc-call-sched/app.png) -Now, we have an empty solid window and will never be crashed randomly 😄. +Now, we have an empty solid window and will never crash randomly 😄. ## Cost Analysis and Optimization + After implementing a first iteration of the `mainthread` package, -we might directly wonder about the performance of this package, -questions could be: +we might directly wonder about this package's performance. +A question could be: -_If a function is sent from a thread to the main thread, what's the -latency when calling such a function?_ +_What is the latency when calling such a function if it is transmitted +from a thread to the main thread?_ -Let's write a few benchmark tests that can measure the performance of -such a call. The idea is very simple, we need a baseline to identify -the initial cost of calling a function, then measure the completion -time when we schedule the same function call on the main thread: +Let us write a few benchmark tests that can measure the performance of +such a call. The idea is straightforward, and we need a baseline to +identify the initial cost of calling a function, then measure +the completion time when we schedule the same function call +on the main thread: ```go var f = func() {} @@ -357,11 +361,12 @@ func BenchmarkMainThreadCall(b *testing.B) { } ``` -Be careful with micro benchmarks here, as we discussed in our previous -research [^bench-time], let's use the `bench` -tool [^bench-tool] for benchmarking. `bench` is tool for executing -Go benchmarks reliably. It automatically locks machine's performance -and execute benchmarks 10x by default to eliminate system measurement error: +Be careful with micro-benchmarks here: Referring to our previous +discussion about the time measurement of benchmarks [^bench-time], +let us use the benchmarking tool [^bench-tool]. The `bench` is a tool +for executing Go benchmarks reliably, and it automatically locks +the machine's performance and executes benchmarks 10x by default +to eliminate system measurement error: ``` $ bench @@ -383,50 +388,53 @@ DirectCall-8 0.00 MainThreadCall-8 2.00 ±0% ``` -The benchmark result indicates that calling an empty function directly in Go -will `1ns` whereas schedule the same empty function to the main thread -will spend `448ns`. Thus the cost is `447ns`. +The benchmark result indicates that calling an empty function directly +in Go will `1ns` whereas schedule the same empty function to +the main thread will spend `448ns`. Thus the cost is `447ns`. -Moreover, when we talk about cost, -we actually care about the cost of CPU as well as memory consumption. -According to the second report regarding `allocs/op`, the result shows -scheduling an empty function to the mainthread will cost `120B` allocation. +Moreover, when we talk about cost, we care about the cost of CPU and +memory consumption. According to the second report regarding `allocs/op`, +the result shows scheduling an empty function to the `mainthread` +will cost `120B` allocation. -Allocation of `120B` per operation might not be a big deal from our first impression. -However, if we consider the actual use case of this package, i.e. managing GUI rendering -calls, either CPU or memory allocation can be propagated to a huge cost over time. -If we are dealing with rendering, especially graphical rendering, the fresh rate -is typically minimum 25fps, ideally 30fps or even higher. +Allocation of `120B` per operation might not be a big deal from +our first impression. However, if we consider the actual use case of +this package, i.e., managing GUI rendering calls, either CPU or +memory allocation can be propagated to a considerable cost over time. +If we are dealing with rendering, especially graphical rendering, +the new rate is typically a minimum of 25fps, ideally 30fps or even higher. -This means, every 5 minutes, without considering mouse button, movements, -and keystrokes, a GUI application will allocate at least: +That means, for every 5 minutes, without considering mouse button, +movements, and keystrokes, a GUI application will allocate at least: $$ 5 \times 60\times 30 \times 120 \text{byte} = 1.08 \text{MiB} $$ -A directly impact from an excessive allocation behavior is the runtime garbage -collector and the scavenger. With higher allocation rate, the garbage collector -is triggered more often, and the scavenger releases memory to the OS more often. -Because of more works are produced for the GC, the GC will also consume more -CPU from the system. It is good enough to say the entire application is -a vicious circle. +A direct impact from an excessive allocation behavior is the runtime +garbage collector and the scavenger. With a higher allocation rate, +the garbage collector is triggered more often, and the scavenger +releases memory to the OS more often. Because more works are produced +for the GC, the GC will also consume more CPU from the system. +It is good enough to say the entire application is a vicious circle. -The following is a trace information of that above application runs in 6 minutes, the total heap allocation is actually 1.41 MiB (2113536-630784 byte), preety close to what we predicted before. +The following is trace information of that above application runs +in 6 minutes, and the total heap allocation is 1.41 MiB +(2113536-630784 byte), pretty close to what we predicted before. ![](./../assets/zero-alloc-call-sched/naive-sched-trace-1.png) ![](./../assets/zero-alloc-call-sched/naive-sched-trace-2.png) -Where does the allocation come from? +Where does the allocation occur? How can we deal with these issues? -How to optimize the exisiting naive implementation? -Let's find out in the next section. +How to optimize the existing naive implementation? +Let us find out in the next section. ## Optimal Threading Control -The first optimization comes to the attempt of avoid allocating -channels. In our `Call` implementation, we allocate a signal channel -for every function that we need to call from the main thread: +The first optimization comes to the attempt to avoid allocating channels. +In our `Call` implementation, we allocate a signal channel for +every function that we need to call from the main thread: ```go {linenos=inline,hl_lines=[3],linenostart=1} // Call calls f on the main thread and blocks until f finishes. @@ -440,9 +448,9 @@ func Call(f func()) { } ``` -This means everytime when we call the `Call` method, we will have to -allocate at least 96 bytes for a channel due to the Go compiler will -uses `runtime.hchan` as the struct that represents the actual channel: +Thus, whenever we call the `Call` method, we will have to allocate +at least 96 bytes for a channel due to the Go compiler will uses +`runtime.hchan` as the struct that represents the channel under the hood: ```go // in src/runtime/chan.go @@ -463,7 +471,7 @@ type hchan struct { } ``` -A well known trick to avoid repetitive allocation is to use +A well-known trick to avoid repetitive allocation is to use the `sync.Pool`. One can: ```go {linenos=inline,hl_lines=["1-3", 6, 7],linenostart=1} @@ -483,7 +491,7 @@ func Call(f func()) { } ``` -With that simple optimization, a rebenchmarked result indicates +With that simple optimization, a benchmarked result indicates an 80% reduction of memory usage: ```txt {linenos=inline,hl_lines=[3,7,11],linenostart=1} @@ -500,24 +508,25 @@ DirectCall-8 0.00 0.00 ~ (all equal) MainThreadCall-8 2.00 ±0% 1.00 ±0% -50.00% (p=0.000 n=10+10) ``` -Can we do it even better? The answer is yes. As you can notice that -there is still a 24B of allocation per operation. But to identify it becomes -a little bit tricky. +Can we do it even better? The answer is yes. One can notice that +there is still a 24B of allocation per operation. However, to identify +it becomes somewhat tricky. In Go, variables can be allocated from heap if: 1. Using `make` and `new` keywords explicitly, or 2. Escape from the stack -The second case is a little bit advance from the normal use of Go. To be short, -escape from the execution stack to the heap is something that decided from -compile time. The Go's compiler will decide when should a vaiable be allocated -on the heap. The process of deciding allocate variables either on the stack or -the heap is called _escape analysis_. +The second case is a little bit advance from the regular use of Go. +To be short, escape from the execution stack to the heap is decided +from compile time. The Go's compiler will decide when a variable should +be allocated on the heap. Deciding to allocate variables either +on the stack or the heap is called _escape analysis_. -The great thing about the Go is that this information is trackable and can be -enabled directly from the Go toolchain. One can use `-gcflags="-m"` to activate -the escape analysis and see the result from the compile time: +The great thing about Go is that this information is trackable and +can be enabled directly from the Go toolchain. +One can use `-gcflags="-m"` to activate the escape analysis and +see the result from the compile-time: ```shell $ go build -gcflags="-m" @@ -526,14 +535,15 @@ $ go build -gcflags="-m" ./mainthread.go:52:11: func literal escapes to heap ``` -The compiler shows us that the sending function is leaking and the wrapper -function that sends via our `funcQ` is causing the func literal escaping to the -heap. The reason that func literal escapes to the heap is because a func literal -is considered as a pointer, and sending a pointer via channel will always +The compiler shows us that the sending function is leaking, +and the wrapper function that sends via our `funcQ` is causing +the function literal escaping to the heap. +The function literal escapes to the heap because a function literal +is considered a pointer, and sending a pointer via channel will always cause an escape by design. -To avoid the escaping fucntion literal, instead of using a function wrapper, -we can send a struct: +To avoid the escaping function literal, instead of using +a function wrapper, we can send a struct: ```go {linenos=inline,hl_lines=["1-4", 10],linenostart=1} type funcdata struct { @@ -550,7 +560,7 @@ func Call(f func()) { } ``` -and when we receives the `funcdata`: +and when we receive the `funcdata`: ```go {linenos=inline,hl_lines=["6-8"],linenostart=1} func Init(main func()) { @@ -589,11 +599,11 @@ Hooray! 🎉 ## Verification and Discussion -Before we conclude our research, let's do a final verification on the real world -example that we had before: the GUI application. +Before we conclude this research, let us do a final verification on +the real-world example that we had before: the GUI application. -While a re-evaluation, we can see from the trace file that entire application -is still allocating memory and the heap is still increasing: +While a re-evaluation, we can see from the trace file that the entire +application is still allocating memory and the heap is still increasing: ![](../assets/zero-alloc-call-sched/opt-sched-trace.png) @@ -602,14 +612,15 @@ only allocates: $$ 958464 - 622592 = 0.32 \text{MiB} $$ -Comparing to the previous 1.41 MiB allocation, we optimized 1.08 MiB of memory -which is exactly what was predicted before. +Compared to the previous 1.41 MiB allocation, we optimized 1.08 MiB of memory, +which we precisely predicted before. -We might still wondering, if scheduling is not allocating memory anymore, -who is still allocating the memory? To find out, we need a little bit help -from the `runtime` package. The compiler translates the allocation operation -to a runtime function `runtime.newobject`. One can add three more lines -and prints who is exactly calling this function using `runtime.FuncForPC`: +We might still wonder if scheduling is not allocating memory anymore, +who is still allocating the memory? To find out, we need a little bit of +help from the `runtime` package. The compiler translates the allocation +operation to a runtime function `runtime.newobject`. One can add 3 +more lines and prints, which is exactly calling this function +using `runtime.FuncForPC`: ```go {linenos=inline,hl_lines=["3-5"],linenostart=1} // src/runtime/malloc.go @@ -621,9 +632,9 @@ func newobject(typ *_type) unsafe.Pointer { } ``` -In above, the `getcallerpc` is a runtime private helper. -If we execute the application again, we will see printed information similar -to below: +In the above, the `getcallerpc` is a runtime private helper. +If we execute the application again, we will see printed information +similar to below: ``` 88 runtime.acquireSudog /Users/changkun/dev/godev/go-github/src/runtime/proc.go 375 @@ -632,7 +643,7 @@ to below: ... ``` -This demonstrates how and why the allocation still happens: +It demonstrates how and why the allocation still happens: ```go {linenos=inline,hl_lines=[23],linenostart=1} // ch <- elem @@ -664,30 +675,28 @@ func acquireSudog() *sudog { } ``` -Unfortunately, this is completely outside the control of the userland. +Unfortunately, this is entirely outside the control of the userland. We are not able to optimize here anymore. Nevertheless, we have reached our goal for today, and this is the best of what we can do so far. -One more thing, if you take a closer look into how much the heap grows -for one step, you will get some calculation like this: 671744-663552=8192 -This is actually the minimum allocation size of the runtime allocator, -which allocates a _page_. Due to the discussion of this -topic if much more out from our goal in this research, we leave that as -your future outlook to dig more on your own, there is a great blog post -[^mem-alloc] as your starting point. +One more thing, if we take a closer look into how much the heap grows +for one step, we will get some calculation like this: 671744-663552=8192 +The result is, in fact, the minimum allocation size of the runtime allocator, +which allocates a _page_. Since the discussion of such a topic +has deviated from this research's goal, we leave that as a future outlook. ## Conclusion In this research, we covered the following topics: -1. Go's runtime scheduler -2. Scheduling on a specific thread, especially the main thread -3. Reliable benchmarking and allocations tracing techniques -4. Go's runtime memory allocator -5. Go's runtime garbage collector +1. The Go runtime scheduler +2. The Go runtime memory allocator +3. The Go runtime garbage collector +4. Scheduling on a specific thread, especially the main thread +5. Reliable benchmarking and allocations tracing techniques 6. Escape analysis -7. Go's channel implementation +7. The channel implementation in Go There are several points we can summarize: @@ -695,12 +704,9 @@ There are several points we can summarize: 2. A function literal allocate 24 bytes of memory 3. Escape analysis can help us identify unexpected allocations, and function literal is considered as a pointer that always escapes to the heap 4. Sending information via a channel can cause allocation intrinsically from the runtime. -5. Go runtime grows the the heap 8K on each step as page allocation +5. Go runtime grows the heap 8K on each step as page allocation -Beyond this research, we also encapsulated all the abstractions derived from this -research, and we published two packages: `mainthread`[^mainthread] and `thread`[^thread]. -These packages gives you the ability to schedule any function calls -either on the main thread, or a specific thread. +We also encapsulated all the abstractions from this research and published two packages: `mainthread`[^mainthread] and `thread`[^thread]. These packages allow us to schedule any function calls either on the main thread or a specific thread. Furthermore, We also submitted a pull request to the Fyne project [^fyne], which could reduce a considerable amount of memory allocations from the existing real-world GUI applications. Have fun! @@ -714,6 +720,7 @@ Have fun! [^bench-tool]: Changkun Ou. "bench: Reliable performance measurement for Go programs. All in one design." https://golang.design/s/bench [^empty-struct]: Dave Cheney. "The empty struct." March 25, 2014. https://dave.cheney.net/2014/03/25/the-empty-struct [^curious-channels]: Dave Cheney. "Curious Channels." April 30, 2013. https://dave.cheney.net/2013/04/30/curious-channels -[^mem-alloc]: Dave Cheney. "A few bytes here, a few there, pretty soon you’re talking real memory." Jan 05, 2021. https://dave.cheney.net/2021/01/05/a-few-bytes-here-a-few-there-pretty-soon-youre-talking-real-memory +[^mem-alloc]: Dave Cheney. "A few bytes here, a few there, pretty soon you're talking real memory." Jan 05, 2021. https://dave.cheney.net/2021/01/05/a-few-bytes-here-a-few-there-pretty-soon-youre-talking-real-memory [^mainthread]: Changkun Ou. "Package golang.design/x/mainthread." https://golang.design/s/mainthread -[^thread]: Changkun Ou. "Package golang.design/x/thread." https://golang.design/s/thread \ No newline at end of file +[^thread]: Changkun Ou. "Package golang.design/x/thread." https://golang.design/s/thread +[^fyne]: Changkun Ou. "Optimize the cost of calling on the main/draw threads." Jan 20, 2021 https://github.com/fyne-io/fyne/pull/1837 \ No newline at end of file