article: add pointer-params

golang-design · Oct 27, 2020 · 6a33110 · 6a33110
1 parent e1e0e18
commit 6a33110
Show file tree

Hide file tree

Showing 2 changed files with 235 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -4,6 +4,7 @@
 
 ## Table of Content
 
+- [Changkun Ou. Pointer Type May Not Be Ideal for Parameters. Oct 27, 2020.](./pointer-params.md)
 - [Changkun Ou. Eliminating A Source of Measurement Errors in Benchmarks. Sep 30, 2020.](./bench-time.md)
 
 ## License

diff --git a/pointer-params.md b/pointer-params.md
@@ -0,0 +1,234 @@
+# Pointer Type May Not Be Ideal for Parameters
+
+Author(s): [Changkun Ou](https://changkun.de)
+
+Last updated: 2020-10-27
+
+## Introduction
+
+We are aware of that using pointers for passing parameters can avoid data copy, which will benefit the prformance. But there are always some edge cases you might need concern.
+
+Let's check this example:
+
+```go
+// vec.go
+type vec1 struct {
+	x, y, z, w float64
+}
+
+func (v vec1) add(u vec1) vec1 {
+	return vec1{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
+}
+
+type vec2 struct {
+	x, y, z, w float64
+}
+
+func (v *vec2) add(u *vec2) *vec2 {
+	v.x += u.x
+	v.y += u.y
+	v.z += u.z
+	v.w += u.w
+	return v
+}
+```
+
+Which `add` implementation runs faster?
+Intuitively, we might think that `vec2` is faster because its parameter `u` uses pointer and there should have no copies on the data, whereas `vec1` involves data copy both when passing and returning.
+
+However, if we write a benchmark:
+
+```go
+func BenchmarkVec(b *testing.B) {
+	b.ReportAllocs()
+	b.Run("vec1", func(b *testing.B) {
+		v1 := vec1{1, 2, 3, 4}
+		v2 := vec1{4, 5, 6, 7}
+		b.ResetTimer()
+		for i := 0; i < b.N; i++ {
+			if i%2 == 0 {
+				v1 = v1.add(v2)
+			} else {
+				v2 = v2.add(v1)
+			}
+		}
+	})
+	b.Run("vec2", func(b *testing.B) {
+		v1 := vec2{1, 2, 3, 4}
+		v2 := vec2{4, 5, 6, 7}
+		b.ResetTimer()
+		for i := 0; i < b.N; i++ {
+			if i%2 == 0 {
+				v1.add(&v2)
+			} else {
+				v2.add(&v1)
+			}
+		}
+	})
+}
+```
+
+And run as follows: 
+
+```sh
+$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee new.txt
+$ benchstat new.txt
+```
+
+The `benchstat` will give you the following result:
+
+```
+name         time/op
+Vec/vec1-16  0.25ns ± 1%
+Vec/vec2-16  2.20ns ± 0%
+```
+
+How is this happening?
+
+## Inlining Optimization
+
+This is all because of compiler optimization, and mostly because of inlining.
+
+If we disable inline from the `add`:
+
+```go
+// vec.go
+type vec1 struct {
+	x, y, z, w float64
+}
+
+//go:noinline
+func (v vec1) add(u vec1) vec1 {
+	return vec1{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
+}
+
+type vec2 struct {
+	x, y, z, w float64
+}
+
+//go:noinline
+func (v *vec2) add(u *vec2) *vec2 {
+	v.x += u.x
+	v.y += u.y
+	v.z += u.z
+	v.w += u.w
+	return v
+}
+```
+
+Run the benchmark and compare the perf with the previous one:
+
+```sh
+$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | tee old.txt
+$ benchstat old.txt new.txt
+name         old time/op  new time/op  delta
+Vec/vec1-16  4.92ns ± 1%  0.25ns ± 1%  -95.01%  (p=0.000 n=10+9)
+Vec/vec2-16  2.89ns ± 1%  2.20ns ± 0%  -23.77%  (p=0.000 n=10+8)
+```
+
+The inline optimization transforms the code:
+
+```go
+v1 := vec1{1, 2, 3, 4}
+v2 := vec1{4, 5, 6, 7}
+v1 = v1.add(v2)
+```
+
+to a direct assign statement:
+
+```go
+v1 := vec1{1, 2, 3, 4}
+v2 := vec1{4, 5, 6, 7}
+v1 = vec1{1+4, 2+5, 3+6, 4+7}
+```
+
+And for the `vec2`'s case:
+
+```go
+v1 := vec2{1, 2, 3, 4}
+v2 := vec2{4, 5, 6, 7}
+v1 = v1.add(v2)
+```
+
+to a direct manipulation:
+
+```go
+v1 := vec2{1, 2, 3, 4}
+v2 := vec2{4, 5, 6, 7}
+v1.x += v2.x
+v1.y += v2.y
+v1.z += v2.z
+v1.w += v2.w
+```
+
+## Unoptimized Move Semantics
+
+If we check the compiled assembly, the reason reveals quickly:
+
+```sh
+$ go tool compile -S vec.go > vec.s
+```
+
+The dumped assumbly code is as follows:
+
+```asm
+"".vec1.add STEXT nosplit size=89 args=0x60 locals=0x0
+	0x0000 00000 (vec.go:8)	TEXT	"".vec1.add(SB), NOSPLIT|ABIInternal, $0-96
+	0x0000 00000 (vec.go:8)	FUNCDATA	$0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
+	0x0000 00000 (vec.go:8)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
+	0x0000 00000 (vec.go:9)	MOVSD	"".u+40(SP), X0
+	0x0006 00006 (vec.go:9)	MOVSD	"".v+8(SP), X1
+	0x000c 00012 (vec.go:9)	ADDSD	X1, X0
+	0x0010 00016 (vec.go:9)	MOVSD	X0, "".~r1+72(SP)
+	0x0016 00022 (vec.go:9)	MOVSD	"".u+48(SP), X0
+	0x001c 00028 (vec.go:9)	MOVSD	"".v+16(SP), X1
+	0x0022 00034 (vec.go:9)	ADDSD	X1, X0
+	0x0026 00038 (vec.go:9)	MOVSD	X0, "".~r1+80(SP)
+	0x002c 00044 (vec.go:9)	MOVSD	"".u+56(SP), X0
+	0x0032 00050 (vec.go:9)	MOVSD	"".v+24(SP), X1
+	0x0038 00056 (vec.go:9)	ADDSD	X1, X0
+	0x003c 00060 (vec.go:9)	MOVSD	X0, "".~r1+88(SP)
+	0x0042 00066 (vec.go:9)	MOVSD	"".u+64(SP), X0
+	0x0048 00072 (vec.go:9)	MOVSD	"".v+32(SP), X1
+	0x004e 00078 (vec.go:9)	ADDSD	X1, X0
+	0x0052 00082 (vec.go:9)	MOVSD	X0, "".~r1+96(SP)
+	0x0058 00088 (vec.go:9)	RET
+"".(*vec2).add STEXT nosplit size=73 args=0x18 locals=0x0
+	0x0000 00000 (vec.go:17)	TEXT	"".(*vec2).add(SB), NOSPLIT|ABIInternal, $0-24
+	0x0000 00000 (vec.go:17)	FUNCDATA	$0, gclocals·8f9cec06d1ae35cc9900c511c5e4bdab(SB)
+	0x0000 00000 (vec.go:17)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
+	0x0000 00000 (vec.go:18)	MOVQ	"".u+16(SP), AX
+	0x0005 00005 (vec.go:18)	MOVSD	(AX), X0
+	0x0009 00009 (vec.go:18)	MOVQ	"".v+8(SP), CX
+	0x000e 00014 (vec.go:18)	ADDSD	(CX), X0
+	0x0012 00018 (vec.go:18)	MOVSD	X0, (CX)
+	0x0016 00022 (vec.go:19)	MOVSD	8(AX), X0
+	0x001b 00027 (vec.go:19)	ADDSD	8(CX), X0
+	0x0020 00032 (vec.go:19)	MOVSD	X0, 8(CX)
+	0x0025 00037 (vec.go:20)	MOVSD	16(CX), X0
+	0x002a 00042 (vec.go:20)	ADDSD	16(AX), X0
+	0x002f 00047 (vec.go:20)	MOVSD	X0, 16(CX)
+	0x0034 00052 (vec.go:21)	MOVSD	24(AX), X0
+	0x0039 00057 (vec.go:21)	ADDSD	24(CX), X0
+	0x003e 00062 (vec.go:21)	MOVSD	X0, 24(CX)
+	0x0043 00067 (vec.go:22)	MOVQ	CX, "".~r1+24(SP)
+	0x0048 00072 (vec.go:22)	RET
+```
+
+The `add` implementation of `vec1` uses values from the previous stack frame and writes the result directly to the return;
+whereas `vec2` needs MOVQ that copies the parameter to different registers (e.g., copy pointers to AX and CX,), then write back to the return.
+
+The unexpected move cost in `vec2` is the additional two `MOVQ` instructions and read operations on the two pointer addresses.
+
+## Further Reading Suggestions
+
+- Changkun Ou. Conduct Reliable Benchmarking in Go. March 26, 2020. https://golang.design/s/gobench
+- Dave Cheney. Mid-stack inlining in Go. May 2, 2020. https://dave.cheney.net/2020/05/02/mid-stack-inlining-in-go
+- Dave Cheney. Inlining optimisations in Go. April 25, 2020. https://dave.cheney.net/2020/04/25/inlining-optimisations-in-go
+- MOVSD. Move or Merge Scalar Double-Precision Floating-Point Value. Last access: 2020-10-27. https://www.felixcloutier.com/x86/movsd
+- ADDSD. Add Scalar Double-Precision Floating-Point Values. Last access: 2020-10-27. https://www.felixcloutier.com/x86/addsd
+- MOVEQ. Move Quadword. Last access: 2020-10-27. https://www.felixcloutier.com/x86/movq
+
+## License
+
+Copyright &copy; 2020 The [golang.design](https://golang.design) Authors.