SIMD Within a Register (SWAR) on Strings in Go 1.20

Created:

To introduce SWAR (SIMD within-a-register) techniques in go, we’ll implement the simplest thing I can think of: checking if a string is ASCII. SWAR is the technique of operating over multiple datapoints within one register of the cpu. In our case one 64-bit variable will contain 8 bytes of our string so we can process 8 bytes at a time. We’ll do this in go without any C functions or assembly.

The Baseline

To determine if a non-ascii code point is present in a utf-8 string, the high-bit of the first byte of the code-point will be set. If we perform a bitwise AND with 0x80 the result will be non-zero if the high bit is set. If we find a non-ascii character, return False. The first version of IsAscii checks the string one byte at a time:

func IsAsciiSlow(s string) bool {
	for i := 0; i < len(s); i++ {
		if 0x80&s[i] > 0 {
			return false
		}
	}
	return true
}

Go 1.20

Go 1.20 adds 3 New functions in the unsafe package: SliceData, String, and StringData. The unsafe.Add function was added in go 1.17 We’ll use the StringData and Add functions from the unsafe package for our SWAR code.

SWAR

On to the SWAR version of IsAscii. First get the bytes of the string using the following code: bytes := unsafe.StringData(s) . This gets a pointer to the bytes of the string. Then we cast the pointer to *uint64 to get the next 8 bytes of the string as an uint64. To proceed along the string, we do pointer addition by adding an offset to the pointer and then cast it to a pointer to a uint64. Then the * will get the actual uint64 value at that position.

// return 8 bytes at offset as an uint64
func GetBytesUint64(bytes *byte, offset int) uint64 {
	// add offset to the pointer and convert bytes to an int64
	data := *(*uint64)(unsafe.Add(unsafe.Pointer(bytes), offset))
	return data
}

In IsAsciiSlow above we checked each byte by ANDing it with 0x80. To use SWAR technique, we can do the AND operation on all 8 bytes of a uint64 in one operation, by ANDing it with 0x8080808080808080. Here’s the SWAR IsAscii function:

const ascii_mask = 0x8080808080808080

func IsAsciiSWAR(s string) bool {
	length := len(s)
	i := 0
	if length >= 8 {
		bytes := unsafe.StringData(s)
		for ; i < length-7; i += 8 {
			if (ascii_mask & GetBytesUint64(bytes, i)) > 0 {
				return false
			}
		}
	}

	// check the last characters of the string
	for ; i < length; i++ {
		if 0x80&s[i] > 0 {
			return false
		}
	}
	return true
}

Bounds Checking

Note that if you run the following code:

str := "foo"
myUint64 := swar.GetBytesUint64(unsafe.StringData(str), 100)

Your program probably won’t return any error. Go doesn’t do any bounds checking of the bytes pointed to by the return value of unsafe.StringData(). By doing these unsafe calls you are making yourself vulnerable to buffer overflow bugs, a major cause of security problems in software. I think the use of unsafe.Add makes your code more risky than other parts of the unsafe package, because it performs pointer arithmetic which is more error prone. Code that just uses the unsafe package to convert from a slice to a different slice type is not as bad as long as you get the slice length right during construction of the new slice.

Loop Unrolling

Another technique used to speed up inner loops of performance-sensitive code is to unroll the loop. Multiple items are processed in one loop. Here’s an example of the loop unrolled 4 times:

func IsAsciiUnroll(s string) bool {
	length := len(s)
	i := 0
	if length >= 4 {
		for ; i < length-3; i += 4 {
			if (0x80 & s[i] & s[i+1] & s[i+2] & s[i+3]) > 0 {
				return false
			}
		}
	}
	for ; i < length; i++ {
		if 0x80&s[i] > 0 {
			return false
		}
	}
	return true
}

Benchmarking

Here’s some benchmark results of the 3 functions. The short string tested is 7 bytes long: “fffffff”. The long string tested is 4099 bytes long.

go_swar>go test -bench=.
goos: windows
goarch: amd64
pkg: swar
cpu: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
BenchmarkIsasciiSlow/short_ASCII-8              202330407                6.013 ns/op
BenchmarkIsasciiSlow/long_non-ASCII-8             587756              2346 ns/op
BenchmarkIsAsciiSWAR/short_ASCII-8              193101264                6.235 ns/op
BenchmarkIsAsciiSWAR/long_non-ASCII-8            3937690               298.3 ns/op
BenchmarkIsasciiUnroll/short_ASCII-8            183495440                6.520 ns/op
BenchmarkIsasciiUnroll/long_non-ASCII-8          1000000              1253 ns/op

(Note that the cpu ‘turboed’ to about 3.58Ghz during the benchmark according to Windows Task Manager)

Table of perf figures of the long string as bytes per second. I’ve divided by 4099 to get ns/byte.

Method ns/byte GBytes/s x Baseline
IsAsciiSlow 0.572 1.747GB/s 1
IsAsciiSWAR 0.073 13.75GB/s 7.87
IsAsciiUnroll 0.306 3.27GB/s 1.87

We can see here that we get a nice speedup of almost 8 times from the SWAR function, there’s almost no overhead to the code. The unrolled function gets us a speedup of almost 2 times, even though we’re processing 4 bytes per loop. Our slow code seems to process almost one byte every 2 clock cycles, and the SWAR code processes almost an 8-byte uint64 in 2 clock cycles.

The main drawback of using SWAR is that the code becomes very hard to understand. In the case of IsAscii, the SWAR code is very simple, but anything more complex and the SWAR code becomes very complex. See for example this page that describes a SWAR tolower.

Inlining

We can check for function inlining using the “-gcflags -m” parameters to go build.If you see the compiler output below, our code is capable of inlining by the go compiler. This helps to get good performance. If we used assembly code to speed up our code by using AVX-2 instructions for example, I believe the go compiler is unable to inline them. That might offset some of the benefits of the faster instructions in go. C++ compilers make intrinsics available for vector instructions to be interspersed with regular code and may inline those. This gives C++ a performance advantage over Go for those who want to use special SIMD instructions.

go build -gcflags -m .
# swar
.\swar.go:9:6: can inline GetBytesUint64
.\swar.go:17:6: can inline unsafeGetBytePointer
.\swar.go:24:6: can inline IsAsciiSWAR
.\swar.go:31:35: inlining call to GetBytesUint64
.\swar.go:45:6: can inline IsAsciiSlow
.\swar.go:55:6: can inline IsAsciiRange
.\swar.go:65:6: can inline IsAsciiUnroll
.\swar.go:9:21: bytes does not escape
.\swar.go:17:27: leaking param: s to result ~r0 level=0
.\swar.go:24:18: s does not escape
.\swar.go:45:18: s does not escape
.\swar.go:55:19: s does not escape
.\swar.go:65:20: s does not escape