Byte or rune: is there any difference?

When you are creating algorithms to deal with strings in Go, especially when using for loops, it’s common to use either a regular for loop (using each item of the slice as a byte) or a for range loop (using each item as a rune). When choosing which one to use, you could have asked youself “Is there any difference?”. That’s what we are going to talk about in this post.

Before going into the difference between rune and byte, it is important to note that a string in Go is, in effect, a read-only slice of bytes.

When trying to decide which one to use, you could write both of them to see what is the difference. See the example below:

func main() {
    const str = "abc"
    for key, char := range str {
        fmt.Printf("%#v of type %T with index %#v\n", char, char, key)
        fmt.Printf("%#v of type %T with index %#v\n", str[key], str[key], key)
    }

    fmt.Println("===")

    for i := 0; i < len(str); i++ {
        fmt.Printf("%#v of type %T with index %#v\n", str[i], str[i], i)
    }
}

Playground to run the above code.

The output of the code above is:

97 of type int32 with index 0
0x61 of type uint8 with index 0
98 of type int32 with index 1
0x62 of type uint8 with index 1
99 of type int32 with index 2
0x63 of type uint8 with index 2
===
0x61 of type uint8 with index 0
0x62 of type uint8 with index 1
0x63 of type uint8 with index 2

By analysing each type, when you index a string (str[key]) the value is going to be a byte (which is the same as uint8). If you use a for range loop, the char is going to be a rune (which is the same as int32).

Then you check if there are any differences between these types when converting back to a string.

func main() {
    var charTest rune = 97
	var byteTest byte = 0x61

	fmt.Println(string(charTest), string(byteTest)) // output: a a
}

Playground to run the above code.

Well, so you could conclude that byte and rune are the same thing and can be used interchangeably, right?

Actually, no.

The key difference between these types is often misunderstood because a lot of programmers learned, wrongly, that a character is stored in one byte.

Some historical context

In the “old-days”, the only characters used were english letters. Each of them has a code called ASCII, which represent every character with a number between 32 and 127 – and, conveniently, this could be stored in one byte.

As time passed, a lot of nations wanted to have their characters in computers as well, but it was impossible to store all possible latin, chinese, arabic, russian and other characters in a single byte. We would have thousands of characters, which cannot be stored in 8 bits.

To solve this problem, Unicode was invented.

Simplifying things a lot, Unicode had the objective to create a single character set that includes every possible character, symbols, etc. This means that a character cannot be always represented as a single byte. Some of them can be represented as one, two, up to six bytes.

In Unicode, one character maps to a code point, which is just an abstract concept. In practice, any letter/symbol/characters maps to a unique code point. You can think of it as the “UnicodeID” of a character.

Code points and rune

Up until now, you know that a string is a slice of bytes and each character (code point) in a string can be represented as 1 to 6 bytes.

Go introduced a short therm for the Code Point concept: rune. This means that a rune is exactly the same as a Code Point. The only addition is that Go defines the word rune as an alias for the type int32, so that programs can be clear when an integer value represents a code point.

Going back to the question “Are byte and rune the same thing and can be used interchangeably?": the reason why we can’t treat byte the same as rune is because a rune can be represented as multiple bytes.

It is easier to understand the difference if we use a character that is stored in more than 1 byte. Check the example below:

func main() {
    str := "日本語"
	for key, char := range str {
		fmt.Printf("Character: %v. Type: %T. Value: %v. Index: %#v\n", string(char), char, char, key)
	}

	fmt.Println("===")

	for key := 0; key < len(str); key++ {
		fmt.Printf("Character: %v. Type: %T. Value: %v. Index: %#v\n", string(str[key]), str[key], str[key], key)
	}
}

Playground to run the above code.

The output of the code above is:

Character: 日. Type: int32. Value: 26085. Index: 0
Character: 本. Type: int32. Value: 26412. Index: 3
Character: 語. Type: int32. Value: 35486. Index: 6
===
Character: æ. Type: uint8. Value: 230. Index: 0
Character: . Type: uint8. Value: 151. Index: 1
Character: ¥. Type: uint8. Value: 165. Index: 2
Character: æ. Type: uint8. Value: 230. Index: 3
Character: . Type: uint8. Value: 156. Index: 4
Character: ¬. Type: uint8. Value: 172. Index: 5
Character: è. Type: uint8. Value: 232. Index: 6
Character: ª. Type: uint8. Value: 170. Index: 7
Character: %

Small challenge 😊

There is a LeetCode problem called Valid Anagram. Below you can check a possible solution for this problem:

func isAnagram(s string, t string) bool {
	if len(s) != len(t) {
		return false
	}

	sOccurrencies := make(map[byte]int, len(s))
	tOccurrencies := make(map[byte]int, len(t))
	for k := range s {
		sOccurrencies[s[k]] += 1
		tOccurrencies[t[k]] += 1
	}

	for k := range sOccurrencies {
		if sOccurrencies[k] != tOccurrencies[k] {
			return false
		}
	}

	return true
}

Playground to run the above code.

This code works fine for single-byte characters. How could you adapt the solution to work for any unicode character? Use the following main function to test it out:

func main() {
	fmt.Println(isAnagram("anagram", "nagaram")) // Works fine
    fmt.Println(isAnagram("こんにちは", "こんばんは")) // Wrong
	fmt.Println(isAnagram("ち", "ん")) // Wrong
}

Conclusion

Using a rune when for ranging looping makes your code compliant to any Unicode character, but if you treat characters as a single byte, your program could not behave as expected when dealing with characters stored in more than 1 byte.

If you want to deep dive more into this topic, I would recommend starting with the following reads:

  1. Strings, bytes, runes and characters in Go.
  2. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

If you have any questions, suggestions or want to discuss the topic even further, reach me out on Twitter!