When you are creating algorithms to deal with strings
in Go, especially when using for loops
, it’s common to use either a regular for loop
(using each item of the slice as a byte
) or a for range loop
(using each item as a rune
). When choosing which one to use, you could have asked youself “Is there any difference?”. That’s what we are going to talk about in this post.
Before going into the difference between
rune
andbyte
, it is important to note that astring
in Go is, in effect, a read-only slice of bytes.
When trying to decide which one to use, you could write both of them to see what is the difference. See the example below:
func main() {
const str = "abc"
for key, char := range str {
fmt.Printf("%#v of type %T with index %#v\n", char, char, key)
fmt.Printf("%#v of type %T with index %#v\n", str[key], str[key], key)
}
fmt.Println("===")
for i := 0; i < len(str); i++ {
fmt.Printf("%#v of type %T with index %#v\n", str[i], str[i], i)
}
}
Playground to run the above code.
The output of the code above is:
97 of type int32 with index 0
0x61 of type uint8 with index 0
98 of type int32 with index 1
0x62 of type uint8 with index 1
99 of type int32 with index 2
0x63 of type uint8 with index 2
===
0x61 of type uint8 with index 0
0x62 of type uint8 with index 1
0x63 of type uint8 with index 2
By analysing each type, when you index a string (str[key]
) the value is going to be a byte
(which is the same as uint8
). If you use a for range loop
, the char
is going to be a rune
(which is the same as int32
).
Then you check if there are any differences between these types when converting back to a string.
func main() {
var charTest rune = 97
var byteTest byte = 0x61
fmt.Println(string(charTest), string(byteTest)) // output: a a
}
Playground to run the above code.
Well, so you could conclude that byte
and rune
are the same thing and can be used interchangeably, right?
Actually, no.
The key difference between these types is often misunderstood because a lot of programmers learned, wrongly, that a character is stored in one byte.
Some historical context
In the “old-days”, the only characters used were english letters. Each of them has a code called ASCII, which represent every character with a number between 32 and 127 – and, conveniently, this could be stored in one byte.
As time passed, a lot of nations wanted to have their characters in computers as well, but it was impossible to store all possible latin, chinese, arabic, russian and other characters in a single byte. We would have thousands of characters, which cannot be stored in 8 bits.
To solve this problem, Unicode was invented.
Simplifying things a lot, Unicode had the objective to create a single character set that includes every possible character, symbols, etc. This means that a character cannot be always represented as a single byte. Some of them can be represented as one, two, up to six bytes.
In Unicode, one character maps to a code point, which is just an abstract concept. In practice, any letter/symbol/characters maps to a unique code point. You can think of it as the “UnicodeID” of a character.
Code points and rune
Up until now, you know that a string
is a slice of bytes and each character (code point) in a string can be represented as 1 to 6 bytes.
Go introduced a short therm for the Code Point concept: rune
. This means that a rune is exactly the same as a Code Point. The only addition is that Go defines the word rune
as an alias for the type int32
, so that programs can be clear when an integer value represents a code point.
Going back to the question “Are byte
and rune
the same thing and can be used interchangeably?": the reason why we can’t treat byte
the same as rune
is because a rune
can be represented as multiple bytes.
It is easier to understand the difference if we use a character that is stored in more than 1 byte. Check the example below:
func main() {
str := "日本語"
for key, char := range str {
fmt.Printf("Character: %v. Type: %T. Value: %v. Index: %#v\n", string(char), char, char, key)
}
fmt.Println("===")
for key := 0; key < len(str); key++ {
fmt.Printf("Character: %v. Type: %T. Value: %v. Index: %#v\n", string(str[key]), str[key], str[key], key)
}
}
Playground to run the above code.
The output of the code above is:
Character: 日. Type: int32. Value: 26085. Index: 0
Character: 本. Type: int32. Value: 26412. Index: 3
Character: 語. Type: int32. Value: 35486. Index: 6
===
Character: æ. Type: uint8. Value: 230. Index: 0
Character: . Type: uint8. Value: 151. Index: 1
Character: ¥. Type: uint8. Value: 165. Index: 2
Character: æ. Type: uint8. Value: 230. Index: 3
Character: . Type: uint8. Value: 156. Index: 4
Character: ¬. Type: uint8. Value: 172. Index: 5
Character: è. Type: uint8. Value: 232. Index: 6
Character: ª. Type: uint8. Value: 170. Index: 7
Character: %
Small challenge 😊
There is a LeetCode problem called Valid Anagram. Below you can check a possible solution for this problem:
func isAnagram(s string, t string) bool {
if len(s) != len(t) {
return false
}
sOccurrencies := make(map[byte]int, len(s))
tOccurrencies := make(map[byte]int, len(t))
for k := range s {
sOccurrencies[s[k]] += 1
tOccurrencies[t[k]] += 1
}
for k := range sOccurrencies {
if sOccurrencies[k] != tOccurrencies[k] {
return false
}
}
return true
}
Playground to run the above code.
This code works fine for single-byte characters. How could you adapt the solution to work for any unicode character? Use the following main
function to test it out:
func main() {
fmt.Println(isAnagram("anagram", "nagaram")) // Works fine
fmt.Println(isAnagram("こんにちは", "こんばんは")) // Wrong
fmt.Println(isAnagram("ち", "ん")) // Wrong
}
Conclusion
Using a rune
when for ranging looping makes your code compliant to any Unicode character, but if you treat characters as a single byte, your program could not behave as expected when dealing with characters stored in more than 1 byte.
If you want to deep dive more into this topic, I would recommend starting with the following reads:
- Strings, bytes, runes and characters in Go.
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
If you have any questions, suggestions or want to discuss the topic even further, reach me out on Twitter!