Characters, Strings, and Encodings
C# 11 adds support for a new flavor of string (kind of) called a UTF-8 string.
To understand what that means, we need to back up a bit and talk about encodings.
Computers, being binary, cannot understand anything but 1’s and 0’s. Every datatype must map the set of possible values into bit patterns made up entirely of 1’s and 0’s. Part of that decision includes how many bits to use.
For some types, this is very simple.
The bool
type, for example, just says, “We’ll use 8 bits (1 byte), and the bit pattern 00000000
represents false
, the bit pattern 00000001
represents true
, and… well… if any other bit pattern comes up, we’ll also treat those as true
.”
The byte
type is more complicated.
It wants to represent numbers between 0 and 255, using 8 bits.
So the bit pattern 00000000
represents 0, 00000001
represents 1, 00000010
represents 2, 00000011
represents 3, and so on.
If you’re a math person, you may recognize this sequence as simply counting in a binary or base-2 counting system. 1, 10, 11, 100, 101, 110, 111, and so on.
The math world understood the notion of counting in binary long before computers existed (as well as counting in base-4, base-8, base-16, base-20, and any other base).
But when computers, as we know them today, came into existence, they leveraged this base-2/binary counting system heavily.
All of the integer types essentially follow this pattern of simply counting in base-2, where all of the numbers are simply a representation of that number in base-2. There’s a bit of nuance for the size and for negative numbers, but they’re all similar.
Floating-point numbers are more complicated, and beyond the scope of this post.
When it comes to text, however, letters aren’t numbers, so we can’t just do the obvious base-2 counting. But the gist is still the same. We come up with a bit pattern for every character that we want to represent.
We’ve also been doing this since before modern computers existed. One of the early flavors of this was ASCII, which initially used seven bits (a total of 128 possible values) to represent a set of characters including uppercase and lowercase letters, the digit symbols 0 through 9, some key punctuation like quote marks, periods, exclamation marks, and the addition symbol, and some control characters not intended for printing directly, but for controlling devices that printed characters (like BEL which dings a bell, and FF/Form Feed which ejects a printed page).
But the set of characters that people wanted to represent grew fast, especially because of a desire to represent non-English languages and characters. Seven bits quickly became eight bits (and nobody could agree on a single way to use the new range, and so there were alternative, competing options).
Eventually, the Unicode standard came along and become an authoritative source for assigning numbers to symbols. Numbers are assigned to all letters is all languages around the world, including dead languages, and mathematical symbols and emoji. If a symbol exists, there’s a Unicode “code point” for it.
But Unicode just assigns numbers to symbols in the abstract. It does not deal with how a computer represents such symbols, and there are multiple competing ways to handle this. These different, competing alternatives for how to represent code points on a computer are called character encodings.
The catch with Unicode is that there are tons of code points defined (150,000 or so, but growing all the time) but that only a small handful are commonly used. This results in competing goals: we want to be able to use all of the characters available to us, but we don’t, necessarily, want to use up a ton of space for every symbol, especially in large text documents. Different encodings will make different tradeoffs.
A simple encoding is UTF-32.
UTF-32 uses 32 bits/4 bytes to represent every single character.
This is a straightforward encoding, because the Unicode code points just simply use their numeric value as though they were an int
, but are thought of as symbols rather than integers.
The downside to UTF-32 is its size. Every character needs four bytes. The text “Hello, World!” needs 13*4=52 bytes. Contrast that with the old ASCII encoding, which would need 13 bytes. If you make a long enough document, that size is going to sting with a UTF-32 encoding!
An alternative encoding is UTF-16. This uses two bytes for most characters–especially the common ones. That cuts the size down quite a bit when contrasted with UTF-32. But 16 bits is only about 65,500 characters, and not quite enough to store the 150,000 Unicode code points! The solution is that sometimes, some characters will “overflow” the 16 bits normally used and use 32 bits instead. If the bit pattern begins in a certain way, it signifies that the full symbol will actually take up two extra bytes. This is referred to as a variable-width encoding, contrasted with UTF-32, which is a fixed-width encoding. This has the advantage of using only two bytes for the most common symbols, reducing the number of bytes needed for most text by cutting it almost in half.
Yet another encoding is UTF-8. This takes the approach of UTF-16 to an extreme. The most common characters can be represented with a single byte, and when necessary, two, three, or four bytes will be used. One nice thing about this encoding is compatability with the old ASCII standard. The text “Hello, World!”, made entirely of valid, old-school ASCII characters is actually identical in UTF-8 and ASCII encodings, which is a nice plus.
These are not the only valid encodings. There are plenty of others. But these are some of the most common encodings.
C#’s char
and string
types use UTF-16.
In fact, within running software, UTF-16 has a lot of advantages.
With two bytes, you can represent nearly any character that you could possibly care to represent.
The catch is that certain rare characters cannot be represented in a single char
.
But these characters still can be represented in a string with many characters by simply occupying the spot of two char
values.
The upside to using UTF-16 within a running program is that dealing with multi-char
symbols is rare, but all characters can still be represented, and most only need two bytes of memory.
So it is quite common for programming languages to use UTF-16 encodings to represent text within a program.
C#, Java, and JavaScript all use UTF-16 encodings, along with a lot of other languages and systems.
Unfortunately, the web world feels differently when it comes to transmission of text data. When it comes to network load, UTF-8 is generally considered preferrable, because it keeps the size down. UTF-8 in a running program is less than ideal, because it is quite common for characters to be 1 byte or 2 bytes, and you must occasionally deal with 3- and 4-byte characters as well. But the reduction in size of, say, an HTML or JSON file–both of which are extremely common–is too much to pass up.
Alas, if you are writing a C# program that runs on the Internet and is transmitting JSON, HTML, or other files that are expected to be UTF-8 encoded, it means you’ll need to convert!
That isn’t too hard, but we’ll save that topic for the next blog post.