UTF-8 Strings in C# 11
In the previous blog post, we talked about character encodings. Of particular note, C#’s internal representation uses UTF-16, which is great for working with text in memory. But the Internet likes to transfer large JSON, HTML, and other files in UTF-8, since it uses fewer bytes in most cases.
That means you may find yourself needing to convert C# strings in UTF-16 encoding to a UTF-8 encoding to send across the Internet, especially if you are working in ASP.NET. (And lots of C# programs are ASP.NET web applications.)
C#’s standard library has code to convert between different encodings.
So suppose you want to encode the text "HTTP/1.0"
, you could so something like this:
byte[] encoded = Encoding.UTF8.GetBytes("HTTP/1.0");
Unfortunately, that is quite slow to do all the time.
So some people got clever and hand-encoded it themselves, ahead of time:
ReadOnlySpan<byte> encoded = new byte[] { 0x48, 0x54, 0x54, 0x50, 0x2f, 0x31, 0x2e, 0x30 };
That’s now way faster, but virtually impossible to read. A good comment would help, but it is certainly not as clear as “HTTP/1.0”.
C# 11 added in a tool to be able to write a string as a plain string while getting the compiler to convert it to a UTF-8 encoded string for you:
var encoded = "HTTP/1.0"u8;
Notice the u8
on the end? That is what’s signaling that you want the string to be converted by the compiler to UTF-8!
I intentionally used var
there to hide what is actually happening.
The type is actually this:
ReadOnlySpan<byte> encoded = "HTTP/1.0"u8;
It is a ReadOnlySpan<byte>
, not a string!
In fact, if you had written it as string encoded = "HTTP/1.0"u8;
, it would have given you a compiler error.
It is not a string!
But it does make it easy to write out UTF-8-encoded chunks of memory for use in places where that is necessary and helpful, such as in a RESTful web service.
NOTE: As of the time of writing, C# 11 is not quite out yet. You may need to turn on C# 11 features for your project if you want to try this out today.