Thoughts from the office by Ed Ball
Wednesday, April 28, 2004

I have a coworker that was burned twice in one day by surprises while using text encodings with the .NET Framework. He was working with files that were meant to be encoded as UTF-8, but were accidentally encoded in code page 1252.

First, he wanted to decode a text document as UTF-8, knowing that there might be invalid UTF-8 characters and hoping to detect that fact somehow. Logically, he used Encoding.UTF8; unfortunately, that seemed to quietly remove invalid characters. Nor did creating a new instance of UTF8Encoding work – until he discovered the constructor with two Booleans, the second of which causes an exception to be thrown if an invalid character is found... (Let me take this opportunity to express my dislike of Boolean parameters to methods and constructors. I hope that the .NET Framework API designers discontinue this practice in the future; an enumerated type would be far more appropriate.)

Second, he wanted to use StreamReader to decode a text document with code page 1252, knowing that it might look like UTF-8. Logically, he created a StreamReader with the constructor that takes a file name and an Encoding object – he used Encoding.GetEncoding("windows-1252") as the encoding, but eventually he realized that the documents were actually being decoded as UTF-8. Ultimately, he discovered the StreamReader constructor with an additional Boolean parameter, which specifies whether or not to allow a byte order mark to override the encoding.

Things like this can take a while to track down, so I post them here in the hopes that readers will avoid similar confusion.

4/28/2004 4:51:52 PM (Pacific Daylight Time, UTC-07:00) | Comments [0] | Code#
Search
Archive
Links
Categories
Administration
Blogroll