dionidium.com

Wayne Burkett's Weblog

XML Character Encoding on the Web
01:14AM CST February 19, 2005

The recent talk about character encodings has shamed me into a second look at how I'm serving my XML documents. The problem, as described in several other places, is that the way your UA should be deciding on a charset and the way it's actually doing it aren't always the same. Additionally, many have pointed out that text/xml, which I've been using as the default for all my XML files, is a bit of a mess, since, according to RFC 3023, UAs that receive text/xml documents "MUST use the default charset value of 'us-ascii'" if the charset parameter is omitted. Why is that a problem? As Mark Pilgrim first pointed out:

  1. Most of the XML documents on the web are served as text/xml without the charset parameter, which means...
  2. According to RFC 3023, they've got to be treated as us-ascii, but...
  3. A lot of those documents contain non-ASCII characters, and...
  4. None of the most popular XML parsers enforce RFC 3023.

This doesn't actually apply to me -- I've explicitly specified UTF-8 as the default charset for all documents -- but I'm switching my feeds to application/xml, anyway, just to be extra-double safe. [1]

And what about the fact that the W3C says you shouldn't specify character encodings in headers? Given that both the encoding specified in my XML documents and the headers they're served with agree that the document contains UTF-8, I'm struggling to see how it matters which one your UA trusts.

Update: A real-world example of the problem.

[1] Atom feeds are an exception; they're now served as application/atom+xml.