Friday, March 23, 2012

Helpful Post on UTF Encoding


Cover Sheet for Slide show for Pragmatic Unicode

http://nedbatchelder.com/text/unipain.html








Summary:
 The five unavoidable Facts of Life:
  1. All input and output of your program is bytes.
  2. The world needs more than 256 symbols to communicate text.
  3. Your program has to deal with both bytes and Unicode.
  4. A stream of bytes can't tell you its encoding.
  5. Encoding specifications can be wrong.
 To keep your code Unicode-clean:
  1. Unicode sandwich: keep all text in your program as Unicode, and convert as close to the edges as possible.
  2. Know what your strings are: you should be able to explain which of your strings are Unicode, which are bytes, and for your byte strings, what encoding they use.
  3. Test your Unicode support. Use exotic strings throughout your test suites to be sure you're covering all the cases.

No comments: