Press "Enter" to skip to content

Tag: utf16le

Code Snippets: UTF-8 to UTF-16 revisited and expanded

Yes, I know I posted yesterday about converting from UTF-8 to UTF-16BE/UTF-16LE – but I wasn’t happy with the code to convert to UTF16. It relied on mb_convert_encoding which whilst clear, it did mean that sequences sent could be silently “fixed”, lone high/low surrogate code points would be refused to be output (which I actually needed as they were one thing I was trying to test against!), and I just wanted more “insight” into the whole UTF8 to UTF16 system.

Unicode is NOT UTF-8

One thing to remember (and which caused me a timesink) was that Unicode is NOT UTF-8! Unicode is a collection of characters which are usually represented in UTF-8 byte sequences (but can be represented in UTF-16BE/UTF-16LE and UTF32 and others).