Code Snippets: UTF-8 to UTF-16 revisited and expanded

Yes, I know I posted yesterday about converting from UTF-8 to UTF-16BE/UTF-16LE – but I wasn’t happy with the code to convert to UTF16. It relied on mb_convert_encoding which whilst clear, it did mean that sequences sent could be silently “fixed”, lone high/low surrogate code points would be refused to be output (which I actually needed as they were one thing I was trying to test against!), and I just wanted more “insight” into the whole UTF8 to UTF16 system.

Unicode is NOT UTF-8

One thing to remember (and which caused me a timesink) was that Unicode is NOT UTF-8! Unicode is a collection of characters which are usually represented in UTF-8 byte sequences (but can be represented in UTF-16BE/UTF-16LE and UTF32 and others).

Character	Unicode Value (Hexadecimal)	Unicode Value (Decimal)	UTF-8 byte sequence
A (Latin Letter Capital A)	U+0041	65	65 (as below 7F)
£ (Pound symbol)	U+00A3	163	C2 A3
? (Waving Hand)	U+1F44B	128075	F0 9F 91 8B

Example of Unicode Characters, their Unicode codepoint values and the UTF-8 byte sequence that results

How does a Unicode character get converted to a UTF8 codepoint? If its value is 7F/127 or less, it is just “as is” as a single byte. If it is more, then it is broken down into binary and then the first byte is used to indicate the “character length” (up to four bytes) along with 3-4 bits of the character. It is then followed by “continuation bytes” which start with binary 10 followed by 6 bits of the character. Wikipedia has a table on the UTF-8 page which explains it a bit more.

The code

Here’s the PHP8 code for converting from UTF8 strings or UTF-8 hexadecimal byte strings to either UTF-16LE (little endian) or UTF16-BE (big endian) and getting the character string, escape sequence and the hexadecimal input string out.


const UTF16_LOW_BYTE_MASK = 0xFF;

# Leading byte masks for determining sequence length
const UTF8_1BYTE_MASK = 0x80; # 1000 0000
const UTF8_2BYTE_MASK = 0xE0; # 1110 0000
const UTF8_3BYTE_MASK = 0xF0; # 1111 0000
const UTF8_4BYTE_MASK = 0xF8; # 1111 1000

# Leading byte prefixes
const UTF8_1BYTE_PREFIX = 0x00; # 0xxxxxxx
const UTF8_2BYTE_PREFIX = 0xC0; # 110xxxxx
const UTF8_3BYTE_PREFIX = 0xE0; # 1110xxxx
const UTF8_4BYTE_PREFIX = 0xF0; # 11110xxx

# Continuation byte mask/prefix
const UTF8_CONT_MASK = 0xC0; # 1100 0000
const UTF8_CONT_PREFIX = 0x80; # 10xxxxxx

# Payload masks
const UTF8_PAYLOAD_MASK = 0x3F; # 0011 1111 (continuation bytes)
const UTF8_2BYTE_PAYLOAD_MASK  = 0x1F; // 0001 1111 (first byte of 2?byte seq)
const UTF8_3BYTE_PAYLOAD_MASK = 0x0F; # 0000 1111 (first byte of 3?byte seq)
const UTF8_4BYTE_PAYLOAD_MASK = 0x07; # 0000 0111 (first byte of 4?byte seq)

# Replacement character for invalid UTF?8
const UNICODE_REPLACEMENT_CHAR = 0xFFFD;

# Surrogate pair boundaries
const UNICODE_PLANE1_BASE = 0x10000; # start of supplementary planes
const UNICODE_HIGH_SURROGATE = 0xD800; # high surrogate base
const UNICODE_LOW_SURROGATE = 0xDC00; # low surrogate base
const UNICODE_SURROGATE_PAYLOAD_MASK = 0x3FF; # mask for the lower 10 bits

/**
 * @param ?string $utf8String A string of Unicode UTF8 character or null.
 * @param ?string $utf8HexBytes A string of hex bytes or null
 * @param bool $asBe True for Java style UTF-16BE handling, false for UTF-16LE.
 * @return array<string,string> The results.
 * @throws InvalidArgumentException If passed incorrect data.
 */
function toUtf16(?string $utf8String = null, ?string $utf8HexBytes = null, bool $asBe = true): array
{
    if ($utf8HexBytes === null && $utf8String !== null) {
        $utf8HexBytes = bin2hex($utf8String);
    } elseif ($utf8HexBytes === null && $utf8String === null) {
        throw new InvalidArgumentException('Either utf8HexBytes OR utf8String needs to be provided');
    } elseif ($utf8HexBytes !== null && $utf8String !== null) {
        throw new InvalidArgumentException('Either utf8HexBytes OR utf8String needs to be provided: not both!');
    } else {
         $utf8HexBytes = str_replace(' ', '', $utf8HexBytes);
        if (1 !== preg_match('/\A[A-Fa-f0-9]*\Z/', $utf8HexBytes)) {
            throw new InvalidArgumentException('Invalid utf8HexBytes sent: only hexadecimal characters accepted');
        }
    }
    # Convert hex string to raw bytes
    $bytes = array_values(unpack('C*', hex2bin($utf8HexBytes)));
    $escape = '';
    $chars = '';
    $position = 0;
    $length = count($bytes);
    while ($position < $length) {
        $firstByte = $bytes[$position++];

        # Determine UTF?8 length
        if (($firstByte & UTF8_1BYTE_MASK) === UTF8_1BYTE_PREFIX) {
            # 1 byte ASCII (less than or equal to 127/0x7F)
            $codepoint = $firstByte;
        } elseif (($firstByte & UTF8_2BYTE_MASK) === UTF8_2BYTE_PREFIX) {
            # 2 byte sequence
            $codepoint = (($firstByte & UTF8_2BYTE_PAYLOAD_MASK) << 6) |
                ($bytes[$position++] & UTF8_PAYLOAD_MASK);
        } elseif (($firstByte & UTF8_3BYTE_MASK) === UTF8_3BYTE_PREFIX) {
            # 3 byte sequence
            $codepoint = (($firstByte & UTF8_3BYTE_PAYLOAD_MASK) << 12) |
                (($bytes[$position++] & UTF8_PAYLOAD_MASK) << 6) |
                ($bytes[$position++] & UTF8_PAYLOAD_MASK);
        } elseif (($firstByte & UTF8_4BYTE_MASK) === UTF8_4BYTE_PREFIX) {
            # 4 byte sequence
            $codepoint = (($firstByte & UTF8_4BYTE_PAYLOAD_MASK) << 18) |
                (($bytes[$position++] & UTF8_PAYLOAD_MASK) << 12) |
                (($bytes[$position++] & UTF8_PAYLOAD_MASK) << 6) |
                ($bytes[$position++] & UTF8_PAYLOAD_MASK);
        } else {
            # Invalid byte to replacement character
            $codepoint = UNICODE_REPLACEMENT_CHAR;
        }

        # Convert code point to Java escapes
        if ($codepoint <= UNICODE_PLANE1_BASE) {
            if ($asBe) {
                # UTF?16BE (Java style)
                $chars  .= pack("n", $codepoint);
                $escape .= sprintf("\\u%04X", $codepoint);
            } else {
                # UTF-16LE
                $chars  .= pack("v", $codepoint);
                $escape .= sprintf("\\x%02X\\x%02X", $codepoint & UTF16_LOW_BYTE_MASK, $codepoint >> 8);
            }
        } else {
            # Surrogate pair
            $codepoint -= UNICODE_PLANE1_BASE;
            $high = UNICODE_HIGH_SURROGATE | ($codepoint >> 10);
            $low = UNICODE_LOW_SURROGATE | ($codepoint & UNICODE_SURROGATE_PAYLOAD_MASK);
            if ($asBe) {
                $chars  .= pack("n", $high) . pack("n", $low);
                $escape .= sprintf("\\u%04X\\u%04X", $high, $low);
            } else {
                $chars  .= pack("v", $high) . pack("v", $low);
                $escape .= sprintf(
                    "\\x%02X\\x%02X\\x%02X\\x%02X",
                    $high & UTF16_LOW_BYTE_MASK,
                    $high >> 8,
                    $low  & UTF16_LOW_BYTE_MASK,
                    $low  >> 8
                );
            }
        }
    }
    return [
        'input utf8 hex' => chunk_split($utf8HexBytes, 2, ' '),
        'utf16 mode' => ($asBe ? 'utf16-be' : 'utf16-le'),
        'utf16 escapes' => $escape,
        'utf16 chars' => $chars
    ];
}

Example usage – to get the UTF16-BE escape sequence for the string “Hello! ? Give me £!”:

print toUtf16(utf8String: "Hello! ? Give me £!", asBe: true)['utf16 escapes'];

would output:

\u0048\u0065\u006C\u006C\u006F\u0021\u0020\uD83D\uDC4B\u0020\u0047\u0069\u0076\u0065\u0020\u006D\u0065\u0020\u00A3\u0021

Or if you wanted to pass in a hexadecimal string and get the utf16 characters out:

print toUtf16(utf8HexBytes: "48 65 6c 6c 6f 21 20 f0 9f 91 8b 20 47 69 76 65 20 6d 65 20 c2 a3 21", asBe: true)['utf16 chars'];
would output:

utf8 terminal: Hello! ?=?K Give me ?!
utf16be terminal: Hello! ? Give me £!

Or to get out the escape sequences in UTF16-LE format:

print toUtf16(utf8String: "Hello! ? Give me £!", asBe: false)['utf16 escapes'];

to get:

\x48\x00\x65\x00\x6C\x00\x6C\x00\x6F\x00\x21\x00\x20\x00\x3D\xD8\x4B\xDC\x20\x00\x47\x00\x69\x00\x76\x00\x65\x00\x20\x00\x6D\x00\x65\x00\x20\x00\xA3\x00\x21\x00

(\x indicates a hexadecimal little-endian byte code – whereas \u indicates a big-endian sequence)

Hope it helps somebody and saves them having to reinvent the wheel.

Code Snippets: UTF-8 to UTF-16 revisited and expanded

Unicode is NOT UTF-8

The code

Related

Be First to Comment

Leave a Reply Cancel reply