Press "Enter" to skip to content

Code Snippets: PHP: Converting to/from UTF-8 to UTF16-BE

Leave a Comment

I needed to convert some Unicode UTF16-BE strings (as used in Java) to UTF-8 (which is “byte-orientated” and so doesn’t need to worry about endianness) – however, there didn’t seem to many examples online. I’m not going to say these methods are robust, 100% accurate for every use case or even the best way to do it – but just ways to do it.

It’s worth noting that Java uses the escape sequence \uXXXX (where X is a hexadecimal code), and PHP uses the nearly similar escape sequence \u{XXXX} . Both seem to use \xXX for a single character.

ADDED: I’ve improved the UTF-8 to UTF-16 code conversion in a newer post.

From a “Java-esque” escaped hex bytestring (such as `Hand\uD83D\uDC4BEarth\uD83C\uDF0D` ) to UTF-8 string

$bytestring='Hand\uD83D\uDC4BEarth\uD83C\uDF0D';

$callback = function ($matches): string {
    # handle UTF16-BE encodings. Returns input if unable to convert
    $byteString = '';
    $out = false;
    if (false !== preg_match_all('/\\\u([A-Fa-f0-9]{4})/', $matches[0], $submatches, PREG_SET_ORDER)) {
        foreach ($submatches as $sub) {
            $byteString .= $sub[1];
        }
    }
    if ($byteString === '' ) {
        return $matches[0];
    }
    $packed = pack('H*', $byteString);
    if ($packed === false) {
        # Error: Unable to pack Java bytestring
        return $matches[0];
    }
    $out = mb_convert_encoding($packed, 'UTF-8', 'UTF-16BE');
    if ($out === false) {
        # Error: Unable to convert to unicode from bytestring 
        return $matches[0];
    }
    return $out;
}; 

$out = preg_replace_callback(
            '/(\\\u[A-Fa-f0-9]{4}){1,}/',
            $callback,
            $value
);

Convert UTF-8 bytestring (such as ‘`D83DDC4B`‘) to UTF-16BE escape sequence such as `\uD83D\uDC4B`

$decCodepoint = hexdec('D83DDC4B');

if ($decCodepoint <= 0xFFFF) {
    $out = sprintf("\\u%04X", $decCodepoint);
} else {
    $decCodepoint -= 0x10000;
    $high = 0xD800 + (($decCodepoint >> 10) & 0x3FF);
    $low  = 0xDC00 + ($decCodepoint & 0x3FF);
    $out = sprintf("\\u%04X\\u%04X", $high, $low);
}
print $out;

Convert UTF-8 PHP style escaped sequence (such as `\u{01F44B}`) to UTF-16BE bytes sequence (such as `\xD83D\xDC4B`)

$originalString = "\u{01F44B}";
// Encode to UTF-16
$utf16EncodedString = mb_convert_encoding($originalString, 'UTF-16BE');
$out = '\x' . implode('\x', str_split(strtoupper(bin2hex($utf16EncodedString)), 4));
print $out;

Check the output string is a UTF8 string

$validUTF8 = (bool) preg_match('//u', $test);

Related

Categories:Net: Techy: PHP

Be First to Comment

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.