I needed to convert some Unicode UTF16-BE strings (as used in Java) to UTF-8 (which is “byte-orientated” and so doesn’t need to worry about endianness) – however, there didn’t seem to many examples online. I’m not going to say these methods are robust, 100% accurate for every use case or even the best way to do it – but just ways to do it.
It’s worth noting that Java uses the escape sequence \uXXXX (where X is a hexadecimal code), and PHP uses the nearly similar escape sequence \u{XXXX} . Both seem to use \xXX for a single character.
From a “Java-esque” escaped hex bytestring (such as Hand\uD83D\uDC4BEarth\uD83C\uDF0D ) to UTF-8 string
$bytestring='Hand\uD83D\uDC4BEarth\uD83C\uDF0D';
$callback = function ($matches): string {
# handle UTF16-BE encodings. Returns input if unable to convert
$byteString = '';
$out = false;
if (false !== preg_match_all('/\\\u([A-Fa-f0-9]{4})/', $matches[0], $submatches, PREG_SET_ORDER)) {
foreach ($submatches as $sub) {
$byteString .= $sub[1];
}
}
if ($byteString === '' ) {
return $matches[0];
}
$packed = pack('H*', $byteString);
if ($packed === false) {
# Error: Unable to pack Java bytestring
return $matches[0];
}
$out = mb_convert_encoding($packed, 'UTF-8', 'UTF-16BE');
if ($out === false) {
# Error: Unable to convert to unicode from bytestring
return $matches[0];
}
return $out;
};
$out = preg_replace_callback(
'/(\\\u[A-Fa-f0-9]{4}){1,}/',
$callback,
$value
);
Convert UTF-8 bytestring (such as ‘D83DDC4B‘) to UTF-16BE escape sequence such as \uD83D\uDC4B
$decCodepoint = hexdec('D83DDC4B');
if ($decCodepoint <= 0xFFFF) {
$out = sprintf("\\u%04X", $decCodepoint);
} else {
$decCodepoint -= 0x10000;
$high = 0xD800 + (($decCodepoint >> 10) & 0x3FF);
$low = 0xDC00 + ($decCodepoint & 0x3FF);
$out = sprintf("\\u%04X\\u%04X", $high, $low);
}
print $out;
Convert UTF-8 PHP style escaped sequence (such as \u{01F44B}) to UTF-16BE bytes sequence (such as \xD83D\xDC4B)
$originalString = "\u{01F44B}";
// Encode to UTF-16
$utf16EncodedString = mb_convert_encoding($originalString, 'UTF-16BE');
$out = '\x' . implode('\x', str_split(strtoupper(bin2hex($utf16EncodedString)), 4));
print $out;
Check the output string is a UTF8 string
$validUTF8 = (bool) preg_match('//u', $test);
Be First to Comment