UTF8 與 Unicode 的轉換 (C++)

先釐清一下這兩者的性質

Unicode: 為世界上所有的文字系統制訂的標準，基本上就是給每個字(letter)一個編號
UTF-8: 為 unicode 的編號制定一個數位編碼方法

UTF-8 是一個長度介於 1~6 byte 的編碼，將 unicode 編號 (code point) 分為六個區間如下表¹

Bits	First code point	Last code point	Bytes	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
7	U+0000	U+007F	1	0xxxxxxx
11	U+0080	U+07FF	2	110xxxxx	10xxxxxx
16	U+0800	U+FFFF	3	1110xxxx	10xxxxxx	10xxxxxx
21	U+10000	U+1FFFFF	4	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
26	U+200000	U+3FFFFFF	5	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
31	U+4000000	U+7FFFFFFF	6	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

觀察上面的表應該可以發現

除了 7 bits 的區間外，第一個 byte 開頭連續 1 的個數就是長度，例如 110XXXXX 就是 2 byte 長，而 1110xxxx 就是 3 byte
除了第一個 byte 外，之後的 byte 前兩個 bit 一定是 10 開頭，這樣的好處在於確立了編碼的 self-synchronizeing，意即當編碼為多個 byte 時，任取一個 byte 無法正常解碼。

Note

第一點中的例外 (7 bits) 是為了與 ASCII 的相容性，而第二點會影響到 code point 至 UTF-8 的轉換。

為了與 UTF-16 的相容性，在 RFC3629 中 UTF-8 移除了 5 與 6 byte 長的區間

Unicode 編號與 UTF-8 的轉換

一樣先看個例子¹

Character	Unicode	Binary code point	Binary UTF-8	Hexadecimal UTF-8
$	U+0024	`0100100`	`00100100`	24
¢	U+00A2	`00010` `100010`	110`00010` 10`100010`	C2 A2
€	U+20AC	`0010` `000010` `101100`	1110`0010` 10`000010` 10`101100`	E2 82 AC
𤭢	U+24B62	`000` `100100` `101101` `100010`	11110`000` 10`100100` 10`101101` 10`100010`	F0 A4 AD A2

Note
反白部分標示出編碼前後不變的部分

算出邊界條件

Unicode	Binary code point	Binary UTF-8	Hexadecimal UTF-8
U+0000	00000000	00000000	00
U+007F	01111111	01111111	7F
U+0080	00000000 10000000	11000010 10000000	C2 80
U+07FF	00000111 11111111	11011111 10111111	DF BF
U+0800	00000000 00001000 00000000	11100000 10100000 10000000	E0 A0 80
U+FFFF	00000000 11111111 11111111	11101111 10111111 10111111	EF BF BF
U+10000	00000001 00000000 00000000	11110000 10010000 10000000 10000000	F0 90 80 80
U+1FFFFF	00011111 11111111 11111111	11110111 10111111 10111111 10111111	F0 9F BF BF

用 C++11 實作 UTF-8 to unicode 轉換函數

uint32_t to_unicode(std::string const &utf8)
{
  uint32_t unicode = 0;
  uint8_t first_byte = (uint8_t)utf8[0];
  uint8_t len = 
    (first_byte >> 7) == 0 ? 1 :
    (first_byte & 0xf0) == 0xf0 ? 4 :
    (first_byte & 0xe0) == 0xe0 ? 3 :
    (first_byte & 0xc0) == 0xc0 ? 2 : 0
    ;
  unicode += (uint8_t)(first_byte << len) >> len;
  for(auto i = 1; i < len; ++i) {
    unicode <<= 6;
    unicode += ((uint8_t)utf8[i]) & 0x3F;
  }
  cout <<"("<< (uint32_t)len << ")";
  return unicode;
}

反向轉也寫一個

std::string to_utf8(uint32_t unicode)
{
  uint8_t len = 0;
  uint8_t mask = 0xF0; // 1111 0000
  std::string utf8;
  len =
    unicode < 0x10000 ? 
    unicode < 0x800 ? 
    unicode < 0x80 ?
    1 : 2 : 3 : 4
    ;
  mask >>= 8 - len;
  mask <<= 8 - len;
  if (len == 1) 
    mask = 0;
  for(auto i=1; i<len;++i) {
    utf8.insert(0, 1, (unicode & 0x3f) | 0x80);
    unicode >>= 6;
  }
  utf8.insert(0, 1, (uint8_t)(mask | unicode));
  return utf8;
}

以邊界條件測試

// compiled with clang++3.3 on FreeBSD 10
int main(void)
{
  uint32_t code[] = { 0, 0x7f, 0x80, 0x7ff, 0x0800, 0xffff, 0x10000, 0x1fffff };
  for (auto c : code) {
    string utf8 = to_utf8(c);
    cout << hex << c << " : ";
    for(auto u : utf8) {
      cout << setw(2) << hex << (0x0ff & (uint32_t)u);
    }
    cout << "(" << utf8.size() << ")" << " : ";
    cout << hex << to_unicode(utf8) << endl;
  }
  return 0;
}

輸出

0 :  0(1) : 0
7f : 7f(1) : 7f
80 : c280(2) : 80
7ff : dfbf(2) : 7ff
800 : e0a080(3) : 800
ffff : efbfbf(3) : ffff
10000 : f0908080(4) : 10000
1fffff : f7bfbfbf(4) : 1fffff

Update: 加速長度計算

uint8_t len =
  (first_byte >> 7) == 0 ? 1 :
  (~first_byte & 0x20) ? 2 :
  (~first_byte & 0x10) ? 3 : 4;

Written with StackEdit.

http://en.wikipedia.org/wiki/UTF-8 ↩︎ ↩︎

C++17 新功能 try_emplace

C++17 新功能 try_emplace 回顧 emplace 大家的好朋友 Standard Template Library (STL) 容器提供如 push_back , insert 等介面，讓我們塞東西進去； C++11 之後，新增了 emplace 系列的介面，如 std::vector::emplace_back , std::map::emplace 等，差異在於 emplace 是在容器內 in-place 直接建構新元素，而不像 push_back 在傳遞參數前建構，下面用實例來說明： struct Value { // ctor1 Value ( int size ) : array ( new char [ size ] ) , size ( size ) { printf ( "ctor1: %d\n" , size ) ; } // ctor2 Value ( const Value & v ) : array ( new char [ v . size ] ) , size ( v . size ) { printf ( "ctor2: %d\n" , size ) ; memcpy ( array . get ( ) , v . array . get ( ) , size ) ; } private : std :: unique_ptr < char [ ] > array ; int size = 0 ; } ; struct Value 定義了自訂建構子 (ctor1)，以指定大小 size 配置陣列，複製建構子 (ctor2) 則會配置與來源相同大小及內容的陣列，為了方便觀察加了一些 printf 。當我們如下使用 std::vector::push_back 時 std :: vector < Value > v ; v . push_back ( Value ( 2048 ) ) ; 首先 Value 會先呼叫 ctor1，傳給 push_ba...

閱讀完整內容

Acerlog 捨漏格

搜尋此網誌

UTF8 與 Unicode 的轉換 (C++)

留言

張貼留言

這個網誌中的熱門文章

得利油漆色卡編碼方式

C++17 新功能 try_emplace