08 - ICU 技术全景：Unicode 国际化组件完整解析-Android系统底层用到的开源库

08 - ICU 技术全景：Unicode 国际化组件完整解析

一、概述

1.1 项目简介

ICU (International Components for Unicode) 是由 IBM 开发并维护的成熟、广泛使用的 C/C++ 和 Java 国际化库集合,提供 Unicode 和全球化支持。

基本信息:

版本: ICU 72.1 (Android 14)
代码规模 :
- C/C++: 809,000+ 行代码
- Java: 1,650 个文件
- 数据文件: 4,121 个国际化数据文件
目录大小: 377 MB
位置 : external/icu/
许可证: Unicode License

核心功能:

Unicode 字符处理 (30+ 字符属性)
字符编码转换 (50+ 编码格式)
文本规范化和边界分析
国际化日期/时间格式化 (180+ 语言区域)
数字和货币格式化
文本排序和整理 (Collation)
正则表达式引擎
双向文本 (BiDi) 处理

1.2 ICU 版本对比

组件	ICU4C	ICU4J	Android ICU4J
语言	C/C++	Pure Java	Java (Android)
包名	-	com.ibm.icu	android.icu
API 稳定性	完整 API	完整 API	稳定子集 API
数据加载	文件/内存	JAR 资源	APEX 模块
代码行数	809,000+	650,000+	精简版

1.3 Android 集成概览

集成方式:

复制代码

Android Framework
       ↓
libandroidicu.so (稳定 C API)
       ↓
┌──────────────┬──────────────┐
│  libicuuc.so │ libicui18n.so│
│   (Common)   │     (i18n)   │
└──────────────┴──────────────┘
       ↓
ICU Data Files (.dat)

使用场景:

文本渲染和字体处理
输入法引擎 (IME)
日期选择器和日历应用
联系人排序
电话号码格式化
WebView 国际化支持

二、架构设计

2.1 整体架构

复制代码

┌─────────────────────────────────────────────────────────┐
│                    Android Framework                     │
│         (Java/Kotlin: android.icu.* 包)                 │
└─────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│              libandroidicu.so (稳定 ABI)                │
│     提供版本隔离的 C API (NDK 可用)                     │
└─────────────────────────────────────────────────────────┘
                          ↓
┌──────────────────────┬──────────────────────────────────┐
│   libicuuc.so        │    libicui18n.so                 │
│   (UC = Common)      │    (i18n = Internationalization) │
│                      │                                  │
│  • 字符属性          │   • 日期/时间格式化               │
│  • 编码转换          │   • 数字/货币格式化               │
│  • Unicode 规范化    │   • 排序/整理 (Collation)         │
│  • BiDi 算法         │   • 正则表达式                    │
│  • 字符串操作        │   • 日历系统                      │
│  • 资源管理          │   • 时区处理                      │
└──────────────────────┴──────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│                  ICU Data Files                          │
│  • icudt72l.dat (小端) / icudt72b.dat (大端)            │
│  • 180+ 语言区域数据                                     │
│  • 排序规则、断词规则、字典                              │
│  • 时区数据库 (TZDB)                                     │
└─────────────────────────────────────────────────────────┘

2.2 模块依赖关系

核心模块依赖图:

复制代码

                    ┌──────────────┐
                    │   i18n/      │
                    │ (高级国际化)  │
                    │  241 .cpp    │
                    └──────┬───────┘
                           │ 依赖
                           ↓
                    ┌──────────────┐
                    │   common/    │
                    │ (基础 Unicode)│
                    │  199 .cpp    │
                    └──────┬───────┘
                           │ 依赖
                           ↓
                    ┌──────────────┐
                    │   stubdata/  │
                    │  (数据存根)   │
                    └──────────────┘

关键文件依赖示例:

复制代码

i18n/calendar.cpp (日历)
    → common/locid.cpp (语言区域识别)
    → common/unistr.cpp (Unicode 字符串)
    → common/utypes.h (基础类型定义)

i18n/coll.cpp (排序整理)
    → common/ucol.cpp (排序核心)
    → common/utrie2.cpp (Trie 数据结构)
    → data/coll/*.txt (排序规则数据)

2.3 数据流设计

字符编码转换流程:

复制代码

输入字节流 (GBK)
       ↓
UConverter::open("GBK")
       ↓
ucnv_toUnicode()  ← 查询转换表
       ↓
UTF-16 内部表示 (UChar*)
       ↓
ucnv_fromUnicode() ← 查询转换表
       ↓
输出字节流 (UTF-8)

国际化格式化流程:

复制代码

原始数据 (1234567.89)
       ↓
Locale::createFromName("zh_CN")
       ↓
NumberFormat::createInstance(locale)
       ↓
format(number) → 查询语言区域数据
       ↓
格式化字符串 "¥1,234,567.89"

三、核心组件详解

3.1 Unicode 处理模块 (common/)

3.1.1 字符属性系统

文件位置 : icu4c/source/common/uchar.cpp (144 KB)

核心功能:

ICU 提供 30+ 种 Unicode 字符属性查询,使用优化的 Trie 数据结构实现 O(1) 查询性能。

关键数据结构:

c 复制代码

// utrie2.h - Trie2 数据结构 (压缩的字符属性查询表)
typedef struct UTrie2 {
    const uint16_t *index;      // 索引数组 (BMP 字符)
    const uint16_t *data16;     // 16位数据数组
    const uint32_t *data32;     // 32位数据数组
    int32_t indexLength;
    int32_t dataLength;

    uint16_t index2NullOffset;
    uint16_t dataNullOffset;
    uint32_t initialValue;

    // 快速查询函数
    UTrie2ValueBits valueBits;
} UTrie2;

// 查询字符属性的宏 (编译期优化)
#define UTRIE2_GET16(trie, c) \
    _UTRIE2_GET((trie), index, (trie)->data16, (c))

主要 API:

c 复制代码

// 1. 字符类型判断 (30+ 函数)
UBool u_isalpha(UChar32 c);        // 是否为字母
UBool u_isdigit(UChar32 c);        // 是否为数字
UBool u_isspace(UChar32 c);        // 是否为空白
UBool u_ispunct(UChar32 c);        // 是否为标点

// 2. 字符类型获取
int8_t u_charType(UChar32 c);       // 返回 UCharCategory 枚举

// UCharCategory 枚举 (30 种类型)
typedef enum UCharCategory {
    U_UPPERCASE_LETTER = 1,         // Lu: 大写字母
    U_LOWERCASE_LETTER = 2,         // Ll: 小写字母
    U_TITLECASE_LETTER = 3,         // Lt: 标题字母
    U_MODIFIER_LETTER = 4,          // Lm: 修饰字母
    U_OTHER_LETTER = 5,             // Lo: 其他字母
    U_DECIMAL_DIGIT_NUMBER = 9,     // Nd: 十进制数字
    U_DASH_PUNCTUATION = 20,        // Pd: 破折号标点
    U_MATH_SYMBOL = 25,             // Sm: 数学符号
    // ... 总共 30 种类型
} UCharCategory;

// 3. 大小写转换
UChar32 u_tolower(UChar32 c);       // 转小写
UChar32 u_toupper(UChar32 c);       // 转大写
UChar32 u_totitle(UChar32 c);       // 转标题格式

// 4. 字符名称
int32_t u_charName(UChar32 code,
                   UCharNameChoice nameChoice,
                   char *buffer, int32_t bufferLength,
                   UErrorCode *pErrorCode);

// 5. Script (文字系统) 识别
UScriptCode uscript_getScript(UChar32 c, UErrorCode *pErrorCode);
// 返回值: USCRIPT_LATIN, USCRIPT_HAN, USCRIPT_ARABIC 等 200+ 文字系统

使用示例:

c 复制代码

// 示例 1: 判断字符类型
UChar32 ch = 0x4E2D;  // '中'
if (u_isalpha(ch)) {
    int8_t type = u_charType(ch);  // 返回 U_OTHER_LETTER
    // type = 5 (Lo: Other Letter)
}

// 示例 2: 获取字符名称
char name[100];
u_charName(0x4E2D, U_UNICODE_CHAR_NAME, name, 100, &status);
// name = "CJK UNIFIED IDEOGRAPH-4E2D"

// 示例 3: Script 识别
UScriptCode script = uscript_getScript(0x4E2D, &status);
// script = USCRIPT_HAN (汉字)

// 示例 4: 混合文本 Script 识别
const char *mixed = u8"Hello世界";
UChar32 c;
int32_t i = 0;
while (i < strlen(mixed)) {
    U8_NEXT(mixed, i, strlen(mixed), c);
    UScriptCode s = uscript_getScript(c, &status);
    // 'H','e','l','l','o' → USCRIPT_LATIN
    // '世','界' → USCRIPT_HAN
}

性能优化:

c 复制代码

// Trie2 快速查询实现 (utrie2.cpp)
static inline uint32_t
utrie2_get32(const UTrie2 *trie, UChar32 c) {
    // BMP 字符 (U+0000..U+FFFF) 快速路径
    if (c <= 0xFFFF) {
        return UTRIE2_GET16(trie, c);  // 2 次数组访问
    }
    // 补充平面字符 (U+10000..U+10FFFF)
    if (c <= 0x10FFFF) {
        return UTRIE2_GET32_FROM_SUPP(trie, c);  // 3 次数组访问
    }
    return trie->errorValue;  // 无效字符
}

3.1.2 字符编码转换

文件位置 : icu4c/source/common/ucnv*.cpp (20 个转换器文件)

核心转换器:

ucnvmbcs.cpp (246 KB): 多字节字符集转换器 (GBK, GB18030, Big5, Shift-JIS 等)
ucnv_u8.cpp: UTF-8 转换器
ucnv_u16.cpp: UTF-16 转换器
ucnv_u32.cpp: UTF-32 转换器
ucnvlat1.cpp: Latin-1 (ISO-8859-1) 转换器

支持的编码 (50+ 种):

复制代码

• UTF 系列: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE
• 简体中文: GBK, GB18030, GB2312, HZ-GB-2312
• 繁体中文: Big5, Big5-HKSCS, EUC-TW
• 日文: Shift-JIS, ISO-2022-JP, EUC-JP
• 韩文: EUC-KR, ISO-2022-KR
• 西欧: ISO-8859-1 (Latin-1), ISO-8859-15, Windows-1252
• EBCDIC 系列: IBM037, IBM1047 等
• 其他: KOI8-R, ISO-2022-CN, SCSU, BOCU-1

核心 API:

c 复制代码

// UConverter 结构体 (转换器对象)
typedef struct UConverter UConverter;

// 1. 创建转换器
UConverter* ucnv_open(const char *converterName, UErrorCode *err);
// 示例: ucnv_open("GBK", &status);
//       ucnv_open("UTF-8", &status);

// 2. 转换到 Unicode (UTF-16)
void ucnv_toUnicode(UConverter *converter,
                    UChar **target, const UChar *targetLimit,
                    const char **source, const char *sourceLimit,
                    int32_t *offsets,
                    UBool flush,
                    UErrorCode *err);

// 3. 从 Unicode 转换
void ucnv_fromUnicode(UConverter *converter,
                      char **target, const char *targetLimit,
                      const UChar **source, const UChar *sourceLimit,
                      int32_t *offsets,
                      UBool flush,
                      UErrorCode *err);

// 4. 便捷函数: 直接转换
int32_t ucnv_convert(const char *toConverterName,
                     const char *fromConverterName,
                     char *target, int32_t targetCapacity,
                     const char *source, int32_t sourceLength,
                     UErrorCode *pErrorCode);

// 5. 关闭转换器
void ucnv_close(UConverter *converter);

完整转换示例:

c 复制代码

// 示例 1: GBK → UTF-8 转换
void convertGBKtoUTF8(const char *gbkText, char *utf8Buffer, int32_t bufferSize) {
    UErrorCode status = U_ZERO_ERROR;

    // 步骤 1: 创建转换器
    UConverter *gbkConv = ucnv_open("GBK", &status);
    UConverter *utf8Conv = ucnv_open("UTF-8", &status);

    // 步骤 2: GBK → UTF-16 (内部格式)
    UChar unicodeBuffer[1024];
    const char *gbkSource = gbkText;
    UChar *unicodeTarget = unicodeBuffer;

    ucnv_toUnicode(gbkConv,
                   &unicodeTarget, unicodeBuffer + 1024,
                   &gbkSource, gbkText + strlen(gbkText),
                   NULL, TRUE, &status);

    // 步骤 3: UTF-16 → UTF-8
    const UChar *unicodeSource = unicodeBuffer;
    char *utf8Target = utf8Buffer;

    ucnv_fromUnicode(utf8Conv,
                     &utf8Target, utf8Buffer + bufferSize,
                     &unicodeSource, unicodeTarget,
                     NULL, TRUE, &status);

    // 步骤 4: 清理
    ucnv_close(gbkConv);
    ucnv_close(utf8Conv);
}

// 示例 2: 使用便捷函数
char utf8Result[1024];
int32_t len = ucnv_convert(
    "UTF-8",           // 目标编码
    "GBK",             // 源编码
    utf8Result, 1024,  // 输出缓冲区
    gbkInput, -1,      // 输入 (-1 = null-terminated)
    &status
);

Java API 示例:

java 复制代码

// Android ICU4J 编码转换
import android.icu.charset.CharsetICU;
import java.nio.charset.Charset;

// 方法 1: 使用 ICU Charset
Charset gbk = CharsetICU.forNameICU("GBK");
String text = "中文测试";
byte[] gbkBytes = text.getBytes(gbk);

// 方法 2: 使用 CharsetDecoder/Encoder
CharsetDecoder decoder = gbk.newDecoder();
CharBuffer unicode = decoder.decode(ByteBuffer.wrap(gbkBytes));

性能优化:

c 复制代码

// ucnvmbcs.cpp - 多字节字符集快速查询
typedef struct UConverterMBCSTable {
    uint8_t countStates;
    uint32_t countToUFallbacks;
    uint32_t stateTableLength;

    // 状态机转换表
    const int32_t (*stateTable)/*[countStates]*/[256];

    // Unicode 映射表 (压缩格式)
    const uint16_t *fromUnicodeTable;
    const uint32_t *fromUnicodeBytes;
} UConverterMBCSTable;

// 快速转换实现 (使用状态机 + 查找表)
// GBK 双字节字符转换只需 2-3 次内存访问

3.1.3 Unicode 规范化

文件位置 : icu4c/source/common/unorm*.cpp (5 个文件)

四种规范化形式:

形式	全称	说明	用途
NFC	Canonical Decomposition + Composition	分解后重组	默认形式,最紧凑
NFD	Canonical Decomposition	完全分解	文本搜索,排序
NFKC	Compatibility Decomposition + Composition	兼容性分解后重组	忽略格式差异
NFKD	Compatibility Decomposition	兼容性完全分解	文本索引

规范化示例:

复制代码

原始字符: é (单个字符 U+00E9)

NFD:  é → e + ´  (U+0065 + U+0301)  [分解为基字符+组合标记]
NFC:  é → é      (U+00E9)            [保持或重组为单字符]

原始字符: ﬁ (连字 U+FB01)

NFKD: ﬁ → f + i  (U+0066 + U+0069)  [兼容性分解]
NFKC: ﬁ → fi     (U+0066 + U+0069)  [兼容性分解,不重组]

核心 API:

c 复制代码

// unorm2.h - Normalization 2.0 API (新版推荐)

// 1. 获取规范化器实例
const UNormalizer2* unorm2_getNFCInstance(UErrorCode *pErrorCode);
const UNormalizer2* unorm2_getNFDInstance(UErrorCode *pErrorCode);
const UNormalizer2* unorm2_getNFKCInstance(UErrorCode *pErrorCode);
const UNormalizer2* unorm2_getNFKDInstance(UErrorCode *pErrorCode);

// 2. 规范化字符串
int32_t unorm2_normalize(const UNormalizer2 *norm2,
                         const UChar *src, int32_t length,
                         UChar *dest, int32_t capacity,
                         UErrorCode *pErrorCode);

// 3. 快速检查 (FCD - Fast Check and Decompose)
UBool unorm2_isNormalized(const UNormalizer2 *norm2,
                          const UChar *s, int32_t length,
                          UErrorCode *pErrorCode);

UNormalizationCheckResult unorm2_quickCheck(const UNormalizer2 *norm2,
                                             const UChar *s, int32_t length,
                                                UErrorCode *pErrorCode);
// 返回: UNORM_YES (已规范化), UNORM_NO (未规范化), UNORM_MAYBE (需要完整检查)

// 4. 比较 (规范化等价比较)
int32_t unorm2_compare(const UChar *s1, int32_t length1,
                       const UChar *s2, int32_t length2,
                       uint32_t options,
                       UErrorCode *pErrorCode);

使用示例:

c 复制代码

// 示例 1: NFC 规范化
UChar source[] = u"e\u0301";  // e + 组合重音符
UChar result[100];
UErrorCode status = U_ZERO_ERROR;

const UNormalizer2 *nfc = unorm2_getNFCInstance(&status);
int32_t len = unorm2_normalize(nfc, source, -1, result, 100, &status);
// result = "é" (U+00E9) - 单个字符

// 示例 2: 快速检查
if (unorm2_isNormalized(nfc, text, -1, &status)) {
    // 文本已经是 NFC 形式,无需规范化
} else {
    // 需要规范化
    unorm2_normalize(nfc, text, -1, normalized, capacity, &status);
}

// 示例 3: 规范化等价比较
UChar s1[] = u"é";           // U+00E9 (NFC)
UChar s2[] = u"e\u0301";     // U+0065 + U+0301 (NFD)

int32_t cmp = unorm2_compare(s1, -1, s2, -1,
                              U_COMPARE_CODE_POINT_ORDER, &status);
// cmp = 0 (规范化等价,认为相等)

Java API 示例:

java 复制代码

import android.icu.text.Normalizer2;

// 获取规范化器
Normalizer2 nfc = Normalizer2.getNFCInstance();
Normalizer2 nfd = Normalizer2.getNFDInstance();

// 规范化
String source = "e\u0301";  // e + 组合重音
String normalized = nfc.normalize(source);  // → "é"

// 快速检查
if (nfc.isNormalized(text)) {
    // 已规范化
}

// 规范化比较
if (Normalizer2.compare(s1, s2, Normalizer2.COMPARE_EQUIV) == 0) {
    // 规范化等价
}

3.1.4 双向文本算法 (BiDi)

文件位置 : icu4c/source/common/ubidi*.cpp (8 个文件)

核心概念:

BiDi (Bidirectional) 算法处理混合 LTR (从左到右) 和 RTL (从右到左) 文本,遵循 Unicode UAX#9 规范。

文本方向示例:

复制代码

纯英文 (LTR): "Hello World" → 显示: Hello World
纯阿拉伯文 (RTL): "مرحبا" → 显示: ابحرم (右到左)
混合文本: "Hello مرحبا World" → 显示需要 BiDi 算法重排

核心 API:

c 复制代码

// ubidi.h - BiDi 算法 API

typedef struct UBiDi UBiDi;

// 1. 创建 BiDi 对象
UBiDi* ubidi_open(void);
UBiDi* ubidi_openSized(int32_t maxLength, int32_t maxRunCount,
                       UErrorCode *pErrorCode);

// 2. 设置文本
void ubidi_setPara(UBiDi *pBiDi,
                   const UChar *text, int32_t length,
                   UBiDiLevel paraLevel,  // UBIDI_DEFAULT_LTR 或 UBIDI_DEFAULT_RTL
                   UBiDiLevel *embeddingLevels,
                   UErrorCode *pErrorCode);

// 3. 获取文本方向
UBiDiDirection ubidi_getDirection(const UBiDi *pBiDi);
// 返回: UBIDI_LTR, UBIDI_RTL, UBIDI_MIXED

// 4. 获取逻辑到视觉映射
int32_t ubidi_getVisualIndex(UBiDi *pBiDi, int32_t logicalIndex,
                              UErrorCode *pErrorCode);

// 5. 获取运行 (连续同方向文本段) 信息
int32_t ubidi_countRuns(UBiDi *pBiDi, UErrorCode *pErrorCode);
UBiDiDirection ubidi_getVisualRun(UBiDi *pBiDi, int32_t runIndex,
                                   int32_t *pLogicalStart,
                                   int32_t *pLength);

// 6. 重排显示顺序
int32_t ubidi_writeReordered(UBiDi *pBiDi,
                              UChar *dest, int32_t destSize,
                              uint16_t options,
                              UErrorCode *pErrorCode);

// 7. 清理
void ubidi_close(UBiDi *pBiDi);

使用示例:

c 复制代码

// 示例 1: 基本 BiDi 处理
UChar text[] = u"Hello مرحبا World";
UErrorCode status = U_ZERO_ERROR;
UBiDi *bidi = ubidi_open();

// 设置文本 (默认基础方向为 LTR)
ubidi_setPara(bidi, text, -1, UBIDI_DEFAULT_LTR, NULL, &status);

// 检查方向
UBiDiDirection dir = ubidi_getDirection(bidi);
if (dir == UBIDI_MIXED) {
    // 混合方向文本,需要重排

    // 获取运行数量
    int32_t runCount = ubidi_countRuns(bidi, &status);

    // 遍历每个运行
    for (int i = 0; i < runCount; i++) {
        int32_t logicalStart, length;
        UBiDiDirection runDir = ubidi_getVisualRun(bidi, i,
                                                    &logicalStart, &length);
        // 渲染 text[logicalStart..logicalStart+length-1]
        // 方向: runDir (UBIDI_LTR 或 UBIDI_RTL)
    }
}

// 或者直接获取重排后的文本
UChar reordered[100];
ubidi_writeReordered(bidi, reordered, 100,
                     UBIDI_DO_MIRRORING | UBIDI_REMOVE_BIDI_CONTROLS,
                     &status);

ubidi_close(bidi);

Java API 示例:

java 复制代码

import android.icu.text.Bidi;

String text = "Hello مرحبا World";
Bidi bidi = new Bidi(text, Bidi.DIRECTION_DEFAULT_LEFT_TO_RIGHT);

if (bidi.isMixed()) {
    int runCount = bidi.getRunCount();
    for (int i = 0; i < runCount; i++) {
        int start = bidi.getRunStart(i);
        int limit = bidi.getRunLimit(i);
        int level = bidi.getRunLevel(i);
        boolean isRTL = (level & 1) != 0;

        String run = text.substring(start, limit);
        // 渲染 run,方向: isRTL ? RTL : LTR
    }
}

实际应用场景:

c 复制代码

// Android TextView BiDi 渲染流程
void renderText(const UChar *text, int32_t length) {
    UBiDi *bidi = ubidi_open();
    ubidi_setPara(bidi, text, length, UBIDI_DEFAULT_LTR, NULL, &status);

    int32_t runCount = ubidi_countRuns(bidi, &status);
    float x = 0;  // 当前绘制位置

    for (int i = 0; i < runCount; i++) {
        int32_t start, len;
        UBiDiDirection dir = ubidi_getVisualRun(bidi, i, &start, &len);

        // 获取该运行的文本
        const UChar *run = text + start;

        // 测量文本宽度
        float width = measureText(run, len);

        if (dir == UBIDI_RTL) {
            // RTL 文本: 从右向左绘制
            drawText(run, len, x + width, y, ALIGN_RIGHT);
        } else {
            // LTR 文本: 从左向右绘制
            drawText(run, len, x, y, ALIGN_LEFT);
        }

        x += width;  // 移动到下一个位置
    }

    ubidi_close(bidi);
}

3.2 国际化模块 (i18n/)

3.2.1 日期/时间格式化

文件位置:

icu4c/source/i18n/calendar.cpp (144 KB) - 日历系统
icu4c/source/i18n/datefmt.cpp (31 KB) - 日期格式化
icu4c/source/i18n/smpdtfmt.cpp (179 KB) - SimpleDateFormat 实现

支持的日历系统 (11 种):

c 复制代码

typedef enum UCalendarType {
    UCAL_GREGORIAN,          // 公历 (默认)
    UCAL_TRADITIONAL,        // 传统日历 (取决于语言区域)

    // 其他日历系统:
    // - Buddhist (佛历)
    // - Chinese (农历)
    // - Coptic (科普特历)
    // - Ethiopic (埃塞俄比亚历)
    // - Hebrew (希伯来历)
    // - Indian (印度国历)
    // - Islamic (伊斯兰历)
    // - Japanese (日本历)
    // - Persian (波斯历)
} UCalendarType;

核心 API:

c 复制代码

// 1. 日历操作
UCalendar* ucal_open(const UChar *zoneID, int32_t len,
                     const char *locale,
                     UCalendarType type,
                     UErrorCode *status);

// 设置/获取字段
void ucal_set(UCalendar *cal, UCalendarDateFields field, int32_t value);
int32_t ucal_get(const UCalendar *cal, UCalendarDateFields field,
                 UErrorCode *status);

// 字段类型
typedef enum UCalendarDateFields {
    UCAL_ERA,                // 纪元
    UCAL_YEAR,               // 年
    UCAL_MONTH,              // 月 (0-based!)
    UCAL_WEEK_OF_YEAR,       // 年中的周
    UCAL_DATE,               // 日
    UCAL_DAY_OF_YEAR,        // 年中的天
    UCAL_DAY_OF_WEEK,        // 星期
    UCAL_HOUR_OF_DAY,        // 小时 (0-23)
    UCAL_MINUTE,             // 分钟
    UCAL_SECOND,             // 秒
    UCAL_MILLISECOND,        // 毫秒
    // ... 总共 24 个字段
} UCalendarDateFields;

// 2. 日期格式化
UDateFormat* udat_open(UDateFormatStyle timeStyle,
                       UDateFormatStyle dateStyle,
                       const char *locale,
                       const UChar *tzID, int32_t tzIDLength,
                       const UChar *pattern, int32_t patternLength,
                       UErrorCode *status);

// 格式化样式
typedef enum UDateFormatStyle {
    UDAT_FULL,              // 完整格式
    UDAT_LONG,              // 长格式
    UDAT_MEDIUM,            // 中等格式
    UDAT_SHORT,             // 短格式
    UDAT_NONE,              // 无
    UDAT_PATTERN = -2       // 使用自定义模式
} UDateFormatStyle;

// 格式化日期
int32_t udat_format(const UDateFormat *format,
                    UDate dateToFormat,
                    UChar *result, int32_t resultLength,
                    UFieldPosition *position,
                    UErrorCode *status);

// 解析日期
UDate udat_parse(const UDateFormat *format,
                 const UChar *text, int32_t textLength,
                 int32_t *parsePos,
                 UErrorCode *status);

使用示例:

c 复制代码

// 示例 1: 基本日期格式化 (不同语言区域)
UErrorCode status = U_ZERO_ERROR;
UDate now = ucal_getNow();  // 当前时间 (毫秒)

// 英文 - 完整格式
UDateFormat *fmt_en = udat_open(UDAT_FULL, UDAT_FULL, "en_US",
                                 NULL, 0, NULL, 0, &status);
UChar result_en[100];
udat_format(fmt_en, now, result_en, 100, NULL, &status);
// "Wednesday, January 29, 2025 at 3:45:30 PM China Standard Time"

// 中文 - 完整格式
UDateFormat *fmt_zh = udat_open(UDAT_FULL, UDAT_FULL, "zh_CN",
                                 NULL, 0, NULL, 0, &status);
UChar result_zh[100];
udat_format(fmt_zh, now, result_zh, 100, NULL, &status);
// "2025年1月29日星期三 中国标准时间 15:45:30"

// 阿拉伯文 - 完整格式
UDateFormat *fmt_ar = udat_open(UDAT_FULL, UDAT_FULL, "ar_SA",
                                 NULL, 0, NULL, 0, &status);
UChar result_ar[100];
udat_format(fmt_ar, now, result_ar, 100, NULL, &status);
// "الأربعاء، ٢٩ يناير ٢٠٢٥ في ٣:٤٥:٣٠ م توقيت الصين الرسمي"

// 示例 2: 自定义模式
UChar pattern[] = u"yyyy-MM-dd HH:mm:ss";
UDateFormat *fmt = udat_open(UDAT_PATTERN, UDAT_PATTERN, "en_US",
                              NULL, 0, pattern, -1, &status);
UChar result[50];
udat_format(fmt, now, result, 50, NULL, &status);
// "2025-01-29 15:45:30"

// 示例 3: 不同日历系统
// 公历
UCalendar *cal_gregorian = ucal_open(NULL, 0, "en_US@calendar=gregorian",
                                      UCAL_GREGORIAN, &status);
// 佛历 (泰国)
UCalendar *cal_buddhist = ucal_open(NULL, 0, "th_TH@calendar=buddhist",
                                     UCAL_TRADITIONAL, &status);
ucal_setMillis(cal_buddhist, now, &status);
int32_t year = ucal_get(cal_buddhist, UCAL_YEAR, &status);
// year = 2568 (公元 2025 年 = 佛历 2568 年)

// 农历 (中国)
UCalendar *cal_chinese = ucal_open(NULL, 0, "zh_CN@calendar=chinese",
                                    UCAL_TRADITIONAL, &status);

Java API 示例:

java 复制代码

import android.icu.text.DateFormat;
import android.icu.text.SimpleDateFormat;
import android.icu.util.Calendar;
import android.icu.util.ChineseCalendar;
import android.icu.util.ULocale;
import java.util.Date;

// 不同语言区域格式化
Date now = new Date();

DateFormat df_en = DateFormat.getDateTimeInstance(
    DateFormat.FULL, DateFormat.FULL, new ULocale("en_US"));
String str_en = df_en.format(now);
// "Wednesday, January 29, 2025 at 3:45:30 PM China Standard Time"

DateFormat df_zh = DateFormat.getDateTimeInstance(
    DateFormat.FULL, DateFormat.FULL, new ULocale("zh_CN"));
String str_zh = df_zh.format(now);
// "2025年1月29日星期三 中国标准时间 15:45:30"

// 自定义模式
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss",
                                             new ULocale("en_US"));
String formatted = sdf.format(now);

// 农历日期
ChineseCalendar chineseCal = new ChineseCalendar();
chineseCal.setTime(now);
int year = chineseCal.get(Calendar.YEAR);      // 农历年
int month = chineseCal.get(Calendar.MONTH);    // 农历月
int day = chineseCal.get(Calendar.DATE);       // 农历日

时区支持 (179+ 时区):

c 复制代码

// 时区 ID 示例:
// "America/New_York", "Europe/London", "Asia/Shanghai", "Asia/Tokyo"
// "GMT+8", "UTC", "PST", "EST"

// 获取可用时区列表
UEnumeration* ucal_openTimeZones(UErrorCode *status);
const UChar* uenum_unext(UEnumeration *en, int32_t *resultLength,
                         UErrorCode *status);

// 设置时区
UChar tzID[] = u"Asia/Shanghai";
UCalendar *cal = ucal_open(tzID, -1, "zh_CN", UCAL_GREGORIAN, &status);

3.2.2 数字/货币格式化

文件位置:

icu4c/source/i18n/numfmt.cpp (28 KB) - NumberFormat 基类
icu4c/source/i18n/decimfmt.cpp (236 KB) - DecimalFormat 实现
icu4c/source/i18n/currfmt.cpp (13 KB) - CurrencyFormat

核心 API:

c 复制代码

// 1. 创建数字格式化器
UNumberFormat* unum_open(UNumberFormatStyle style,
                         const UChar *pattern, int32_t patternLength,
                         const char *locale,
                         UParseError *parseErr,
                         UErrorCode *status);

// 格式化样式
typedef enum UNumberFormatStyle {
    UNUM_DECIMAL,            // 十进制数字
    UNUM_CURRENCY,           // 货币
    UNUM_PERCENT,            // 百分比
    UNUM_SCIENTIFIC,         // 科学记数法
    UNUM_SPELLOUT,           // 拼写 (one, two, three...)
    UNUM_ORDINAL,            // 序数 (1st, 2nd, 3rd...)
    UNUM_DURATION,           // 持续时间
    UNUM_CURRENCY_PLURAL,    // 货币复数形式
    UNUM_CURRENCY_ISO,       // ISO 货币代码
    UNUM_CURRENCY_ACCOUNTING // 会计货币格式
} UNumberFormatStyle;

// 2. 格式化数字
int32_t unum_formatDouble(const UNumberFormat *fmt,
                          double number,
                          UChar *result, int32_t resultLength,
                          UFieldPosition *pos,
                          UErrorCode *status);

int32_t unum_formatInt64(const UNumberFormat *fmt,
                         int64_t number,
                         UChar *result, int32_t resultLength,
                         UFieldPosition *pos,
                         UErrorCode *status);

// 3. 解析数字
double unum_parseDouble(const UNumberFormat *fmt,
                        const UChar *text, int32_t textLength,
                        int32_t *parsePos,
                        UErrorCode *status);

// 4. 设置货币
void unum_setTextAttribute(UNumberFormat *fmt,
                           UNumberFormatTextAttribute attr,
                           const UChar *newValue, int32_t newValueLength,
                           UErrorCode *status);

使用示例:

c 复制代码

// 示例 1: 基本数字格式化 (不同语言区域)
UErrorCode status = U_ZERO_ERROR;
double value = 1234567.89;

// 英文 (美国) - 使用逗号分隔
UNumberFormat *fmt_en = unum_open(UNUM_DECIMAL, NULL, 0, "en_US",
                                   NULL, &status);
UChar result_en[100];
unum_formatDouble(fmt_en, value, result_en, 100, NULL, &status);
// "1,234,567.89"

// 德文 (德国) - 使用点分隔,逗号作小数点
UNumberFormat *fmt_de = unum_open(UNUM_DECIMAL, NULL, 0, "de_DE",
                                   NULL, &status);
UChar result_de[100];
unum_formatDouble(fmt_de, value, result_de, 100, NULL, &status);
// "1.234.567,89"

// 阿拉伯文 - 使用阿拉伯数字
UNumberFormat *fmt_ar = unum_open(UNUM_DECIMAL, NULL, 0, "ar_SA",
                                   NULL, &status);
UChar result_ar[100];
unum_formatDouble(fmt_ar, value, result_ar, 100, NULL, &status);
// "١٬٢٣٤٬٥٦٧٫٨٩"

// 示例 2: 货币格式化
// 美元
UNumberFormat *fmt_usd = unum_open(UNUM_CURRENCY, NULL, 0, "en_US",
                                    NULL, &status);
unum_setTextAttribute(fmt_usd, UNUM_CURRENCY_CODE, u"USD", 3, &status);
UChar result_usd[100];
unum_formatDouble(fmt_usd, value, result_usd, 100, NULL, &status);
// "$1,234,567.89"

// 人民币
UNumberFormat *fmt_cny = unum_open(UNUM_CURRENCY, NULL, 0, "zh_CN",
                                    NULL, &status);
unum_setTextAttribute(fmt_cny, UNUM_CURRENCY_CODE, u"CNY", 3, &status);
UChar result_cny[100];
unum_formatDouble(fmt_cny, value, result_cny, 100, NULL, &status);
// "¥1,234,567.89"

// 示例 3: 百分比格式化
UNumberFormat *fmt_pct = unum_open(UNUM_PERCENT, NULL, 0, "en_US",
                                    NULL, &status);
UChar result_pct[100];
unum_formatDouble(fmt_pct, 0.8952, result_pct, 100, NULL, &status);
// "90%" (四舍五入)

// 示例 4: 科学记数法
UNumberFormat *fmt_sci = unum_open(UNUM_SCIENTIFIC, NULL, 0, "en_US",
                                    NULL, &status);
UChar result_sci[100];
unum_formatDouble(fmt_sci, value, result_sci, 100, NULL, &status);
// "1.234567890E6"

// 示例 5: 自定义模式
UChar pattern[] = u"#,##0.00";  // 至少 2 位小数
UNumberFormat *fmt_custom = unum_open(UNUM_PATTERN_DECIMAL,
                                       pattern, -1, "en_US",
                                       NULL, &status);
UChar result_custom[100];
unum_formatDouble(fmt_custom, 123.4, result_custom, 100, NULL, &status);
// "123.40"

// 示例 6: 紧凑格式 (CompactDecimal)
UNumberFormat *fmt_compact = unum_open(UNUM_DECIMAL_COMPACT_SHORT,
                                        NULL, 0, "en_US", NULL, &status);
UChar result_compact[100];
unum_formatDouble(fmt_compact, 1234567, result_compact, 100, NULL, &status);
// "1.2M"

// 中文紧凑格式
UNumberFormat *fmt_compact_zh = unum_open(UNUM_DECIMAL_COMPACT_SHORT,
                                           NULL, 0, "zh_CN", NULL, &status);
UChar result_compact_zh[100];
unum_formatDouble(fmt_compact_zh, 12345, result_compact_zh, 100, NULL, &status);
// "1.2万"

Java API 示例:

java 复制代码

import android.icu.text.NumberFormat;
import android.icu.text.DecimalFormat;
import android.icu.text.CompactDecimalFormat;
import android.icu.util.Currency;
import android.icu.util.ULocale;

double value = 1234567.89;

// 不同语言区域数字格式
NumberFormat nf_en = NumberFormat.getInstance(new ULocale("en_US"));
String str_en = nf_en.format(value);  // "1,234,567.89"

NumberFormat nf_de = NumberFormat.getInstance(new ULocale("de_DE"));
String str_de = nf_de.format(value);  // "1.234.567,89"

// 货币格式
NumberFormat cf_usd = NumberFormat.getCurrencyInstance(new ULocale("en_US"));
cf_usd.setCurrency(Currency.getInstance("USD"));
String str_usd = cf_usd.format(value);  // "$1,234,567.89"

// 百分比
NumberFormat pf = NumberFormat.getPercentInstance(new ULocale("en_US"));
String str_pct = pf.format(0.8952);  // "90%"

// 紧凑格式
CompactDecimalFormat cdf = CompactDecimalFormat.getInstance(
    new ULocale("en_US"), CompactDecimalFormat.CompactStyle.SHORT);
String str_compact = cdf.format(1234567);  // "1.2M"

CompactDecimalFormat cdf_zh = CompactDecimalFormat.getInstance(
    new ULocale("zh_CN"), CompactDecimalFormat.CompactStyle.SHORT);
String str_compact_zh = cdf_zh.format(12345);  // "1.2万"

货币符号示例 (180+ 货币):

复制代码

USD → $ (en_US), US$ (其他)
CNY → ¥ (zh_CN), CN¥ (其他)
EUR → € (欧元区)
GBP → £ (英镑)
JPY → ¥ (ja_JP), JP¥ (其他)
KRW → ₩ (韩元)
RUB → ₽ (卢布)
INR → ₹ (印度卢比)

3.2.3 排序/整理 (Collation)

文件位置:

icu4c/source/i18n/coll.cpp (52 KB) - Collator 基类
icu4c/source/i18n/ucol.cpp (94 KB) - Collation C API
icu4c/source/i18n/ucol_sit.cpp (70 KB) - 短字符串优化

核心概念:

Collation (整理) 定义字符串的排序规则,不同语言有不同的排序习惯。ICU 实现了 UCA (Unicode Collation Algorithm)。

比较强度等级:

c 复制代码

typedef enum UCollationStrength {
    UCOL_PRIMARY = 0,      // 主要差异: 忽略重音和大小写
                           // 'a' = 'A' = 'à' = 'Á'

    UCOL_SECONDARY = 1,    // 次要差异: 考虑重音,忽略大小写
                           // 'a' = 'A' < 'à' = 'À'

    UCOL_TERTIARY = 2,     // 三级差异: 考虑大小写 (默认)
                           // 'a' < 'A' < 'à' < 'À'

    UCOL_QUATERNARY = 3,   // 四级差异: 考虑标点符号
    UCOL_IDENTICAL = 15    // 完全相同: 码点级别比较
} UCollationStrength;

核心 API:

c 复制代码

// 1. 创建 Collator
UCollator* ucol_open(const char *loc, UErrorCode *status);

// 2. 字符串比较
UCollationResult ucol_strcoll(const UCollator *coll,
                               const UChar *source, int32_t sourceLength,
                               const UChar *target, int32_t targetLength);
// 返回: UCOL_LESS (-1), UCOL_EQUAL (0), UCOL_GREATER (1)

// 3. 设置比较强度
void ucol_setStrength(UCollator *coll, UCollationStrength strength);

// 4. 生成排序键 (Sort Key)
int32_t ucol_getSortKey(const UCollator *coll,
                        const UChar *source, int32_t sourceLength,
                        uint8_t *result, int32_t resultLength);

// 5. 关闭 Collator
void ucol_close(UCollator *coll);

使用示例:

c 复制代码

// 示例 1: 基本字符串比较
UErrorCode status = U_ZERO_ERROR;
UCollator *coll_en = ucol_open("en_US", &status);

UChar s1[] = u"apple";
UChar s2[] = u"banana";

UCollationResult result = ucol_strcoll(coll_en, s1, -1, s2, -1);
// result = UCOL_LESS (apple < banana)

// 示例 2: 不同强度比较
UChar a1[] = u"café";
UChar a2[] = u"cafe";

ucol_setStrength(coll_en, UCOL_PRIMARY);
result = ucol_strcoll(coll_en, a1, -1, a2, -1);
// result = UCOL_EQUAL (忽略重音)

ucol_setStrength(coll_en, UCOL_SECONDARY);
result = ucol_strcoll(coll_en, a1, -1, a2, -1);
// result = UCOL_GREATER (café > cafe,考虑重音)

// 示例 3: 不同语言排序规则
// 德语: ä 在 a 后面
UCollator *coll_de = ucol_open("de_DE", &status);
UChar g1[] = u"Ärger";
UChar g2[] = u"Argument";
result = ucol_strcoll(coll_de, g1, -1, g2, -1);
// result = UCOL_GREATER (Ärger > Argument)

// 瑞典语: ä 在 z 后面
UCollator *coll_sv = ucol_open("sv_SE", &status);
result = ucol_strcoll(coll_sv, g1, -1, g2, -1);
// result = UCOL_GREATER (Ärger > Argument, ä 排在字母表末尾)

// 中文: 按拼音排序
UCollator *coll_zh = ucol_open("zh_CN", &status);
UChar c1[] = u"北京";
UChar c2[] = u"上海";
result = ucol_strcoll(coll_zh, c1, -1, c2, -1);
// result = UCOL_LESS (běi < shàng)

// 中文: 按笔画排序
UCollator *coll_zh_stroke = ucol_open("zh_CN@collation=stroke", &status);
result = ucol_strcoll(coll_zh_stroke, c1, -1, c2, -1);
// result = UCOL_GREATER (北:5画, 上:3画, 总笔画: 12 > 6)

// 示例 4: 生成排序键 (用于数据库索引)
uint8_t key1[100], key2[100];
int32_t len1 = ucol_getSortKey(coll_en, u"apple", -1, key1, 100);
int32_t len2 = ucol_getSortKey(coll_en, u"banana", -1, key2, 100);

// 直接比较字节序列 (比 ucol_strcoll 快)
int cmp = memcmp(key1, key2, (len1 < len2) ? len1 : len2);
// cmp < 0 → apple < banana

Java API 示例:

java 复制代码

import android.icu.text.Collator;
import android.icu.text.RuleBasedCollator;
import android.icu.util.ULocale;

// 基本比较
Collator coll_en = Collator.getInstance(new ULocale("en_US"));
int result = coll_en.compare("apple", "banana");  // < 0

// 设置强度
coll_en.setStrength(Collator.PRIMARY);
result = coll_en.compare("café", "cafe");  // = 0 (忽略重音)

// 中文拼音排序
Collator coll_zh = Collator.getInstance(new ULocale("zh_CN"));
result = coll_zh.compare("北京", "上海");  // < 0

// 中文笔画排序
Collator coll_stroke = Collator.getInstance(
    new ULocale("zh_CN@collation=stroke"));
result = coll_stroke.compare("北京", "上海");  // > 0

// 生成排序键
RuleBasedCollator rbc = (RuleBasedCollator) coll_en;
byte[] key = rbc.getCollationKey("apple").toByteArray();

性能优化 - Fast Latin:

c 复制代码

// ucol_sit.cpp - 短字符串优化
// 对于纯拉丁字符 (a-z, A-Z),使用快速路径
// 避免完整的 UCA 算法,直接使用预计算表

static inline UCollationResult
ucol_strcollFastLatin(const UChar *source, int32_t sourceLength,
                      const UChar *target, int32_t targetLength) {
    // 快速检查是否为纯拉丁字符
    if (isFastLatin(source, sourceLength) &&
        isFastLatin(target, targetLength)) {
        // 使用快速查找表 (2-3x 速度提升)
        return compareFastLatin(source, sourceLength,
                                target, targetLength);
    }
    // 回退到完整 UCA 算法
    return ucol_strcollRegular(source, sourceLength,
                               target, targetLength);
}

3.2.4 正则表达式引擎

文件位置:

icu4c/source/i18n/regexcmp.cpp (111 KB) - 正则表达式编译器
icu4c/source/i18n/rematch.cpp (195 KB) - 匹配引擎
icu4c/source/i18n/repattrn.cpp (41 KB) - 模式对象

核心特性:

语法: PCRE (Perl Compatible Regular Expression) 风格
Unicode 支持: \p{Property}, \p{Script=Han}, \p{Block=CJK}
引擎: DFA + NFA 混合 (类似 PCRE)
性能: 针对 Unicode 优化,支持预编译模式

核心 API:

c 复制代码

// 1. 编译正则表达式
URegularExpression* uregex_open(const UChar *pattern, int32_t patternLength,
                                 uint32_t flags,
                                 UParseError *pe,
                                 UErrorCode *status);

// 标志位
#define UREGEX_CASE_INSENSITIVE  0x02  // 不区分大小写
#define UREGEX_COMMENTS          0x04  // 允许空白和注释
#define UREGEX_DOTALL            0x20  // . 匹配换行符
#define UREGEX_MULTILINE         0x08  // ^ $ 匹配行首尾
#define UREGEX_UNICODE_WORD      0x100 // \w 使用 Unicode 定义

// 2. 设置输入文本
void uregex_setText(URegularExpression *regexp,
                    const UChar *text, int32_t textLength,
                    UErrorCode *status);

// 3. 查找匹配
UBool uregex_find(URegularExpression *regexp,
                  int32_t startIndex,
                  UErrorCode *status);

UBool uregex_findNext(URegularExpression *regexp,
                      UErrorCode *status);

// 4. 获取匹配结果
int32_t uregex_start(URegularExpression *regexp,
                     int32_t groupNum,
                     UErrorCode *status);

int32_t uregex_end(URegularExpression *regexp,
                   int32_t groupNum,
                   UErrorCode *status);

// 5. 提取分组
int32_t uregex_group(URegularExpression *regexp,
                     int32_t groupNum,
                     UChar *dest, int32_t destCapacity,
                     UErrorCode *status);

// 6. 替换
int32_t uregex_replaceAll(URegularExpression *regexp,
                          const UChar *replacementText,
                          int32_t replacementLength,
                          UChar *destBuf, int32_t destCapacity,
                          UErrorCode *status);

// 7. 关闭
void uregex_close(URegularExpression *regexp);

使用示例:

c 复制代码

// 示例 1: 基本匹配
UErrorCode status = U_ZERO_ERROR;
UParseError pe;

UChar pattern[] = u"\\d{3}-\\d{4}";  // 电话号码: 123-4567
URegularExpression *regex = uregex_open(pattern, -1, 0, &pe, &status);

UChar text[] = u"My number is 123-4567 and 890-1234";
uregex_setText(regex, text, -1, &status);

if (uregex_find(regex, 0, &status)) {
    int32_t start = uregex_start(regex, 0, &status);
    int32_t end = uregex_end(regex, 0, &status);
    // start = 13, end = 21, matched = "123-4567"

    UChar match[100];
    uregex_group(regex, 0, match, 100, &status);
    // match = "123-4567"
}

// 查找下一个匹配
if (uregex_findNext(regex, &status)) {
    UChar match[100];
    uregex_group(regex, 0, match, 100, &status);
    // match = "890-1234"
}

uregex_close(regex);

// 示例 2: Unicode 字符类
// 匹配所有汉字
UChar pattern_han[] = u"\\p{Script=Han}+";
URegularExpression *regex_han = uregex_open(pattern_han, -1, 0, &pe, &status);

UChar text_mixed[] = u"Hello世界Test测试";
uregex_setText(regex_han, text_mixed, -1, &status);

while (uregex_findNext(regex_han, &status)) {
    UChar match[100];
    uregex_group(regex_han, 0, match, 100, &status);
    // 第1次: match = "世界"
    // 第2次: match = "测试"
}

// 示例 3: Unicode 属性匹配
// \p{L} = 所有字母
// \p{N} = 所有数字
// \p{P} = 所有标点
// \p{Script=Han} = 汉字
// \p{Script=Latin} = 拉丁字母
// \p{Block=CJK} = CJK 统一汉字块

UChar pattern_prop[] = u"\\p{L}+";  // 匹配所有字母 (任何语言)
URegularExpression *regex_prop = uregex_open(pattern_prop, -1,
                                              UREGEX_UNICODE_WORD, &pe, &status);

UChar text_multi[] = u"Hello世界مرحبا";
uregex_setText(regex_prop, text_multi, -1, &status);

while (uregex_findNext(regex_prop, &status)) {
    // 第1次: "Hello"
    // 第2次: "世界"
    // 第3次: "مرحبا" (阿拉伯文)
}

// 示例 4: 分组捕获
UChar pattern_email[] = u"([\\w.]+)@([\\w.]+)";
URegularExpression *regex_email = uregex_open(pattern_email, -1,
                                                UREGEX_UNICODE_WORD, &pe, &status);

UChar text_email[] = u"Email: user@example.com";
uregex_setText(regex_email, text_email, -1, &status);

if (uregex_find(regex_email, 0, &status)) {
    UChar group0[100], group1[100], group2[100];

    uregex_group(regex_email, 0, group0, 100, &status);  // "user@example.com"
    uregex_group(regex_email, 1, group1, 100, &status);  // "user"
    uregex_group(regex_email, 2, group2, 100, &status);  // "example.com"
}

// 示例 5: 替换操作
UChar pattern_space[] = u"\\s+";
URegularExpression *regex_space = uregex_open(pattern_space, -1, 0, &pe, &status);

UChar text_spaces[] = u"too   many    spaces";
uregex_setText(regex_space, text_spaces, -1, &status);

UChar result[100];
UChar replacement[] = u" ";
int32_t len = uregex_replaceAll(regex_space, replacement, -1,
                                 result, 100, &status);
// result = "too many spaces"

Java API 示例:

java 复制代码

import android.icu.text.UnicodeSet;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

// 基本匹配
Pattern pattern = Pattern.compile("\\d{3}-\\d{4}");
Matcher matcher = pattern.matcher("My number is 123-4567");
if (matcher.find()) {
    String match = matcher.group();  // "123-4567"
    int start = matcher.start();     // 13
    int end = matcher.end();         // 21
}

// Unicode 属性 (需要使用 UNICODE_CHARACTER_CLASS 标志)
Pattern pattern_han = Pattern.compile("\\p{Script=Han}+",
                                       Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher_han = pattern_han.matcher("Hello世界Test");
while (matcher_han.find()) {
    System.out.println(matcher_han.group());  // "世界"
}

// 使用 ICU UnicodeSet (更强大)
UnicodeSet hanSet = new UnicodeSet("[[:Script=Han:]]");
String text = "Hello世界Test测试";
for (int i = 0; i < text.length(); ) {
    int cp = text.codePointAt(i);
    if (hanSet.contains(cp)) {
        // 是汉字
    }
    i += Character.charCount(cp);
}

// 分组捕获
Pattern pattern_email = Pattern.compile("([\\w.]+)@([\\w.]+)");
Matcher matcher_email = pattern_email.matcher("user@example.com");
if (matcher_email.find()) {
    String full = matcher_email.group(0);   // "user@example.com"
    String user = matcher_email.group(1);   // "user"
    String domain = matcher_email.group(2); // "example.com"
}

// 替换
String result = text.replaceAll("\\s+", " ");

正则表达式引擎实现 (简化):

cpp 复制代码

// regexcmp.cpp - 编译正则表达式为字节码
class RegexCompile {
    // 编译过程:
    // 1. 解析正则表达式语法树
    // 2. 优化 (公共子表达式消除,字符类合并)
    // 3. 生成字节码 (类似 JVM 字节码)

    void compilePattern(const UnicodeString &pattern) {
        // 生成字节码指令
        emit(URX_NOP);                    // 空操作
        emit(URX_BACKTRACK);              // 回溯点
        emit(URX_STRING_LEN, length);     // 字符串匹配
        emit(URX_LOOP_C, min, max);       // 循环 {m,n}
        emit(URX_BACKSLASH_X);            // \X (扩展字素簇)
        emit(URX_PROP, propertySet);      // \p{Property}
    }
};

// rematch.cpp - 匹配引擎
class RegexMatcher {
    // NFA + 回溯匹配
    UBool matchAt(int64_t startIdx) {
        int64_t fp = 0;  // 字节码指令指针

        for (;;) {
            int32_t op = fPattern->fCompiledPat->elementAti(fp++);

            switch (op) {
            case URX_STRING_LEN:
                // 匹配固定长度字符串
                if (memcmp(input + inputIdx,
                           patternString, length) == 0) {
                    inputIdx += length;
                } else {
                    backtrack();  // 回溯
                }
                break;

            case URX_PROP:
                // Unicode 属性匹配
                if (propertySet->contains(input[inputIdx])) {
                    inputIdx++;
                } else {
                    backtrack();
                }
                break;

            // ... 其他指令
            }
        }
    }
};

3.3 数据文件模块 (data/)

文件位置 : icu4c/source/data/ (4,121 个文件)

数据文件结构:

复制代码

data/
├── locales/           # 语言区域数据 (1,829 .txt 文件)
│   ├── en.txt         # 英文基础数据
│   ├── zh.txt         # 中文基础数据
│   ├── zh_Hans.txt    # 简体中文
│   ├── zh_Hant.txt    # 繁体中文
│   ├── ar.txt         # 阿拉伯文
│   └── ...            # 180+ 语言区域
│
├── coll/              # 排序规则 (271 .txt 文件)
│   ├── root.txt       # 根排序规则 (UCA)
│   ├── zh.txt         # 中文排序 (拼音)
│   ├── zh__stroke.txt # 中文笔画排序
│   ├── de.txt         # 德语排序
│   └── ...
│
├── brkitr/            # 断词规则 (32 .txt 文件)
│   ├── char.txt       # 字符边界
│   ├── word.txt       # 单词边界
│   ├── line.txt       # 行边界
│   ├── sent.txt       # 句子边界
│   └── word_*.txt     # 语言特定单词边界
│
├── translit/          # 音译规则 (120+ .txt 文件)
│   ├── Latin-ASCII.txt         # 拉丁→ASCII
│   ├── Han-Latin.txt           # 汉字→拼音
│   ├── Arabic-Latin.txt        # 阿拉伯文→拉丁
│   └── ...
│
├── rbnf/              # 基于规则的数字格式化 (55 .txt 文件)
│   ├── en.txt         # 英文数字拼写
│   ├── zh.txt         # 中文数字读法
│   └── ...
│
├── unit/              # 单位格式化 (1 .txt)
├── zone/              # 时区数据 (6 .txt)
├── curr/              # 货币数据 (2 .txt)
└── lang/              # 语言名称 (359 .txt)

数据文件编译流程:

复制代码

源文件 (.txt)
       ↓
   genrb 工具 (Resource Bundle 编译器)
       ↓
二进制资源 (.res)
       ↓
   pkgdata 工具 (打包工具)
       ↓
单个数据文件 (icudt72l.dat) - 小端
           或 (icudt72b.dat) - 大端

数据文件示例:

txt 复制代码

// locales/zh.txt - 中文语言区域数据 (简化)
zh {
    Version { "45" }

    // 本地化的月份名称
    calendar {
        gregorian {
            monthNames {
                format {
                    wide {
                        "一月", "二月", "三月", "四月",
                        "五月", "六月", "七月", "八月",
                        "九月", "十月", "十一月", "十二月"
                    }
                }
            }
            dayNames {
                format {
                    wide {
                        "星期日", "星期一", "星期二", "星期三",
                        "星期四", "星期五", "星期六"
                    }
                }
            }
        }
        chinese {
            monthNames {
                format {
                    wide {
                        "正月", "二月", "三月", "四月",
                        "五月", "六月", "七月", "八月",
                        "九月", "十月", "十一月", "腊月"
                    }
                }
            }
        }
    }

    // 数字格式
    NumberElements {
        latn {
            patterns {
                currencyFormat { "¤#,##0.00" }
                decimalFormat { "#,##0.###" }
                percentFormat { "#,##0%" }
                scientificFormat { "#E0" }
            }
            symbols {
                decimal { "." }
                group { "," }
                minusSign { "-" }
                percentSign { "%" }
            }
        }
    }
}

访问数据文件 API:

c 复制代码

// 1. 打开资源包
UResourceBundle* ures_open(const char *packageName,
                           const char *locale,
                           UErrorCode *status);

// 2. 获取字符串资源
const UChar* ures_getStringByKey(const UResourceBundle *resourceBundle,
                                  const char *key,
                                  int32_t *len,
                                  UErrorCode *status);

// 3. 关闭资源包
void ures_close(UResourceBundle *resourceBundle);

// 示例:
UErrorCode status = U_ZERO_ERROR;
UResourceBundle *bundle = ures_open(NULL, "zh_CN", &status);

int32_t len;
const UChar *monthName = ures_getStringByKey(bundle,
                                              "calendar/gregorian/monthNames/format/wide/0",
                                              &len, &status);
// monthName = "一月"

ures_close(bundle);

3.4 Android 集成层 (libandroidicu)

文件位置 : icu4c/source/stubdata/ + android_icu4c/

稳定 ABI 设计:

复制代码

App/NDK Code
      ↓ 调用
libandroidicu.so (稳定 API 子集)
      ↓ 转发
libicuuc.so + libicui18n.so (完整实现,可独立更新)

为什么需要 libandroidicu?

版本隔离: 应用使用稳定 API,ICU 可以在 APEX 中独立更新
ABI 稳定性: 只暴露精选的稳定 API,避免 ABI 破坏
二进制大小: 应用只链接需要的符号,减小 APK 体积

暴露的稳定 API (精选):

c 复制代码

// libandroidicu 暴露的函数 (200+ 个)
// 字符属性
u_charType
u_isalpha
u_isdigit
u_tolower
u_toupper

// 编码转换
ucnv_open
ucnv_close
ucnv_toUnicode
ucnv_fromUnicode

// 文本处理
ubrk_open
ubrk_next
ubrk_close

// 正则表达式
uregex_open
uregex_find
uregex_group
uregex_close

// ... (精选的稳定子集)

使用示例 (NDK):

c 复制代码

// Android NDK CMakeLists.txt
find_library(libandroidicu-lib androidicu)
target_link_libraries(myapp ${libandroidicu-lib})

// C++ 代码
#include <unicode/uchar.h>
#include <unicode/ucnv.h>

void processText() {
    // 使用稳定 API
    UChar32 ch = 0x4E2D;
    if (u_isalpha(ch)) {
        // ...
    }

    UErrorCode status = U_ZERO_ERROR;
    UConverter *conv = ucnv_open("UTF-8", &status);
    // ...
    ucnv_close(conv);
}

Java 层集成:

java 复制代码

// Android Framework 使用 android.icu.* 包
import android.icu.text.DateFormat;
import android.icu.text.NumberFormat;
import android.icu.text.Collator;
import android.icu.util.Calendar;

// 内部通过 JNI 调用 libicuuc.so
public class DateFormat {
    // ...

    private static native void openCalendar(...);  // JNI 调用
}

四、构建系统

4.1 Android.bp 构建配置

主构建文件 : external/icu/Android.bp

核心库定义:

python 复制代码

// 1. libicuuc (Common - Unicode 处理)
cc_library {
    name: "libicuuc",
    host_supported: true,
    native_bridge_supported: true,

    srcs: [
        "icu4c/source/common/**/*.cpp",
    ],
    exclude_srcs: [
        "icu4c/source/common/udata.cpp",  // 使用自定义数据加载
    ],

    cflags: [
        "-DU_COMMON_IMPLEMENTATION",
        "-O3",
        "-fvisibility=hidden",
    ],

    export_include_dirs: [
        "icu4c/source/common",
    ],

    // APEX 集成
    apex_available: [
        "com.android.i18n",
        "//apex_available:platform",
    ],
}

// 2. libicui18n (Internationalization)
cc_library {
    name: "libicui18n",
    host_supported: true,
    native_bridge_supported: true,

    srcs: [
        "icu4c/source/i18n/**/*.cpp",
    ],

    shared_libs: [
        "libicuuc",  // 依赖 Common
    ],

    cflags: [
        "-DU_I18N_IMPLEMENTATION",
        "-O3",
    ],

    apex_available: [
        "com.android.i18n",
        "//apex_available:platform",
    ],
}

// 3. libandroidicu (稳定 ABI 层)
cc_library {
    name: "libandroidicu",
    host_supported: true,
    native_bridge_supported: true,

    // 只暴露稳定 API
    srcs: [
        "libandroidicu/static_shim/shim.cpp",  // API 转发层
    ],

    shared_libs: [
        "libicuuc",
        "libicui18n",
    ],

    // NDK 可用
    llndk: {
        symbol_file: "libandroidicu.map.txt",  // 导出符号列表
    },

    stubs: {
        symbol_file: "libandroidicu.map.txt",
        versions: ["28", "29", "30"],  // API 版本
    },
}

// 4. ICU 数据文件
prebuilt_etc {
    name: "icu-data",
    src: "icu4c/source/data/in/icudt72l.dat",
    filename: "icudt72l.dat",

    sub_dir: "icu",

    apex_available: [
        "com.android.i18n",
    ],
}

4.2 APEX 模块集成

ICU APEX 模块 : com.android.i18n

json 复制代码

// packages/modules/RuntimeI18n/apex/apex_manifest.json
{
  "name": "com.android.i18n",
  "version": 340090000,
  "provideNativeLibs": [
    "libicuuc.so",
    "libicui18n.so",
    "libandroidicu.so"
  ],
  "requireNativeLibs": []
}

APEX 优势:

独立更新: ICU 可以通过 Google Play 更新,无需完整 OTA
安全: 时区数据和 Unicode 版本可快速修复
隔离: 不同应用可使用不同 ICU 版本 (通过 linker namespace)

4.3 数据文件打包

数据文件生成流程:

bash 复制代码

# 1. 编译资源文件 (.txt → .res)
genrb -d output/ -i source/data/locales/ zh.txt

# 2. 打包为单个 .dat 文件
pkgdata -m common -c -p icudt72l -O icudata.lst

# 生成: icudt72l.dat (26 MB)

数据文件加载 (运行时):

c 复制代码

// common/udata.cpp - 数据加载
static const char *gDataDirectory = "/apex/com.android.i18n/etc/icu/";

void* udata_open(const char *path, const char *type, const char *name) {
    // 1. 检查内存映射数据
    if (gCommonICUData != NULL) {
        return findDataInMappedData(name, type);
    }

    // 2. 从文件加载
    char filename[256];
    snprintf(filename, sizeof(filename),
             "%s/%s%s.%s", gDataDirectory, name, ICUDATA_NAME, type);

    FILE *file = fopen(filename, "rb");
    // mmap 映射数据文件 (零拷贝)
    void *data = mmap(NULL, fileSize, PROT_READ, MAP_SHARED, fd, 0);

    return data;
}

五、性能优化

5.1 编译优化

编译器标志:

python 复制代码

cflags: [
    "-O3",                    // 最高优化级别
    "-fvisibility=hidden",    // 隐藏内部符号,减小 .so 大小
    "-ffunction-sections",    // 函数独立段,启用 GC
    "-fdata-sections",
    "-fno-exceptions",        // 禁用 C++ 异常 (减小代码体积)
    "-fno-rtti",              // 禁用 RTTI
    "-DU_HAVE_STD_ATOMICS=1", // 使用 C++11 原子操作
],

ldflags: [
    "-Wl,--gc-sections",      // 链接时垃圾回收
    "-Wl,--exclude-libs,ALL", // 隐藏静态库符号
]

5.2 数据结构优化

Trie2 压缩查找表:

c 复制代码

// utrie2.cpp - Trie2 数据结构 (2-stage lookup)
// BMP 字符 (U+0000..U+FFFF): 2 次数组访问 O(1)
// 非 BMP 字符: 3 次数组访问 O(1)

#define UTRIE2_INDEX_SHIFT 5
#define UTRIE2_DATA_BLOCK_LENGTH (1 << UTRIE2_INDEX_SHIFT)

static inline uint32_t
utrie2_get16(const UTrie2 *trie, UChar32 c) {
    int32_t i1 = c >> UTRIE2_SHIFT_1;              // 高位索引
    int32_t i2 = trie->index[i1];                  // 中间索引
    int32_t i3 = (i2 << UTRIE2_INDEX_SHIFT) +
                 (c & UTRIE2_DATA_MASK);           // 数据索引
    return trie->data16[i3];                       // 获取值
}

// 压缩率:
// 完整 Unicode (1,114,112 码点 × 2 字节) = 2.2 MB
// Trie2 压缩后: ~80 KB (压缩率 96%)

MBCS 状态机优化:

c 复制代码

// ucnvmbcs.cpp - 多字节字符集状态机
// GBK 编码转换: 每字符 2-3 次内存访问

typedef struct {
    const int32_t (*stateTable)[256];  // 256 项跳转表 (L1 cache 友好)
    const uint16_t *toUnicodeTable;    // 直接映射表
} UConverterMBCSTable;

// 热路径优化 (单字节 ASCII)
if (c < 0x80) {
    return c;  // 快速返回
}

// 状态机查表
int32_t state = 0;
int32_t nextState = stateTable[state][byte1];
if (nextState >= 0) {
    return toUnicodeTable[nextState];  // 单字节字符
} else {
    state = -nextState;
    nextState = stateTable[state][byte2];
    return toUnicodeTable[nextState];  // 双字节字符
}

5.3 缓存优化

线程本地缓存:

cpp 复制代码

// umutex.cpp - 线程本地缓存
// 避免重复创建相同 Collator/DateFormat 对象

class ThreadLocalCache {
    static thread_local std::unordered_map<std::string, void*> cache;

public:
    static Collator* getCollator(const char *locale) {
        auto it = cache.find(locale);
        if (it != cache.end()) {
            return static_cast<Collator*>(it->second);  // 缓存命中
        }

        Collator *coll = Collator::createInstance(locale, status);
        cache[locale] = coll;
        return coll;
    }
};

数据文件 mmap 零拷贝:

c 复制代码

// udata.cpp - 数据文件映射
// 使用 mmap 避免文件读取开销

void* loadDataFile(const char *path) {
    int fd = open(path, O_RDONLY);
    struct stat st;
    fstat(fd, &st);

    // mmap 映射整个文件 (懒加载,按需分页)
    void *data = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);

    // 建议内核预读 (减少缺页中断)
    madvise(data, st.st_size, MADV_WILLNEED);

    close(fd);
    return data;
}

5.4 SIMD 优化

UTF-8 验证 (NEON 优化):

c 复制代码

// UTF-8 快速验证 (ARM NEON)
#ifdef __ARM_NEON
#include <arm_neon.h>

bool utf8_validate_neon(const uint8_t *data, size_t len) {
    const uint8_t *end = data + len;

    while (data + 16 <= end) {
        // 加载 16 字节
        uint8x16_t chunk = vld1q_u8(data);

        // 检查 ASCII (0x00-0x7F)
        uint8x16_t ascii_mask = vcltq_u8(chunk, vdupq_n_u8(0x80));

        // 快速路径: 全部 ASCII
        if (vminvq_u8(ascii_mask) == 0xFF) {
            data += 16;
            continue;
        }

        // 慢路径: 包含多字节字符
        // ... 完整 UTF-8 验证逻辑
    }

    // 处理剩余字节
    // ...
}
#endif

六、测试框架

6.1 单元测试

测试目录 : icu4c/source/test/

主要测试模块:

复制代码

test/
├── intltest/     # C++ 国际化测试 (230+ .cpp 文件)
│   ├── caltest.cpp         # 日历测试
│   ├── dtfmttst.cpp        # 日期格式测试
│   ├── numfmtst.cpp        # 数字格式测试
│   ├── collationtest.cpp   # 排序测试
│   └── regextst.cpp        # 正则表达式测试
│
├── cintltst/     # C API 测试 (150+ .c 文件)
│   ├── ucnvtst.c           # 编码转换测试
│   ├── uregextst.c         # 正则表达式 C API 测试
│   └── ...
│
├── testdata/     # 测试数据文件
└── perf/         # 性能基准测试

测试示例:

cpp 复制代码

// intltest/caltest.cpp - 日历测试
void CalendarTest::TestGregorian() {
    UErrorCode status = U_ZERO_ERROR;
    Calendar *cal = Calendar::createInstance(status);

    // 测试 1: 设置日期
    cal->set(2025, Calendar::JANUARY, 29);
    int32_t year = cal->get(Calendar::YEAR, status);
    int32_t month = cal->get(Calendar::MONTH, status);
    int32_t day = cal->get(Calendar::DATE, status);

    if (year != 2025 || month != 0 || day != 29) {
        errln("Calendar set/get failed");
    }

    // 测试 2: 日期运算
    cal->add(Calendar::DATE, 7, status);  // +7 天
    day = cal->get(Calendar::DATE, status);
    if (day != 5) {  // 2025-01-29 + 7 = 2025-02-05
        errln("Calendar add failed");
    }

    delete cal;
}

6.2 Android CTS 测试

CTS 测试位置 : cts/tests/tests/icu/

测试覆盖:

java 复制代码

// CTS: android.icu.text.DateFormatTest
public class DateFormatTest extends AndroidTestCase {
    public void testChineseDateFormat() {
        DateFormat df = DateFormat.getDateInstance(
            DateFormat.FULL, new ULocale("zh_CN"));

        Calendar cal = Calendar.getInstance();
        cal.set(2025, Calendar.JANUARY, 29);

        String formatted = df.format(cal.getTime());
        assertTrue(formatted.contains("2025"));
        assertTrue(formatted.contains("1月"));
        assertTrue(formatted.contains("29"));
    }

    public void testArabicNumberFormat() {
        NumberFormat nf = NumberFormat.getInstance(
            new ULocale("ar_SA"));

        String formatted = nf.format(12345);
        // 应该使用阿拉伯数字 ١٢٣٤٥
        assertEquals("١٢٬٣٤٥", formatted);
    }
}

6.3 性能基准测试

性能测试 : icu4c/source/test/perf/

cpp 复制代码

// perf/collperf.cpp - 排序性能测试
class CollationPerformanceTest {
    void testSortPerformance() {
        // 测试 10,000 个字符串排序
        std::vector<UnicodeString> strings = loadTestData(10000);

        UErrorCode status = U_ZERO_ERROR;
        Collator *coll = Collator::createInstance("zh_CN", status);

        auto start = std::chrono::high_resolution_clock::now();

        std::sort(strings.begin(), strings.end(),
            [coll](const UnicodeString &a, const UnicodeString &b) {
                return coll->compare(a, b) < 0;
            });

        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

        logln(UnicodeString("Sort 10000 strings: ") + duration.count() + "ms");
    }
};

七、最佳实践

7.1 API 使用建议

1. 使用 C++ API (推荐)

cpp 复制代码

// ✅ 推荐: 使用 C++ API (RAII,异常安全)
#include <unicode/calendar.h>
#include <unicode/datefmt.h>

void formatDate() {
    UErrorCode status = U_ZERO_ERROR;

    // 自动资源管理
    std::unique_ptr<Calendar> cal(Calendar::createInstance("zh_CN", status));
    std::unique_ptr<DateFormat> df(DateFormat::createDateInstance(
        DateFormat::FULL, Locale("zh_CN")));

    UnicodeString formatted;
    df->format(cal->getTime(status), formatted);

    // 析构函数自动清理资源
}

// ❌ 不推荐: C API (手动资源管理,容易泄漏)
UCalendar *cal = ucal_open(...);
UDateFormat *df = udat_open(...);
// ... 如果中间出错return,会泄漏资源
ucal_close(cal);
udat_close(df);

2. 复用 Formatter 对象

cpp 复制代码

// ✅ 推荐: 复用 formatter (避免重复创建)
class MyApp {
    std::unique_ptr<DateFormat> m_dateFormat;

    MyApp() {
        UErrorCode status = U_ZERO_ERROR;
        m_dateFormat.reset(DateFormat::createDateInstance(
            DateFormat::MEDIUM, Locale::getDefault()));
    }

    void formatDates(const std::vector<UDate> &dates) {
        for (UDate date : dates) {
            UnicodeString formatted;
            m_dateFormat->format(date, formatted);
            // 使用 formatted
        }
    }
};

// ❌ 不推荐: 每次创建 formatter (性能差 10-100x)
void formatDates(const std::vector<UDate> &dates) {
    for (UDate date : dates) {
        UErrorCode status = U_ZERO_ERROR;
        DateFormat *df = DateFormat::createDateInstance(...);  // 每次创建!
        UnicodeString formatted;
        df->format(date, formatted);
        delete df;
    }
}

3. 使用 Sort Key 优化大量字符串排序

cpp 复制代码

// ✅ 推荐: 使用 sort key (大量字符串排序)
struct StringWithKey {
    UnicodeString str;
    CollationKey key;
};

void sortLargeList(std::vector<UnicodeString> &strings) {
    UErrorCode status = U_ZERO_ERROR;
    Collator *coll = Collator::createInstance("zh_CN", status);

    // 1. 预生成所有 sort key (O(n))
    std::vector<StringWithKey> data;
    for (const auto &str : strings) {
        CollationKey key;
        coll->getCollationKey(str, key, status);
        data.push_back({str, key});
    }

    // 2. 比较 sort key (快 5-10x)
    std::sort(data.begin(), data.end(),
        [](const StringWithKey &a, const StringWithKey &b) {
            return a.key.compareTo(b.key) < 0;  // 直接比较字节
        });

    delete coll;
}

// ❌ 不推荐: 每次比较调用 compare() (重复分析字符串)
std::sort(strings.begin(), strings.end(),
    [coll](const UnicodeString &a, const UnicodeString &b) {
        return coll->compare(a, b) < 0;  // 每次比较都重新分析字符串!
    });

7.2 性能优化建议

1. 编码转换: 批量处理

cpp 复制代码

// ✅ 推荐: 批量转换 (减少函数调用开销)
void convertBatch(const std::vector<std::string> &gbkTexts) {
    UErrorCode status = U_ZERO_ERROR;
    UConverter *gbkConv = ucnv_open("GBK", &status);
    UConverter *utf8Conv = ucnv_open("UTF-8", &status);

    for (const auto &text : gbkTexts) {
        // 直接转换整个字符串
        UChar unicode[1024];
        const char *src = text.c_str();
        UChar *tgt = unicode;

        ucnv_toUnicode(gbkConv, &tgt, unicode + 1024,
                       &src, src + text.size(),
                       NULL, TRUE, &status);

        // ... 处理 unicode
    }

    ucnv_close(gbkConv);
    ucnv_close(utf8Conv);
}

2. 正则表达式: 预编译模式

cpp 复制代码

// ✅ 推荐: 预编译正则表达式
class EmailValidator {
    URegularExpression *m_regex;

public:
    EmailValidator() {
        UErrorCode status = U_ZERO_ERROR;
        UParseError pe;
        m_regex = uregex_open(u"[\\w.]+@[\\w.]+", -1, 0, &pe, &status);
    }

    ~EmailValidator() {
        uregex_close(m_regex);
    }

    bool validate(const UChar *email) {
        UErrorCode status = U_ZERO_ERROR;
        uregex_setText(m_regex, email, -1, &status);
        return uregex_matches(m_regex, 0, &status);
    }
};

// ❌ 不推荐: 每次验证都编译正则表达式
bool validateEmail(const UChar *email) {
    UErrorCode status = U_ZERO_ERROR;
    UParseError pe;
    URegularExpression *regex = uregex_open(...);  // 每次编译!
    uregex_setText(regex, email, -1, &status);
    bool result = uregex_matches(regex, 0, &status);
    uregex_close(regex);
    return result;
}

3. Unicode 规范化: 快速检查

cpp 复制代码

// ✅ 推荐: 先快速检查,再规范化
void processText(const UChar *text, int32_t length) {
    UErrorCode status = U_ZERO_ERROR;
    const Normalizer2 *nfc = Normalizer2::getNFCInstance(status);

    // 快速检查 (通常很快)
    if (nfc->isNormalized(UnicodeString(text, length), status)) {
        // 已规范化,直接处理
        handleNormalizedText(text, length);
    } else {
        // 需要规范化
        UnicodeString normalized = nfc->normalize(
            UnicodeString(text, length), status);
        handleNormalizedText(normalized.getBuffer(), normalized.length());
    }
}

7.3 常见陷阱

1. 月份是 0-based!

cpp 复制代码

// ❌ 错误: 月份使用 1-12
cal->set(2025, 1, 29);  // 实际是 2月29日,不是1月29日!

// ✅ 正确: 月份使用 0-11
cal->set(2025, Calendar::JANUARY, 29);  // JANUARY = 0

2. UChar 不是 char!

cpp 复制代码

// ❌ 错误: 混淆 char 和 UChar
const char *str = "Hello";
uregex_setText(regex, str, -1, &status);  // 编译错误!

// ✅ 正确: 使用 UChar* 或 UnicodeString
const UChar *ustr = u"Hello";  // u 前缀
uregex_setText(regex, ustr, -1, &status);

3. 忘记检查 UErrorCode

cpp 复制代码

// ❌ 错误: 不检查错误状态
UErrorCode status = U_ZERO_ERROR;
DateFormat *df = DateFormat::createInstance(&status);
df->format(date, formatted);  // 如果创建失败,df 可能为 NULL!

// ✅ 正确: 检查错误
UErrorCode status = U_ZERO_ERROR;
DateFormat *df = DateFormat::createInstance(&status);
if (U_FAILURE(status)) {
    // 处理错误
    return;
}
df->format(date, formatted);

八、与其他库的对比

8.1 ICU vs. C++ std::locale

特性	ICU	C++ std::locale
Unicode 支持	完整 (Unicode 15.0)	有限 (依赖系统)
语言区域数量	180+	取决于系统
跨平台一致性	完全一致	不一致 (不同平台表现不同)
日历系统	11 种	仅公历
排序规则	UCA	取决于系统
正则表达式	Unicode 感知	有限 Unicode 支持
二进制大小	~8 MB (数据+代码)	很小 (依赖系统)

示例对比:

cpp 复制代码

// ICU: 跨平台一致
UErrorCode status = U_ZERO_ERROR;
DateFormat *df = DateFormat::createDateInstance(
    DateFormat::FULL, Locale("zh_CN"));
// 任何平台: "2025年1月29日星期三"

// C++ std::locale: 平台相关
std::locale loc("zh_CN.UTF-8");  // Linux
// 或 std::locale loc("zh-CN");  // Windows
// 格式化结果在不同平台可能不同!

8.2 ICU vs. Java java.text

特性	ICU4J	java.text
实现	纯 Java	原生 Java
数据源	ICU 数据文件	JRE 内置
更新频率	快 (每年多次)	慢 (JRE 版本)
Unicode 版本	最新	落后 1-2 年
日历系统	11 种	8 种
API 兼容性	扩展 java.text	标准 Java API

Android 选择 ICU 的原因:

更快的 Unicode 更新 (通过 APEX)
更丰富的国际化功能
C/C++ 和 Java 统一的数据源
跨平台一致性

九、总结

9.1 核心优势

1. 完整的 Unicode 支持

Unicode 15.0 (150,000+ 字符)
30+ 字符属性
200+ 文字系统 (Script)

2. 全面的国际化功能

180+ 语言区域支持
11 种日历系统
50+ 字符编码转换
UCA 排序算法

3. 高性能

Trie2 压缩数据结构 (96% 压缩率)
SIMD 优化 (UTF-8 验证,编码转换)
零拷贝数据加载 (mmap)
Fast Latin 优化

4. Android 深度集成

APEX 模块独立更新
NDK 稳定 ABI (libandroidicu)
Framework 全面使用
26 MB 数据文件 (180+ 语言)

9.2 关键统计

代码规模:

C/C++ 代码: 809,000+ 行
Java 代码: 650,000+ 行 (ICU4J)
头文件: 288 个
数据文件: 4,121 个

功能模块:

字符属性: 30+ 类型
编码转换: 50+ 编码
日历: 11 种系统
语言区域: 180+ 种
时区: 179+ 个

性能指标:

字符属性查询: O(1) (2-3 次内存访问)
UTF-8 → UTF-16: ~1 GB/s (NEON 优化)
排序键生成: ~10 MB/s
正则表达式: ~50 MB/s

9.3 应用场景

Android Framework:

TextView 文本渲染
输入法引擎 (IME)
联系人排序
日历应用
电话号码格式化
WebView 国际化

第三方应用:

多语言应用
文本处理工具
数据库索引
搜索引擎
国际化电商
社交应用

08 - ICU 技术全景：Unicode 国际化组件完整解析-Android系统底层用到的开源库