采用 Trie 树结合 RoaringBitmap 技术，构建高效的子串倒排索引

1. Trie 树（前缀树）

✅ 是什么？

Trie 树（发音为 "try"）是一种专门用于处理字符串的树形数据结构，特别适合做前缀匹配 和快速查找。

🌰 举个例子：

假设我们有这些公司名：

华为
华为技术
华为终端
华为云
华为供应链

如果用普通列表查找"华为"，需要遍历所有项；但用 Trie 树，它们会自动组织成一个"树"：

复制代码

    (根)
     |
     华
     |
     为
   /  |   \
 技术 终端 云  供应链

当你输入"华为"，系统立刻定位到"为"节点，然后把所有子节点都作为候选结果返回 ------ 实现自动补全 和高效模糊匹配。

✅ 在本场景中的作用：

快速识别用户输入中的"产品""公司"等关键词（如输入"华"就能匹配"华为"）
支持前缀搜索、拼写容错、别名扩展
提升 NER（命名实体识别）的召回率和速度

🔹 2. RoaringBitmap

✅ 是什么？

RoaringBitmap 是一种高效的压缩位图结构 ，用来紧凑地存储和快速操作大量整数集合，比如"哪些文档包含某个词"。

🌰 举个例子：

假设每个产品有一个 ID（如 1, 2, 3, ... 100万），我们想记录"哪些产品属于'手机'分类"。

传统方式：用数组或集合存所有 ID → 占内存大、运算慢。

用 RoaringBitmap：把 ID 映射成"位"，用压缩方式存储，比如：

"手机" → {1, 2, 3, 1000, 1001, 200000}

用 RoaringBitmap 存储后，可能只占几十字节，且支持极快的"交集""并集"运算。

✅ 在本场景中的作用：

将每个关键词（如"华为"）关联到它在知识图谱中的实体 ID 列表
用 RoaringBitmap 存储这些 ID 集合，大幅节省内存
支持高速"关键词匹配后的候选集合并"（如"华为 AND 手机" = 两个 Bitmap 做交集）
提升高并发下模糊搜索的响应速度

🔗 结合使用：Trie + RoaringBitmap

步骤	过程
1️⃣	用户输入"华" → Trie 树快速匹配出所有以"华"开头的词（如"华为"、"华星"）
2️⃣	每个词对应一个 RoaringBitmap，里面存着它关联的实体 ID（如"华为" → {1001, 1002}）
3️⃣	将多个词的 Bitmap 做并集或交集，快速得到最终候选结果
4️⃣	返回给 NER 模型进行语义判断

✅ 优势总结：

快：Trie 实现 O(m) 前缀匹配（m是字符串长度）
省：RoaringBitmap 压缩率高，内存占用比普通集合低 5~10 倍
强：支持大规模维度数据下的实时模糊检索，支撑高并发 AI 查询

💡 总结一句话：

Trie 树负责"快速找到可能的词"，RoaringBitmap 负责"高效记录和计算这些词对应的实体"，两者结合，实现海量业务维度数据下的高性能语义匹配。

复制代码

public class KGraphCache {

    private final CompactInvertedIndex invertedIndex = new CompactInvertedIndex();
    private final Map<String, BIKgInfoVo> kgCache = new ConcurrentHashMap<>(); // subject → Vo

    // 使用 AtomicBoolean 保证线程安全
    private final AtomicBoolean initialized = new AtomicBoolean(false);

    private final AtomicInteger loadAttemptCount = new AtomicInteger(0);
    private static final int MAX_LOAD_ATTEMPTS = 6;

    private static final String PASSWORD_KGDICT = "*********";

    // 定义在类顶部
    private static final Map<Integer, String> CATEGORY_MAP = createCategoryMap();

    private static Map<Integer, String> createCategoryMap() {
        Map<Integer, String> map = new HashMap<>();
        map.put(1, "产品分类");
        map.put(2, "公司");
        map.put(3, "产品");
        return Collections.unmodifiableMap(map);
    }

    @Resource
    private AppKnowledgeGraphMapper appKnowledgeGraphMapper;

    @Autowired
    @Qualifier("asyncExecutor")
    private Executor asyncExecutor;

    // ========================
    // 初始化与调度
    // ========================

    @PostConstruct
    public void init() {
        log.info("知识图谱缓存组件已注册，等待首次加载...");
    }

    /**
     * 每 30 分钟检查一次：如果未初始化，则尝试加载
     * 一旦成功，后续不再执行加载逻辑
     */
    @Scheduled(initialDelay = 10_000, fixedDelay = 30 * 60 * 1000)
    public void checkAndLoadCache() {
        // 如果已成功初始化，不再执行
        if (initialized.get()) {
            return;
        }

        // 如果已达到最大重试次数，不再尝试
        if (loadAttemptCount.get() >= MAX_LOAD_ATTEMPTS) {
            return;
        }

        int attempt = loadAttemptCount.incrementAndGet();
        log.info("【延迟初始化检查】第 {} 次尝试加载知识图谱缓存...", attempt);

        loadCacheAsync().thenAccept(success -> {
            if (success) {
                boolean wasSet = initialized.compareAndSet(false, true);
                if (wasSet) {
                    log.info("知识图谱缓存首次加载成功，已标记为 initialized");
                    // 可选：重置计数器（非必须）
                    loadAttemptCount.set(1); // 或 reset to 0，看需求
                }
            } else {
                int currentFailures = loadAttemptCount.get();
                log.warn("第 {} 次加载失败，已失败 {} 次，将在下个周期重试...（最多 {} 次）",
                        currentFailures, currentFailures, MAX_LOAD_ATTEMPTS);
            }
        }).exceptionally(throwable -> {
            int currentFailures = loadAttemptCount.get();
            log.warn("第 {} 次加载任务执行异常：{}", currentFailures, throwable.getMessage(), throwable);
            return null;
        });
    }

    /**
     * 每天早上 5:01 执行缓存更新（增量或全量）
     */
    @Scheduled(cron = "0 1 5 * * ?") // 每天 5:01:00
    public void scheduledRefresh() {
        log.info("开始执行定时任务：每日知识图谱缓存更新");
        loadCacheAsync().thenAccept(success -> {
            if (success) {
                log.info("每日缓存更新完成");
            } else {
                log.warn("每日缓存更新失败，建议人工检查");
            }
        });
    }

    // ========================
    // 异步加载核心逻辑
    // ========================

    /**
     * 异步加载缓存，返回是否成功
     * @return CompletableFuture<Boolean> 加载是否成功
     */
    public CompletableFuture<Boolean> loadCacheAsync() {
        return CompletableFuture.supplyAsync(() -> {
            try {
                log.info("【异步任务】开始加载知识图谱缓存...");
                long start = System.currentTimeMillis();

                LocalDate yesterday = LocalDate.now().minusDays(1);
                LocalDate dayBeforeYesterday = yesterday.minusDays(1);

                List<AppKnowledgeGraph> allData = selectByDt(yesterday);
                if (allData.isEmpty()) {
                    log.info("未查到昨天（{}）的数据，尝试查询前天（{}）的数据", yesterday, dayBeforeYesterday);
                    allData = selectByDt(dayBeforeYesterday);
                }

                if (allData.isEmpty()) {
                    log.warn("未加载到任何知识图谱数据（尝试了昨天和前天），本次加载视为失败");
                    return false;
                }

                // 去重：subject + predicateId + objectId（清洗前去重）
                Map<List<Object>, AppKnowledgeGraph> dedupMap = allData.stream()
                        .filter(kg -> kg.getSubject() != null && kg.getPredicateId() != null && kg.getObjectId() != null)
                        .collect(Collectors.toMap(
                                kg -> {
                                    String cleanedSubject = TextCleaner.cleanSubject(kg.getSubject());
                                    return Arrays.asList(cleanedSubject, kg.getPredicateId(), kg.getObjectId());
                                },
                                kg -> kg,
                                (e1, e2) -> e1 // 保留第一个
                        ));

                List<AppKnowledgeGraph> uniqueData = new ArrayList<>(dedupMap.values());

                // 分组：按 subject 聚合（注意：这里也要清洗 subject）
                Map<String, List<AppKnowledgeGraph>> grouped = uniqueData.stream()
                        .collect(Collectors.groupingBy(kg -> TextCleaner.cleanSubject(kg.getSubject())));

                // === 增量更新缓存开始 ===
                Set<String> currentSubjects = new HashSet<>(grouped.keySet());
                Set<String> existingSubjects = new HashSet<>(kgCache.keySet());

                // 1. 删除已不存在的 subject
                Set<String> toRemove = new HashSet<>(existingSubjects);
                toRemove.removeAll(currentSubjects);
                for (String subject : toRemove) {
                    kgCache.remove(subject);
                    invertedIndex.remove(subject);
                }

                // 2. 新增或更新现有 subject
                for (Map.Entry<String, List<AppKnowledgeGraph>> entry : grouped.entrySet()) {
                    String cleanedSubject = entry.getKey();
                    BIKgInfoVo vo = new BIKgInfoVo();
                    vo.setEntity(cleanedSubject);
                    vo.setRelations(entry.getValue().stream()
                            .map(kg -> new BIKgRelationVO(kg.getPredicateId(), kg.getObjectId()))
                            .collect(Collectors.toList()));

                    // 更新缓存和倒排索引
                    BIKgInfoVo oldVo = kgCache.put(cleanedSubject, vo);
                    if (oldVo == null) {
                        invertedIndex.add(cleanedSubject); // 新增
                    }
                    // 已存在则无需操作 invertedIndex
                }

                long time = System.currentTimeMillis() - start;
                log.info("知识图谱缓存更新完成，共加载 {} 条唯一三元组，缓存大小：{}，耗时 {}ms",
                        uniqueData.size(), kgCache.size(), time);

                return true;

            } catch (Exception e) {
                log.error("异步加载知识图谱缓存时发生异常", e);
                return false;
            }
        }, asyncExecutor);
    }

    // ========================
    // 数据查询
    // ========================

    private List<AppKnowledgeGraph> selectByDt(LocalDate localDate) {
        AppKnowledgeGraphImpl example = new AppKnowledgeGraphImpl();
        java.sql.Date sqlDate = java.sql.Date.valueOf(localDate);
        example.createCriteria().andDtEqualTo(sqlDate);
        return appKnowledgeGraphMapper.selectByExample(example);
    }

    // ========================
    // 查询接口
    // ========================

    public List<BIKgInfoVo> searchByQuestion(String question) {
        if (question == null || question.trim().isEmpty()) {
            return Collections.emptyList();
        }

        question = question.trim();

        Set<Integer> matchedIds = invertedIndex.search(question);
        if (matchedIds.isEmpty()) {
            return Collections.emptyList();
        }

        // 获取所有候选 subjects
        List<String> subjects = matchedIds.stream()
                .map(invertedIndex::getStringById)
                .filter(Objects::nonNull)
                .collect(Collectors.toList());

        // 按最长公共子串长度降序 + 字符串长度升序排序
        String finalQuestion = question;
        subjects.sort((a, b) -> {
            int lcsA = longestCommonSubstringLength(a, finalQuestion);
            int lcsB = longestCommonSubstringLength(b, finalQuestion);
            if (lcsA != lcsB) {
                return Integer.compare(lcsB, lcsA); // LCS 越长越靠前
            }
            return Integer.compare(a.length(), b.length()); // 长度越短越靠前
        });

        // 结果集合
        List<BIKgInfoVo> result = new ArrayList<>();
        Set<String> seen = new HashSet<>(); // 防止重复加入同一 subject

        // 分类统计：key=categoryName, value=count
        Map<String, Integer> categoryCount = new HashMap<>();
        Set<String> selectedCategories = new LinkedHashSet<>(); // 保持分类首次出现顺序

        // 分类映射
        Map<Integer, String> categoryMap = CATEGORY_MAP;

        for (String subject : subjects) {
            if (result.size() >= 12) break;

            BIKgInfoVo vo = kgCache.get(subject);
            if (vo == null || seen.contains(subject)) continue;

            // 从 relations 中提取分类
            String category = extractCategory(vo, categoryMap);
            if (category == null) {
                category = "其他"; // 默认分类
            }

            // 判断是否可以加入该分类（最多 3 个分类，每类最多 5 个）
            if (selectedCategories.size() < 3 || selectedCategories.contains(category)) {
                int count = categoryCount.getOrDefault(category, 0);
                if (count < 4) {
                    selectedCategories.add(category);
                    categoryCount.put(category, count + 1);
                    BIKgInfoVo newVo = new BIKgInfoVo();
                    newVo.setEntity(DESUtils.encrypt(vo.getEntity(), PASSWORD_KGDICT)); // 加密
                    newVo.setRelations(vo.getRelations());
                    result.add(newVo);
                    seen.add(subject);
                }
            }
        }

        return result;
    }

    private String extractCategory(BIKgInfoVo vo, Map<Integer, String> categoryMap) {
        if (vo.getRelations() == null) return null;

        // 假设 predicate == 1 表示"类型"关系
        for (BIKgRelationVO rel : vo.getRelations()) {
            if (rel.getPredicate() != null && rel.getPredicate().equals(1)) {
                Integer obj = rel.getObject();
                if (obj != null && categoryMap.containsKey(obj)) {
                    return categoryMap.get(obj);
                }
            }
        }
        return null; // 无法识别分类
    }

    private int longestCommonSubstringLength(String a, String b) {
        int m = a.length(), n = b.length();
        if (m == 0 || n == 0) return 0;

        int[][] dp = new int[m + 1][n + 1];
        int max = 0;
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                if (a.charAt(i - 1) == b.charAt(j - 1)) {
                    dp[i][j] = dp[i - 1][j - 1] + 1;
                    max = Math.max(max, dp[i][j]);
                } else {
                    dp[i][j] = 0;
                }
            }
        }
        return max;
    }

    // ========================
    // 监控与状态
    // ========================

    public boolean isInitialized() {
        return initialized.get();
    }

    public int size() {
        return kgCache.size();
    }


    // ================== 工具类：文本清洗 ==================
    public static class TextCleaner {
        /**
         * 要移除的非法字符：双引号 "、单引号 '、反斜杠 \、尖括号 <>、花括号 {}、方括号 []、竖线 |
         */
        private static final Pattern INVALID_CHARS_PATTERN = Pattern.compile("[\"'\\\\<>{}\\[\\]|]");

        /**
         * 清洗 subject 字符串，移除非法字符
         */
        public static String cleanSubject(String subject) {
            if (subject == null || subject.isEmpty()) {
                return subject;
            }
            return INVALID_CHARS_PATTERN.matcher(subject).replaceAll("");
        }
    }
}

/**
 * 使用 Trie + RoaringBitmap 实现的紧凑倒排索引
 * 支持：将文本拆分为 ≥2 字子串，插入到 Trie，指向 subjectId
 * 查询时：从 question 提取子串，快速返回匹配的 subjectId 集合
 */
public class CompactInvertedIndex {

    private final TrieNode root = new TrieNode();
    private final Map<String, Integer> stringToId = new ConcurrentHashMap<>();
    private final List<String> idToString = new CopyOnWriteArrayList<>();
    private volatile int nextId = 0;

    // ========================
    // ID 映射管理
    // ========================

    private int getId(String str) {
        return stringToId.computeIfAbsent(str, k -> {
            int id;
            synchronized (this) {
                id = nextId++;
                while (idToString.size() <= id) {
                    idToString.add(null);
                }
                idToString.set(id, k);
            }
            return id;
        });
    }

    public String getStringById(int id) {
        return id >= 0 && id < idToString.size() ? idToString.get(id) : null;
    }

    // ========================
    // 新增：清空整个索引
    // ========================

    /**
     * 清空所有数据：重建 Trie、清空 ID 映射
     * 线程安全：使用 synchronized 控制
     */
    public synchronized void clear() {
        this.root.children.clear();
        if (this.root.bitmap != null) {
            this.root.bitmap.clear();
        }
        this.stringToId.clear();
        this.idToString.clear();
        this.nextId = 0;
        log("倒排索引已清空");
    }

    /**
     * 批量添加多个字符串（如 subject 列表）
     * @param strings 字符串集合
     */
    public void addAll(Collection<String> strings) {
        if (strings == null || strings.isEmpty()) return;
        for (String str : strings) {
            add(str);
        }
        log("批量添加 " + strings.size() + " 个字符串到倒排索引");
    }

    // ========================
    // 构建索引
    // ========================

    /**
     * 添加一个文本（如 subject），绑定到其 ID
     */
    public void add(String text) {
        if (text == null || text.length() < 2) return;
        int id = getId(text);
        for (int i = 0; i <= text.length() - 2; i++) {
            for (int j = i + 2; j <= text.length(); j++) {
                String substr = text.substring(i, j);
                insertSubstring(substr, id);
            }
        }
    }

    private void insertSubstring(String substr, int id) {
        TrieNode node = root;
        for (char c : substr.toCharArray()) {
            node = node.children.computeIfAbsent(c, k -> new TrieNode());
        }
        if (node.bitmap == null) {
            synchronized (node) {
                if (node.bitmap == null) {
                    node.bitmap = new RoaringBitmap();
                }
            }
        }
        node.bitmap.add(id);
    }

    // ========================
    // 查询匹配
    // ========================

    /**
     * 查询 question 中所有 ≥2 字子串，返回匹配的 subject ID 集合
     */
    public Set<Integer> search(String question) {
        if (question == null || question.length() < 2) {
            return Collections.emptySet();
        }

        Set<Integer> result = ConcurrentHashMap.newKeySet();
        for (int i = 0; i <= question.length() - 2; i++) {
            for (int j = i + 2; j <= question.length(); j++) {
                String substr = question.substring(i, j);
                RoaringBitmap ids = searchSubstring(substr);
                if (ids != null && !ids.isEmpty()) {
                    IntIterator iter = ids.getIntIterator();
                    while (iter.hasNext()) {
                        result.add(iter.next());
                    }
                }
            }
        }
        return result;
    }

    private RoaringBitmap searchSubstring(String substr) {
        TrieNode node = root;
        for (char c : substr.toCharArray()) {
            node = node.children.get(c);
            if (node == null) return null;
        }
        return node.bitmap;
    }

    // ========================
    // 移除支持
    // ========================

    public void remove(String text) {
        if (text == null || text.length() < 2) return;

        Integer id = stringToId.get(text);
        if (id == null) return;

        for (int i = 0; i <= text.length() - 2; i++) {
            for (int j = i + 2; j <= text.length(); j++) {
                String substr = text.substring(i, j);
                removeSubstring(substr, id);
            }
        }

        stringToId.remove(text);
        // 可选：idToString.set(id, null); 如果你想标记为空槽
    }

    private void removeSubstring(String substr, int id) {
        TrieNode node = root;
        for (char c : substr.toCharArray()) {
            node = node.children.get(c);
            if (node == null) return;
        }

        if (node.bitmap != null) {
            node.bitmap.remove(id);
            // 可考虑回收节点（需父指针），此处略
        }
    }

    // ========================
    // 统计信息
    // ========================

    public int size() {
        return stringToId.size();
    }

    public long getMemoryEstimateKB() {
        return countNodes(root) * 100L / 1024 +
                stringToId.keySet().stream().mapToInt(String::length).sum() * 2L / 1024;
    }

    private long countNodes(TrieNode node) {
        if (node == null) return 0;
        long count = 1;
        for (TrieNode child : node.children.values()) {
            count += countNodes(child);
        }
        return count;
    }

    // ========================
    // Trie 节点定义
    // ========================

    private static class TrieNode {
        ConcurrentMap<Character, TrieNode> children = new ConcurrentHashMap<>(4);
        volatile RoaringBitmap bitmap; // 使用 volatile 保证可见性
    }

    // ========================
    // 调试日志（可选）
    // ========================

    private void log(String msg) {
        System.out.println("[CompactInvertedIndex] " + msg);
        // 建议替换为 SLF4J Logger
        // log.debug(msg);
    }
}

CompactInvertedIndex 代码 正是一个典型的、基于 Trie + RoaringBitmap 实现的高效子串倒排索引 ，它将两种技术有机结合，实现了海量文本中快速模糊匹配的能力。

下面详细解释它是如何实现 Trie + RoaringBitmap 的：

✅ 一、整体设计目标

该组件的目标是：

给定一个用户问题（question），快速找出所有与之子串匹配的"知识图谱实体"（如"华为""小米手机"），返回这些实体的 ID 集合。

为此，它采用：

Trie 树 ：实现子串的高速前缀查找
RoaringBitmap ：对匹配到的实体 ID 进行高效存储与去重合并

✅ 二、核心结构解析

1. TrieNode：Trie 树的节点

复制代码

private static class TrieNode {
    ConcurrentMap<Character, TrieNode> children = new ConcurrentHashMap<>(4);
    volatile RoaringBitmap bitmap; // 存储命中该子串的所有字符串 ID
}

children：当前字符的下一个字符映射（如 '华' → '为'）
bitmap：当某个子串（如"小米"）被完全匹配时，记录所有包含它的原始字符串（如"小米手机"）的 ID

👉 Trie 树的每条路径代表一个子串，终点节点的 bitmap 存储了所有包含该子串的文本 ID。

2. RoaringBitmap：高效存储 ID 集合

每个 TrieNode 的 bitmap 是一个 RoaringBitmap，用于存储所有在该子串上命中过的字符串 ID。
优点：
- 内存占用小（压缩存储）
- 支持快速 add、remove、or（并集）、and（交集）等集合操作
- 适合高并发、大数据量场景

✅ 三、构建索引：`add()` 方法（写入阶段）

复制代码

public void add(String text) {
    if (text == null || text.length() < 2) return;
    int id = getId(text); // 给每个唯一字符串分配一个 ID
    for (int i = 0; i <= text.length() - 2; i++) {
        for (int j = i + 2; j <= text.length(); j++) {
            String substr = text.substring(i, j);
            insertSubstring(substr, id);
        }
    }
}

🔍 关键逻辑：

将每个文本（如"华为手机"）拆解为所有 长度 ≥2 的子串 ：
- "华为"
- "为手"
- "手机"
- "华为手"
- "为手机"
- "华为手机"
每个子串插入 Trie 树，并在终点节点的 bitmap 中记录该文本的 ID。

👉 这样，只要用户输入中包含任意一个子串（如"华为"），就能快速定位到"华为手机"这个实体。

插入过程示意图（以"华为"为例）：

复制代码

root
  └── '华'
       └── '为' → TrieNode.bitmap.add(id_of_华为手机)

✅ 四、查询匹配：`search()` 方法（读取阶段）

复制代码

public Set<Integer> search(String question) {
    for (int i = 0; i <= question.length() - 2; i++) {
        for (int j = i + 2; j <= question.length(); j++) {
            String substr = question.substring(i, j);
            RoaringBitmap ids = searchSubstring(substr);
            if (ids != null && !ids.isEmpty()) {
                // 将所有命中子串的 ID 合并到 result 中
                IntIterator iter = ids.getIntIterator();
                while (iter.hasNext()) {
                    result.add(iter.next());
                }
            }
        }
    }
    return result;
}

🔍 查询逻辑：

将用户问题（如"华为销量"）也拆解为所有 ≥2 字的子串：
- "华为"
- "为销"
- "销量"
- "华为销"
- "为销量"
- "华为销量"
对每个子串，在 Trie 树中查找是否有匹配。
如果有，取出对应节点的 bitmap，将其包含的所有 ID 加入结果集。

👉 最终返回的是：所有在问题中出现过任意子串的候选实体 ID 集合。

✅ 五、Trie + RoaringBitmap 的优势体现

技术	作用	在本代码中的体现
Trie 树	快速前缀匹配，避免全量扫描	通过字符逐层查找，O(m) 时间定位子串
RoaringBitmap	高效存储和合并 ID 集合	每个节点用 bitmap 存 ID，查询时自动去重
子串索引	提升模糊匹配召回率	拆解所有 ≥2 字子串，不怕用户输入不完整
并发安全	支持多线程读写	使用 `ConcurrentHashMap`、`volatile`、`synchronized`

✅ 六、性能优化亮点

内存优化 ：
- 使用 RoaringBitmap 压缩存储 ID，比 HashSet<Integer> 节省 5~10 倍内存。
速度优化 ：
- Trie 查找时间复杂度接近 O(m)，m 为子串长度。
- 所有子串并行查找（虽然代码是单线程，但可扩展为并行）。
动态更新 ：
- 支持 add、remove、clear，可用于增量更新缓存。
线程安全 ：
- 使用 ConcurrentHashMap、volatile、synchronized 保证多线程安全。

✅ 七、举个完整例子

假设：

添加 "华为手机" → 分配 ID=1
添加 "小米手机" → 分配 ID=2

构建后：

Trie 中 "华为" 节点的 bitmap = {1}
"小米" 节点的 bitmap = {2}
"手机" 节点的 bitmap = {1, 2}

当用户输入 "手机品牌"：

子串 "手机" 匹配 → 返回 ID {1, 2}
最终系统可查出"华为手机"和"小米手机"作为候选

✅ 总结：它是如何实现 Trie + RoaringBitmap 的？

该 CompactInvertedIndex 类通过以下方式实现了 Trie + RoaringBitmap：

用 Trie 树组织所有 ≥2 字的子串，实现 O(m) 高速前缀匹配；

每个 Trie 节点维护一个 RoaringBitmap，存储所有包含该子串的原始字符串 ID；

查询时拆解问题为子串，在 Trie 中查找并合并所有命中子串的 bitmap，得到候选 ID 集合；

结合子串索引与 bitmap 压缩，实现高召回、低延迟、低内存的模糊匹配。

🎯 这是工业级 NLP 系统中常见的"候选召回"优化方案，特别适合知识图谱、搜索、AI 分词等场景。

--------------------------------------------------------------------------------

getId(String str) 方法是整个 CompactInvertedIndex 倒排索引中非常关键的一环，它实现了 字符串到整数 ID 的高效、线程安全的唯一映射。

我们来逐行解析它的作用、设计思想和为什么这么写。

🔍 代码原文

复制代码

private int getId(String str) {
    return stringToId.computeIfAbsent(str, k -> {
        int id;
        synchronized (this) {
            id = nextId++;
            while (idToString.size() <= id) {
                idToString.add(null);
            }
            idToString.set(id, k);
        }
        return id;
    });
}

✅ 一、整体功能：字符串 → 整数 ID 映射

目标：

给每一个唯一的字符串（如"华为手机"）分配一个唯一的整数 ID（如 0, 1, 2, ...），后续所有操作都用 ID 而不是字符串，提升性能。

这叫 字符串 intern 机制 或 字典编码（Dictionary Encoding）。

使用场景：

Trie 中不直接存字符串，只存 ID（节省内存）
RoaringBitmap 存的是 int 类型的 ID（高效）

✅ 二、数据结构说明

变量	类型	用途
`stringToId`	`Map<String, Integer>`	字符串 → ID 的映射（主索引）
`idToString`	`List<String>`	ID → 字符串的反向映射（用于查询后还原）
`nextId`	`int`	下一个可用的 ID 编号

✅ 三、逐行解释

第1行：入口

复制代码

private int getId(String str) {

输入一个字符串 str
返回其对应的整数 ID

第2行：`computeIfAbsent` 实现懒加载 + 线程安全去重

复制代码

return stringToId.computeIfAbsent(str, k -> { ... });

📌 `computeIfAbsent(key, mappingFunction)`

这是 ConcurrentHashMap 提供的原子操作：

如果 str 已经存在映射，直接返回已有 ID
如果不存在，才执行后面的 lambda 函数生成新 ID 并插入

✅ 优势：

高并发下避免重复分配 ID
不需要外部加锁判断是否存在

第3行：进入同步块

复制代码

synchronized (this) {

虽然外层用了 ConcurrentHashMap.computeIfAbsent，但内部还要修改共享变量 nextId 和 idToString，所以必须加锁保证原子性。

⚠️ 注意：不能只靠 ConcurrentHashMap，因为 nextId++ 和 idToString 操作需要一起原子执行。

第4行：获取下一个 ID

复制代码

id = nextId++;

使用自增 ID 分配策略
初始为 0，每次调用后递增

例如：

第一次：id = 0, nextId 变成 1
第二次：id = 1, nextId 变成 2

第5--6行：确保 `idToString` 列表足够长

复制代码

while (idToString.size() <= id) {
    idToString.add(null);
}

👉 这是为了防止 List 数组越界。

举个例子：

当前 idToString.size() == 0
id = 5（可能由于并发或历史原因）
直接 set(5, ...) 会抛异常
所以先用 null 占位，直到 size > 5

✅ 安全扩容，避免 IndexOutOfBoundsException

第7行：保存字符串到 ID 的反向映射

复制代码

idToString.set(id, k);

将字符串保存在 idToString[id] 位置，便于后续通过 ID 查回原始字符串。

比如：

复制代码

String entity = idToString.get(1); // 得到 "华为手机"

第8--9行：结束同步并返回 ID

复制代码

}
return id;

退出同步块，返回分配好的 ID。

✅ 四、完整流程图解

假设依次调用 getId("华为手机")、getId("小米手机")：

步骤	操作	`stringToId`	`idToString`	`nextId`
1	调用 `getId("华为手机")`	`"华为手机" → 0`	`[0]="华为手机"`	1
2	调用 `getId("小米手机")`	`"小米手机" → 1`	`[1]="小米手机"`	2
3	再次调用 `getId("华为手机")`	（已存在）直接返回 0	不变	不变

✅ 五、为何这样设计？优点总结

特性	实现方式	好处
唯一性	`computeIfAbsent`	同一个字符串永远返回同一个 ID
线程安全	`ConcurrentHashMap + synchronized(this)`	多线程并发调用不会出错
正查反查	`stringToId` + `idToString`	支持 ID ↔ 字符串双向映射
高性能	用 `int` 替代 `String` 存储	Trie 和 Bitmap 更快更省内存
动态扩展	自增 ID + 动态 List	支持无限添加新字符串

采用 Trie 树结合 RoaringBitmap 技术，构建高效的子串倒排索引

1. Trie 树（前缀树）

✅ 是什么？

🌰 举个例子：

✅ 在本场景中的作用：

🔹 2. RoaringBitmap

✅ 是什么？

🌰 举个例子：

✅ 在本场景中的作用：

🔗 结合使用：Trie + RoaringBitmap

💡 总结一句话：

✅ 一、整体设计目标

✅ 二、核心结构解析

1. TrieNode：Trie 树的节点

2. RoaringBitmap：高效存储 ID 集合

✅ 三、构建索引：add() 方法（写入阶段）

🔍 关键逻辑：

插入过程示意图（以"华为"为例）：

✅ 四、查询匹配：search() 方法（读取阶段）

🔍 查询逻辑：

✅ 五、Trie + RoaringBitmap 的优势体现

✅ 六、性能优化亮点

✅ 七、举个完整例子

✅ 总结：它是如何实现 Trie + RoaringBitmap 的？

🔍 代码原文

✅ 一、整体功能：字符串 → 整数 ID 映射

✅ 二、数据结构说明

✅ 三、逐行解释

第1行：入口

第2行：computeIfAbsent 实现懒加载 + 线程安全去重

📌 computeIfAbsent(key, mappingFunction)

第3行：进入同步块

第4行：获取下一个 ID

第5--6行：确保 idToString 列表足够长

第7行：保存字符串到 ID 的反向映射

第8--9行：结束同步并返回 ID

✅ 四、完整流程图解

✅ 五、为何这样设计？优点总结

✅ 三、构建索引：`add()` 方法（写入阶段）

✅ 四、查询匹配：`search()` 方法（读取阶段）

第2行：`computeIfAbsent` 实现懒加载 + 线程安全去重

📌 `computeIfAbsent(key, mappingFunction)`

第5--6行：确保 `idToString` 列表足够长