【知识获取与分享社区项目 | 项目日记第 21 天】索引构建与联想建议：Outbox 增量更新 + Completion Suggester

一、搜索系统不只是查询接口

前两篇主要讲搜索查询：

text 复制代码

multi_match 关键词召回
function_score 排序
search_after 分页

但搜索系统还有一个很重要的问题：

text 复制代码

ES 里的数据从哪里来？

在项目中，MySQL 是内容主数据源，ES 是搜索索引。

所以搜索系统需要解决两个问题：

历史内容如何写入 ES；
新发布、更新、删除的内容如何同步到 ES。

项目里的方案是：

text 复制代码

启动时索引为空则回灌历史公开内容
业务变更写 Outbox
Canal 订阅 Outbox binlog
Kafka 分发事件
搜索消费者异步更新 ES

同时，搜索索引中还单独维护了 title_suggest 字段，用于前缀联想。

二、ES 索引 Mapping 初始化

索引初始化类位于：

text 复制代码

src/main/java/com/tongji/search/index/SearchIndexInitializer.java

核心代码如下：

java 复制代码

@PostConstruct
public void ensureIndex() {
    try {
        boolean exists = es.indices().exists(e -> e.index(INDEX)).value();
        if (exists) {
            return;
        }

        es.indices().create(c -> c.index(INDEX).mappings(m -> m
                .properties("content_id", Property.of(p -> p.long_(LongNumberProperty.of(b -> b))))
                .properties("content_type", Property.of(p -> p.keyword(KeywordProperty.of(b -> b))))
                .properties("description", Property.of(p -> p.text(TextProperty.of(b -> b.analyzer("ik_max_word")))))
                .properties("title", Property.of(p -> p.text(TextProperty.of(b -> b.analyzer("ik_max_word").searchAnalyzer("ik_smart")))))
                .properties("body", Property.of(p -> p.text(TextProperty.of(b -> b.analyzer("ik_max_word")))))
                .properties("tags", Property.of(p -> p.keyword(KeywordProperty.of(b -> b))))
                .properties("author_id", Property.of(p -> p.long_(LongNumberProperty.of(b -> b))))
                .properties("author_avatar", Property.of(p -> p.keyword(KeywordProperty.of(b -> b))))
                .properties("author_nickname", Property.of(p -> p.keyword(KeywordProperty.of(b -> b))))
                .properties("author_tag_json", Property.of(p -> p.keyword(KeywordProperty.of(b -> b))))
                .properties("publish_time", Property.of(p -> p.date(DateProperty.of(b -> b))))
                .properties("like_count", Property.of(p -> p.integer(IntegerNumberProperty.of(b -> b))))
                .properties("favorite_count", Property.of(p -> p.integer(IntegerNumberProperty.of(b -> b))))
                .properties("view_count", Property.of(p -> p.integer(IntegerNumberProperty.of(b -> b))))
                .properties("status", Property.of(p -> p.keyword(KeywordProperty.of(b -> b))))
                .properties("img_urls", Property.of(p -> p.keyword(KeywordProperty.of(b -> b))))
                .properties("is_top", Property.of(p -> p.keyword(KeywordProperty.of(b -> b))))
                .properties("title_suggest", Property.of(p -> p.completion(CompletionProperty.of(b -> b))))
        ));
    } catch (Exception ignored) {
    }
}

索引名是：

java 复制代码

private static final String INDEX = "zhiguang_content_index";

几个字段类型比较关键：

字段	类型	作用
`title`	text	标题全文检索
`body`	text	正文全文检索
`tags`	keyword	标签精确过滤
`status`	keyword	过滤 published / deleted
`like_count`	integer	排序加权
`view_count`	integer	排序加权
`publish_time`	date	时间排序
`title_suggest`	completion	前缀联想

其中 title 和 body 使用 IK 分词器：

java 复制代码

.analyzer("ik_max_word")
.searchAnalyzer("ik_smart")

这说明 ES 集群需要安装 IK 分词插件，否则索引创建会失败或分词效果不符合预期。

三、启动时历史数据回灌

索引写入服务位于：

text 复制代码

src/main/java/com/tongji/search/index/SearchIndexService.java

启动后会执行：

java 复制代码

@PostConstruct
public void ensureBackfill() {
    try {
        long cnt = es.count(c -> c.index(INDEX)).count();
        if (cnt > 0) return;

        int limit = 500;
        int offset = 0;

        while (true) {
            List<KnowPostFeedRow> rows = knowPostMapper.listFeedPublic(limit, offset);
            if (rows == null || rows.isEmpty()) {
                break;
            }

            for (KnowPostFeedRow r : rows) {
                upsertKnowPost(r.getId());
            }

            offset += rows.size();
        }

        log.info("Search index backfill completed: {} documents",
                es.count(c -> c.index(INDEX)).count());
    } catch (Exception e) {
        log.warn("Search index backfill skipped: {}", e.getMessage());
    }
}

这里的逻辑很清晰：

text 复制代码

如果 ES 索引里已有数据，不回灌
如果索引为空，分页查询公开已发布知文
逐条调用 upsertKnowPost 写入 ES

回灌用到的 MyBatis 查询是：

xml 复制代码

<select id="listFeedPublic" resultType="com.tongji.knowpost.model.KnowPostFeedRow">
    SELECT
        p.id,
        p.title,
        p.description,
        p.tags,
        p.img_urls AS imgUrls,
        u.avatar AS authorAvatar,
        u.nickname AS authorNickname,
        u.tags_json AS authorTagJson,
        p.publish_time AS publishTime,
        p.is_top AS isTop
    FROM know_posts p
    JOIN users u ON p.creator_id = u.id
    WHERE p.status = 'published' AND p.visible = 'public'
    ORDER BY p.publish_time DESC
    LIMIT #{limit} OFFSET #{offset}
</select>

历史回灌只回灌：

text 复制代码

published + public

这样可以避免草稿和非公开内容进入搜索初始化索引。

四、单篇知文 upsert 到 ES

核心写入方法如下：

java 复制代码

public void upsertKnowPost(long id) {
    try {
        KnowPostDetailRow row = knowPostMapper.findDetailById(id);
        if (row == null) {
            log.warn("Index upsert skipped: post {} not found", id);
            return;
        }

        Map<String, Object> doc = new HashMap<>();
        doc.put("content_id", row.getId());
        doc.put("content_type", row.getType());
        doc.put("title", row.getTitle());
        doc.put("description", row.getDescription());
        doc.put("author_id", row.getCreatorId());
        doc.put("author_avatar", row.getAuthorAvatar());
        doc.put("author_nickname", row.getAuthorNickname());
        doc.put("author_tag_json", row.getAuthorTagJson());

        if (row.getPublishTime() != null) {
            doc.put("publish_time", row.getPublishTime().toEpochMilli());
        }

        doc.put("status", row.getStatus());
        doc.put("tags", parseStringArray(row.getTags()));
        doc.put("img_urls", parseStringArray(row.getImgUrls()));

        if (row.getIsTop() != null) {
            doc.put("is_top", row.getIsTop());
        }

        String body = fetchContentSafe(row.getContentUrl());
        if (body == null || body.isBlank()) {
            body = row.getDescription();
        }
        if (body != null) {
            doc.put("body", truncate(body, 4000));
        }

        Map<String, Long> counts = counterService.getCounts("knowpost", String.valueOf(id), List.of("like","fav"));
        doc.put("like_count", counts.getOrDefault("like", 0L));
        doc.put("favorite_count", counts.getOrDefault("fav", 0L));
        doc.put("view_count", 0L);

        if (row.getTitle() != null && !row.getTitle().isBlank()) {
            doc.put("title_suggest", row.getTitle());
        }

        IndexRequest<Map<String, Object>> req = IndexRequest.of(b -> b
                .index(INDEX)
                .id(String.valueOf(id))
                .document(doc)
                .refresh(Refresh.WaitFor)
        );

        IndexResponse resp = es.index(req);
        log.info("Indexed post {} result={} version={}", id, resp.result(), resp.version());
    } catch (Exception e) {
        log.error("Index upsert failed for post {}: {}", id, e.getMessage());
    }
}

这段代码做了几件事。

第一，从 MySQL 查询知文详情：

java 复制代码

KnowPostDetailRow row = knowPostMapper.findDetailById(id);

第二，组装 ES 文档字段：

text 复制代码

基础内容字段：title、description、body
作者字段：author_id、author_avatar、author_nickname
排序字段：publish_time、like_count、view_count
过滤字段：status、tags
展示字段：img_urls
联想字段：title_suggest

第三，从计数系统读取点赞和收藏数：

java 复制代码

counterService.getCounts("knowpost", String.valueOf(id), List.of("like","fav"));

第四，用知文 ID 作为 ES 文档 ID：

java 复制代码

.id(String.valueOf(id))

这样同一篇文章重复 upsert 时，会覆盖同一条 ES 文档，而不是产生重复文档。

五、正文内容从 OSS 拉取

知文正文存在 OSS 中，所以索引服务会通过 contentUrl 拉取正文：

java 复制代码

String body = fetchContentSafe(row.getContentUrl());
if (body == null || body.isBlank()) {
    body = row.getDescription();
}
if (body != null) {
    doc.put("body", truncate(body, 4000));
}

这里有两个工程细节。

第一，正文拉取失败不会中断索引流程，而是降级使用 description。

第二，正文最多截断到 4000 字符：

java 复制代码

private String truncate(String s, int max) {
    if (s == null) {
        return null;
    }

    return s.length() <= max ? s : s.substring(0, max);
}

搜索系统不一定要把整篇 Markdown 全量塞进 ES。对普通关键词检索来说，标题、摘要和正文前几千字符已经能覆盖大部分召回需求。

六、软删除：通过 status 隐藏搜索结果

删除知文时，搜索索引不会真的删除 ES 文档，而是写入：

java 复制代码

public void softDeleteKnowPost(long id) {
    try {
        Map<String, Object> doc = new HashMap<>();
        doc.put("content_id", id);
        doc.put("status", "deleted");

        IndexRequest<Map<String, Object>> req = IndexRequest.of(b -> b
                .index(INDEX)
                .id(String.valueOf(id))
                .document(doc)
                .refresh(Refresh.WaitFor)
        );

        es.index(req);
    } catch (Exception e) {
        log.error("Index soft delete failed for post {}: {}", id, e.getMessage());
    }
}

搜索查询里有过滤条件：

java 复制代码

status = published

所以只要 ES 文档状态变成 deleted，它就不会再出现在搜索结果中。

这种软删除方式实现简单，也方便后续排查索引状态。

七、Outbox 驱动搜索索引增量更新

知文发布时，会写入 Outbox 事件：

java 复制代码

String payload = objectMapper.writeValueAsString(
        Map.of("entity", "knowpost", "op", "upsert", "id", id)
);
outboxMapper.insert(outId, "knowpost", id, "KnowPostPublished", payload);

元数据更新时，也写入 upsert 事件：

java 复制代码

String payload = objectMapper.writeValueAsString(
        Map.of("entity", "knowpost", "op", "upsert", "id", id)
);
outboxMapper.insert(outId, "knowpost", id, "KnowPostMetadataUpdated", payload);

删除时，写入 delete 事件：

java 复制代码

String payload = objectMapper.writeValueAsString(
        Map.of("entity", "knowpost", "op", "delete", "id", id)
);
outboxMapper.insert(outId, "knowpost", id, "KnowPostDeleted", payload);

搜索消费者监听 Kafka 中的 Outbox 消息：

java 复制代码

@KafkaListener(topics = OutboxTopics.CANAL_OUTBOX, groupId = "search-index-consumer")
public void onMessage(String message, Acknowledgment ack) {
    try {
        List<JsonNode> rows = OutboxMessageUtil.extractRows(objectMapper, message);

        if (rows.isEmpty()) {
            ack.acknowledge();
            return;
        }

        for (JsonNode row : rows) {
            JsonNode payloadNode = row.get("payload");
            if (payloadNode == null) {
                continue;
            }

            JsonNode payload = objectMapper.readTree(payloadNode.asText());
            String entity = text(payload.get("entity"));
            String op = text(payload.get("op"));
            Long id = asLong(payload.get("id"));

            if (!"knowpost".equals(entity) || id == null) {
                continue;
            }

            if ("delete".equalsIgnoreCase(op)) {
                indexService.softDeleteKnowPost(id);
            } else {
                indexService.upsertKnowPost(id);
            }
        }

        ack.acknowledge();
    } catch (Exception ignored) {}
}

这条链路的价值是：

text 复制代码

发布主流程只负责写业务表和 Outbox
搜索索引异步更新
ES 异常不会直接影响发布事务
后续可以通过消息重试或回灌修复索引

这就是搜索系统中的最终一致性设计。

八、Completion Suggester 实现前缀联想

搜索框输入时，用户通常希望看到类似这样的提示：

text 复制代码

输入：Red
提示：Redis 计数系统设计
提示：Redis Lua 原子更新
提示：Redis 缓存一致性

项目中使用 ES 的 completion suggester 实现低延迟前缀联想。

Mapping 中定义了：

java 复制代码

.properties("title_suggest", Property.of(p -> p.completion(CompletionProperty.of(b -> b))))

写入索引时，如果标题不为空，就写入：

java 复制代码

if (row.getTitle() != null && !row.getTitle().isBlank()) {
    doc.put("title_suggest", row.getTitle());
}

查询接口是：

java 复制代码

@GetMapping("/suggest")
public SuggestResponse suggest(@RequestParam("prefix") @NotBlank String prefix,
                               @RequestParam(value = "size", required = false, defaultValue = "10") @Min(1) int size) {
    return searchService.suggest(prefix, size);
}

Service 实现如下：

java 复制代码

public SuggestResponse suggest(String prefix, int size) {
    co.elastic.clients.elasticsearch.core.SearchResponse<Map<String, Object>> resp;
    try {
        resp = es.search(s -> s.index(INDEX)
                .suggest(sug -> sug.suggesters("title_suggest",
                        sc -> sc.prefix(prefix)
                                .completion(c -> c.field("title_suggest").size(size))))
                , (Class<Map<String, Object>>)(Class<?>) Map.class);
    } catch (Exception e) {
        return new SuggestResponse(Collections.emptyList());
    }

    List<String> items = new ArrayList<>();

    try {
        var sugg = resp.suggest();
        List<Suggestion<Map<String, Object>>> entry = sugg == null ? null : sugg.get("title_suggest");

        if (entry != null) {
            for (var s : entry) {
                var comp = s.completion();
                if (comp != null && comp.options() != null) {
                    for (var opt : comp.options()) {
                        String text = opt.text();
                        if (text != null && !text.isBlank()) {
                            items.add(text);
                        }
                    }
                }
            }
        }
    } catch (Exception ignored) {}

    return new SuggestResponse(items);
}

返回结构非常轻：

java 复制代码

public record SuggestResponse(
        List<String> items
) {}

前端拿到后可以直接渲染到搜索框下拉列表。

九、完整搜索系统链路回顾

把三篇内容串起来，搜索系统整体链路是：

text 复制代码

应用启动
  ↓
确保 zhiguang_content_index 存在
  ↓
索引为空则回灌历史公开已发布知文
  ↓

用户发布 / 更新 / 删除知文
  ↓
业务事务写 know_posts + outbox
  ↓
Canal 订阅 outbox binlog
  ↓
Kafka 分发 outbox 消息
  ↓
搜索消费者 upsert / soft delete ES 文档
  ↓

用户搜索
  ↓
multi_match 检索 title/body
  ↓
tags/status filter
  ↓
function_score 融合点赞和浏览权重
  ↓
search_after 游标分页
  ↓
返回 FeedItemResponse 列表
  ↓

用户输入搜索前缀
  ↓
completion suggester 返回标题候选

这套设计中，MySQL 仍然是权威数据源，ES 是面向搜索体验构建的派生索引。

十、本篇小结

这一篇主要分析了搜索系统的索引构建和联想建议。

核心点是：

text 复制代码

SearchIndexInitializer 负责创建 ES Mapping
SearchIndexService 负责历史回灌、upsert、soft delete
Outbox + Canal + Kafka 驱动搜索索引增量更新
content_id 作为 ES 文档 ID，保证同一知文覆盖写入
title_suggest 使用 completion 类型支持低延迟前缀联想

搜索系统真正的重点不只是"能搜到"，而是：

text 复制代码

索引能自动构建
内容变更能异步同步
排序能兼顾相关性和业务价值
分页能稳定向后加载
联想建议能快速响应

这也是平台从普通内容发布系统，进一步走向完整内容消费体验的一块关键能力。