博客系统全文搜索实战：用 Elasticsearch 告别 MySQL LIKE 查询

前言

你的博客系统搜索功能是这样实现的吗？

sql 复制代码

SELECT * FROM article 
WHERE (title LIKE '%Spring Boot%' OR content LIKE '%Spring Boot%')
  AND audit_status = 1 
  AND status = 0
ORDER BY score DESC
LIMIT 10;

这条 SQL 在数据量小的时候没问题，但一旦文章数量上万，LIKE '%关键词%' 就会触发全表扫描，索引完全失效。更糟糕的是，它不支持分词------搜索"Spring Boot"不会匹配"SpringBoot"，搜索"弹性搜索"不会匹配"Elasticsearch"。

本文以一个真实的博客文章表为基础，手把手演示如何用 Elasticsearch 构建一套支持全文检索、多条件过滤、相关性排序、搜索高亮的博客搜索系统。

一、从 MySQL 表到 ES 索引设计

先看我们的文章表结构（核心字段）：

sql 复制代码

CREATE TABLE `article` (
  `aid`          varchar(64)   COMMENT '文章id',
  `title`        varchar(200)  COMMENT '文章标题',
  `describe`     varchar(500)  COMMENT '文章摘要',
  `content`      text          COMMENT '文章内容（HTML格式）',
  `tags`         varchar(500)  COMMENT '文章标签（逗号分隔）',
  `author_name`  varchar(100)  COMMENT '作者名称',
  `tag_name`     varchar(100)  COMMENT '标签名称',
  `tid`          varchar(64)   COMMENT '分类id',
  `is_top`       int           COMMENT '0:正常,1:置顶',
  `status`       tinyint(1)    COMMENT '0:正常,1:已删除',
  `audit_status` int           COMMENT '0-待审核,1-审核通过,2-审核拒绝',
  `look_count`   int           COMMENT '查看总数',
  `like_count`   int           COMMENT '点赞数',
  `score`        double        COMMENT '加权分数',
  `create_time`  datetime      COMMENT '文章创建时间'
);

对应的 ES Mapping 设计如下：

json 复制代码

PUT /article-index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "ik_smart_pinyin": {
          "type": "custom",
          "tokenizer": "ik_smart"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "aid":         { "type": "keyword" },
      "title":       { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart",
                       "copy_to": "full_text" },
      "describe":    { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart",
                       "copy_to": "full_text" },
      "content":     { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart",
                       "copy_to": "full_text" },
      "tags":        { "type": "text", "analyzer": "ik_max_word", "copy_to": "full_text" },
      "full_text":   { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart" },
      "author_name": { "type": "keyword" },
      "tag_name":    { "type": "keyword" },
      "tid":         { "type": "keyword" },
      "is_top":      { "type": "integer" },
      "status":      { "type": "integer" },
      "audit_status":{ "type": "integer" },
      "look_count":  { "type": "integer" },
      "like_count":  { "type": "integer" },
      "score":       { "type": "double" },
      "create_time": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis" }
    }
  }
}

几个关键设计决策：

字段	类型选择	原因
`title`/`describe`/`content`	`text` + ik 分词	需要全文检索，必须分词
`aid`/`tid`/`author_name`	`keyword`	精确匹配，不需要分词
`tags`	`text` + ik 分词	标签内容可能是中文，需要分词匹配
`full_text`	`copy_to` 聚合字段	多字段联合搜索时只查一个字段，性能更好
`audit_status`/`status`	`integer`	用于 filter 过滤，不参与评分

二、Query DSL 基础

Elasticsearch 使用基于 JSON 的查询 DSL（Domain Specific Language）。所有查询都遵循这个基本结构：

json 复制代码

POST /article-index/_search
{
  "query": {
    "查询类型": {
      "查询条件": "查询条件值"
    }
  }
}

查询分两大类：

全文查询（Full-text Query）：对文本字段分词后匹配，有相关性评分
词条查询（Term-level Query）：精确匹配，不分词，通常用于过滤

三、全文检索场景

3.1 单字段搜索（match）

搜索标题中包含"Spring Boot"的文章：

json 复制代码

POST /article-index/_search
{
  "query": {
    "match": {
      "title": "Spring Boot 实战"
    }
  }
}

match 会对查询词分词，默认是 or 关系------只要标题包含"Spring"、"Boot"、"实战"任意一个词就会命中。

如果要求标题同时包含 所有词（and 关系）：

json 复制代码

POST /article-index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Spring Boot 实战",
        "operator": "and"
      }
    }
  }
}

3.2 跨字段搜索（multi_match）

博客搜索通常需要同时搜索标题、摘要、内容、标签。用 multi_match：

json 复制代码

POST /article-index/_search
{
  "query": {
    "multi_match": {
      "query": "Redis 缓存穿透",
      "fields": ["title^3", "describe^2", "tags^2", "content"],
      "type": "best_fields"
    }
  }
}

^3 表示权重加成------标题匹配的权重是内容的 3 倍，这符合直觉：标题命中比正文命中更相关。

或者利用前面 Mapping 中定义的 full_text 聚合字段，一次搜索所有文本：

json 复制代码

POST /article-index/_search
{
  "query": {
    "match": {
      "full_text": "Redis 缓存穿透"
    }
  }
}

3.3 短语搜索（match_phrase）

搜索标题中连续出现"缓存穿透"的文章（不接受分词后乱序匹配）：

json 复制代码

POST /article-index/_search
{
  "query": {
    "match_phrase": {
      "title": "缓存穿透"
    }
  }
}

slop 参数允许词之间有间隔，slop: 1 表示两个词之间最多插入 1 个其他词：

json 复制代码

POST /article-index/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "缓存 穿透",
        "slop": 1
      }
    }
  }
}

四、精确过滤场景

4.1 按分类筛选（term）

查询某个分类下的所有文章：

json 复制代码

POST /article-index/_search
{
  "query": {
    "term": {
      "tid": "category-java"
    }
  }
}

⚠️ term 查询不分词，tid 字段必须是 keyword 类型，否则会匹配不到。

4.2 按多个标签筛选（terms）

查询标签为"Java"或"Spring"的文章：

json 复制代码

POST /article-index/_search
{
  "query": {
    "terms": {
      "tag_name": ["Java", "Spring", "SpringBoot"]
    }
  }
}

4.3 范围查询（range）

查询最近 30 天发布的文章：

json 复制代码

POST /article-index/_search
{
  "query": {
    "range": {
      "create_time": {
        "gte": "now-30d/d",
        "lte": "now/d"
      }
    }
  }
}

查询阅读量超过 1000 的热门文章：

json 复制代码

POST /article-index/_search
{
  "query": {
    "range": {
      "look_count": {
        "gte": 1000
      }
    }
  }
}

五、复合查询：生产环境的真实搜索

单独的 match 或 term 很少单独使用。真实的博客搜索需要组合多个条件：

搜索"Redis"相关文章，要求：已审核通过、未删除、按综合评分排序

这就需要 bool 查询：

json 复制代码

POST /article-index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "Redis",
            "fields": ["title^3", "describe^2", "tags^2", "content"],
            "type": "best_fields"
          }
        }
      ],
      "filter": [
        { "term": { "audit_status": 1 } },
        { "term": { "status": 0 } }
      ]
    }
  }
}

bool 查询的四个子句：

子句	含义	是否影响评分
`must`	必须满足（AND）	✅ 参与评分
`should`	满足更好（OR）	✅ 参与评分
`filter`	必须满足（AND）	❌ 不参与评分，有缓存，性能更好
`must_not`	必须不满足（NOT）	❌ 不参与评分

关键原则 ：audit_status、status 这类过滤条件放 filter，不放 must。原因：

这些条件不影响相关性，放 must 会干扰评分
filter 有查询缓存，重复过滤性能更好

5.1 置顶文章优先

博客首页通常需要置顶文章排在最前面，用 should + boost 实现：

json 复制代码

POST /article-index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "full_text": "Spring" } }
      ],
      "should": [
        {
          "term": {
            "is_top": {
              "value": 1,
              "boost": 10
            }
          }
        }
      ],
      "filter": [
        { "term": { "audit_status": 1 } },
        { "term": { "status": 0 } }
      ]
    }
  }
}

boost: 10 让置顶文章的评分大幅提升，自然排在前面。

5.2 function_score：融合业务评分

文章表里有个 score 字段（加权分数，综合了阅读量、点赞数等）。如何让 ES 的相关性评分和业务评分结合？用 function_score：

json 复制代码

POST /article-index/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            { "match": { "full_text": "Redis 缓存" } }
          ],
          "filter": [
            { "term": { "audit_status": 1 } },
            { "term": { "status": 0 } }
          ]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "score",
            "factor": 0.1,
            "modifier": "log1p",
            "missing": 1
          }
        },
        {
          "field_value_factor": {
            "field": "look_count",
            "factor": 0.01,
            "modifier": "log1p",
            "missing": 0
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

最终评分 = ES 相关性评分 × (业务score贡献 + 阅读量贡献)，兼顾了搜索相关性和内容热度。

六、排序与分页

6.1 多维度排序

按综合评分降序，评分相同时按创建时间降序：

json 复制代码

POST /article-index/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "audit_status": 1 } },
        { "term": { "status": 0 } }
      ]
    }
  },
  "sort": [
    { "score":       { "order": "desc" } },
    { "create_time": { "order": "desc" } }
  ]
}

6.2 分页

from + size 实现基础分页（适合前 10000 条数据）：

json 复制代码

POST /article-index/_search
{
  "query": { "match_all": {} },
  "sort": [{ "create_time": { "order": "desc" } }],
  "from": 0,
  "size": 10
}

⚠️ ES 默认限制 from + size <= 10000。深度分页（第 1000 页以后）需要用 search_after：

json 复制代码

POST /article-index/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "create_time": { "order": "desc" } },
    { "aid": { "order": "asc" } }
  ],
  "search_after": ["2024-01-15 10:30:00", "article-xxx-id"],
  "size": 10
}

search_after 用上一页最后一条记录的排序值作为游标，性能稳定，不受深度影响。

七、搜索高亮

搜索结果中高亮显示匹配的关键词，提升用户体验：

json 复制代码

POST /article-index/_search
{
  "query": {
    "bool": {
      "must": [
        { "multi_match": {
            "query": "Redis 缓存穿透",
            "fields": ["title", "describe", "content"]
        }}
      ],
      "filter": [
        { "term": { "audit_status": 1 } },
        { "term": { "status": 0 } }
      ]
    }
  },
  "highlight": {
    "pre_tags":  "<em class='highlight'>",
    "post_tags": "</em>",
    "fields": {
      "title":    { "number_of_fragments": 0 },
      "describe": { "fragment_size": 150, "number_of_fragments": 1 },
      "content":  { "fragment_size": 200, "number_of_fragments": 3 }
    }
  }
}

参数说明：

number_of_fragments: 0：返回整个字段（适合标题，不截断）
fragment_size：每个高亮片段的字符数
number_of_fragments：最多返回几个高亮片段

返回结果中会多一个 highlight 字段：

json 复制代码

{
  "_source": { "title": "Redis 缓存穿透、缓存击穿、缓存雪崩详解" },
  "highlight": {
    "title": ["<em class='highlight'>Redis</em> <em class='highlight'>缓存穿透</em>、缓存击穿、缓存雪崩详解"]
  }
}

八、批量操作

8.1 批量查询（mget）

一次查询多篇文章（比如渲染推荐列表）：

json 复制代码

POST /article-index/_mget
{
  "docs": [
    { "_id": "article-001" },
    { "_id": "article-002" },
    { "_id": "article-003" }
  ]
}

8.2 批量写入（bulk）

MySQL 数据同步到 ES 时，用 bulk 批量写入，减少网络开销：

json 复制代码

POST /_bulk
{ "index": { "_index": "article-index", "_id": "article-001" } }
{ "aid": "article-001", "title": "Redis 缓存穿透详解", "audit_status": 1, "status": 0, "score": 95.5 }
{ "index": { "_index": "article-index", "_id": "article-002" } }
{ "aid": "article-002", "title": "Spring Boot 自动装配原理", "audit_status": 1, "status": 0, "score": 88.0 }
{ "delete": { "_index": "article-index", "_id": "article-deleted-001" } }

建议每批 1000-5000 条，单次请求体不超过 15MB。

九、完整搜索接口示例

把上面所有内容组合成一个生产可用的博客搜索查询：

json 复制代码

POST /article-index/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "Redis 缓存",
                "fields": ["title^3", "describe^2", "tags^2", "content"],
                "type": "best_fields",
                "minimum_should_match": "60%"
              }
            }
          ],
          "should": [
            { "term": { "is_top": { "value": 1, "boost": 5 } } }
          ],
          "filter": [
            { "term":  { "audit_status": 1 } },
            { "term":  { "status": 0 } },
            { "range": { "create_time": { "gte": "now-1y" } } }
          ]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "score",
            "factor": 0.1,
            "modifier": "log1p",
            "missing": 1
          }
        }
      ],
      "boost_mode": "multiply"
    }
  },
  "highlight": {
    "pre_tags": "<em>",
    "post_tags": "</em>",
    "fields": {
      "title":    { "number_of_fragments": 0 },
      "describe": { "fragment_size": 150, "number_of_fragments": 1 }
    }
  },
  "sort": [
    { "_score": { "order": "desc" } },
    { "create_time": { "order": "desc" } }
  ],
  "from": 0,
  "size": 10,
  "_source": ["aid", "title", "describe", "author_name", "tag_name", "look_count", "like_count", "score", "create_time", "blog_img"]
}

注意最后的 _source 字段------只返回列表页需要的字段，不返回 content（文章正文可能很大），减少网络传输。

总结

场景	推荐方案
全文搜索	`multi_match` + ik 分词 + 字段权重
精确过滤	`filter` 中用 `term`/`terms`/`range`
相关性 + 业务评分融合	`function_score`
置顶/加权	`should` + `boost`
深度分页	`search_after` 替代 `from+size`
搜索高亮	`highlight` + 自定义标签
批量同步	`bulk` API，每批 1000-5000 条

从 MySQL LIKE 迁移到 ES 的核心收益：

分词搜索，不再漏掉同义词和变体
相关性排序，最匹配的结果排在最前
filter 缓存，高频过滤条件性能提升明显
content 字段不再参与 MySQL 索引，减少存储压力

觉得有用的话，点个赞、收个藏，下次找得到 👇