Elasticsearch 核心概念详解：Index、Document、Field

一、核心三要素概览

1.1 关系图谱（逻辑视图）

复制代码

Index (索引) - 数据库级别
    │
    ├─ Mapping (映射) - 表结构定义
    │     │
    │     └─ Field Definitions (字段定义)
    │
    └─ Documents (文档集合)
          │
          ├─ Document 1 (文档1)
          │     ├─ Field: name (字段)
          │     ├─ Field: age
          │     └─ Field: address
          │
          ├─ Document 2 (文档2)
          │     ├─ Field: name
          │     └─ Field: email
          └─ ...

说明：这是 Elasticsearch 的逻辑视图，从用户/应用的角度看数据如何组织。

1.2 存储结构（物理视图）

复制代码

Elasticsearch 索引结构
│
├─ Index（索引）
│   ├─ Shard 0（分片0）
│   │   ├─ Segment 0（段）
│   │   │   ├─ 倒排索引 (Inverted Index)
│   │   │   │   ├─ Term Dictionary (FST)
│   │   │   │   ├─ Term Index (FST)
│   │   │   │   └─ Posting List (跳表 + Frame Of Reference)
│   │   │   ├─ 正排索引 (Doc Values - 列式存储)
│   │   │   ├─ Store Fields (原始文档存储)
│   │   │   └─ 其他元数据
│   │   ├─ Segment 1
│   │   └─ ...
│   ├─ Shard 1
│   └─ ...
└─ Replica Shards（副本分片）

说明：这是 Elasticsearch 的物理视图，从存储引擎的角度看数据如何存储。

💡 详细底层原理：参见《Elasticsearch底层原理与存储架构深度解析.md》

1.3 两种视图的区别与联系

1.3.1 核心区别

维度	逻辑视图 (1.1)	物理视图 (1.2)
视角	用户/应用层	存储引擎层
关注点	数据如何组织和访问	数据如何存储和检索
抽象级别	高层抽象	底层实现
主要概念	Index、Document、Field、Mapping	Shard、Segment、倒排索引、正排索引
用途	API 设计、数据建模	性能优化、原理理解
可见性	开发者日常接触	通常透明，除非调优

1.3.2 两种视图的对应关系

从逻辑到物理的映射：

复制代码

逻辑视图                        物理视图
───────────────────────────────────────────────────────────

Index (索引)          ═══════>  分散到多个 Shard（分片）
  │                              │
  │                              ├─ Shard 0
  │                              ├─ Shard 1
  │                              └─ Shard N
  │
  ├─ Mapping            ═══════>  影响索引结构的创建
  │   └─ Field Definitions         ├─ 倒排索引（text类型）
  │                                 ├─ Doc Values（聚合/排序）
  │                                 └─ BKD Tree（数值/地理）
  │
  └─ Documents          ═══════>  存储在 Segment 中
      │                            │
      ├─ Document 1    ─────────>  分散存储：
      │   ├─ Field: name             ├─ 倒排索引（搜索）
      │   ├─ Field: age              ├─ Doc Values（聚合）
      │   └─ Field: address          └─ Store Fields（原文）
      │
      └─ Document 2    ─────────>  同样分散存储...

1.3.3 详细映射关系表

逻辑层概念	物理层实现	映射关系说明
Index	Shard（分片）	1个Index被切分成N个Shard，分布在不同节点
Document	Segment中的多个数据结构	1个Document的数据分散存储在同一Shard的Segment中
Field (text)	倒排索引	全文检索字段存储为倒排索引（Term → Posting List）
Field (keyword/数值)	Doc Values + 倒排索引	聚合/排序字段存储为列式的Doc Values
Field (原始值)	Store Fields	存储原始JSON文档，用于返回_source
Mapping	索引结构定义	决定字段使用何种数据结构（倒排/正排/BKD树）
写入Document	写入Segment	数据先写入内存buffer，refresh后生成Segment
查询Document	查询多个Segment	跨多个Segment查询，合并结果

1.3.4 为什么需要两种结构？

原因1：职责分离

复制代码

逻辑视图的职责：
├─ 提供简洁的 API 接口
├─ 隐藏底层复杂性
├─ 方便数据建模
└─ 降低学习成本

物理视图的职责：
├─ 优化存储效率
├─ 提升搜索性能
├─ 支持分布式扩展
└─ 实现高可用

原因2：性能优化

逻辑上的"一个文档"在物理上被拆分成多种数据结构，各司其职：

复制代码

Document {
  "name": "iPhone 15",     ──┐
  "price": 7999,             │  逻辑上是一个整体
  "tags": ["5G", "手机"]    ──┘
}

物理存储（同一Shard的Segment中）：
│
├─ 倒排索引 (Inverted Index)
│   ├─ "iphone" → [Doc1, Doc5, ...]      # name字段
│   ├─ "15" → [Doc1, Doc10, ...]         # name字段
│   ├─ "5g" → [Doc1, Doc3, ...]          # tags字段
│   └─ "手机" → [Doc1, Doc2, ...]        # tags字段
│
├─ Doc Values (列式存储)
│   ├─ price: [7999, 3999, 5999, ...]    # 用于聚合/排序
│   └─ tags: [["5G","手机"], ...]
│
└─ Store Fields (行式存储)
    └─ Doc1: {"name":"iPhone 15", "price":7999, "tags":["5G","手机"]}

好处：

倒排索引：搜索"iPhone" → 毫秒级响应
Doc Values：聚合统计平均价格 → 列式存储，扫描快
Store Fields：返回完整文档 → 保留原始数据

原因3：分布式需求

复制代码

逻辑视图：
products (索引)
  ├─ 1000万个 Documents
  └─ 100GB 数据

物理视图（分布式存储）：
products (索引)
  ├─ Shard 0 (Node1)  ──> 200万 Docs, 20GB
  ├─ Shard 1 (Node2)  ──> 200万 Docs, 20GB
  ├─ Shard 2 (Node3)  ──> 200万 Docs, 20GB
  ├─ Shard 3 (Node4)  ──> 200万 Docs, 20GB
  └─ Shard 4 (Node5)  ──> 200万 Docs, 20GB

逻辑上统一，物理上分散，实现水平扩展。

1.3.5 实际应用示例

示例1：搜索流程（从逻辑到物理）

用户视角（逻辑）：

json 复制代码

GET /products/_search
{
  "query": {
    "match": { "name": "iPhone" }
  }
}

ES内部执行（物理）：

复制代码

1. 路由到所有相关 Shard（Shard 0-4）
2. 每个 Shard 内部：
   ├─ 查询所有 Segment 的倒排索引
   ├─ Term "iphone" → Posting List [Doc1, Doc5, ...]
   └─ 从 Store Fields 读取文档内容
3. 协调节点合并结果并返回

示例2：聚合操作（从逻辑到物理）

用户视角（逻辑）：

json 复制代码

GET /products/_search
{
  "aggs": {
    "avg_price": {
      "avg": { "field": "price" }
    }
  }
}

ES内部执行（物理）：

复制代码

1. 路由到所有 Shard
2. 每个 Shard 内部：
   ├─ 从 Doc Values 读取 price 列
   ├─ 列式存储，顺序扫描高效
   └─ 计算局部平均值
3. 协调节点汇总各 Shard 结果，计算全局平均值

1.3.6 两种视图的协作关系

复制代码

┌──────────────────────────────────────────┐
│          应用层 API                        │
│  (操作逻辑视图：Index, Document, Field)    │
└──────────────────────────────────────────┘
                    ↓
          ┌─────────────────┐
          │   ES 协调层      │
          │ (逻辑 → 物理映射) │
          └─────────────────┘
                    ↓
┌──────────────────────────────────────────┐
│          物理存储层                        │
│  (操作物理视图：Shard, Segment, 倒排索引)  │
└──────────────────────────────────────────┘

关键点：

用户只需关心逻辑视图（写文档、搜索、聚合）
ES自动处理物理视图（分片、索引结构、存储优化）
调优时需要理解物理视图（分片数、segment合并、缓存）

1.3.7 学习建议

复制代码

入门阶段：
└─ 理解逻辑视图（Index、Document、Field）
   重点：如何使用 API 操作数据

进阶阶段：
└─ 理解物理视图（Shard、Segment、倒排索引）
   重点：性能优化、容量规划

高级阶段：
└─ 掌握两者映射关系
   重点：深度调优、问题排查

1.4 与关系数据库对比

层级	关系数据库	Elasticsearch	说明
最外层	Database (数据库)	Cluster (集群)	最顶层容器
次外层	Table (表)	Index (索引)	数据集合
数据行	Row (行)	Document (文档)	单条数据
列	Column (列)	Field (字段)	数据属性
表结构	Schema	Mapping	结构定义

关键区别：

MySQL 的数据库包含多个表，ES 的集群包含多个索引
MySQL 的行是固定结构，ES 的文档是灵活的 JSON
MySQL 需要严格定义 Schema，ES 可以动态映射

二、Index（索引）详解

2.1 Index 是什么？

定义：Index 是 Elasticsearch 中具有相似特征的文档集合，是数据存储和搜索的顶层逻辑命名空间。

类比理解：

如果 ES 是一个图书馆，Index 就是某一类书籍的书架
如果 ES 是文件系统，Index 就是顶层文件夹
如果 ES 是数据库，Index 就是一张表（ES 7.x 后）

2.2 Index 的作用与定位

作用1：数据隔离

复制代码

电商系统示例：

es-cluster/
├─ products (商品索引)
├─ orders (订单索引)
├─ users (用户索引)
└─ logs (日志索引)

好处：

不同业务数据物理隔离
独立的配置和优化策略
权限控制更精细

作用2：分片存储

复制代码

Index: products
├─ Shard 0 (主) → Node 1
│   └─ Replica 0 (副) → Node 2
├─ Shard 1 (主) → Node 2
│   └─ Replica 1 (副) → Node 3
└─ Shard 2 (主) → Node 3
    └─ Replica 2 (副) → Node 1

原理：

Index 在创建时分成多个 Shard（分片）
Shard 分布在不同节点上
实现分布式存储和并行查询

作用3：定义 Mapping（映射）

json 复制代码

// Index 的 Mapping 定义
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "price": { "type": "double" },
      "created_at": { "type": "date" }
    }
  }
}

2.3 Index 的底层原理

物理存储结构

复制代码

Index: products (逻辑概念)
    ↓
物理层：分片 (Shard)
    ↓
Shard 0/
├─ segments_N (段元数据文件)
├─ _0.cfs (复合段文件)
├─ _0.cfe (复合段入口文件)
├─ _1.si (段信息文件)
└─ write.lock (写锁文件)

Segment（段）：

Shard 的最小物理存储单元
不可变（Immutable）
包含倒排索引、正排索引（Doc Values）
定期合并（Merge）优化性能

Index 的元数据

json 复制代码

{
  "products": {
    "aliases": {},
    "mappings": { ... },
    "settings": {
      "index": {
        "number_of_shards": "3",
        "number_of_replicas": "1",
        "refresh_interval": "1s",
        "max_result_window": "10000"
      }
    }
  }
}

关键配置：

配置项	说明	默认值	建议
number_of_shards	主分片数	1	根据数据量设置（创建后不可改）
number_of_replicas	副本数	1	至少1个保证高可用
refresh_interval	刷新间隔	1s	写入频繁时可调大
max_result_window	最大分页深度	10000	深分页用 scroll 或 search_after

2.4 Index 的生命周期管理

ILM（Index Lifecycle Management）

复制代码

Hot Phase (热阶段)
  ↓ 7天后
Warm Phase (温阶段) - 减少副本数
  ↓ 30天后
Cold Phase (冷阶段) - 迁移到低成本存储
  ↓ 90天后
Delete Phase (删除) - 删除索引

实际配置示例：

json 复制代码

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "allocate": {
            "number_of_replicas": 1
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

2.5 Index 命名规范与最佳实践

命名规范

复制代码

推荐格式：<项目>-<类型>-<环境>-<日期>

示例：
myapp-logs-prod-2024.01.29
myapp-products-test
system-metrics-2024.01

好处：

便于管理和检索
支持通配符查询：myapp-logs-*
方便按日期滚动

最佳实践

1. 单索引 vs 多索引

复制代码

❌ 不推荐：一个大索引存所有数据
products-all/
├─ 手机
├─ 电脑
├─ 衣服
└─ 食品 (10亿文档)

✅ 推荐：按类目分索引
products-electronics/
products-clothing/
products-food/

2. 时间序列数据使用滚动索引

复制代码

logs-2024.01.28
logs-2024.01.29
logs-2024.01.30

# 使用别名统一访问
logs (别名) → 指向所有 logs-* 索引

3. 索引模板（Index Template）

json 复制代码

{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" }
      }
    }
  }
}

作用：新建匹配模式的索引时自动应用配置。

三、Document（文档）详解

3.1 Document 是什么？

定义：Document 是 Elasticsearch 中可被索引的基本信息单元，以 JSON 格式表示。

核心特点：

每个 Document 有唯一的 _id
存储在某个 Index 中
自包含（self-contained）：包含所有字段和值
Schema-free：不同文档可以有不同字段（动态映射）

3.2 Document 的结构

完整结构示例

json 复制代码

{
  "_index": "products",              // 所属索引
  "_type": "_doc",                    // 类型（ES 7.x 后固定为 _doc）
  "_id": "1001",                      // 文档唯一标识
  "_version": 3,                      // 版本号（用于乐观锁）
  "_seq_no": 15,                      // 序列号（全局递增）
  "_primary_term": 1,                 // 主分片任期号
  "_source": {                        // 原始 JSON 数据
    "name": "iPhone 15 Pro Max",
    "brand": "Apple",
    "price": 9999,
    "specs": {
      "storage": "256GB",
      "color": "钛金属"
    },
    "tags": ["5G", "A17"],
    "created_at": "2024-01-29T10:30:00"
  },
  "_score": 1.5,                      // 相关度评分（查询时）
  "fields": {                         // 存储字段（可选）
    "category": ["手机"]
  }
}

元数据字段详解

元数据	说明	用途
`_index`	文档所属索引	路由和存储定位
`_id`	文档唯一标识	可自定义或自动生成
`_source`	原始 JSON 文档	存储和返回原始数据
`_version`	版本号	乐观锁并发控制
`_seq_no`	序列号	分布式并发控制
`_score`	相关度评分	搜索结果排序

3.3 Document 的生命周期

1. 创建（Index）

bash 复制代码

# 自动生成 ID
POST /products/_doc
{
  "name": "MacBook Pro",
  "price": 12999
}

# 响应
{
  "_index": "products",
  "_id": "W0tpsmIBdwcYyG50zbta",  # 自动生成
  "_version": 1,
  "result": "created"
}

# 指定 ID
PUT /products/_doc/1001
{
  "name": "MacBook Pro",
  "price": 12999
}

2. 读取（Get）

bash 复制代码

GET /products/_doc/1001

# 响应
{
  "_index": "products",
  "_id": "1001",
  "_version": 1,
  "_source": {
    "name": "MacBook Pro",
    "price": 12999
  }
}

3. 更新（Update）

全量更新（覆盖整个文档）：

bash 复制代码

PUT /products/_doc/1001
{
  "name": "MacBook Pro M3",
  "price": 13999
}

部分更新（只更新指定字段）：

bash 复制代码

POST /products/_update/1001
{
  "doc": {
    "price": 11999
  }
}

脚本更新：

bash 复制代码

POST /products/_update/1001
{
  "script": {
    "source": "ctx._source.price -= params.discount",
    "params": {
      "discount": 1000
    }
  }
}

4. 删除（Delete）

bash 复制代码

DELETE /products/_doc/1001

# 响应
{
  "_index": "products",
  "_id": "1001",
  "_version": 2,
  "result": "deleted"
}

3.4 Document 的底层原理

存储原理

复制代码

写入流程：
1. 文档到达协调节点
   ↓
2. 路由到目标主分片
   routing_shard = hash(_routing) % number_of_primary_shards
   ↓
3. 写入内存缓冲区（Index Buffer）
   ↓
4. 写入 Translog（事务日志，持久化）
   ↓
5. Refresh（默认1秒）→ 生成 Segment（可搜索）
   ↓
6. Flush（默认30分钟）→ Segment 持久化到磁盘
   ↓
7. 同步到副本分片

关键点：

Translog：保证数据不丢失（类似 MySQL 的 redo log）
Refresh：让数据可被搜索（近实时的原因）
Segment：不可变文件，包含倒排索引

文档的物理存储

复制代码

Segment 文件（Lucene 索引）
├─ 倒排索引 (.tip, .tim, .doc)
│   └─ Term → [Doc1, Doc3, Doc5]
│
├─ 正排索引 / Doc Values (.dvd, .dvm)
│   └─ Doc1 → [field1: value1, field2: value2]
│
├─ 存储字段 (.fdt, .fdx)
│   └─ 原始 _source 数据
│
└─ 归一化因子 / Norms (.nvd, .nvm)
    └─ 用于BM25评分计算

三种存储结构：

倒排索引：用于搜索（Term → Documents）
Doc Values：用于排序、聚合（Document → Field Values）
_source：用于返回原始文档

3.5 Document 的路由机制

路由公式

复制代码

shard_num = hash(_routing) % number_of_primary_shards

默认路由 ：使用 _id 作为路由值

bash 复制代码

# 文档 ID = "1001"
shard = hash("1001") % 3 = Shard 1

# 文档总是在同一个分片

自定义路由：

bash 复制代码

# 按用户ID路由，同一用户的文档在同一分片
PUT /orders/_doc/order-001?routing=user-123
{
  "user_id": "user-123",
  "product": "iPhone",
  "amount": 9999
}

# 查询时也要指定相同的路由
GET /orders/_doc/order-001?routing=user-123

好处：

相关文档聚合在一起
避免跨分片查询
提升查询性能

3.6 Document 的版本控制

内部版本控制（_version）

bash 复制代码

# 第一次创建：_version = 1
PUT /products/_doc/1001
{ "name": "iPhone" }

# 更新后：_version = 2
PUT /products/_doc/1001
{ "name": "iPhone Pro" }

# 再次更新：_version = 3

乐观锁机制（使用 seq_no 和 primary_term）：

bash 复制代码

# 先获取当前文档的 seq_no 和 primary_term
GET /products/_doc/1001

# 响应
{
  "_seq_no": 5,
  "_primary_term": 1,
  "_source": { ... }
}

# 只有 seq_no=5 且 primary_term=1 时才更新
PUT /products/_doc/1001?if_seq_no=5&if_primary_term=1
{
  "name": "iPhone Pro Max"
}

# 如果版本不匹配，返回冲突错误
{
  "error": {
    "type": "version_conflict_engine_exception",
    "reason": "[1001]: version conflict, current version [6] is different than the one provided [5]"
  }
}

外部版本控制

bash 复制代码

# 使用外部系统的版本号（如数据库的时间戳）
PUT /products/_doc/1001?version=1706515200&version_type=external
{
  "name": "iPhone"
}

# 规则：外部版本号必须 > 当前版本号

3.7 Document 最佳实践

1. 文档 ID 设计

bash 复制代码

❌ 不推荐：自增 ID
1, 2, 3, 4, 5 ...
# 问题：分布不均匀，热点分片

✅ 推荐：UUID 或业务主键
uuid: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
业务ID: "order-20240129-123456"

2. 文档大小控制

复制代码

建议：单个文档 < 10MB
最大：100MB（理论限制）

过大文档的问题：
- 网络传输慢
- 内存占用高
- 更新效率低

3. 嵌套 vs 父子文档

嵌套对象（适合一对少）：

json 复制代码

{
  "product_name": "iPhone",
  "comments": [
    { "user": "张三", "text": "不错" },
    { "user": "李四", "text": "很好" }
  ]
}

父子文档（适合一对多）：

bash 复制代码

# 父文档
PUT /products/_doc/1001
{ "name": "iPhone" }

# 子文档
PUT /comments/_doc/comment-1?routing=1001
{
  "product_id": 1001,
  "user": "张三",
  "text": "不错"
}

4. _source 优化

json 复制代码

{
  "mappings": {
    "_source": {
      "enabled": true,           // 是否存储原始文档
      "includes": ["name", "price"],  // 只存储部分字段
      "excludes": ["large_field"]     // 排除大字段
    }
  }
}

禁用 _source 的场景：

只做指标聚合，不需要返回原文
存储成本敏感
所有字段都启用了 store

四、Field（字段）详解

4.1 Field 是什么？

定义：Field 是 Document 中的键值对（Key-Value），代表文档的某个属性。

特点：

每个 Field 有名称和值
每个 Field 有数据类型（通过 Mapping 定义）
支持嵌套（Object 或 Nested 类型）
可以多值（数组）

4.2 Field 的数据类型

核心数据类型

分类	类型	说明	示例
字符串	text	全文检索（分词）	"iPhone 15 Pro Max"
	keyword	精确匹配（不分词）	"PENDING"
数值	long	64位整数	123456789
	integer	32位整数	12345
	double	双精度浮点	99.99
布尔	boolean	true/false	true
日期	date	日期时间	"2024-01-29T10:30:00Z"
对象	object	嵌套对象（扁平化）	{ "city": "北京" }
	nested	嵌套对象（独立文档）	[{ "name": "张三" }]
数组	-	同类型多值	["tag1", "tag2"]
地理	geo_point	地理坐标	{ "lat": 39.9, "lon": 116.4 }
	geo_shape	地理形状	多边形区域
特殊	ip	IP地址	"192.168.1.1"
	binary	二进制数据（Base64）	"U29tZSBiaW5hcnk="

类型详解

1. text vs keyword

json 复制代码

{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",           // 全文搜索
        "fields": {
          "keyword": {
            "type": "keyword"     // 同时支持精确匹配
          }
        }
      }
    }
  }
}

区别：

维度	text	keyword
分词	分词	不分词
用途	全文搜索	精确匹配、排序、聚合
索引	倒排索引	倒排索引 + Doc Values
示例查询	match	term

使用示例：

bash 复制代码

# text：搜索"手机"能匹配"苹果手机很好用"
GET /products/_search
{
  "query": {
    "match": { "title": "手机" }
  }
}

# keyword：必须完全匹配"苹果手机很好用"
GET /products/_search
{
  "query": {
    "term": { "title.keyword": "苹果手机很好用" }
  }
}

2. object vs nested

object（默认）：扁平化存储

json 复制代码

// 文档
{
  "comments": [
    { "user": "张三", "rating": 5 },
    { "user": "李四", "rating": 3 }
  ]
}

// 实际存储（扁平化）
{
  "comments.user": ["张三", "李四"],
  "comments.rating": [5, 3]
}

// 问题：丢失了对应关系！
// 查询"张三 AND rating=3"也能匹配到（错误）

nested：独立子文档

json 复制代码

{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested",        // 声明为 nested
        "properties": {
          "user": { "type": "keyword" },
          "rating": { "type": "integer" }
        }
      }
    }
  }
}

// 查询
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "bool": {
          "must": [
            { "match": { "comments.user": "张三" } },
            { "match": { "comments.rating": 5 } }
          ]
        }
      }
    }
  }
}

3. date 类型

json 复制代码

{
  "mappings": {
    "properties": {
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}

// 支持多种格式
{
  "created_at": "2024-01-29 10:30:00"     // 字符串
}
{
  "created_at": 1706515200000             // 时间戳（毫秒）
}

4. 数组类型

json 复制代码

// ES 没有专门的数组类型，任何字段都可以是数组
{
  "tags": ["5G", "双卡", "快充"],         // 字符串数组
  "prices": [99.9, 199.9, 299.9]          // 数值数组
}

// 限制：数组元素必须是同一类型

4.3 Field 的索引原理

Multi-Field（多字段）

json 复制代码

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",              // 主字段：全文搜索
        "fields": {
          "keyword": {
            "type": "keyword"        // 子字段：精确匹配
          },
          "pinyin": {
            "type": "text",
            "analyzer": "pinyin"     // 子字段：拼音搜索
          }
        }
      }
    }
  }
}

// 使用
{
  "query": {
    "match": { "name": "手机" }          // 全文搜索
  }
}
{
  "query": {
    "term": { "name.keyword": "小米手机" }  // 精确匹配
  }
}
{
  "query": {
    "match": { "name.pinyin": "shouji" }  // 拼音搜索
  }
}

Doc Values（列式存储）

作用：支持排序、聚合、脚本访问

复制代码

传统倒排索引（行式）：
Term → Documents
"手机" → [Doc1, Doc3, Doc5]

Doc Values（列式）：
Document → Field Values
Doc1 → price: 999
Doc3 → price: 1999
Doc5 → price: 2999

配置：

json 复制代码

{
  "mappings": {
    "properties": {
      "status": {
        "type": "keyword",
        "doc_values": true      // 默认开启（keyword、数值、日期）
      },
      "description": {
        "type": "text",
        "doc_values": false     // text 默认关闭
      }
    }
  }
}

优化建议：

不需要排序/聚合的字段，关闭 doc_values 节省磁盘
text 字段默认没有 doc_values（需要聚合时用 keyword）

Field Data（堆内存缓存）

用途：text 字段的聚合（不推荐）

json 复制代码

{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "fielddata": true        // 启用（慎用！）
      }
    }
  }
}

问题：

占用大量堆内存
可能导致 OOM
性能差

替代方案：使用 multi-field 添加 keyword 子字段

4.4 Field 的存储选项

_source vs store

_source（默认）：存储整个文档

json 复制代码

{
  "_source": {
    "name": "iPhone",
    "price": 9999,
    "description": "很长的描述文本..."
  }
}

store：单独存储字段

json 复制代码

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "store": true           // 单独存储
      },
      "description": {
        "type": "text",
        "store": false          // 不单独存储（默认）
      }
    }
  }
}

// 查询时只返回 store=true 的字段
{
  "query": { "match_all": {} },
  "_source": false,
  "stored_fields": ["name"]    // 只返回 name
}

使用场景：

_source 包含超大字段（如图片 Base64）
只需要返回部分字段
节省网络传输

enabled 控制

json 复制代码

{
  "mappings": {
    "properties": {
      "metadata": {
        "type": "object",
        "enabled": false        // 不索引，只存储
      }
    }
  }
}

效果：

数据会存在 _source 中
但无法被搜索
适合存储元数据

4.5 Field 的分析器（Analyzer）

分析过程

复制代码

原始文本: "iPhone 15 Pro Max"
    ↓
字符过滤器 (Char Filter)
    ↓ 无变化
分词器 (Tokenizer)
    ↓ ["iPhone", "15", "Pro", "Max"]
词元过滤器 (Token Filter)
    ↓ 转小写 + 去停用词
最终词条: ["iphone", "15", "pro", "max"]

内置分析器

bash 复制代码

# 标准分析器（默认）
"The quick brown fox" → ["the", "quick", "brown", "fox"]

# 简单分析器
"Hello-World_123" → ["hello", "world"]

# 空格分析器
"Hello World" → ["Hello", "World"]

# 语言分析器
"running quickly" → ["run", "quick"]  (词干提取)

中文分析器（IK）

json 复制代码

{
  "settings": {
    "analysis": {
      "analyzer": {
        "ik_smart": {           // 粗粒度
          "type": "ik_smart"
        },
        "ik_max_word": {        // 细粒度
          "type": "ik_max_word"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",      // 索引时用细粒度
        "search_analyzer": "ik_smart"   // 搜索时用粗粒度
      }
    }
  }
}

// 分词示例
"我爱北京天安门"
ik_smart:    ["我", "爱", "北京", "天安门"]
ik_max_word: ["我", "爱", "北京", "天安", "天安门", "安门"]

自定义分析器

json 复制代码

{
  "settings": {
    "analysis": {
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [
            ":) => happy",
            ":( => sad"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["的", "了", "在"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "char_filter": ["emoticons"],
          "tokenizer": "my_tokenizer",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

4.6 Field 最佳实践

1. 字段类型选择

复制代码

文本搜索 → text
精确匹配 → keyword
数值范围 → integer/long/double
日期范围 → date
排序聚合 → keyword/数值/date (需要 doc_values)

2. 性能优化

禁用不需要的功能：

json 复制代码

{
  "mappings": {
    "properties": {
      "status": {
        "type": "keyword",
        "index": false,          // 不需要搜索
        "doc_values": false      // 不需要排序/聚合
      },
      "id": {
        "type": "keyword",
        "norms": false           // 不需要评分
      }
    }
  }
}

索引选项（index_options）：

json 复制代码

{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "index_options": "offsets"  // docs < freqs < positions < offsets
      }
    }
  }
}

选项	存储内容	功能	占用空间
docs	文档列表	基本搜索	最小
freqs	+ 词频	TF-IDF评分	小
positions	+ 位置	短语查询	中
offsets	+ 偏移量	高亮显示	最大

3. 动态映射控制

json 复制代码

{
  "mappings": {
    "dynamic": "strict",       // 禁止动态添加字段
    "properties": {
      "user_info": {
        "type": "object",
        "dynamic": true        // 允许动态添加子字段
      }
    }
  }
}

选项：

true：允许动态添加（默认）
false：忽略新字段（不索引但存在 _source）
strict：拒绝未定义字段（抛出异常）

4. 字段命名规范

复制代码

推荐：
- created_at (下划线分隔)
- user_id
- product_name

避免：
- createdAt (驼峰式)
- CreateDate (大写开头)
- product-name (连字符，可能与查询语法冲突)

五、三者关系与协作

5.1 层级关系

复制代码

┌─────────────────────────────────────┐
│         Cluster (集群)               │
│  ┌───────────────────────────────┐  │
│  │    Index (索引) - products    │  │
│  │  ┌─────────────────────────┐  │  │
│  │  │  Mapping (映射)         │  │  │
│  │  │  ├─ name: text          │  │  │
│  │  │  ├─ price: double       │  │  │
│  │  │  └─ tags: keyword       │  │  │
│  │  └─────────────────────────┘  │  │
│  │                                │  │
│  │  Documents (文档集合)          │  │
│  │  ┌─────────────────────────┐  │  │
│  │  │ Document 1 (_id: 1001)  │  │  │
│  │  │ ├─ name: "iPhone"       │  │  │
│  │  │ ├─ price: 9999          │  │  │
│  │  │ └─ tags: ["5G"]         │  │  │
│  │  └─────────────────────────┘  │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

5.2 协作流程

写入流程

复制代码

1. 定义 Index + Mapping
   ↓
2. 创建 Document
   ↓
3. 解析 Document 的 Field
   ↓
4. 根据 Field 类型分词/索引
   ↓
5. 存储到 Index 的 Shard

查询流程

复制代码

1. 指定 Index
   ↓
2. 指定查询条件（Field 级别）
   ↓
3. 从倒排索引查找匹配的 Document ID
   ↓
4. 从存储中获取完整 Document
   ↓
5. 返回 Document 的 Field

5.3 实际案例

场景：电商商品搜索

1. 创建 Index 和 Mapping

bash 复制代码

PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "ik_smart": {
          "type": "ik_smart"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "product_id": {
        "type": "keyword"
      },
      "name": {
        "type": "text",
        "analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "category": {
        "type": "keyword"
      },
      "brand": {
        "type": "keyword"
      },
      "price": {
        "type": "double"
      },
      "sales": {
        "type": "long"
      },
      "tags": {
        "type": "keyword"
      },
      "description": {
        "type": "text",
        "analyzer": "ik_smart"
      },
      "specs": {
        "type": "object",
        "properties": {
          "storage": { "type": "keyword" },
          "color": { "type": "keyword" }
        }
      },
      "created_at": {
        "type": "date"
      }
    }
  }
}

2. 插入 Document

bash 复制代码

POST /products/_doc/1001
{
  "product_id": "P1001",
  "name": "iPhone 15 Pro Max",
  "category": "手机",
  "brand": "Apple",
  "price": 9999,
  "sales": 12580,
  "tags": ["5G", "A17芯片", "钛金属"],
  "description": "最新款苹果旗舰手机，搭载A17 Pro芯片",
  "specs": {
    "storage": "256GB",
    "color": "钛金属"
  },
  "created_at": "2024-01-29T10:00:00Z"
}

3. 搜索 Document

bash 复制代码

# 全文搜索 + 多条件过滤
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "苹果手机",
            "fields": ["name^3", "description"]
          }
        }
      ],
      "filter": [
        { "term": { "category": "手机" } },
        { "range": { "price": { "gte": 5000, "lte": 12000 } } },
        { "terms": { "tags": ["5G"] } }
      ]
    }
  },
  "sort": [
    { "sales": { "order": "desc" } },
    { "_score": { "order": "desc" } }
  ],
  "aggs": {
    "brands": {
      "terms": { "field": "brand" }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 3000 },
          { "from": 3000, "to": 6000 },
          { "from": 6000 }
        ]
      }
    }
  }
}

4. 更新 Field

bash 复制代码

# 降价促销
POST /products/_update/1001
{
  "script": {
    "source": "ctx._source.price = params.new_price",
    "params": {
      "new_price": 8999
    }
  }
}

六、实战案例

案例1：日志分析系统

Index 设计

bash 复制代码

# 按日期滚动索引
PUT /logs-2024.01.29
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1,
    "refresh_interval": "5s"
  },
  "mappings": {
    "properties": {
      "@timestamp": { "type": "date" },
      "level": { "type": "keyword" },
      "logger": { "type": "keyword" },
      "message": { "type": "text", "analyzer": "standard" },
      "thread": { "type": "keyword" },
      "host": { "type": "keyword" },
      "ip": { "type": "ip" },
      "user_id": { "type": "keyword" },
      "trace_id": { "type": "keyword" },
      "exception": {
        "type": "object",
        "properties": {
          "class": { "type": "keyword" },
          "message": { "type": "text" },
          "stacktrace": { "type": "text", "index": false }
        }
      }
    }
  }
}

Document 示例

json 复制代码

{
  "@timestamp": "2024-01-29T10:30:45.123Z",
  "level": "ERROR",
  "logger": "com.example.OrderService",
  "message": "Failed to process order: timeout",
  "thread": "http-nio-8080-exec-5",
  "host": "app-server-01",
  "ip": "192.168.1.100",
  "user_id": "user-12345",
  "trace_id": "a1b2c3d4e5f6",
  "exception": {
    "class": "java.net.SocketTimeoutException",
    "message": "Read timed out",
    "stacktrace": "java.net.SocketTimeoutException: Read timed out\n\tat ..."
  }
}

查询分析

bash 复制代码

# 统计错误日志趋势
GET /logs-*/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "level": "ERROR" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "aggs": {
    "error_timeline": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1m"
      },
      "aggs": {
        "top_errors": {
          "terms": { "field": "exception.class" }
        }
      }
    }
  }
}

案例2：用户画像系统

Index 设计

bash 复制代码

PUT /user_profiles
{
  "mappings": {
    "properties": {
      "user_id": { "type": "keyword" },
      "basic_info": {
        "properties": {
          "age": { "type": "integer" },
          "gender": { "type": "keyword" },
          "city": { "type": "keyword" },
          "location": { "type": "geo_point" }
        }
      },
      "preferences": {
        "type": "keyword"
      },
      "behaviors": {
        "type": "nested",
        "properties": {
          "action": { "type": "keyword" },
          "category": { "type": "keyword" },
          "timestamp": { "type": "date" }
        }
      },
      "tags": { "type": "keyword" },
      "score": { "type": "double" }
    }
  }
}

Document 示例

json 复制代码

{
  "user_id": "u_123456",
  "basic_info": {
    "age": 28,
    "gender": "male",
    "city": "北京",
    "location": {
      "lat": 39.9042,
      "lon": 116.4074
    }
  },
  "preferences": ["数码", "运动", "旅游"],
  "behaviors": [
    {
      "action": "view",
      "category": "手机",
      "timestamp": "2024-01-29T10:00:00Z"
    },
    {
      "action": "purchase",
      "category": "手机",
      "timestamp": "2024-01-29T10:30:00Z"
    }
  ],
  "tags": ["高消费", "科技爱好者", "活跃用户"],
  "score": 85.6
}