elasticsearch入门（二）：文档基础操作

文章首发于微信公众号《itThinking》，原文链接：mp.weixin.qq.com/s/kxIi__34u...

简介

Elasticsearch 是一个强大的分布式搜索和分析引擎，文档操作是其核心功能。本文将介绍如何使用 Elasticsearch 进行基本的文档 CRUD（创建、读取、更新、删除）操作，并提供详细的示例。

初始化索引

在开始文档操作之前，我们先创建一个带有显性映射的索引, 类似于数据库中创建表。本文示例创建一个博客索引。

创建索引并定义映射

arduino 复制代码

#请求
curl -X PUT "localhost:9200/my_blog" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",  // 使用中文分词器
        "search_analyzer": "ik_smart"
      },
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "author": {
        "type": "keyword"  // 关键字类型，用于精确匹配和聚合
      },
      "tags": {
        "type": "keyword",
        "ignore_above": 256  // 超过256字符的字段不会被索引
      },
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      },
      "updated_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      },
      "views": {
        "type": "integer"
      },
      "is_published": {
        "type": "boolean"
      },
      "rating": {
        "type": "float"
      },
      "metadata": {
        "type": "object",  // 对象类型
        "properties": {
          "category": {
            "type": "keyword"
          },
          "read_time": {
            "type": "integer"
          }
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "1s"
  }
}
'

#响应
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "my_blog"
}

可以看到博客索引大概有 10 个字段，这里会去除一些跟搜索无关的字段。

1. settings：这里需要设置跟「分片」、「副本」相关的配置。我这里设置为 3 个分片、每个分片 1 个副本。
1. mappings：映射定义了文档的字段类型和属性，对应数据库中数据库中的字段定义，后续会写一篇映射的专门文章。本示例索引大概有 10 个字段，这里会去除一些跟搜索无关的字段。

查看映射信息

sql 复制代码

curl -X GET "localhost:9200/my_blog/_mapping"

更新映射（添加新字段）

vbnet 复制代码

#请求
curl -X PUT "localhost:9200/my_blog/_mapping" -H 'Content-Type: application/json' -d'
{
  "properties": {
    "new_field": {
      "type": "text"
    }
  }
}
'
#响应
{
"acknowledged": true
}

1. 创建文档（Create）

ES 提供了两种创建文档的方式：

1. 使用 Index API 索引文档。
1. 使用 Create API 创建文档。

Index API

vbnet 复制代码

#请求
curl -X PUT "localhost:9200/my_blog/_doc/1" -H 'Content-Type: application/json' -d'
{
  "title": "Elasticsearch 入门教程",
  "content": "这是一篇关于 Elasticsearch 基础操作的教程，适合初学者学习",
  "author": "张三",
  "tags": ["搜索", "数据库", "教程", "技术"],
  "created_at": "2023-10-01 09:00:00",
  "updated_at": "2023-10-01 09:00:00",
  "views": 100,
  "is_published": true,
  "rating": 4.5,
  "metadata": {
    "category": "技术文章",
    "read_time": 10
  }
}
'

#返回结果
{
"_index": "my_blog",
"_type": "_doc",
"_id": "1",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}

result 为 created 创建，且version为1，通过Index API新建的文档，如果id已经存在的情况下，多次执行，不会报错，只会将返回结果的result变为updated，version在原来的基础上 +1.

在索引一个文档的时候，如果文档 ID 已经存在，会先删除旧文档，然后再写入新文档的内容，并且增加文档版本号。

Create API

指定 ID 创建文档

vbnet 复制代码

#请求
curl -X PUT "localhost:9200/my_blog/_create/1" -H 'Content-Type: application/json' -d'
{
  "title": "Elasticsearch 入门教程",
  "content": "这是一篇关于 Elasticsearch 基础操作的教程，适合初学者学习",
  "author": "张三",
  "tags": ["搜索", "数据库", "教程", "技术"],
  "created_at": "2023-10-01 09:00:00",
  "updated_at": "2023-10-01 09:00:00",
  "views": 100,
  "is_published": true,
  "rating": 4.5,
  "metadata": {
    "category": "技术文章",
    "read_time": 10
  }
}
'
#响应
{
"_index": "my_blog",
"_type": "_doc",
"_id": "2",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1
}

使用自动生成 ID 创建文档

arduino 复制代码

#请求
curl -X POST "localhost:9200/my_blog/_doc" -H 'Content-Type: application/json' -d'
{
  "title": "Elasticsearch 高级技巧",
  "content": "深入学习 Elasticsearch 的高级功能和性能优化",
  "author": "李四",
  "tags": ["高级", "优化", "性能", "分布式"],
  "created_at": "2023-10-02 14:30:00",
  "updated_at": "2023-10-02 14:30:00",
  "views": 50,
  "is_published": true,
  "rating": 4.8,
  "metadata": {
    "category": "进阶教程",
    "read_time": 15
  }
}
'

#响应
{
"_index": "my_blog",
"_type": "_doc",
"_id": "xxxxxxxxxxxxxxxx",  //elasticsearch自动生成的id
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 2,
"_primary_term": 1
}

序号	语句	特性描述
1	PUT my_blog/_doc/1	插入时需要指定id，且重复插入相同id的文档，只会将返回的结果中的version自增，且result改为updated，本质上是先删除，再写入，并将版本号+1
2	PUT my_blog/_create/1	插入时同样需要指定id，但当插入相同id的文档时，会返回状态码为409的错误
3	POST my_blog/_doc	不需要指定文档 ID，系统自动生成。

上表是新建文档时 3 种写法的总结:

• 如果你有更新文档内容的需求，应该使用第一种方式。
• 如果写入文档时有唯一性校验需求的话，应该使用第二种方式。
• 如果需要系统为你创建文档 ID，应该使用第三种方式。

相对于第一种方式来说，第三种方式写入的效率会更高，因为不需要在库里查询文档是否已经存在，并且进行后续的删除工作。

当然，创建还可以添加控制参数，比如：指定路由(示例routing=user123)、刷新（示例refresh=true）、超时控制（示例timeout=1m）。

2. 获取文档（Read）

根据 ID 获取文档

bash 复制代码

#请求
curl -X GET "localhost:9200/my_blog/_doc/1"

#响应
{
"_index": "my_blog",
"_type": "_doc",
"_id": "1",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"title": "Elasticsearch 入门教程",
"content": "这是一篇关于 Elasticsearch 基础操作的教程，适合初学者学习",
"author": "张三",
"tags": ["搜索", "数据库", "教程", "技术"],
"created_at": "2023-10-01 09:00:00",
"updated_at": "2023-10-01 09:00:00",
"views": 100,
"is_published": true,
"rating": 4.5,
"metadata": {
"category": "技术文章",
"read_time": 10
}
}
}

获取文档源数据（只返回 _source 字段）

bash 复制代码

#请求
curl -X GET "localhost:9200/my_blog/_source/1"

#响应
{
"title": "Elasticsearch 入门教程",
"content": "这是一篇关于 Elasticsearch 基础操作的教程，适合初学者学习",
"author": "张三",
"tags": ["搜索", "数据库", "教程", "技术"],
"created_at": "2023-10-01 09:00:00",
"updated_at": "2023-10-01 09:00:00",
"views": 100,
"is_published": true,
"rating": 4.5,
"metadata": {
"category": "技术文章",
"read_time": 10
}
}

获取指定字段

bash 复制代码

#请求
curl -X GET "localhost:9200/my_blog/_doc/1?_source=title,author,views"

#响应
{
"_index": "my_blog",
"_type": "_doc",
"_id": "1",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"title": "Elasticsearch 入门教程",
"author": "张三",
"views": 100
}
}

使用 MGET API 获取多个文档

vbnet 复制代码

curl -X GET "localhost:9200/my_blog/_mget" -H 'Content-Type: application/json' -d' 
{
  "ids": ["1", "3", "5"]
} '

MGET的更多用法参照www.elastic.co/guide/en/el...

3. 更新文档（Update）

部分字段更新

makefile 复制代码

#请求
curl -X POST "localhost:9200/my_blog/_update/1" -H 'Content-Type: application/json' -d'
{
  "doc": {
    "views": 200,
    "updated_at": "2023-10-05 16:45:00",
    "metadata.read_time": 12
  }
}
'

#响应
{
"_index": "my_blog",
"_type": "_doc",
"_id": "1",
"_version": 2,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 3,
"_primary_term": 1
}

使用脚本更新

vbnet 复制代码

curl -X POST "localhost:9200/my_blog/_update/1" -H 'Content-Type: application/json' -d'
{
  "script": {
    "source": "ctx._source.views += params.increment; ctx._source.updated_at = params.update_time",
    "params": {
      "increment": 10,
      "update_time": "2023-10-06 10:00:00"
    },
    "lang": "painless"
  }
}
'

4. 删除文档（Delete）

根据 ID 删除文档

sql 复制代码

curl -X DELETE "localhost:9200/my_blog/_doc/1"

根据查询条件删除文档

vbnet 复制代码

curl -X POST "localhost:9200/my_blog/_delete_by_query" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "views": {
        "lt": 50
      }
    }
  }
}
'

5. 批量操作（Bulk API）

当我们需要写入多个文档的时候，我们可以使用 Bulk API 来批量处理文档以提高执行效率。

Bulk API 支持在一次调用中操作不同的索引，可以在 Body 中指定索引也可以在 URI 中指定索引。同时支持以下 4 种类型的操作：

• Index
• Create
• Update
• Delete

Bulk API 的格式是用换行符分隔 JSON 的结构，第一行指定操作类型和元数据（索引、文档id等），紧接着的一行是这个操作的内容

vbnet 复制代码

curl -X POST "localhost:9200/_bulk" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "my_blog", "_id" : "3" } }
{ "title": "批量操作教程", "author": "王五", "views": 30, "created_at": "2023-10-03", "is_published": true }
{ "update" : { "_index" : "my_blog", "_id" : "2" } }
{ "doc" : { "views": 75, "updated_at": "2023-10-07" } }
{ "delete" : { "_index" : "my_blog", "_id" : "1" } }
'

6. 查询操作

简单查询

vbnet 复制代码

curl -X GET "localhost:9200/my_blog/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "title": "Elasticsearch 教程"
    }
  }
}
'

复合查询

vbnet 复制代码

curl -X GET "localhost:9200/my_blog/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title": "教程" } }
      ],
      "filter": [
        { "range": { "views": { "gte": 50 } } },
        { "term": { "is_published": true } }
      ]
    }
  },
  "sort": [
    { "views": { "order": "desc" } }
  ],
  "from": 0,
  "size": 10
}
'

最佳实践

1. 映射设计：在创建索引前仔细设计映射，避免后期修改字段类型
1. 版本控制 ：使用 version 参数处理并发写入冲突
1. 批量操作：大量数据操作时使用 Bulk API，提高效率
1. 错误处理 ：检查响应中的 errors 字段处理批量操作中的失败
1. 刷新策略 ：根据需求调整 refresh_interval，平衡写入性能和查询实时性

总结

本文介绍了 Elasticsearch 文档的完整 CRUD 操作流程，包括：

1. 映射初始化：如何定义字段类型和属性
1. 创建文档：指定 ID 和自动生成 ID 两种方式
1. 读取文档：获取单个文档和批量查询
1. 更新文档：部分更新和脚本更新
1. 删除文档：单个删除和条件删除
1. 批量操作：使用 Bulk API 进行高效操作

通过合理的映射设计和正确的操作方式，可以充分发挥 Elasticsearch 的强大功能。建议在实际项目中使用官方客户端库来获得更好的开发体验和性能。