Elasticsearch Synthetic _source

_source 字段包含索引时传入的原始 JSON 文档体。_source 字段本身不被索引（因此不可搜索），但会被存储，以便在执行获取请求（如 get 或 search）时返回。

如果磁盘使用很重要，可以考虑以下选项：

使用 synthetic _source，在检索时重建源内容，而不是存储在磁盘上。这样可以减少磁盘使用，但会导致 Get 和 Search 查询中访问 _source 变慢。
完全禁用 _source 字段。这样可以减少磁盘使用，但会禁用依赖 _source 的功能。

什么是 synthetic _source？

当文档被索引时，有些字段，比如需要生成 doc_values 或 stored fileds，来自 _source 的字段值会根据数据类型复制到独立的列表 doc_values 中（磁盘上的不同数据结构，用于模式匹配），这样可以独立搜索这些值。当在这些小列表中找到所需值后，返回原始文档。由于只搜索了小列表，而不是整个文档的所有字段值，搜索所需的时间会减少。虽然这种处理方式提升了速度，但会在小列表和原始文档中存储重复的数据。

更多阅读：

Synthetic _source 是一种索引配置模式，可以改变文档在摄取时的处理方式，以节省存储空间并避免数据重复。它会创建独立的列表，但不会保留原始的原始文档。相反，在找到值后，会使用小列表中的数据重建 _source 内容。由于没有存储原始文档，仅在磁盘上存储 "列表"，可以节省大量存储空间。

复制代码

PUT idx
{
  "settings": {
    "index": {
      "mapping": {
        "source": {
          "mode": "synthetic"
        }
      }
    }
  }
}

需要注意的是，由于 _source 值是在文档被检索时即时重建的，因此需要额外时间来完成重建。这会为用户节省存储空间，但会降低搜索速度。虽然这种即时重建通常比直接保存源文档并在查询时加载更慢，但它节省了大量存储空间。通过在不需要时不加载 _source 字段，可以避免额外的延迟。

Synthetic _source 目前被广泛使用于 logsdb 及 TSDB。它可以帮我们节省许多的磁盘空间。

Elasticsearch 8.17 Logsdb：企业降本增效利器

支持的字段

Synthetic _source 支持所有字段类型。根据实现细节，不同字段类型在使用 synthetic _source 时具有不同属性。

大多数字段类型使用现有数据构建 synthetic _source，最常见的是 doc_values 和 stored fields。对于这些字段类型，不需要额外空间来存储 _source 字段内容。由于 doc_values 的存储布局，生成的 _source 字段相比原始文档会有修改。

对于其他所有字段类型，字段的原始值会按原样存储，方式与非 synthetic 模式下的 _source 字段相同。这种情况下不会有修改，_source 中的字段数据与原始文档相同。同样，使用 ignore_malformed 或 ignore_above 的字段的格式错误值也需要按原样存储。这种方式存储效率较低，因为为 _source 重建所需的数据除了索引字段所需的其他数据（如 doc_values）外，还会额外存储。

Synthetic _source 限制

某些字段类型有额外限制，这些限制记录在字段类型文档的 synthetic _source 部分。

Synthetic _source 不支持仅存储源的快照仓库。要存储使用 synthetic _source 的索引，请选择其他类型的仓库。

Synthetic _source 修改

启用 synthetic _source 时，检索到的文档相比原始 JSON 会有一些修改。

数组被移动到叶子字段

Synthetic _source 中的数组会被移动到叶子字段。例如：

由于 _source 值是通过 "doc values" 列表中的值重建的，因此原始 JSON 会被做一些修改。例如，数组会被移到叶子节点。

复制代码

PUT idx/_doc/1
{
  "foo": [
    {
      "bar": 1
    },
    {
      "bar": 2
    }
  ]
}

将变为：

复制代码

{
  "foo": {
    "bar": [1, 2]
  }
}

这可能导致某些数组消失：

复制代码

PUT idx/_doc/1
{
  "foo": [
    {
      "bar": 1
    },
    {
      "baz": 2
    }
  ]
}

将变为：

复制代码

{
  "foo": {
    "bar": 1,
    "baz": 2
  }
}

字段名称与映射一致

Synthetic _source 使用映射中字段的原始名称。当与动态映射一起使用时，字段名中带点（.）的字段默认被解释为多个对象，而在禁用子对象的对象中，字段名中的点会被保留。例如：

复制代码

PUT idx/_doc/1
{
  "foo.bar.baz": 1
}

将变为：

复制代码

{
  "foo": {
    "bar": {
      "baz": 1
    }
  }
}

如何将索引配置为 synthetic _source 模式

测试代码：在此测试中，将 synthetic _source 模式下的索引与标准索引进行对比。

复制代码

PUT index
{
  "settings": {
    "index": {
      "mapping": {
        "source": {
          "mode": "synthetic"
        }
      }
    }
  }
}

测试

标准索引使用 multi-field 来说明如何通过全文搜索和聚合检索文档，并在 _source 内容中包含已禁用字段的值。

复制代码

PUT test_standard
{
  "mappings": {
    "properties": {
      "disabled_field": {
        "enabled": false
      },
      "multi_field": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

让我们导入一些示例文档：

复制代码

PUT test_standard/_doc/1
{
  "multi_field": "Host_01",
  "disabled_field" : "Required for storage 01"
}

PUT test_standard/_doc/2
{
  "multi_field": "Host_02",
  "disabled_field" : "Required for storage 02"
}

PUT test_standard/_doc/3
{
  "multi_field": "Host_03",
  "disabled_field" : "Required for storage 03"
}

全文搜索会检索带有 _source 内容的文档：

复制代码

GET test_standard/_search
{
  "query": {
    "match": {
      "multi_field": "host_01"
    }
  }
}

结果：文档通过对已分析的字段进行全文搜索被检索到。返回的结果包含 _source 中的所有值，包括已被禁用的字段：

复制代码

{
  "took": 17,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.9808291,
    "hits": [
      {
        "_index": "test_standard",
        "_id": "1",
        "_score": 0.9808291,
        "_source": {
          "multi_field": "Host_01",
          "disabled_field": "Required for storage 01"
        }
      }
    ]
  }
}

这里，synthetic _source 模式下的索引使用 multi-fields 来说明 "text" 数据类型如何用于 "doc values" 列表，以及禁用字段中的值如何不可用。

复制代码

PUT test_synthetic
{
  "settings": {
    "index": {
      "mapping": {
        "source": {
          "mode": "synthetic"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "keyword_field": {
        "type": "keyword"
      },
      "multi_field": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "text_field": {
        "type": "text"
      },
      "disabled_field": {
        "enabled": false
      },
      "skill_array_field": {
        "properties": {
          "language": {
            "type": "text"
          },
          "level": {
            "type": "text"
          }
        }
      }
    }
  }
}

让我们导入一些示例文档：

复制代码

PUT test_synthetic/_doc/1
{
  "keyword_field": "Host_01",
  "disabled_field": "Required for storage 01",
  "multi_field": "Some info about computer 1",
  "text_field": "This is a text field 1",
  "skills_array_field": [
    {
      "language": "ruby",
      "level": "expert"
    },
    {
      "language": "javascript",
      "level": "beginner"
    }
  ],
  "foo": [
    {
      "bar": 1
    },
    {
      "bar": 2
    }
  ],
  "foo1.bar.baz": 1
}

PUT test_synthetic/_doc/2
{
  "keyword_field": "Host_02",
  "disabled_field": "Required for storage 02",
  "multi_field": "Some info about computer 2",
  "text_field": "This is a text field 2",
  "skills_array_field": [
    {
      "language": "C",
      "level": "guru"
    },
    {
      "language": "javascript",
      "level": "beginner"
    }
  ],
  "foo": [
    {
      "bar": 1
    },
    {
      "bar": 2
    }
  ],
  "foo1.bar.baz": 2
}

PUT test_synthetic/_doc/3
{
  "keyword_field": "Host_03",
  "disabled_field": "Required for storage 03",
  "multi_field": "Some info about computer 3",
  "text_field": "This is a text field 3",
  "skills_array_field": [
    {
      "language": "golang",
      "level": "beginner"
    }
  ],
  "foo": [
    {
      "bar": 1
    },
    {
      "bar": 2
    }
  ],
  "foo1.bar.baz": 3
}

搜索 "keyword" 数据类型时需要精确匹配。另外，禁用字段中的值也不再可用。

复制代码

GET test_synthetic/_search
{
  "query": {
    "match": {
      "keyword_field": "Host_01"
    }
  }
}

响应**：**

复制代码

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.9808291,
    "hits": [
      {
        "_index": "test_synthetic",
        "_id": "1",
        "_score": 0.9808291,
        "_source": {
          "keyword_field": "Host_01",
          "disabled_field": "Required for storage 01",
          "multi_field": "Some info about computer 1",
          "text_field": "This is a text field 1",
          "skills_array_field": [
            {
              "language": "ruby",
              "level": "expert"
            },
            {
              "language": "javascript",
              "level": "beginner"
            }
          ],
          "foo": [
            {
              "bar": 1
            },
            {
              "bar": 2
            }
          ],
          "foo1.bar.baz": 1
        }
      }
    ]
  }
}

我们再做一次搜索：

复制代码

GET test_synthetic/_search
{
  "query": {
    "match": {
      "multi_field": "info"
    }
  }
}

响应是：

复制代码

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 0.13353139,
    "hits": [
      {
        "_index": "test_synthetic",
        "_id": "2",
        "_score": 0.13353139,
        "_source": {
          "keyword_field": "Host_02",
          "disabled_field": "Required for storage 02",
          "multi_field": "Some info about computer 2",
          "text_field": "This is a text field 2",
          "skills_array_field": [
            {
              "language": "C",
              "level": "guru"
            },
            {
              "language": "javascript",
              "level": "beginner"
            }
          ],
          "foo": [
            {
              "bar": 1
            },
            {
              "bar": 2
            }
          ],
          "foo1.bar.baz": 2
        }
      },
      {
        "_index": "test_synthetic",
        "_id": "3",
        "_score": 0.13353139,
        "_source": {
          "keyword_field": "Host_03",
          "disabled_field": "Required for storage 03",
          "multi_field": "Some info about computer 3",
          "text_field": "This is a text field 3",
          "skills_array_field": [
            {
              "language": "golang",
              "level": "beginner"
            }
          ],
          "foo": [
            {
              "bar": 1
            },
            {
              "bar": 2
            }
          ],
          "foo1.bar.baz": 3
        }
      },
      {
        "_index": "test_synthetic",
        "_id": "1",
        "_score": 0.13353139,
        "_source": {
          "keyword_field": "Host_01",
          "disabled_field": "Required for storage 01",
          "multi_field": "Some info about computer 1",
          "text_field": "This is a text field 1",
          "skills_array_field": [
            {
              "language": "ruby",
              "level": "expert"
            },
            {
              "language": "javascript",
              "level": "beginner"
            }
          ],
          "foo": [
            {
              "bar": 1
            },
            {
              "bar": 2
            }
          ],
          "foo1.bar.baz": 1
        }
      }
    ]
  }
}

更多阅读，请参考官方文档：_source field | Elastic Documentation