使用 Node.js Elasticsearch 客户端索引大型 CSV 文件

作者：来自 Elastic joshmock

使用 bulk API 可以轻松地将大量文档索引到 Elasticsearch：将你的数据记录转换为 JSON 文档，并插入指示它们应该添加到哪个索引的指令，然后将这个大的换行分隔 JSON blob 作为请求体，通过单个 HTTP 请求发送到 Elasticsearch 集群。或者，使用 Node.js 客户端的 bulk 函数。

下面演示如何读取 CSV 文件，将其行转换为 JSON 对象，并进行索引：

复制代码

import { Client } from '@elastic/elasticsearch'
import { parse } from "csv-parse/sync"
import { readFileSync } from 'node:fs'

const csv = parse(readFileSync('data.csv', 'utf8'), { columns: true })
const operations = csv.flatMap(row => [
  { index: { _index: "my_index" } },
  row
])

const client = new Client({ node: 'http://localhost:9200' })
await client.bulk({ operations })

但是，如果你需要发送的数据量超过 Elasticsearch 单次请求能接收的大小，或者你的 CSV 文件太大，无法一次性全部加载到内存中，该怎么办？这时可以使用 bulk helper！

虽然 bulk API 本身已经很简单，但对于更复杂的场景，helper 提供了对流式输入的支持，可以将大型数据集拆分为多个请求等。

例如，如果你的 Elasticsearch 服务器只能接收小于 10MB 的 HTTP 请求，你可以通过设置 flushBytes 值来指示 bulk helper 拆分数据。每当请求即将超过设置值时，就会发送一次 bulk 请求：

复制代码

const csv = parse(readFileSync('data.csv', 'utf8'), { columns: true })
await client.helpers.bulk({
  datasource: csv,
  onDocument(doc) {
    return { index: { _index: "my_index" } }
  },
  // send a bulk request for every 9.5MB
  flushBytes: 9500000
})

或者，如果你的 CSV 文件太大无法一次性加载到内存中，helper 可以将流作为数据源，而不是使用数组：

复制代码

import { createReadStream } from 'node:fs'
import { parse } from 'csv-parse'

const parser = parse({ columns: true })
await client.helpers.bulk({
  datasource: createReadStream('data.csv').pipe(parser),
  onDocument(doc) {
    return { index: { _index: "my_index" } }
  }
})

这会将 CSV 文件中的行缓冲到内存中，解析为 JSON 对象，并让 helper 将结果刷新为一个或多个 HTTP 请求发送出去。这个解决方案不仅节省内存，而且阅读起来也和将整个文件加载到内存中的方法一样简单！

原文：https://discuss.elastic.co/t/dec-9th-2025-en-use-the-node-js-elasticsearch-client-to-index-large-csv-files/382901