从零实现数据库（2）——HashIndex + IndexManager

从零实现数据库（2）------HashIndex + IndexManager

Bloom Filter 加速的是"这个 key 在不在这个 SSTable 里"，但按字段值查文档 （比如 email="alice@x.com"）Bloom Filter 帮不上忙。

HashIndex：怎么建立 字段值 → 文档ID集合 的映射
IndexManager：怎么在 insert/update/delete 时自动维护索引
索引怎么持久化到磁盘（JSON 格式）

HashIndex 数据结构

plain 复制代码

HashIndex { fieldName = "email" }

index (ConcurrentHashMap):
┌─────────────────────┬───────────────────┐
│  "alice@x.com"      →  {"a1", "c3"}     │
│  "bob@x.com"        →  {"b2"}           │
│  "__null__"         →  {"d4"}           │  ← null 值统一映射为 "__null__"
└─────────────────────┴───────────────────┘

本质：字段值 → 文档 ID 集合。一个字段值可能对应多个文档（email 可能重复）。

java 复制代码

public class HashIndex {

    private static final ObjectMapper MAPPER = new ObjectMapper();

    private final String collectionName;
    private final String fieldName;
    private final File indexFile;
    private final ConcurrentHashMap<String, Set<String>> index = new ConcurrentHashMap<>();

    public HashIndex(String collectionName, String fieldName, File indexDir) throws IOException {
        this.collectionName = collectionName;
        this.fieldName = fieldName;
        this.indexFile = new File(indexDir, collectionName + "_" + fieldName + ".idx");
        indexDir.mkdirs();
        load();
    }
    
     private void load() throws IOException {
        if (!indexFile.exists()) return;
        Map<String, List<String>> raw = MAPPER.readValue(indexFile,
                new TypeReference<Map<String, List<String>>>() {});
        for (Map.Entry<String, List<String>> e : raw.entrySet()) {
            Set<String> set = Collections.newSetFromMap(new ConcurrentHashMap<>());
            set.addAll(e.getValue());
            index.put(e.getKey(), set);
        }
    }
}

索引创建（倒排索引）

plain 复制代码

客户端: POST /api/collections/users/index  {"field": "email"}
  │
  ▼
Collection.createIndex("email")
  │
  ▼
IndexManager.createIndex("email", this)
  │
  ├─ new HashIndex("users", "email", indexDir)
  │     │
  │     ├─ 创建目录: data/users/_indexes/
  │     ├─ 文件路径: data/users/_indexes/users_email.idx
  │     └─ load() → 文件不存在，空索引
  │
  ├─ indexes.put("email", hashIndex)  ← 注册到管理器
  └─ idx.save()  → 写入空索引到磁盘

java 复制代码

// IndexManager.java
public void createIndex(String fieldName, Collection collection) throws IOException {
    if (indexes.containsKey(fieldName)) return;

    HashIndex idx = new HashIndex(collectionName, fieldName, indexDir);
    indexes.put(fieldName, idx);
    idx.save();
}

// HashIndex.java
public void save() throws IOException {
    Map<String, List<String>> serializable = new HashMap<>();
    for (Map.Entry<String, Set<String>> e : index.entrySet()) {
        serializable.put(e.getKey(), new ArrayList<>(e.getValue()));
    }
    MAPPER.writeValue(indexFile, serializable);
}

本文仅实现了索引创建好后的增量索引，已存在的数据不会被自动索引。这涉及到需要全量扫描collection内的所有文档，而且也需要处理全量创建索引时的增量更新问题。

索引生命周期：自动维护

每次CRUD自动更新索引：

java 复制代码

Collection.insert(fields)
  │
  ├─ engine.put(key(id), value)          ← 数据写入 LSM
  └─ indexManager.onInsert(doc)          ← 索引更新
       │
       for (每个已建的索引):
         取 doc 中对应字段的值
         HashIndex.add(value, docId)

plain 复制代码

文档1: { _id: "a1", email: "alice@x.com", age: 25 }
文档2: { _id: "b2", email: "bob@x.com",   age: 30 }
文档3: { _id: "c3", email: "alice@x.com", age: 28 }

已建索引: email, age

插入文档1:
  email 索引 .add("alice@x.com", "a1")   → {"alice@x.com" → {"a1"}}
  age   索引 .add(25,           "a1")     → {25 → {"a1"}}

插入文档2:
  email 索引 .add("bob@x.com",   "b2")   → {"alice@x.com" → {"a1"}, "bob@x.com" → {"b2"}}
  age   索引 .add(30,            "b2")    → {25 → {"a1"}, 30 → {"b2"}}

插入文档3:
  email 索引 .add("alice@x.com", "c3")   → {"alice@x.com" → {"a1","c3"}, ...}
  age   索引 .add(28,            "c3")    → {25 → {"a1"}, 28 → {"c3"}, 30 → {"b2"}}

java 复制代码

// IndexManager.onUpdate(oldDoc, newDoc)
for (每个索引):
  取出旧值和新值
  if (新旧不同):
    remove(docId, 旧值)     ← 从旧值的集合中移除
    add(新值, docId)        ← 加入新值的集合

plain 复制代码

更新文档1: age 从 25 → 26

  age 索引 .remove("a1", 25)    → {25 → {},          28 → {"c3"}, 30 → {"b2"}}
  age 索引 .add(26, "a1")       → {25 → {}, 26 → {"a1"}, ...}

  然后清理空的 {25 → {}}:
    if (ids.isEmpty()) index.remove(key)

java 复制代码

// IndexManager.onDelete(doc)
for (每个索引):
  HashIndex.removeAll(docId)

scss 复制代码

删除文档3 (c3):

  email 索引 .removeAll("c3")    → {"alice@x.com" → {"a1"}, "bob@x.com" → {"b2"}}
  age   索引 .removeAll("c3")    → {25 → {}, 26 → {"a1"}, 30 → {"b2"}}

索引生命周期：查询

plain 复制代码

客户端: POST /api/collections/users/query
         {"filter": {"email": "alice@x.com"}}
  │
  ▼
DocumentService.query("users", filter, ...)
  │
  ├─ col.hasIndex("email")? → YES
  │
  ├─ col.lookupByIndex("email", "alice@x.com")
  │     → IndexManager.lookup("email", "alice@x.com")
  │         → HashIndex.lookup("alice@x.com")
  │              │
  │              index.get("alice@x.com") → {"a1", "c3"}
  │
  │  candidateIds = {"a1", "c3"}      ← 只有 2 个候选，不是扫描全集合
  │  indexUsed = true
  │
  ▼
QueryEngine.execute(candidateIds, ...):
  对 {"a1", "c3"} 逐个读文档 → 应用 filter → projection → sort

索引持久化

plain 复制代码

内存:                                           磁盘:
HashIndex                                        data/users/_indexes/users_email.idx
┌──────────────────────┐                        ┌──────────────────────────────┐
│ "alice@x.com"→{a1,c3}│  save()                │ {"alice@x.com":["a1","c3"], │
│ "bob@x.com"  →{b2}   │ ──────► JSON ──────►   │  "bob@x.com":["b2"]}        │
└──────────────────────┘                        └──────────────────────────────┘

                                                load() ← 启动时读回

触发持久化的时机：

plain 复制代码

Collection.flush()  → indexManager.saveAll()
Collection.close()  → flush() → saveAll()

IndexManager 启动时加载

plain 复制代码

IndexManager("users", "data/users/_indexes")

  data/users/_indexes/ 目录下:
    users_email.idx  → 解析文件名得 fieldName="email"
    users_age.idx    → 解析文件名得 fieldName="age"

  对每个 .idx 文件:
    new HashIndex("users", fieldName, indexDir)
      → load() 从 JSON 读回所有映射
      → indexes.put(fieldName, hashIndex)

java 复制代码

public IndexManager(String collectionName, File indexDir) throws IOException {
    this.collectionName = collectionName;
    this.indexDir = indexDir;
    indexDir.mkdirs();

    // Load existing indexes
    File[] idxFiles = indexDir.listFiles((d, n) -> n.startsWith(collectionName + "_") && n.endsWith(".idx"));
    if (idxFiles != null) {
        for (File f : idxFiles) {
            String name = f.getName();
            String fieldName = name.substring(collectionName.length() + 1, name.length() - 4);
            indexes.put(fieldName, new HashIndex(collectionName, fieldName, indexDir));
        }
    }
}

启动时将idx文件记录的索引再读回内存。

这里同样有内存OOM的风险

生产级做法说明

不全部加载到内存索引存 B-Tree / LSM，查询时从磁盘读取（MongoDB 用 B-Tree）

LRU 缓存只缓存热点索引页，冷数据留在磁盘

前缀压缩相邻 key 共享前缀，减少内存占用

分段加载按 key range 分段，按需加载

生产级做法	说明
不全部加载到内存	索引存 B-Tree / LSM，查询时从磁盘读取（MongoDB 用 B-Tree）
LRU 缓存	只缓存热点索引页，冷数据留在磁盘
前缀压缩	相邻 key 共享前缀，减少内存占用
分段加载	按 key range 分段，按需加载

总结

plain 复制代码

HashIndex:     字段值 → 文档ID集合    (单个字段的倒排索引)
IndexManager:  管理 N 个 HashIndex     (集合级索引注册表)
Collection:    在 CRUD 时自动调用       (数据与索引同步)

查询路径:      字段值 → HashIndex → 文档ID列表 → 只读这几个文档
无索引路径:    扫描全集合所有文档

代价:          写入时多一次 HashMap put (几乎无感知)
               flush 时多一次 JSON 序列化
收益:          等值查询从 O(N) 全表扫描 → O(1) 哈希查找