探秘新一代向量存储格式Lance-format (十九) 向量索引 - 向量搜索查询优化

第19章：向量搜索查询优化

🎯 核心概览

向量搜索的查询优化是实现高效、精准搜索的关键。通过调整搜索参数、应用过滤策略和重排机制，可以在精度和速度之间找到最优平衡，获得 5-10 倍的性能提升。

📊 查询优化的三个维度

1. Nprobes 动态调整：平衡精度与速度

Nprobes 是什么？

在 IVF（倒排文件）索引中，Nprobes 控制搜索时要探测的分区数量：

分区总数：例如 100 个分区
Nprobes = 1：只搜索最接近的 1 个分区（快速但精度低）
Nprobes = 10：搜索最接近的 10 个分区（平衡方案）
Nprobes = 100：搜索所有分区（精确但慢）

css 复制代码

查询向量 q
  ↓ 计算 q 到所有质心的距离
  ↓ 选择距离最小的 Nprobes 个质心
  ↓ 搜索这 Nprobes 个分区中的向量
  ↓ 返回 Top-K 结果

性能影响

Nprobes	搜索范围	相对速度	精度(Recall)
1	1% 数据	100x	60%
5	5% 数据	20x	85%
10	10% 数据	10x	95%
50	50% 数据	2x	99%
100	100% 数据	1x	100%

Rust 实现示例

rust 复制代码

// IVF 搜索的核心参数
pub struct IvfSearchParams {
    pub nprobes: usize,           // 搜索的分区数
    pub k: usize,                 // 返回 Top-K
    pub refine: bool,             // 是否精调
    pub metric_type: MetricType,  // L2、Cosine、Dot
}

impl IvfIndex {
    // 搜索函数
    pub async fn search(
        &self,
        query: &dyn Array,
        params: IvfSearchParams,
    ) -> Result<SearchResult> {
        // 1. 计算查询向量到所有质心的距离
        let distances_to_centroids = self.compute_query_distances(query)?;
        
        // 2. 找出距离最小的 nprobes 个质心
        let mut indices: Vec<usize> = (0..self.num_partitions).collect();
        indices.sort_by(|&a, &b| {
            distances_to_centroids[a].partial_cmp(&distances_to_centroids[b]).unwrap()
        });
        let probe_indices: Vec<usize> = indices[..params.nprobes].to_vec();
        
        // 3. 搜索每个分区
        let mut candidates = Vec::new();
        for partition_id in probe_indices {
            let partition = self.load_partition(partition_id).await?;
            let partition_results = self.search_partition(
                partition,
                query,
                params.k,
            ).await?;
            candidates.extend(partition_results);
        }
        
        // 4. 合并结果并排序
        candidates.sort_by(|a, b| {
            a.distance.partial_cmp(&b.distance).unwrap()
        });
        candidates.truncate(params.k);
        
        Ok(SearchResult {
            ids: candidates.iter().map(|c| c.id).collect(),
            distances: candidates.iter().map(|c| c.distance).collect(),
        })
    }
}

Python API 使用

python 复制代码

import lance
import numpy as np

# 创建带 IVF 索引的表
data = {
    "vector": np.random.randn(100000, 768).astype(np.float32),
    "id": np.arange(100000),
    "label": np.random.choice(["A", "B", "C"], 100000),
}

table = lance.write_table(data, uri="./data.lance")

# 创建 IVF-PQ 索引
index = table.create_index(
    column="vector",
    index_type="ivf_pq",
    num_partitions=100,
    num_sub_vectors=8,
    num_bits=8,
)

# ========== Nprobes 对比搜索 ==========
query_vec = np.random.randn(768).astype(np.float32)

# 快速搜索：Nprobes=1，只查 1% 的数据
results_fast = (
    table.search(query_vec)
    .nprobes(1)         # 只探测 1 个分区
    .limit(10)
    .to_list()
)
print(f"快速搜索（Nprobes=1）：{len(results_fast)} 个结果，速度极快")

# 平衡搜索：Nprobes=10，查 10% 的数据
results_balanced = (
    table.search(query_vec)
    .nprobes(10)        # 探测 10 个分区
    .limit(10)
    .to_list()
)
print(f"平衡搜索（Nprobes=10）：{len(results_balanced)} 个结果")

# 精确搜索：Nprobes=100，全表扫描
results_exact = (
    table.search(query_vec)
    .nprobes(100)       # 探测所有 100 个分区
    .limit(10)
    .to_list()
)
print(f"精确搜索（Nprobes=100）：{len(results_exact)} 个结果，精度最高")

# ========== 动态调整 Nprobes ==========
# 根据数据集大小动态调整
num_rows = table.count_rows()
optimal_nprobes = max(1, int(np.log10(num_rows)))  # 对数增长

results_optimal = (
    table.search(query_vec)
    .nprobes(optimal_nprobes)
    .limit(10)
    .to_list()
)
print(f"动态 Nprobes={optimal_nprobes}：{len(results_optimal)} 个结果")

2. 预过滤（PreFilter）：先过滤再搜索

原理：在向量搜索之前先应用标量过滤，减少搜索范围。

ini 复制代码

原始数据集：100万条记录
  ↓ 预过滤：WHERE price < 100 AND category = "electronics"
  ↓ 过滤后：10万条符合条件的记录
  ↓ 向量搜索：只在这 10万条中搜索
  ↓ 效果：10 倍性能提升

性能对比

查询方式	搜索范围	距离计算次数	相对耗时
无过滤	100万条	100万次	100ms
预过滤 10%	10万条	10万次	10ms
预过滤 1%	1万条	1万次	1ms

Rust 实现

rust 复制代码

// Scanner 中的过滤实现
pub struct ScannerBuilder {
    dataset: Arc<Dataset>,
    filter: Option<String>,  // 谓词表达式，如 "price < 100"
    prefilter_batch_size: usize,
}

impl ScannerBuilder {
    // 应用预过滤
    pub fn filter(mut self, filter_expr: &str) -> Self {
        self.filter = Some(filter_expr.to_string());
        self
    }
}

pub struct Scanner {
    dataset: Arc<Dataset>,
    filter_expr: Option<Expr>,  // 解析后的表达式
}

impl Scanner {
    // 搜索带预过滤
    pub async fn search(
        &self,
        query: &dyn Array,
        k: usize,
    ) -> Result<Vec<SearchResult>> {
        // 1. 应用过滤，获取符合条件的 RowID
        let filtered_row_ids = if let Some(ref filter) = self.filter_expr {
            self.apply_filter(filter).await?
        } else {
            // 无过滤时搜索全表
            (0..self.dataset.count_rows() as u64).collect::<Vec<_>>()
        };
        
        // 2. 在过滤后的行集合上搜索
        let results = self.search_filtered(
            query,
            &filtered_row_ids,
            k,
        ).await?;
        
        Ok(results)
    }
    
    async fn apply_filter(&self, expr: &Expr) -> Result<Vec<u64>> {
        let mut filtered = Vec::new();
        
        // 扫描每个 Fragment
        for fragment in self.dataset.fragments() {
            // 读取相关列的数据
            let batch = fragment.scan()
                .columns(expr.get_input_columns())
                .execute()
                .await?;
            
            // 计算过滤表达式
            let mask = expr.evaluate(&batch)?;
            
            // 收集满足条件的行
            for (idx, &keep) in mask.iter().enumerate() {
                if keep {
                    filtered.push(fragment.get_row_id(idx)?);
                }
            }
        }
        
        Ok(filtered)
    }
}

Python API 使用

python 复制代码

import lance
import numpy as np

# 创建带多列的表
data = {
    "vector": np.random.randn(100000, 768).astype(np.float32),
    "id": np.arange(100000),
    "price": np.random.randint(10, 1000, 100000),
    "category": np.random.choice(["electronics", "clothing", "food"], 100000),
    "rating": np.random.uniform(1.0, 5.0, 100000),
}

table = lance.write_table(data, uri="./products.lance")

# 创建索引
index = table.create_index(
    column="vector",
    index_type="ivf_pq",
    num_partitions=100,
)

query_vec = np.random.randn(768).astype(np.float32)

# ========== 无过滤搜索 ==========
# 在全部 100000 条中搜索
results_no_filter = (
    table.search(query_vec)
    .limit(10)
    .to_list()
)
print(f"无过滤：搜索 100000 条记录，耗时 ~100ms")

# ========== 单条件预过滤 ==========
# 只在价格 < 500 的商品中搜索
results_price = (
    table.search(query_vec)
    .where("price < 500")          # 预过滤条件
    .limit(10)
    .to_list()
)
print(f"预过滤（price<500）：搜索 ~50000 条记录，耗时 ~50ms")

# ========== 多条件预过滤 ==========
# 在电子产品且评分 > 4.0 的商品中搜索
results_multi_filter = (
    table.search(query_vec)
    .where("category = 'electronics' AND rating > 4.0")
    .limit(10)
    .to_list()
)
print(f"多条件过滤：搜索 ~5000 条记录，耗时 ~5ms")

# ========== 过滤与 Nprobes 组合 ==========
# 结合过滤和 Nprobes 参数
results_combined = (
    table.search(query_vec)
    .where("price < 200")           # 预过滤
    .nprobes(5)                     # 并联动态调整分区探测数
    .limit(10)
    .to_list()
)
print(f"过滤+Nprobes=5：搜索 ~2000 条记录，耗时 ~2ms")

3. 重排（Reranking）：精确距离计算前 Top-K

原理：先用量化距离快速获得候选集，再用精确距离计算重排，提高精度。

css 复制代码

IVF-PQ 搜索结果（量化距离）
  ↓ 返回 Top-1000（使用 PQ 码距离）
  ↓ 加载原始向量（从存储读取）
  ↓ 精确距离计算：||q - x||^2
  ↓ 重排并返回 Top-10（最终结果）
  
效果：精度提升 2-3%，额外耗时 < 5%

重排流程

rust 复制代码

pub struct RerankingStrategy {
    pub refine_factor: usize,  // 重排倍数，如 4 表示先返回 K*4 再重排
    pub use_original_vectors: bool,  // 是否使用原始向量
}

impl IvfIndex {
    pub async fn search_with_reranking(
        &self,
        query: &dyn Array,
        k: usize,
        rerank_params: RerankingStrategy,
    ) -> Result<SearchResult> {
        // 1. 第一阶段：快速搜索，返回 K * refine_factor 个候选
        let candidate_count = k * rerank_params.refine_factor;
        let candidates = self.quick_search(
            query,
            candidate_count,  // 返回更多候选
        ).await?;
        
        // 2. 第二阶段：加载原始向量
        let mut candidates_with_vectors = Vec::new();
        for candidate in candidates {
            // 从存储中加载原始向量
            let original_vector = self.load_original_vector(candidate.id).await?;
            candidates_with_vectors.push((candidate, original_vector));
        }
        
        // 3. 第三阶段：精确距离计算
        let mut refined_results = Vec::new();
        let query_array = query.as_primitive::<Float32Type>()?;
        
        for (candidate, original_vector) in candidates_with_vectors {
            let precise_distance = self.compute_precise_distance(
                &query_array,
                &original_vector,
            );
            refined_results.push((candidate.id, precise_distance));
        }
        
        // 4. 第四阶段：按精确距离重排
        refined_results.sort_by(|a, b| {
            a.1.partial_cmp(&b.1).unwrap()
        });
        
        // 5. 返回最终的 Top-K
        Ok(SearchResult {
            ids: refined_results[..k].iter().map(|(id, _)| *id).collect(),
            distances: refined_results[..k].iter().map(|(_, dist)| *dist).collect(),
        })
    }
}

Python API 使用

python 复制代码

import lance
import numpy as np
import time

data = {
    "vector": np.random.randn(100000, 768).astype(np.float32),
    "id": np.arange(100000),
}

table = lance.write_table(data, uri="./data.lance")

# 创建 IVF-PQ 索引
index = table.create_index(
    column="vector",
    index_type="ivf_pq",
    num_partitions=100,
    num_sub_vectors=8,
)

query_vec = np.random.randn(768).astype(np.float32)

# ========== 无重排搜索 ==========
# 直接返回 PQ 量化距离的结果
start = time.time()
results_fast = (
    table.search(query_vec)
    .nprobes(10)
    .limit(10)
    .to_list()
)
time_fast = time.time() - start
print(f"无重排：耗时 {time_fast*1000:.1f}ms，精度 ~95%")

# ========== 启用重排 ==========
# 先搜索 Top-40（refinement_factor=4），再用精确距离重排
start = time.time()
results_refined = (
    table.search(query_vec)
    .nprobes(10)
    .refinement_factor(4)      # 重排倍数：先返回 10*4=40 个
    .limit(10)
    .to_list()
)
time_refined = time.time() - start
print(f"有重排（4x）：耗时 {time_refined*1000:.1f}ms，精度 ~97%")

# ========== 比较精度 ==========
# 精确搜索（全表 + 精确距离）
results_exact = (
    table.search(query_vec)
    .nprobes(100)              # 全表扫描
    .limit(10)
    .to_list()
)

# 计算 Recall
fast_ids = set(r["id"] for r in results_fast)
refined_ids = set(r["id"] for r in results_refined)
exact_ids = set(r["id"] for r in results_exact)

recall_fast = len(fast_ids & exact_ids) / len(exact_ids)
recall_refined = len(refined_ids & exact_ids) / len(exact_ids)

print(f"\n精度对比：")
print(f"  无重排：Recall = {recall_fast:.1%}")
print(f"  有重排（4x）：Recall = {recall_refined:.1%}")
print(f"  精确搜索：Recall = 100%")

🔧 综合优化策略

动态参数调整框架

python 复制代码

import lance
import numpy as np
from typing import Dict, Tuple

class AdaptiveSearchOptimizer:
    """根据数据集特征自动调整搜索参数"""
    
    def __init__(self, table: lance.LanceTable):
        self.table = table
        self.num_rows = table.count_rows()
        self.num_partitions = 100  # 假设 IVF 分区数
    
    def recommend_parameters(
        self,
        required_recall: float,  # 目标精度，如 0.95
        max_latency_ms: float,   # 最大延迟，如 50ms
    ) -> Dict[str, int]:
        """推荐最优参数"""
        
        # 1. 计算 Nprobes
        # Recall 与 nprobes 大致成线性关系：recall ≈ 0.6 + 0.004 * nprobes
        estimated_nprobes = int((required_recall - 0.6) / 0.004)
        nprobes = max(1, min(self.num_partitions, estimated_nprobes))
        
        # 2. 计算重排倍数
        # 重排通常增加 2-3% recall，代价是 10-20% 延迟增加
        refinement_factor = 2 if required_recall > 0.98 else 1
        
        # 3. 根据延迟估计预过滤比率
        # 每 ms 大约能搜索 1000 条记录（假设）
        approx_search_rows = max_latency_ms * 1000
        prefilter_ratio = max(0.01, approx_search_rows / self.num_rows)
        
        return {
            "nprobes": nprobes,
            "refinement_factor": refinement_factor,
            "prefilter_ratio": prefilter_ratio,
        }
    
    def benchmark_search(
        self,
        query: np.ndarray,
        nprobes: int,
        refine_factor: int,
    ) -> Tuple[float, float, float]:
        """基准测试：返回（延迟, 精度, 吞吐量）"""
        import time
        
        # 测试 10 次取平均
        latencies = []
        for _ in range(10):
            start = time.time()
            _ = (
                self.table.search(query)
                .nprobes(nprobes)
                .refinement_factor(refine_factor)
                .limit(10)
                .to_list()
            )
            latencies.append(time.time() - start)
        
        avg_latency = np.mean(latencies)
        throughput = 1.0 / avg_latency  # 每秒查询数
        
        # 精度计算
        exact_results = set(r["id"] for r in (
            self.table.search(query)
            .nprobes(self.num_partitions)
            .limit(10)
            .to_list()
        ))
        
        current_results = set(r["id"] for r in (
            self.table.search(query)
            .nprobes(nprobes)
            .refinement_factor(refine_factor)
            .limit(10)
            .to_list()
        ))
        
        recall = len(exact_results & current_results) / len(exact_results) if exact_results else 1.0
        
        return avg_latency, recall, throughput

# 使用示例
data = {
    "vector": np.random.randn(1000000, 768).astype(np.float32),
    "id": np.arange(1000000),
}
table = lance.write_table(data, uri="./large_dataset.lance")

optimizer = AdaptiveSearchOptimizer(table)

# 场景1：高精度要求
params_high_precision = optimizer.recommend_parameters(
    required_recall=0.99,
    max_latency_ms=100,
)
print(f"高精度场景参数：{params_high_precision}")

# 场景2：实时查询要求
params_realtime = optimizer.recommend_parameters(
    required_recall=0.90,
    max_latency_ms=10,
)
print(f"实时场景参数：{params_realtime}")

# 基准测试
query = np.random.randn(768).astype(np.float32)
latency, recall, throughput = optimizer.benchmark_search(
    query,
    nprobes=params_realtime["nprobes"],
    refine_factor=params_realtime["refinement_factor"],
)
print(f"基准：延迟={latency*1000:.1f}ms，精度={recall:.1%}，吞吐={throughput:.0f}qps")

📈 性能优化总结

优化效果对标

优化策略	速度提升	精度损失	实施难度
Nprobes 调整	5-100x	< 10%	⭐
预过滤	2-10x	0%	⭐⭐
重排	0.9x（稍慢）	-2-3%	⭐⭐
组合优化	5-10x	< 5%	⭐⭐⭐

最佳实践

小规模精确搜索：无优化，直接全表扫描
大规模平衡查询：Nprobes=log(N)/log(100)，如 N=100w 时 Nprobes=5
高精度要求：启用重排（refinement_factor=2-4）
实时低延迟：激进预过滤 + Nprobes=1
生产环境：使用自适应优化器根据实时性能动态调整

📚 总结

向量搜索查询优化的核心是在精度、速度和成本之间找到平衡。通过三个主要维度的优化------Nprobes 动态调整、预过滤和重排 ------可以实现 5-10 倍的性能提升，同时保持精度不变或小幅提升。关键是根据具体的业务需求选择合适的参数组合。