探秘新一代向量存储格式Lance-format (二十六) 数据清理与压缩

第26章:数据清理与压缩

🎯 核心概览

随着数据的不断更新、删除,Lance 会产生大量的过期版本和碎片化文件。数据清理与压缩通过删除过期版本、合并小文件等操作,回收存储空间并提高查询性能。定期清理可节省 30-50% 的存储空间


🗑️ 过期版本清理

版本生命周期

scss 复制代码
Version 1 (创建) → Version 2 (更新) → Version 3 (删除) → Version 4 (更新) → ...
  ↓                 ↓                  ↓                ↓
  活跃状态           活跃状态           活跃状态         活跃状态
  (有实时查询)     (有实时查询)     (有实时查询)   (最新,有实时查询)
  
  时间流逝后...
  
Version 1 → Version 2 → Version 3 (可删除) ← Version 4 (活跃) ← Version 5 (最新,活跃)
  过期          过期         过期                 活跃          活跃
  可删除        可删除                                          

清理器实现

rust 复制代码
pub struct VersionCleaner {
    dataset_path: PathBuf,
    manifest_dir: PathBuf,
    retention_policy: RetentionPolicy,
}

pub enum RetentionPolicy {
    // 保留最近 N 个版本
    KeepLastN(usize),
    
    // 保留最近 N 小时的版本
    KeepRecent(Duration),
    
    // 自定义策略
    Custom(Box<dyn Fn(&VersionInfo) -> bool + Send + Sync>),
}

impl VersionCleaner {
    pub async fn clean(&self) -> Result<CleanupReport> {
        // 1. 列出所有版本
        let all_versions = self.list_all_versions().await?;
        
        if all_versions.is_empty() {
            return Ok(CleanupReport::default());
        }
        
        // 2. 确定可删除的版本
        let (keep_versions, delete_versions) = self.determine_versions(&all_versions)?;
        
        // 3. 检查活跃查询
        let active_versions = self.find_active_versions().await?;
        let safe_to_delete: Vec<_> = delete_versions.iter()
            .filter(|v| !active_versions.contains(v))
            .copied()
            .collect();
        
        // 4. 删除版本文件
        let mut freed_space = 0u64;
        for version in safe_to_delete {
            freed_space += self.delete_version(version).await?;
        }
        
        Ok(CleanupReport {
            versions_deleted: (all_versions.len() - keep_versions.len()) as u32,
            space_freed: freed_space,
            timestamp: SystemTime::now(),
        })
    }
    
    fn determine_versions(&self, all_versions: &[VersionInfo]) -> Result<(Vec<u64>, Vec<u64>)> {
        let keep: Vec<u64> = match &self.retention_policy {
            RetentionPolicy::KeepLastN(n) => {
                // 保留最后 N 个版本
                all_versions.iter()
                    .rev()
                    .take(*n)
                    .map(|v| v.version)
                    .collect()
            }
            
            RetentionPolicy::KeepRecent(duration) => {
                // 保留最近 duration 时间内的版本
                let cutoff_time = SystemTime::now() - *duration;
                all_versions.iter()
                    .filter(|v| v.timestamp >= cutoff_time)
                    .map(|v| v.version)
                    .collect()
            }
            
            RetentionPolicy::Custom(predicate) => {
                all_versions.iter()
                    .filter(|v| predicate(v))
                    .map(|v| v.version)
                    .collect()
            }
        };
        
        let delete: Vec<u64> = all_versions.iter()
            .map(|v| v.version)
            .filter(|v| !keep.contains(v))
            .collect();
        
        Ok((keep, delete))
    }
    
    async fn find_active_versions(&self) -> Result<Vec<u64>> {
        // 查询活跃查询的版本
        // 通过跟踪连接信息或查询日志
        Ok(vec![])  // 简化版本
    }
    
    async fn delete_version(&self, version: u64) -> Result<u64> {
        // 1. 加载 manifest
        let manifest = self.load_manifest(version).await?;
        
        // 2. 计算要删除的文件
        let files_to_delete: Vec<_> = manifest.fragments.iter()
            .flat_map(|f| f.files.iter().map(|file| file.path.clone()))
            .collect();
        
        // 3. 计算释放的空间
        let mut freed_space = 0u64;
        for file_path in files_to_delete {
            if let Ok(metadata) = tokio::fs::metadata(&file_path).await {
                freed_space += metadata.len();
                tokio::fs::remove_file(&file_path).await?;
            }
        }
        
        // 4. 删除 manifest 文件
        let manifest_file = self.manifest_dir.join(format!("v{}_manifest", version));
        if manifest_file.exists() {
            freed_space += tokio::fs::metadata(&manifest_file).await?.len();
            tokio::fs::remove_file(&manifest_file).await?;
        }
        
        Ok(freed_space)
    }
}

#[derive(Default)]
pub struct CleanupReport {
    pub versions_deleted: u32,
    pub space_freed: u64,
    pub timestamp: SystemTime,
}

🗜️ 文件压缩(Compaction)

小文件合并

rust 复制代码
pub struct Compactor {
    dataset: Arc<Dataset>,
    config: CompactionConfig,
}

pub struct CompactionConfig {
    // 触发压缩的文件大小阈值
    pub size_threshold: u64,
    
    // 单个 Fragment 的最大大小
    pub max_fragment_size: u64,
    
    // 压缩后的目标行数
    pub target_rows_per_fragment: u32,
    
    // 并发压缩的最大 Fragment 数
    pub max_concurrent_fragments: usize,
}

impl Compactor {
    pub async fn compact(&self) -> Result<CompactionReport> {
        // 1. 识别碎片化 Fragment
        let fragments_to_compact = self.identify_fragments().await?;
        
        if fragments_to_compact.is_empty() {
            return Ok(CompactionReport::default());
        }
        
        // 2. 按并发限制分批处理
        let batches = fragments_to_compact
            .chunks(self.config.max_concurrent_fragments)
            .collect::<Vec<_>>();
        
        let mut total_rows_before = 0u64;
        let mut total_rows_after = 0u64;
        let mut fragments_compacted = 0u32;
        
        for batch in batches {
            // 并发压缩多个 Fragment
            let futures = batch.iter().map(|frag_id| {
                self.compact_fragment(*frag_id)
            });
            
            let results = futures::future::join_all(futures).await;
            
            for result in results {
                match result {
                    Ok((before, after)) => {
                        total_rows_before += before;
                        total_rows_after += after;
                        fragments_compacted += 1;
                    }
                    Err(e) => {
                        eprintln!("Compaction error: {}", e);
                    }
                }
            }
        }
        
        Ok(CompactionReport {
            fragments_compacted,
            rows_before: total_rows_before,
            rows_after: total_rows_after,
            space_saved: total_rows_before.saturating_sub(total_rows_after) * 1024,
        })
    }
    
    async fn identify_fragments(&self) -> Result<Vec<u32>> {
        let current_manifest = self.dataset.load_current_manifest().await?;
        
        let mut small_fragments = Vec::new();
        
        for fragment in &current_manifest.fragments {
            // 计算 Fragment 大小
            let size: u64 = fragment.files.iter()
                .map(|f| f.size)
                .sum();
            
            // 如果小于阈值,标记为待压缩
            if size < self.config.size_threshold {
                small_fragments.push(fragment.id);
            }
        }
        
        Ok(small_fragments)
    }
    
    async fn compact_fragment(&self, fragment_id: u32) -> Result<(u64, u64)> {
        // 1. 读取 Fragment 中的所有行
        let batch = self.dataset.read_fragment(fragment_id).await?;
        let rows_before = batch.num_rows() as u64;
        
        // 2. 过滤已删除的行
        let visibility = self.dataset.get_visibility().await?;
        let mask = visibility.get_valid_rows_mask(fragment_id)?;
        let filtered = batch.filter(&mask)?;
        let rows_after = filtered.num_rows() as u64;
        
        // 3. 重新写入 Fragment
        let new_fragment_id = self.dataset.write_batch(filtered).await?;
        
        // 4. 更新 manifest(用新 Fragment 替换旧 Fragment)
        self.dataset.replace_fragment(fragment_id, new_fragment_id).await?;
        
        Ok((rows_before, rows_after))
    }
}

#[derive(Default)]
pub struct CompactionReport {
    pub fragments_compacted: u32,
    pub rows_before: u64,
    pub rows_after: u64,
    pub space_saved: u64,
}

Python API

python 复制代码
import lance
import schedule

table = lance.open("data.lance")

# 手动压缩
report = table.compact()
print(f"Compacted {report.fragments_compacted} fragments")
print(f"Space saved: {report.space_saved / 1024 / 1024:.1f} MB")

# 清理旧版本(保留最近 7 天)
from datetime import timedelta
cleaned = table.cleanup(keep_recent=timedelta(days=7))
print(f"Deleted {cleaned.versions_deleted} versions")
print(f"Freed {cleaned.space_freed / 1024 / 1024:.1f} MB")

# 定期清理
def cleanup_job():
    table = lance.open("data.lance")
    table.cleanup(keep_recent=timedelta(days=7))
    table.compact()

schedule.every().day.at("02:00").do(cleanup_job)

🔄 空间回收策略

增量回收

rust 复制代码
pub struct IncrementalRecovery {
    dataset: Arc<Dataset>,
    batch_size: usize,
}

impl IncrementalRecovery {
    // 逐步回收空间,避免一次性大操作
    pub async fn recover_incrementally(&self) -> Result<Vec<RecoveryPass>> {
        let mut passes = Vec::new();
        
        loop {
            // 每次处理一定数量的 Fragment
            let pass = self.recover_batch(self.batch_size).await?;
            
            if pass.fragments_processed == 0 {
                break;  // 没有更多 Fragment 可处理
            }
            
            passes.push(pass);
            
            // 让出时间给其他查询
            tokio::time::sleep(Duration::from_secs(1)).await;
        }
        
        Ok(passes)
    }
    
    async fn recover_batch(&self, limit: usize) -> Result<RecoveryPass> {
        // 识别 N 个可回收的 Fragment
        let fragments = self.identify_candidates(limit).await?;
        
        let mut pass = RecoveryPass::default();
        
        for frag_id in fragments {
            if let Ok((before, after)) = self.recover_fragment(frag_id).await {
                pass.space_recovered += (before - after) * 1024;
                pass.fragments_processed += 1;
            }
        }
        
        Ok(pass)
    }
}

#[derive(Default)]
pub struct RecoveryPass {
    pub fragments_processed: usize,
    pub space_recovered: u64,
}

📊 监控与报告

rust 复制代码
pub struct CleanupScheduler {
    interval: Duration,
    dataset: Arc<Dataset>,
}

impl CleanupScheduler {
    pub async fn run(&self) {
        let mut interval = tokio::time::interval(self.interval);
        
        loop {
            interval.tick().await;
            
            match self.perform_cleanup().await {
                Ok(report) => {
                    println!("Cleanup completed: {}", serde_json::to_string(&report).unwrap());
                }
                Err(e) => {
                    eprintln!("Cleanup failed: {}", e);
                }
            }
        }
    }
    
    async fn perform_cleanup(&self) -> Result<CleanupSummary> {
        let cleaner = VersionCleaner {
            dataset_path: self.dataset.path().to_path_buf(),
            manifest_dir: self.dataset.path().join("_manifest"),
            retention_policy: RetentionPolicy::KeepLastN(10),
        };
        
        let cleanup_report = cleaner.clean().await?;
        
        let compactor = Compactor {
            dataset: self.dataset.clone(),
            config: CompactionConfig {
                size_threshold: 10 * 1024 * 1024,  // 10MB
                max_fragment_size: 1024 * 1024 * 1024,  // 1GB
                target_rows_per_fragment: 1_000_000,
                max_concurrent_fragments: 4,
            },
        };
        
        let compaction_report = compactor.compact().await?;
        
        Ok(CleanupSummary {
            versions_deleted: cleanup_report.versions_deleted,
            space_freed_from_cleanup: cleanup_report.space_freed,
            fragments_compacted: compaction_report.fragments_compacted,
            space_freed_from_compaction: compaction_report.space_saved,
            timestamp: SystemTime::now(),
        })
    }
}

#[derive(Serialize)]
pub struct CleanupSummary {
    pub versions_deleted: u32,
    pub space_freed_from_cleanup: u64,
    pub fragments_compacted: u32,
    pub space_freed_from_compaction: u64,
    pub timestamp: SystemTime,
}

📚 总结

数据清理与压缩是维护 Lance 性能和节省存储的关键:

  1. 版本清理:删除过期版本,释放存储空间
  2. 文件压缩:合并小文件,提高查询性能
  3. 空间回收:增量回收避免业务中断
  4. 自动调度:定期自动清理

这些操作可以节省 30-50% 的存储空间,同时保持系统高可用。

相关推荐
许泽宇的技术分享14 分钟前
解密Anthropic的MCP Inspector:从协议调试到AI应用开发的全栈架构之旅
人工智能·架构·typescript·mcp·ai开发工具
Jason_zhao_MR1 小时前
米尔RK3506核心板SDK重磅升级,解锁三核A7实时控制新架构
linux·嵌入式硬件·物联网·架构·嵌入式·嵌入式实时数据库
シ風箏1 小时前
Flink【基础知识 01】简介+核心架构+分层API+集群架构+应用场景+特点优势(一篇即可大概了解Flink)
大数据·架构·flink·bigdata
Psycho_MrZhang2 小时前
Airflow简介和架构
架构·wpf
IT知识分享2 小时前
中科天玑全要素AI舆情系统功能、架构解析
人工智能·语言模型·架构
没有bug.的程序员3 小时前
微服务基础设施清单:必须、应该、可以、无需的四级分类指南
java·jvm·微服务·云原生·容器·架构
郑州光合科技余经理3 小时前
海外国际版同城服务系统开发:PHP技术栈
java·大数据·开发语言·前端·人工智能·架构·php
-大头.4 小时前
数据库高可用架构终极指南
数据库·架构
喜欢吃豆5 小时前
下一代 AI 销售陪练系统的架构蓝图与核心技术挑战深度研究报告
人工智能·架构·大模型·多模态·ai销售陪练
程序员小胖胖5 小时前
每天一道面试题之架构篇|可插拔规则引擎系统架构设计
架构·系统架构