第26章:数据清理与压缩
🎯 核心概览
随着数据的不断更新、删除,Lance 会产生大量的过期版本和碎片化文件。数据清理与压缩通过删除过期版本、合并小文件等操作,回收存储空间并提高查询性能。定期清理可节省 30-50% 的存储空间。
🗑️ 过期版本清理
版本生命周期
scss
Version 1 (创建) → Version 2 (更新) → Version 3 (删除) → Version 4 (更新) → ...
↓ ↓ ↓ ↓
活跃状态 活跃状态 活跃状态 活跃状态
(有实时查询) (有实时查询) (有实时查询) (最新,有实时查询)
时间流逝后...
Version 1 → Version 2 → Version 3 (可删除) ← Version 4 (活跃) ← Version 5 (最新,活跃)
过期 过期 过期 活跃 活跃
可删除 可删除
清理器实现
rust
pub struct VersionCleaner {
dataset_path: PathBuf,
manifest_dir: PathBuf,
retention_policy: RetentionPolicy,
}
pub enum RetentionPolicy {
// 保留最近 N 个版本
KeepLastN(usize),
// 保留最近 N 小时的版本
KeepRecent(Duration),
// 自定义策略
Custom(Box<dyn Fn(&VersionInfo) -> bool + Send + Sync>),
}
impl VersionCleaner {
pub async fn clean(&self) -> Result<CleanupReport> {
// 1. 列出所有版本
let all_versions = self.list_all_versions().await?;
if all_versions.is_empty() {
return Ok(CleanupReport::default());
}
// 2. 确定可删除的版本
let (keep_versions, delete_versions) = self.determine_versions(&all_versions)?;
// 3. 检查活跃查询
let active_versions = self.find_active_versions().await?;
let safe_to_delete: Vec<_> = delete_versions.iter()
.filter(|v| !active_versions.contains(v))
.copied()
.collect();
// 4. 删除版本文件
let mut freed_space = 0u64;
for version in safe_to_delete {
freed_space += self.delete_version(version).await?;
}
Ok(CleanupReport {
versions_deleted: (all_versions.len() - keep_versions.len()) as u32,
space_freed: freed_space,
timestamp: SystemTime::now(),
})
}
fn determine_versions(&self, all_versions: &[VersionInfo]) -> Result<(Vec<u64>, Vec<u64>)> {
let keep: Vec<u64> = match &self.retention_policy {
RetentionPolicy::KeepLastN(n) => {
// 保留最后 N 个版本
all_versions.iter()
.rev()
.take(*n)
.map(|v| v.version)
.collect()
}
RetentionPolicy::KeepRecent(duration) => {
// 保留最近 duration 时间内的版本
let cutoff_time = SystemTime::now() - *duration;
all_versions.iter()
.filter(|v| v.timestamp >= cutoff_time)
.map(|v| v.version)
.collect()
}
RetentionPolicy::Custom(predicate) => {
all_versions.iter()
.filter(|v| predicate(v))
.map(|v| v.version)
.collect()
}
};
let delete: Vec<u64> = all_versions.iter()
.map(|v| v.version)
.filter(|v| !keep.contains(v))
.collect();
Ok((keep, delete))
}
async fn find_active_versions(&self) -> Result<Vec<u64>> {
// 查询活跃查询的版本
// 通过跟踪连接信息或查询日志
Ok(vec![]) // 简化版本
}
async fn delete_version(&self, version: u64) -> Result<u64> {
// 1. 加载 manifest
let manifest = self.load_manifest(version).await?;
// 2. 计算要删除的文件
let files_to_delete: Vec<_> = manifest.fragments.iter()
.flat_map(|f| f.files.iter().map(|file| file.path.clone()))
.collect();
// 3. 计算释放的空间
let mut freed_space = 0u64;
for file_path in files_to_delete {
if let Ok(metadata) = tokio::fs::metadata(&file_path).await {
freed_space += metadata.len();
tokio::fs::remove_file(&file_path).await?;
}
}
// 4. 删除 manifest 文件
let manifest_file = self.manifest_dir.join(format!("v{}_manifest", version));
if manifest_file.exists() {
freed_space += tokio::fs::metadata(&manifest_file).await?.len();
tokio::fs::remove_file(&manifest_file).await?;
}
Ok(freed_space)
}
}
#[derive(Default)]
pub struct CleanupReport {
pub versions_deleted: u32,
pub space_freed: u64,
pub timestamp: SystemTime,
}
🗜️ 文件压缩(Compaction)
小文件合并
rust
pub struct Compactor {
dataset: Arc<Dataset>,
config: CompactionConfig,
}
pub struct CompactionConfig {
// 触发压缩的文件大小阈值
pub size_threshold: u64,
// 单个 Fragment 的最大大小
pub max_fragment_size: u64,
// 压缩后的目标行数
pub target_rows_per_fragment: u32,
// 并发压缩的最大 Fragment 数
pub max_concurrent_fragments: usize,
}
impl Compactor {
pub async fn compact(&self) -> Result<CompactionReport> {
// 1. 识别碎片化 Fragment
let fragments_to_compact = self.identify_fragments().await?;
if fragments_to_compact.is_empty() {
return Ok(CompactionReport::default());
}
// 2. 按并发限制分批处理
let batches = fragments_to_compact
.chunks(self.config.max_concurrent_fragments)
.collect::<Vec<_>>();
let mut total_rows_before = 0u64;
let mut total_rows_after = 0u64;
let mut fragments_compacted = 0u32;
for batch in batches {
// 并发压缩多个 Fragment
let futures = batch.iter().map(|frag_id| {
self.compact_fragment(*frag_id)
});
let results = futures::future::join_all(futures).await;
for result in results {
match result {
Ok((before, after)) => {
total_rows_before += before;
total_rows_after += after;
fragments_compacted += 1;
}
Err(e) => {
eprintln!("Compaction error: {}", e);
}
}
}
}
Ok(CompactionReport {
fragments_compacted,
rows_before: total_rows_before,
rows_after: total_rows_after,
space_saved: total_rows_before.saturating_sub(total_rows_after) * 1024,
})
}
async fn identify_fragments(&self) -> Result<Vec<u32>> {
let current_manifest = self.dataset.load_current_manifest().await?;
let mut small_fragments = Vec::new();
for fragment in ¤t_manifest.fragments {
// 计算 Fragment 大小
let size: u64 = fragment.files.iter()
.map(|f| f.size)
.sum();
// 如果小于阈值,标记为待压缩
if size < self.config.size_threshold {
small_fragments.push(fragment.id);
}
}
Ok(small_fragments)
}
async fn compact_fragment(&self, fragment_id: u32) -> Result<(u64, u64)> {
// 1. 读取 Fragment 中的所有行
let batch = self.dataset.read_fragment(fragment_id).await?;
let rows_before = batch.num_rows() as u64;
// 2. 过滤已删除的行
let visibility = self.dataset.get_visibility().await?;
let mask = visibility.get_valid_rows_mask(fragment_id)?;
let filtered = batch.filter(&mask)?;
let rows_after = filtered.num_rows() as u64;
// 3. 重新写入 Fragment
let new_fragment_id = self.dataset.write_batch(filtered).await?;
// 4. 更新 manifest(用新 Fragment 替换旧 Fragment)
self.dataset.replace_fragment(fragment_id, new_fragment_id).await?;
Ok((rows_before, rows_after))
}
}
#[derive(Default)]
pub struct CompactionReport {
pub fragments_compacted: u32,
pub rows_before: u64,
pub rows_after: u64,
pub space_saved: u64,
}
Python API
python
import lance
import schedule
table = lance.open("data.lance")
# 手动压缩
report = table.compact()
print(f"Compacted {report.fragments_compacted} fragments")
print(f"Space saved: {report.space_saved / 1024 / 1024:.1f} MB")
# 清理旧版本(保留最近 7 天)
from datetime import timedelta
cleaned = table.cleanup(keep_recent=timedelta(days=7))
print(f"Deleted {cleaned.versions_deleted} versions")
print(f"Freed {cleaned.space_freed / 1024 / 1024:.1f} MB")
# 定期清理
def cleanup_job():
table = lance.open("data.lance")
table.cleanup(keep_recent=timedelta(days=7))
table.compact()
schedule.every().day.at("02:00").do(cleanup_job)
🔄 空间回收策略
增量回收
rust
pub struct IncrementalRecovery {
dataset: Arc<Dataset>,
batch_size: usize,
}
impl IncrementalRecovery {
// 逐步回收空间,避免一次性大操作
pub async fn recover_incrementally(&self) -> Result<Vec<RecoveryPass>> {
let mut passes = Vec::new();
loop {
// 每次处理一定数量的 Fragment
let pass = self.recover_batch(self.batch_size).await?;
if pass.fragments_processed == 0 {
break; // 没有更多 Fragment 可处理
}
passes.push(pass);
// 让出时间给其他查询
tokio::time::sleep(Duration::from_secs(1)).await;
}
Ok(passes)
}
async fn recover_batch(&self, limit: usize) -> Result<RecoveryPass> {
// 识别 N 个可回收的 Fragment
let fragments = self.identify_candidates(limit).await?;
let mut pass = RecoveryPass::default();
for frag_id in fragments {
if let Ok((before, after)) = self.recover_fragment(frag_id).await {
pass.space_recovered += (before - after) * 1024;
pass.fragments_processed += 1;
}
}
Ok(pass)
}
}
#[derive(Default)]
pub struct RecoveryPass {
pub fragments_processed: usize,
pub space_recovered: u64,
}
📊 监控与报告
rust
pub struct CleanupScheduler {
interval: Duration,
dataset: Arc<Dataset>,
}
impl CleanupScheduler {
pub async fn run(&self) {
let mut interval = tokio::time::interval(self.interval);
loop {
interval.tick().await;
match self.perform_cleanup().await {
Ok(report) => {
println!("Cleanup completed: {}", serde_json::to_string(&report).unwrap());
}
Err(e) => {
eprintln!("Cleanup failed: {}", e);
}
}
}
}
async fn perform_cleanup(&self) -> Result<CleanupSummary> {
let cleaner = VersionCleaner {
dataset_path: self.dataset.path().to_path_buf(),
manifest_dir: self.dataset.path().join("_manifest"),
retention_policy: RetentionPolicy::KeepLastN(10),
};
let cleanup_report = cleaner.clean().await?;
let compactor = Compactor {
dataset: self.dataset.clone(),
config: CompactionConfig {
size_threshold: 10 * 1024 * 1024, // 10MB
max_fragment_size: 1024 * 1024 * 1024, // 1GB
target_rows_per_fragment: 1_000_000,
max_concurrent_fragments: 4,
},
};
let compaction_report = compactor.compact().await?;
Ok(CleanupSummary {
versions_deleted: cleanup_report.versions_deleted,
space_freed_from_cleanup: cleanup_report.space_freed,
fragments_compacted: compaction_report.fragments_compacted,
space_freed_from_compaction: compaction_report.space_saved,
timestamp: SystemTime::now(),
})
}
}
#[derive(Serialize)]
pub struct CleanupSummary {
pub versions_deleted: u32,
pub space_freed_from_cleanup: u64,
pub fragments_compacted: u32,
pub space_freed_from_compaction: u64,
pub timestamp: SystemTime,
}
📚 总结
数据清理与压缩是维护 Lance 性能和节省存储的关键:
- 版本清理:删除过期版本,释放存储空间
- 文件压缩:合并小文件,提高查询性能
- 空间回收:增量回收避免业务中断
- 自动调度:定期自动清理
这些操作可以节省 30-50% 的存储空间,同时保持系统高可用。