SugLucene索引构建

概述

基于 Apache Lucene 的搜索建议索引构建器,专门用于构建支持中文和拼音搜索的倒排索引。

核心功能

1. 索引构建流程

复制代码
配置获取 → 数据下载 → 索引构建 → 文件上传 → 清理临时文件

2. 主要特性

  • 任意位置子串匹配:支持搜索任意子串找到完整词汇
  • 中文拼音混合搜索:支持拼音、首字母缩写、拼音前缀搜索
  • 高性能查询:基于 Lucene 倒排索引,毫秒级响应
  • 自动化构建:通过 Crane 任务调度,支持定时重建

技术架构

1. 核心类结构

java 复制代码
@Service
public class SugLuceneIndexBuilder {
    // 配置管理
    @MtConfig(key = "sug.fst.index.config")
    private static volatile Map<String, FSTBuildScene> fstIndexConfig;
    
    // 核心服务
    @Autowired private HiveService hiveService;
    @Autowired private AmazonS3Client s3Client;
    
    // 拼音处理
    private static Sterotoner sterotoner = new Sterotoner();
}

2. 字段定义

java 复制代码
public static class CorrectLuceneFields {
    public static String Content = "content";               // 原始内容存储
    public static String ContentPrefix = "content_prefix";  // 搜索索引字段
}

索引构建详解

1. 数据获取

java 复制代码
private boolean downloadDataFromHive(FSTBuildScene sceneConfig) {
    String sql = sceneConfig.getHiveSql();
    String localInputFileName = INPUT_PATH + sceneConfig.getIndexName() + ".tsv";
    
    HiveServiceResponse response = hiveService.downloadFileWithRetryR(sql, localInputFileName, 3);
    return response != null && response.isSuccess();
}

特点

  • 从 Hive 执行 SQL 查询获取数据
  • 支持重试机制(最多3次)
  • 数据格式:TSV(Tab分隔)

2. 文档映射核心逻辑

A. 文档结构生成

java 复制代码
public Iterable<? extends IndexableField> mapDoc(String line) {
    String content = line.trim();
    List<IndexableField> fields = new ArrayList<>();
    
    // 1. 存储原始内容(用于返回结果)
    fields.add(new StoredField(CorrectLuceneFields.Content, content));
    
    // 2. 生成搜索索引字段
    addContentPrefixFields(fields, content, normalizeString(content));
    
    return fields;
}

B. 索引字段生成策略

java 复制代码
private void addContentPrefixFields(List<IndexableField> fields, String content, String queryClean) {
    // 1. 中文子串枚举 - O(n²) 复杂度
    for (int i = 0; i < content.length(); i++) {
        for (int j = i + 1; j <= content.length(); j++) {
            String substring = content.substring(i, j);
            fields.add(new StringField(CorrectLuceneFields.ContentPrefix, substring, Field.Store.NO));
        }
    }
    
    // 2. 拼音变体生成
    Set<String> allPinyins = getPyQuerys(queryClean, true);
    for (String pinyin : allPinyins) {
        fields.add(new StringField(CorrectLuceneFields.ContentPrefix, pinyin, Field.Store.NO));
        
        // 拼音前缀生成
        for (int len = 2; len < pinyin.length(); len++) {
            String prefix = pinyin.substring(0, len);
            fields.add(new StringField(CorrectLuceneFields.ContentPrefix, prefix, Field.Store.NO));
        }
    }
}

3. 具体示例:麦当劳 的索引生成

输入数据

复制代码
麦当劳

生成的索引项

vbnet 复制代码
# StoredField (用于返回结果)
content: "麦当劳"

# StringField (用于搜索)
content_prefix: "麦"
content_prefix: "麦当"  
content_prefix: "麦当劳"
content_prefix: "当"
content_prefix: "当劳"
content_prefix: "劳"
content_prefix: "maidanglao"    # 完整拼音
content_prefix: "mdl"           # 首字母缩写
content_prefix: "mai"           # 拼音前缀
content_prefix: "maid"          # 拼音前缀
content_prefix: "maida"         # 拼音前缀
content_prefix: "maidan"        # 拼音前缀
content_prefix: "maidang"       # 拼音前缀
content_prefix: "maidangl"      # 拼音前缀
content_prefix: "maidangla"     # 拼音前缀

底层数据结构

1. Lucene 倒排索引结构

arduino 复制代码
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│ Term Dictionary │    │   Posting Lists  │    │ Document Store  │
├─────────────────┤    ├──────────────────┤    ├─────────────────┤
│ "麦"     → ptr1 │───→│ [doc1,doc5,doc10]│    │ doc1: "麦当劳"  │
│ "麦当"   → ptr2 │───→│ [doc1,doc8]      │    │ doc2: "劳动节"  │
│ "麦当劳" → ptr3 │───→│ [doc1]           │    │ doc3: "当当网"  │
│ "当"     → ptr4 │───→│ [doc1,doc3,doc7] │    │ ...             │
│ "当劳"   → ptr5 │───→│ [doc1,doc2]      │    │                 │
│ "劳"     → ptr6 │───→│ [doc1,doc4,doc9] │    │                 │
│ "mai"    → ptr7 │───→│ [doc1,doc6]      │    │                 │
│ "mdl"    → ptr8 │───→│ [doc1,doc11]     │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘

生成的 Lucene Document 结构

arduino 复制代码
// 每个原始文本生成一个 Lucene Document
Document {
    StoredField("content", "麦当劳"),           // 用于返回结果
    
    StringField("content_prefix", "麦"),        // 用于搜索
    StringField("content_prefix", "麦当"),      // 用于搜索
    StringField("content_prefix", "麦当劳"),    // 用于搜索
    StringField("content_prefix", "当"),        // 用于搜索
    StringField("content_prefix", "当劳"),      // 用于搜索
    StringField("content_prefix", "劳"),        // 用于搜索
    StringField("content_prefix", "maidanglao"), // 拼音
    StringField("content_prefix", "mdl"),       // 首字母
    StringField("content_prefix", "mai"),       // 拼音前缀
    StringField("content_prefix", "maid"),      // 拼音前缀
    StringField("content_prefix", "maida"),     // 拼音前缀
    StringField("content_prefix", "maidan"),    // 拼音前缀
    StringField("content_prefix", "maidang"),   // 拼音前缀
    StringField("content_prefix", "maidangl"),  // 拼音前缀
    StringField("content_prefix", "maidangla"), // 拼音前缀
    // ... 更多拼音前缀
}

2. 搜索过程:查询 "麦"

底层查询实现

ini 复制代码
public List<String> queryFromIndex(String cleanQuery, String scene) {
    try {
        IndexSearcher indexSearcher = indexSearcherMap.get(scene);
        
        // 获取叶子读取器
        List<LeafReaderContext> leaves = indexSearcher.getIndexReader().leaves();
        LeafReader reader = leaves.get(0).reader();
        
        Set<String> results = new HashSet<>();
        
        // 构建查询词项
        Term term = new Term("content_prefix", cleanQuery);
        
        // 获取倒排列表
        PostingsEnum postings = reader.postings(term);
        if (postings != null) {
            int docId;
            while ((docId = postings.nextDoc()) != PostingsEnum.NO_MORE_DOCS) {
                Document doc = reader.document(docId);
                String content = doc.get("content");
                if (content != null) {
                    results.add(content);
                }
            }
        }

        return new ArrayList<>(results);
    } catch (Exception e) {
        LOGGER.error("Error querying index for scene: {}, query: {}", scene, cleanQuery, e);
        return new ArrayList<>();
    }
}

查询执行过程

arduino 复制代码
1. Term查找: content_prefix:"麦"
2. 获取PostingList: [doc1, doc5, doc10, ...]
3. 文档检索: 
   - doc1 → "麦当劳"
   - doc5 → "麦咖啡"  
   - doc10 → "麦片粥"
4. 返回结果: ["麦当劳", "麦咖啡", "麦片粥"]

拼音处理机制

1. Sterotoner 拼音转换

java 复制代码
public Set<String> getPyQuerys(final String query, boolean addFirstAlpha) {
    Set<String> result = new HashSet<>();
    Set<String> fpy = new HashSet<>();  // 首字母拼音
    Set<String> apy = new HashSet<>();  // 完整拼音
    
    sterotoner.getPinyinAll(query, apy, fpy);
    
    if (addFirstAlpha) {
        result.addAll(fpy);  // 添加首字母缩写
    }
    result.addAll(apy);      // 添加完整拼音
    
    return result;
}

2. 拼音变体示例

makefile 复制代码
输入: "麦当劳"
输出: 
- 完整拼音: "maidanglao"
- 首字母: "mdl"
- 前缀: "mai", "maid", "maida", "maidan", "maidang", "maidangl", "maidangla"

文件路径和存储

1. 本地路径配置

java 复制代码
public static final String DEFAULT_LOCAL_PATH = "/opt/meituan/dict/sug_fallback_indexes/";
public static final String INPUT_PATH = DEFAULT_LOCAL_PATH + "input/";
public static final String OUTPUT_PATH = DEFAULT_LOCAL_PATH + "output/";

2. 文件结构

python 复制代码
/opt/meituan/dict/sug_fallback_indexes/
├── input/
│   └── {indexName}.tsv           # 从Hive下载的原始数据
└── output/
    └── {indexName}/
        ├── segments_1            # Lucene段信息
        ├── _0.cfe               # 复合字段存储
        ├── _0.cfs               # 复合字段索引
        ├── _0.si                # 段信息
        └── write.lock           # 写锁文件

3. S3 上传

java 复制代码
private void uploadS3(String indexName) {
    String bucketName = "index";
    String s3KeyPrefix = "RecallIndexBackup/FallbackIndex/" + indexName;
    
    // 上传所有索引文件(除了write.lock)
    // 生成并上传文件清单 file_manifest.txt
}

性能特点

1. 时间复杂度

  • 索引构建: O(n³) - n为平均字符串长度
  • 查询性能: O(log m + k) - m为词典大小,k为结果数量

2. 空间复杂度

  • 索引大小: 原始数据的 10-20 倍
  • 内存占用: 构建时需要加载所有数据到内存

3. 查询性能

  • 精确匹配: 毫秒级响应
  • 前缀查询: 毫秒级响应
  • 任意子串: 毫秒级响应

配置管理

1. MCC 配置

java 复制代码
@MtConfig(clientId = MccConfiguration.ID, 
          key = "sug.fst.index.config", 
          converter = FSTBuildSceneConverter.class)
private static volatile Map<String, FSTBuildScene> fstIndexConfig;

2. 配置示例

json 复制代码
{
  "restaurant_suggest": {
    "hiveSql": "SELECT name FROM restaurant_table WHERE status = 1",
    "indexName": "restaurant_suggest",
    "columnNum": 1
  }
}

任务调度

1. Crane 任务

java 复制代码
@Crane("sug.fallback.fst.build.universal.task")
public void buildFSTIndex(String sceneName) {
    // 索引构建逻辑
}

2. 执行流程

markdown 复制代码
1. 获取场景配置
2. 从Hive下载数据
3. 构建Lucene索引
4. 上传到S3
5. 清理临时文件

优化策略

1. 索引优化

java 复制代码
indexWriter.commit();
indexWriter.flush();
indexWriter.forceMerge(1);  // 强制合并为单个段

2. 字符串规范化

java 复制代码
public static String normalizeString(String phrase) {
    // 1. 去除"外卖"前后缀
    // 2. 全角转半角
    // 3. 转小写
    // 4. 去除多余空格
    return removeSpaceEx(full2Half(formatQuery.toLowerCase()));
}

适用场景

✅ 适合的场景

  • 搜索建议/自动补全
  • 任意位置子串匹配
  • 中文拼音混合搜索
  • 高频查询场景

❌ 不适合的场景

  • 复杂全文搜索
  • 需要评分排序
  • 频繁更新的索引
  • 内存极度敏感的场景

监控指标

1. 构建指标

  • 索引构建时间
  • 索引文件大小
  • 生成的索引项数量

2. 查询指标

  • 查询响应时间
  • 查询准确率
  • 内存使用量

注意事项

1. 数据质量

  • 确保输入数据去重
  • 过滤掉过短的字符串(< MINLEN)
  • 处理特殊字符和编码问题

2. 资源管理

  • 构建过程中内存占用较大
  • 需要足够的磁盘空间存储索引
  • S3 上传需要网络带宽

3. 维护成本

  • 索引不可变,更新需要重建
  • 需要定期清理临时文件
  • 监控索引构建任务状态

这个实现通过牺牲存储空间来换取查询性能,特别适合搜索建议这种对响应时间要求极高的场景。

复制代码
// lyx/fst-index-builder

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.TypeReference;
import com.cip.crane.client.spring.annotation.Crane;
import com.google.common.base.Splitter;
import com.google.common.collect.Lists;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.common.collect.Iterables;
import com.sankuai.meituan.config.annotation.MtConfig;
import com.sankuai.meituan.config.configuration.MccConfiguration;
import com.sankuai.meituan.config.exception.MtConfigException;
import com.sankuai.meituan.config.function.MtConfigConverter;
import com.sankuai.meituan.waimai.d.search.offline.similarity.relevance.common.pinyin.Sterotoner;
import com.sankuai.meituan.waimai.traffic.offline.task.domain.HiveServiceResponse;
import com.sankuai.meituan.waimai.traffic.offline.task.service.thirdparty.HiveService;
import com.sankuai.meituan.waimai.traffic.offline.task.util.FSTBuildScene;
import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.index.*;
import org.apache.lucene.store.FSDirectory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.lang.reflect.Field;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

import static com.sankuai.meituan.waimai.d.search.util.StringHelp.full2Half;
import static com.sankuai.meituan.waimai.traffic.offline.task.util.StringUtil.removeSpaceEx;

@Service
public class SugLuceneIndexBuilder {
    private static final Logger LOGGER = LoggerFactory.getLogger(SugLuceneIndexBuilder.class);

    // 下载到本地的HIVE文件路径
    public static final String DEFAULT_LOCAL_PATH = "/opt/meituan/dict/sug_fallback_indexes/";
//    public static final String DEFAULT_LOCAL_PATH = "/Users/longyuxin/Desktop/work/waimai_d_traffic_data_source/waimai-d-traffic-offline-task-service/src/test/resources/test_index";
    public static final String INPUT_PATH = DEFAULT_LOCAL_PATH + "input/";
    public static final String OUTPUT_PATH = DEFAULT_LOCAL_PATH + "output/";
    public static final String REMOTE_INDEX_PATH = "RecallIndexBackup/FallbackIndex/";

    @Autowired
    private HiveService hiveService;
    @Autowired
    private AmazonS3Client s3Client;

    private static Sterotoner sterotoner = new Sterotoner();

    private static final int MINLEN = 2;

    public static class CorrectLuceneFields {
        public static String Content = "content";               // 原始内容
        public static String ContentPrefix = "content_prefix";  // 子串
    }

    @MtConfig(clientId =  MccConfiguration.ID, key = "sug.fst.index.config", converter = FSTBuildSceneConverter.class)
    private static volatile Map<String, FSTBuildScene> fstIndexConfig = new ConcurrentHashMap<>();


    public static class FSTBuildSceneConverter implements MtConfigConverter<Map<String, FSTBuildScene>> {
        @Override
        public Map<String, FSTBuildScene> convert(Field field, String key, String newValue) throws MtConfigException {
            LOGGER.info("config:{} changed,  newValue:{}", key, newValue);
            Map<String, FSTBuildScene> result = new HashMap<>();
            if (StringUtils.isNotBlank(newValue)) {
                try {
                    result = JSON.parseObject(newValue, new TypeReference<Map<String, FSTBuildScene>>() {});
                }catch (Exception e) {
                    LOGGER.warn("parse FSTBuildScene error! key:{}", key, e);
                }
            }
            return result;
        }
    }

    @Crane("sug.fallback.fst.build.universal.task")
    public void buildFSTIndex(String sceneName) {
        FSTBuildScene scene = null;
        try {
            LOGGER.info("Starting FST index build for scene: {}", sceneName);

            // 1. 获取场景配置
            scene = fstIndexConfig.get(sceneName);

            if (scene == null || scene.getHiveSql().isEmpty()) {
                LOGGER.error("Invalid scene config " + sceneName);
                return;
            }

            // 2. 从 Hive 下载数据 临时保存在本地
            if (!downloadDataFromHive(scene)) {
                LOGGER.error("fetch data failed" + sceneName);
                return;
            }

            // 3. 处理数据并构建 FST 索引
            buildIndex(scene);

        }  catch (Exception e) {
            LOGGER.error("Failed to build FST index for scene: " + sceneName, e);
        }finally {
            // 清理临时文件
            if (scene != null && scene.getLocalPath() != null) {
                cleanupTempFile(scene.getLocalPath());
            }
        }
    }

    private boolean downloadDataFromHive(FSTBuildScene sceneConfig) {
        try {
            String sql = sceneConfig.getHiveSql();
            String localInputFileName = INPUT_PATH + sceneConfig.getIndexName() + ".tsv";
            sceneConfig.setLocalPath(localInputFileName);

            HiveServiceResponse hiveServiceResponse = hiveService.downloadFileWithRetryR(sql, localInputFileName, 3);
            return hiveServiceResponse != null && hiveServiceResponse.isSuccess() && hiveServiceResponse.getQueryInfo() != null;
        } catch (Exception e) {
            LOGGER.error("Error downloading data from hive", e);
            return false;
        }
    }

    private void cleanupTempFile(String filePath) {
        try {
            Files.deleteIfExists(Paths.get(filePath));
            LOGGER.info("Cleaned up temp file: {}", filePath);
        } catch (IOException e) {
            LOGGER.warn("Failed to cleanup temp file: {}", filePath, e);
        }
    }

    public Iterable<? extends IndexableField> mapDoc(String line) {
        String content = line.trim();
        if (content.isEmpty() || content.length() < MINLEN) {
            return Collections.emptyList();
        }

        List<IndexableField> fields = new ArrayList<>();

        // 1. 存储原始内容
        fields.add(new StoredField(CorrectLuceneFields.Content, content));

        // 2. 核心索引:ContentPrefix(中文子串 + 拼音前缀 + 首字母前缀)
        addContentPrefixFields(fields, content, normalizeString(content));

        return fields;
    }

    private void addContentPrefixFields(List<IndexableField> fields, String content, String queryClean) {
        // 1. 添加中文子串
        for (int i = 0; i < content.length(); i++) {
            for (int j = i + 1; j <= content.length(); j++) {
                String substring = content.substring(i, j);
                fields.add(new StringField(CorrectLuceneFields.ContentPrefix, substring, org.apache.lucene.document.Field.Store.NO));
            }
        }

        // 2. 添加拼音子串
        Set<String> allPinyins = getPyQuerys(queryClean, true);
        Set<String> addedPrefixes = new HashSet<>();
        for (String pinyin : allPinyins) {
            fields.add(new StringField(CorrectLuceneFields.ContentPrefix, pinyin, org.apache.lucene.document.Field.Store.NO));

            for (int len = 2; len < pinyin.length(); len++) {  // 注意:len < pinyin.length()
                String prefix = pinyin.substring(0, len);
                if (addedPrefixes.add(prefix)) {
                    fields.add(new StringField(CorrectLuceneFields.ContentPrefix, prefix, org.apache.lucene.document.Field.Store.NO));
                }
            }
        }
    }

    public Set<String> getPyQuerys(final String query, boolean addFirstAlpha) {
        Set<String> result = new HashSet<>();
        Set<String> fpy = new HashSet<String>();
        Set<String> apy = new HashSet<String>();
        sterotoner.getPinyinAll(query, apy, fpy);

        if (addFirstAlpha) {
            for (String py : fpy) {
                if (!py.trim().isEmpty()) {
                    result.add(py);
                }
            }
        }

        for (String py : apy) {
            if (!py.trim().isEmpty()) {
                result.add(py);
            }
        }

        return result;
    }

    public boolean buildIndex(FSTBuildScene scene) throws IOException {
        String localIndexPath = OUTPUT_PATH + scene.getIndexName();

        // 完全清理索引目录
        cleanupIndexDirectory(localIndexPath);
        Files.createDirectories(Paths.get(localIndexPath));
        IndexWriterConfig writerConfig = new IndexWriterConfig();
        writerConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        IndexWriter indexWriter = null;
        try {
            indexWriter = new IndexWriter(FSDirectory.open(Paths.get(localIndexPath)), writerConfig);

            // 读取并处理文件
            try (BufferedReader reader = Files.newBufferedReader(Paths.get(scene.getLocalPath()))) {
                String line;
                int columnCount = -1;
                while ((line = reader.readLine()) != null) {
                    String firstColumn = getFirstColumn(line);
                    if (firstColumn != null) {
                        Iterable<? extends IndexableField> doc = mapDoc(firstColumn);
                        if (!Iterables.isEmpty(doc)) {
                            indexWriter.addDocument(doc);
                        }
                    }
                }
            }

            indexWriter.commit();
            indexWriter.flush();
            indexWriter.forceMerge(1);

        } finally {
            if (indexWriter != null) {
                try {
                    indexWriter.close();
                } catch (IOException e) {
                    LOGGER.error("Failed to close IndexWriter", e);
                }
            }
        }

        uploadS3(scene.getIndexName());
        return true;
    }

    private String getFirstColumn(String line) {
        int firstSeparatorIndex = line.indexOf('\u0001');
        if (firstSeparatorIndex == -1) {
            // 没有分隔符,整行就是第一列
            return line.trim().isEmpty() ? null : line.trim();
        }
        // 有分隔符,取第一个分隔符前的内容
        String firstColumn = line.substring(0, firstSeparatorIndex).trim();
        return firstColumn.isEmpty() ? null : firstColumn;
    }

    private void cleanupIndexDirectory(String indexPath) {
        try {
            File indexDir = new File(indexPath);
            if (indexDir.exists()) {
                File[] files = indexDir.listFiles();
                if (files != null) {
                    for (File file : files) {
                        if (file.isFile()) {
                            boolean deleted = file.delete();
                            LOGGER.info("Deleted index file: {}, success: {}", file.getName(), deleted);
                        }
                    }
                }
            }
        } catch (Exception e) {
            LOGGER.warn("Failed to cleanup index directory: {}", indexPath, e);
        }
    }

    public static String normalizeString(String phrase) {
        if (phrase == null || phrase.trim().length() == 0) {
            return null;
        }

        String formatQuery = phrase;
        if (formatQuery.endsWith(" 外卖")) {
            formatQuery = formatQuery.substring(0, phrase.length() - 3);
        } else if (formatQuery.startsWith("外卖 ")) {
            formatQuery = formatQuery.substring(3, phrase.length());
        }

        if (org.apache.commons.lang3.StringUtils.isEmpty(formatQuery)) {
            formatQuery = "外卖";
        }

        String halfstr = full2Half(formatQuery.toLowerCase());
        return removeSpaceEx(halfstr);
    }

    private void uploadS3(String indexName) {
        try {
            String bucketName = "index";
            String localIndexPath = OUTPUT_PATH + indexName;
            String s3KeyPrefix = REMOTE_INDEX_PATH + indexName;

            File indexDir = new File(localIndexPath);
            if (!indexDir.exists() || !indexDir.isDirectory()) {
                LOGGER.error("Index directory does not exist: {}", localIndexPath);
                return;
            }
            File[] indexFiles = indexDir.listFiles((dir, name) ->
                    !name.equals("write.lock") && !name.startsWith(".")
            );

            if (indexFiles == null || indexFiles.length == 0) {
                LOGGER.warn("No index files found to upload for: {}", indexName);
                return;
            }

            List<String> uploadedFiles = new ArrayList<>();

            // 上传索引文件
            for (File file : indexFiles) {
                String s3Key = s3KeyPrefix + "/" + file.getName();
                LOGGER.info("Uploading file: {} to S3 key: {}", file.getName(), s3Key);
                s3Client.upload(bucketName, s3Key, file);
                uploadedFiles.add(file.getName());
            }

            // 生成并上传文件清单
            uploadFileManifest(bucketName, s3KeyPrefix, uploadedFiles);

            LOGGER.info("Successfully uploaded {} index files and manifest for: {}",
                    uploadedFiles.size(), indexName);

        } catch (Exception e) {
            LOGGER.error("upload to s3 error", e);
        }
    }

    private void uploadFileManifest(String bucketName, String s3KeyPrefix, List<String> fileNames) {
        try {
            // 创建临时的清单文件
            String manifestFileName = "file_manifest.txt";
            String localManifestPath = OUTPUT_PATH + "temp_" + manifestFileName;

            // 写入文件清单
            try (BufferedWriter writer = Files.newBufferedWriter(Paths.get(localManifestPath))) {
                for (String fileName : fileNames) {
                    writer.write(fileName);
                    writer.newLine();
                }
            }

            // 上传清单文件
            String manifestS3Key = s3KeyPrefix + "/" + manifestFileName;
            s3Client.upload(bucketName, manifestS3Key, new File(localManifestPath));

            // 清理临时清单文件
            Files.deleteIfExists(Paths.get(localManifestPath));

            LOGGER.info("Uploaded file manifest with {} files to: {}", fileNames.size(), manifestS3Key);

        } catch (Exception e) {
            LOGGER.error("Failed to upload file manifest", e);
        }
    }
}
复制代码
// 查询方法
public List<String> queryFromIndex(String cleanQuery, String scene) {
    try {
        IndexSearcher indexSearcher = indexSearcherMap.get(scene);

        if (indexSearcher == null || indexSearcher.getIndexReader() == null ||
                indexSearcher.getIndexReader().leaves().isEmpty()) {
            LOGGER.warn("No index searcher found for scene: {}, available scenes: {}",
                    scene, indexSearcherMap.keySet());
            return new ArrayList<>();
        }


        List<LeafReaderContext> leaves = indexSearcher.getIndexReader().leaves();
        if (leaves.isEmpty()) {
            return new ArrayList<>();
        }
        LeafReader reader = leaves.get(0).reader();
        if (reader == null) {
            return new ArrayList<>();
        }

        Set<String> results = new HashSet<>();
        Term term = new Term("content_prefix", cleanQuery);
        PostingsEnum postings = reader.postings(term);
        if (postings != null) {
            int docId;
            while ((docId = postings.nextDoc()) != PostingsEnum.NO_MORE_DOCS) {
                Document doc = reader.document(docId);
                String content = doc.get("content");
                if (content != null) {
                    results.add(content);
                }
            }
        }

        return new ArrayList<>(results);
    } catch (Exception e) {
        LOGGER.error("Error querying index for scene: {}, query: {}", scene, cleanQuery, e);
        return new ArrayList<>();
    }
}
相关推荐
舒一笑2 小时前
Saga分布式事务框架执行逻辑
后端·程序员·设计
Emma歌小白3 小时前
完整后台模块模板
后端
得物技术3 小时前
MySQL单表为何别超2000万行?揭秘B+树与16KB页的生死博弈|得物技术
数据库·后端·mysql
Emma歌小白3 小时前
配套后端(Node.js + Express + SQLite)
后端
isysc13 小时前
面了一个校招生,竟然说我是老古董
java·后端·面试
uhakadotcom3 小时前
静态代码检测技术入门:Python 的 Tree-sitter 技术详解与示例教程
后端·面试·github
幂简集成explinks3 小时前
e签宝签署API更新实战:新增 signType 与 FDA 合规参数配置
后端·设计模式·开源
River4163 小时前
Javer 学 c++(十三):引用篇
c++·后端