SugLucene索引构建

概述

基于 Apache Lucene 的搜索建议索引构建器，专门用于构建支持中文和拼音搜索的倒排索引。

核心功能

1. 索引构建流程

复制代码

配置获取 → 数据下载 → 索引构建 → 文件上传 → 清理临时文件

2. 主要特性

任意位置子串匹配：支持搜索任意子串找到完整词汇
中文拼音混合搜索：支持拼音、首字母缩写、拼音前缀搜索
高性能查询：基于 Lucene 倒排索引，毫秒级响应
自动化构建：通过 Crane 任务调度，支持定时重建

技术架构

1. 核心类结构

java 复制代码

@Service
public class SugLuceneIndexBuilder {
    // 配置管理
    @MtConfig(key = "sug.fst.index.config")
    private static volatile Map<String, FSTBuildScene> fstIndexConfig;
    
    // 核心服务
    @Autowired private HiveService hiveService;
    @Autowired private AmazonS3Client s3Client;
    
    // 拼音处理
    private static Sterotoner sterotoner = new Sterotoner();
}

2. 字段定义

java 复制代码

public static class CorrectLuceneFields {
    public static String Content = "content";               // 原始内容存储
    public static String ContentPrefix = "content_prefix";  // 搜索索引字段
}

索引构建详解

1. 数据获取

java 复制代码

private boolean downloadDataFromHive(FSTBuildScene sceneConfig) {
    String sql = sceneConfig.getHiveSql();
    String localInputFileName = INPUT_PATH + sceneConfig.getIndexName() + ".tsv";
    
    HiveServiceResponse response = hiveService.downloadFileWithRetryR(sql, localInputFileName, 3);
    return response != null && response.isSuccess();
}

特点：

从 Hive 执行 SQL 查询获取数据
支持重试机制（最多3次）
数据格式：TSV（Tab分隔）

2. 文档映射核心逻辑

A. 文档结构生成

java 复制代码

public Iterable<? extends IndexableField> mapDoc(String line) {
    String content = line.trim();
    List<IndexableField> fields = new ArrayList<>();
    
    // 1. 存储原始内容（用于返回结果）
    fields.add(new StoredField(CorrectLuceneFields.Content, content));
    
    // 2. 生成搜索索引字段
    addContentPrefixFields(fields, content, normalizeString(content));
    
    return fields;
}

B. 索引字段生成策略

java 复制代码

private void addContentPrefixFields(List<IndexableField> fields, String content, String queryClean) {
    // 1. 中文子串枚举 - O(n²) 复杂度
    for (int i = 0; i < content.length(); i++) {
        for (int j = i + 1; j <= content.length(); j++) {
            String substring = content.substring(i, j);
            fields.add(new StringField(CorrectLuceneFields.ContentPrefix, substring, Field.Store.NO));
        }
    }
    
    // 2. 拼音变体生成
    Set<String> allPinyins = getPyQuerys(queryClean, true);
    for (String pinyin : allPinyins) {
        fields.add(new StringField(CorrectLuceneFields.ContentPrefix, pinyin, Field.Store.NO));
        
        // 拼音前缀生成
        for (int len = 2; len < pinyin.length(); len++) {
            String prefix = pinyin.substring(0, len);
            fields.add(new StringField(CorrectLuceneFields.ContentPrefix, prefix, Field.Store.NO));
        }
    }
}

3. 具体示例：`麦当劳` 的索引生成

输入数据

复制代码

麦当劳

生成的索引项

vbnet 复制代码

# StoredField (用于返回结果)
content: "麦当劳"

# StringField (用于搜索)
content_prefix: "麦"
content_prefix: "麦当"  
content_prefix: "麦当劳"
content_prefix: "当"
content_prefix: "当劳"
content_prefix: "劳"
content_prefix: "maidanglao"    # 完整拼音
content_prefix: "mdl"           # 首字母缩写
content_prefix: "mai"           # 拼音前缀
content_prefix: "maid"          # 拼音前缀
content_prefix: "maida"         # 拼音前缀
content_prefix: "maidan"        # 拼音前缀
content_prefix: "maidang"       # 拼音前缀
content_prefix: "maidangl"      # 拼音前缀
content_prefix: "maidangla"     # 拼音前缀

底层数据结构

1. Lucene 倒排索引结构

arduino 复制代码

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│ Term Dictionary │    │   Posting Lists  │    │ Document Store  │
├─────────────────┤    ├──────────────────┤    ├─────────────────┤
│ "麦"     → ptr1 │───→│ [doc1,doc5,doc10]│    │ doc1: "麦当劳"  │
│ "麦当"   → ptr2 │───→│ [doc1,doc8]      │    │ doc2: "劳动节"  │
│ "麦当劳" → ptr3 │───→│ [doc1]           │    │ doc3: "当当网"  │
│ "当"     → ptr4 │───→│ [doc1,doc3,doc7] │    │ ...             │
│ "当劳"   → ptr5 │───→│ [doc1,doc2]      │    │                 │
│ "劳"     → ptr6 │───→│ [doc1,doc4,doc9] │    │                 │
│ "mai"    → ptr7 │───→│ [doc1,doc6]      │    │                 │
│ "mdl"    → ptr8 │───→│ [doc1,doc11]     │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘

生成的 Lucene Document 结构

arduino 复制代码

// 每个原始文本生成一个 Lucene Document
Document {
    StoredField("content", "麦当劳"),           // 用于返回结果
    
    StringField("content_prefix", "麦"),        // 用于搜索
    StringField("content_prefix", "麦当"),      // 用于搜索
    StringField("content_prefix", "麦当劳"),    // 用于搜索
    StringField("content_prefix", "当"),        // 用于搜索
    StringField("content_prefix", "当劳"),      // 用于搜索
    StringField("content_prefix", "劳"),        // 用于搜索
    StringField("content_prefix", "maidanglao"), // 拼音
    StringField("content_prefix", "mdl"),       // 首字母
    StringField("content_prefix", "mai"),       // 拼音前缀
    StringField("content_prefix", "maid"),      // 拼音前缀
    StringField("content_prefix", "maida"),     // 拼音前缀
    StringField("content_prefix", "maidan"),    // 拼音前缀
    StringField("content_prefix", "maidang"),   // 拼音前缀
    StringField("content_prefix", "maidangl"),  // 拼音前缀
    StringField("content_prefix", "maidangla"), // 拼音前缀
    // ... 更多拼音前缀
}

2. 搜索过程：查询 "麦"

底层查询实现

ini 复制代码

public List<String> queryFromIndex(String cleanQuery, String scene) {
    try {
        IndexSearcher indexSearcher = indexSearcherMap.get(scene);
        
        // 获取叶子读取器
        List<LeafReaderContext> leaves = indexSearcher.getIndexReader().leaves();
        LeafReader reader = leaves.get(0).reader();
        
        Set<String> results = new HashSet<>();
        
        // 构建查询词项
        Term term = new Term("content_prefix", cleanQuery);
        
        // 获取倒排列表
        PostingsEnum postings = reader.postings(term);
        if (postings != null) {
            int docId;
            while ((docId = postings.nextDoc()) != PostingsEnum.NO_MORE_DOCS) {
                Document doc = reader.document(docId);
                String content = doc.get("content");
                if (content != null) {
                    results.add(content);
                }
            }
        }

        return new ArrayList<>(results);
    } catch (Exception e) {
        LOGGER.error("Error querying index for scene: {}, query: {}", scene, cleanQuery, e);
        return new ArrayList<>();
    }
}

查询执行过程

arduino 复制代码

1. Term查找: content_prefix:"麦"
2. 获取PostingList: [doc1, doc5, doc10, ...]
3. 文档检索: 
   - doc1 → "麦当劳"
   - doc5 → "麦咖啡"  
   - doc10 → "麦片粥"
4. 返回结果: ["麦当劳", "麦咖啡", "麦片粥"]

拼音处理机制

1. Sterotoner 拼音转换

java 复制代码

public Set<String> getPyQuerys(final String query, boolean addFirstAlpha) {
    Set<String> result = new HashSet<>();
    Set<String> fpy = new HashSet<>();  // 首字母拼音
    Set<String> apy = new HashSet<>();  // 完整拼音
    
    sterotoner.getPinyinAll(query, apy, fpy);
    
    if (addFirstAlpha) {
        result.addAll(fpy);  // 添加首字母缩写
    }
    result.addAll(apy);      // 添加完整拼音
    
    return result;
}

2. 拼音变体示例

makefile 复制代码

输入: "麦当劳"
输出: 
- 完整拼音: "maidanglao"
- 首字母: "mdl"
- 前缀: "mai", "maid", "maida", "maidan", "maidang", "maidangl", "maidangla"

文件路径和存储

1. 本地路径配置

java 复制代码

public static final String DEFAULT_LOCAL_PATH = "/opt/meituan/dict/sug_fallback_indexes/";
public static final String INPUT_PATH = DEFAULT_LOCAL_PATH + "input/";
public static final String OUTPUT_PATH = DEFAULT_LOCAL_PATH + "output/";

2. 文件结构

python 复制代码

/opt/meituan/dict/sug_fallback_indexes/
├── input/
│   └── {indexName}.tsv           # 从Hive下载的原始数据
└── output/
    └── {indexName}/
        ├── segments_1            # Lucene段信息
        ├── _0.cfe               # 复合字段存储
        ├── _0.cfs               # 复合字段索引
        ├── _0.si                # 段信息
        └── write.lock           # 写锁文件

3. S3 上传

java 复制代码

private void uploadS3(String indexName) {
    String bucketName = "index";
    String s3KeyPrefix = "RecallIndexBackup/FallbackIndex/" + indexName;
    
    // 上传所有索引文件（除了write.lock）
    // 生成并上传文件清单 file_manifest.txt
}

性能特点

1. 时间复杂度

索引构建: O(n³) - n为平均字符串长度
查询性能: O(log m + k) - m为词典大小，k为结果数量

2. 空间复杂度

索引大小: 原始数据的 10-20 倍
内存占用: 构建时需要加载所有数据到内存

3. 查询性能

精确匹配: 毫秒级响应
前缀查询: 毫秒级响应
任意子串: 毫秒级响应

配置管理

1. MCC 配置

java 复制代码

@MtConfig(clientId = MccConfiguration.ID, 
          key = "sug.fst.index.config", 
          converter = FSTBuildSceneConverter.class)
private static volatile Map<String, FSTBuildScene> fstIndexConfig;

2. 配置示例

json 复制代码

{
  "restaurant_suggest": {
    "hiveSql": "SELECT name FROM restaurant_table WHERE status = 1",
    "indexName": "restaurant_suggest",
    "columnNum": 1
  }
}

任务调度

1. Crane 任务

java 复制代码

@Crane("sug.fallback.fst.build.universal.task")
public void buildFSTIndex(String sceneName) {
    // 索引构建逻辑
}

2. 执行流程

markdown 复制代码

1. 获取场景配置
2. 从Hive下载数据
3. 构建Lucene索引
4. 上传到S3
5. 清理临时文件

优化策略

1. 索引优化

java 复制代码

indexWriter.commit();
indexWriter.flush();
indexWriter.forceMerge(1);  // 强制合并为单个段

2. 字符串规范化

java 复制代码

public static String normalizeString(String phrase) {
    // 1. 去除"外卖"前后缀
    // 2. 全角转半角
    // 3. 转小写
    // 4. 去除多余空格
    return removeSpaceEx(full2Half(formatQuery.toLowerCase()));
}

适用场景

✅ 适合的场景

搜索建议/自动补全
任意位置子串匹配
中文拼音混合搜索
高频查询场景

❌ 不适合的场景

复杂全文搜索
需要评分排序
频繁更新的索引
内存极度敏感的场景

监控指标

1. 构建指标

索引构建时间
索引文件大小
生成的索引项数量

2. 查询指标

查询响应时间
查询准确率
内存使用量

注意事项

1. 数据质量

确保输入数据去重
过滤掉过短的字符串（< MINLEN）
处理特殊字符和编码问题

2. 资源管理

构建过程中内存占用较大
需要足够的磁盘空间存储索引
S3 上传需要网络带宽

3. 维护成本

索引不可变，更新需要重建
需要定期清理临时文件
监控索引构建任务状态

这个实现通过牺牲存储空间来换取查询性能，特别适合搜索建议这种对响应时间要求极高的场景。

写复制代码

// lyx/fst-index-builder

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.TypeReference;
import com.cip.crane.client.spring.annotation.Crane;
import com.google.common.base.Splitter;
import com.google.common.collect.Lists;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.common.collect.Iterables;
import com.sankuai.meituan.config.annotation.MtConfig;
import com.sankuai.meituan.config.configuration.MccConfiguration;
import com.sankuai.meituan.config.exception.MtConfigException;
import com.sankuai.meituan.config.function.MtConfigConverter;
import com.sankuai.meituan.waimai.d.search.offline.similarity.relevance.common.pinyin.Sterotoner;
import com.sankuai.meituan.waimai.traffic.offline.task.domain.HiveServiceResponse;
import com.sankuai.meituan.waimai.traffic.offline.task.service.thirdparty.HiveService;
import com.sankuai.meituan.waimai.traffic.offline.task.util.FSTBuildScene;
import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.index.*;
import org.apache.lucene.store.FSDirectory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.IOException;
import java.lang.reflect.Field;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

import static com.sankuai.meituan.waimai.d.search.util.StringHelp.full2Half;
import static com.sankuai.meituan.waimai.traffic.offline.task.util.StringUtil.removeSpaceEx;

@Service
public class SugLuceneIndexBuilder {
    private static final Logger LOGGER = LoggerFactory.getLogger(SugLuceneIndexBuilder.class);

    // 下载到本地的HIVE文件路径
    public static final String DEFAULT_LOCAL_PATH = "/opt/meituan/dict/sug_fallback_indexes/";
//    public static final String DEFAULT_LOCAL_PATH = "/Users/longyuxin/Desktop/work/waimai_d_traffic_data_source/waimai-d-traffic-offline-task-service/src/test/resources/test_index";
    public static final String INPUT_PATH = DEFAULT_LOCAL_PATH + "input/";
    public static final String OUTPUT_PATH = DEFAULT_LOCAL_PATH + "output/";
    public static final String REMOTE_INDEX_PATH = "RecallIndexBackup/FallbackIndex/";

    @Autowired
    private HiveService hiveService;
    @Autowired
    private AmazonS3Client s3Client;

    private static Sterotoner sterotoner = new Sterotoner();

    private static final int MINLEN = 2;

    public static class CorrectLuceneFields {
        public static String Content = "content";               // 原始内容
        public static String ContentPrefix = "content_prefix";  // 子串
    }

    @MtConfig(clientId =  MccConfiguration.ID, key = "sug.fst.index.config", converter = FSTBuildSceneConverter.class)
    private static volatile Map<String, FSTBuildScene> fstIndexConfig = new ConcurrentHashMap<>();


    public static class FSTBuildSceneConverter implements MtConfigConverter<Map<String, FSTBuildScene>> {
        @Override
        public Map<String, FSTBuildScene> convert(Field field, String key, String newValue) throws MtConfigException {
            LOGGER.info("config:{} changed,  newValue:{}", key, newValue);
            Map<String, FSTBuildScene> result = new HashMap<>();
            if (StringUtils.isNotBlank(newValue)) {
                try {
                    result = JSON.parseObject(newValue, new TypeReference<Map<String, FSTBuildScene>>() {});
                }catch (Exception e) {
                    LOGGER.warn("parse FSTBuildScene error! key:{}", key, e);
                }
            }
            return result;
        }
    }

    @Crane("sug.fallback.fst.build.universal.task")
    public void buildFSTIndex(String sceneName) {
        FSTBuildScene scene = null;
        try {
            LOGGER.info("Starting FST index build for scene: {}", sceneName);

            // 1. 获取场景配置
            scene = fstIndexConfig.get(sceneName);

            if (scene == null || scene.getHiveSql().isEmpty()) {
                LOGGER.error("Invalid scene config " + sceneName);
                return;
            }

            // 2. 从 Hive 下载数据 临时保存在本地
            if (!downloadDataFromHive(scene)) {
                LOGGER.error("fetch data failed" + sceneName);
                return;
            }

            // 3. 处理数据并构建 FST 索引
            buildIndex(scene);

        }  catch (Exception e) {
            LOGGER.error("Failed to build FST index for scene: " + sceneName, e);
        }finally {
            // 清理临时文件
            if (scene != null && scene.getLocalPath() != null) {
                cleanupTempFile(scene.getLocalPath());
            }
        }
    }

    private boolean downloadDataFromHive(FSTBuildScene sceneConfig) {
        try {
            String sql = sceneConfig.getHiveSql();
            String localInputFileName = INPUT_PATH + sceneConfig.getIndexName() + ".tsv";
            sceneConfig.setLocalPath(localInputFileName);

            HiveServiceResponse hiveServiceResponse = hiveService.downloadFileWithRetryR(sql, localInputFileName, 3);
            return hiveServiceResponse != null && hiveServiceResponse.isSuccess() && hiveServiceResponse.getQueryInfo() != null;
        } catch (Exception e) {
            LOGGER.error("Error downloading data from hive", e);
            return false;
        }
    }

    private void cleanupTempFile(String filePath) {
        try {
            Files.deleteIfExists(Paths.get(filePath));
            LOGGER.info("Cleaned up temp file: {}", filePath);
        } catch (IOException e) {
            LOGGER.warn("Failed to cleanup temp file: {}", filePath, e);
        }
    }

    public Iterable<? extends IndexableField> mapDoc(String line) {
        String content = line.trim();
        if (content.isEmpty() || content.length() < MINLEN) {
            return Collections.emptyList();
        }

        List<IndexableField> fields = new ArrayList<>();

        // 1. 存储原始内容
        fields.add(new StoredField(CorrectLuceneFields.Content, content));

        // 2. 核心索引：ContentPrefix（中文子串 + 拼音前缀 + 首字母前缀）
        addContentPrefixFields(fields, content, normalizeString(content));

        return fields;
    }

    private void addContentPrefixFields(List<IndexableField> fields, String content, String queryClean) {
        // 1. 添加中文子串
        for (int i = 0; i < content.length(); i++) {
            for (int j = i + 1; j <= content.length(); j++) {
                String substring = content.substring(i, j);
                fields.add(new StringField(CorrectLuceneFields.ContentPrefix, substring, org.apache.lucene.document.Field.Store.NO));
            }
        }

        // 2. 添加拼音子串
        Set<String> allPinyins = getPyQuerys(queryClean, true);
        Set<String> addedPrefixes = new HashSet<>();
        for (String pinyin : allPinyins) {
            fields.add(new StringField(CorrectLuceneFields.ContentPrefix, pinyin, org.apache.lucene.document.Field.Store.NO));

            for (int len = 2; len < pinyin.length(); len++) {  // 注意：len < pinyin.length()
                String prefix = pinyin.substring(0, len);
                if (addedPrefixes.add(prefix)) {
                    fields.add(new StringField(CorrectLuceneFields.ContentPrefix, prefix, org.apache.lucene.document.Field.Store.NO));
                }
            }
        }
    }

    public Set<String> getPyQuerys(final String query, boolean addFirstAlpha) {
        Set<String> result = new HashSet<>();
        Set<String> fpy = new HashSet<String>();
        Set<String> apy = new HashSet<String>();
        sterotoner.getPinyinAll(query, apy, fpy);

        if (addFirstAlpha) {
            for (String py : fpy) {
                if (!py.trim().isEmpty()) {
                    result.add(py);
                }
            }
        }

        for (String py : apy) {
            if (!py.trim().isEmpty()) {
                result.add(py);
            }
        }

        return result;
    }

    public boolean buildIndex(FSTBuildScene scene) throws IOException {
        String localIndexPath = OUTPUT_PATH + scene.getIndexName();

        // 完全清理索引目录
        cleanupIndexDirectory(localIndexPath);
        Files.createDirectories(Paths.get(localIndexPath));
        IndexWriterConfig writerConfig = new IndexWriterConfig();
        writerConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

        IndexWriter indexWriter = null;
        try {
            indexWriter = new IndexWriter(FSDirectory.open(Paths.get(localIndexPath)), writerConfig);

            // 读取并处理文件
            try (BufferedReader reader = Files.newBufferedReader(Paths.get(scene.getLocalPath()))) {
                String line;
                int columnCount = -1;
                while ((line = reader.readLine()) != null) {
                    String firstColumn = getFirstColumn(line);
                    if (firstColumn != null) {
                        Iterable<? extends IndexableField> doc = mapDoc(firstColumn);
                        if (!Iterables.isEmpty(doc)) {
                            indexWriter.addDocument(doc);
                        }
                    }
                }
            }

            indexWriter.commit();
            indexWriter.flush();
            indexWriter.forceMerge(1);

        } finally {
            if (indexWriter != null) {
                try {
                    indexWriter.close();
                } catch (IOException e) {
                    LOGGER.error("Failed to close IndexWriter", e);
                }
            }
        }

        uploadS3(scene.getIndexName());
        return true;
    }

    private String getFirstColumn(String line) {
        int firstSeparatorIndex = line.indexOf('\u0001');
        if (firstSeparatorIndex == -1) {
            // 没有分隔符，整行就是第一列
            return line.trim().isEmpty() ? null : line.trim();
        }
        // 有分隔符，取第一个分隔符前的内容
        String firstColumn = line.substring(0, firstSeparatorIndex).trim();
        return firstColumn.isEmpty() ? null : firstColumn;
    }

    private void cleanupIndexDirectory(String indexPath) {
        try {
            File indexDir = new File(indexPath);
            if (indexDir.exists()) {
                File[] files = indexDir.listFiles();
                if (files != null) {
                    for (File file : files) {
                        if (file.isFile()) {
                            boolean deleted = file.delete();
                            LOGGER.info("Deleted index file: {}, success: {}", file.getName(), deleted);
                        }
                    }
                }
            }
        } catch (Exception e) {
            LOGGER.warn("Failed to cleanup index directory: {}", indexPath, e);
        }
    }

    public static String normalizeString(String phrase) {
        if (phrase == null || phrase.trim().length() == 0) {
            return null;
        }

        String formatQuery = phrase;
        if (formatQuery.endsWith(" 外卖")) {
            formatQuery = formatQuery.substring(0, phrase.length() - 3);
        } else if (formatQuery.startsWith("外卖 ")) {
            formatQuery = formatQuery.substring(3, phrase.length());
        }

        if (org.apache.commons.lang3.StringUtils.isEmpty(formatQuery)) {
            formatQuery = "外卖";
        }

        String halfstr = full2Half(formatQuery.toLowerCase());
        return removeSpaceEx(halfstr);
    }

    private void uploadS3(String indexName) {
        try {
            String bucketName = "index";
            String localIndexPath = OUTPUT_PATH + indexName;
            String s3KeyPrefix = REMOTE_INDEX_PATH + indexName;

            File indexDir = new File(localIndexPath);
            if (!indexDir.exists() || !indexDir.isDirectory()) {
                LOGGER.error("Index directory does not exist: {}", localIndexPath);
                return;
            }
            File[] indexFiles = indexDir.listFiles((dir, name) ->
                    !name.equals("write.lock") && !name.startsWith(".")
            );

            if (indexFiles == null || indexFiles.length == 0) {
                LOGGER.warn("No index files found to upload for: {}", indexName);
                return;
            }

            List<String> uploadedFiles = new ArrayList<>();

            // 上传索引文件
            for (File file : indexFiles) {
                String s3Key = s3KeyPrefix + "/" + file.getName();
                LOGGER.info("Uploading file: {} to S3 key: {}", file.getName(), s3Key);
                s3Client.upload(bucketName, s3Key, file);
                uploadedFiles.add(file.getName());
            }

            // 生成并上传文件清单
            uploadFileManifest(bucketName, s3KeyPrefix, uploadedFiles);

            LOGGER.info("Successfully uploaded {} index files and manifest for: {}",
                    uploadedFiles.size(), indexName);

        } catch (Exception e) {
            LOGGER.error("upload to s3 error", e);
        }
    }

    private void uploadFileManifest(String bucketName, String s3KeyPrefix, List<String> fileNames) {
        try {
            // 创建临时的清单文件
            String manifestFileName = "file_manifest.txt";
            String localManifestPath = OUTPUT_PATH + "temp_" + manifestFileName;

            // 写入文件清单
            try (BufferedWriter writer = Files.newBufferedWriter(Paths.get(localManifestPath))) {
                for (String fileName : fileNames) {
                    writer.write(fileName);
                    writer.newLine();
                }
            }

            // 上传清单文件
            String manifestS3Key = s3KeyPrefix + "/" + manifestFileName;
            s3Client.upload(bucketName, manifestS3Key, new File(localManifestPath));

            // 清理临时清单文件
            Files.deleteIfExists(Paths.get(localManifestPath));

            LOGGER.info("Uploaded file manifest with {} files to: {}", fileNames.size(), manifestS3Key);

        } catch (Exception e) {
            LOGGER.error("Failed to upload file manifest", e);
        }
    }
}

读复制代码

// 查询方法
public List<String> queryFromIndex(String cleanQuery, String scene) {
    try {
        IndexSearcher indexSearcher = indexSearcherMap.get(scene);

        if (indexSearcher == null || indexSearcher.getIndexReader() == null ||
                indexSearcher.getIndexReader().leaves().isEmpty()) {
            LOGGER.warn("No index searcher found for scene: {}, available scenes: {}",
                    scene, indexSearcherMap.keySet());
            return new ArrayList<>();
        }


        List<LeafReaderContext> leaves = indexSearcher.getIndexReader().leaves();
        if (leaves.isEmpty()) {
            return new ArrayList<>();
        }
        LeafReader reader = leaves.get(0).reader();
        if (reader == null) {
            return new ArrayList<>();
        }

        Set<String> results = new HashSet<>();
        Term term = new Term("content_prefix", cleanQuery);
        PostingsEnum postings = reader.postings(term);
        if (postings != null) {
            int docId;
            while ((docId = postings.nextDoc()) != PostingsEnum.NO_MORE_DOCS) {
                Document doc = reader.document(docId);
                String content = doc.get("content");
                if (content != null) {
                    results.add(content);
                }
            }
        }

        return new ArrayList<>(results);
    } catch (Exception e) {
        LOGGER.error("Error querying index for scene: {}, query: {}", scene, cleanQuery, e);
        return new ArrayList<>();
    }
}

SugLucene索引构建

概述

核心功能

1. 索引构建流程

2. 主要特性

技术架构

1. 核心类结构

2. 字段定义

索引构建详解

1. 数据获取

2. 文档映射核心逻辑

A. 文档结构生成

B. 索引字段生成策略

3. 具体示例：麦当劳 的索引生成

输入数据

生成的索引项

底层数据结构

1. Lucene 倒排索引结构

生成的 Lucene Document 结构

2. 搜索过程：查询 "麦"

底层查询实现

查询执行过程

拼音处理机制

1. Sterotoner 拼音转换

2. 拼音变体示例

文件路径和存储

1. 本地路径配置

2. 文件结构

3. S3 上传

性能特点

1. 时间复杂度

2. 空间复杂度

3. 查询性能

配置管理

1. MCC 配置

2. 配置示例

任务调度

1. Crane 任务

2. 执行流程

优化策略

1. 索引优化

2. 字符串规范化

适用场景

✅ 适合的场景

❌ 不适合的场景

监控指标

1. 构建指标

2. 查询指标

注意事项

1. 数据质量

2. 资源管理

3. 维护成本

3. 具体示例：`麦当劳` 的索引生成