nltk关键字抽取与轻量级搜索引擎(Whoosh, ElasticSearcher)

背景

有时候你想用一句完整的话或一个文本在基于关键字的搜索引擎里搜索,但是如果把整个文本放进去搜索的话,效果不是很好,因为你的搜索引擎是基于关键字而不是sematic search。那怎么抽取关键字呢?

利用NLTK抽取关键的代码

python 复制代码
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

def extract_keywords(text):
    # Tokenize the text
    words = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
    print('filtered words:', filtered_words)
    # Calculate word frequency
    freq_dist = FreqDist(filtered_words)

    # Extract keywords based on frequency or other criteria
    keywords = [word for word, freq in freq_dist.most_common(10)]  # Adjust the number of keywords as needed

    return keywords

if __name__ == '__main__':
    text = """
    Elasticsearch provides powerful search capabilities and is commonly used in production environments for large-scale document search and retrieval. However, it might be overkill for small projects or scenarios where simpler solutions like Whoosh are sufficient. Choose the solution that best fits your needs.
    """
    keywords = extract_keywords(text)
    print(keywords)

执行结果

python 复制代码
filtered words: ['elasticsearch', 'provides', 'powerful', 'search', 'capabilities', 'commonly', 'used', 'production', 'environments', 'document', 'search', 'retrieval', 'however', 'might', 'overkill', 'small', 'projects', 'scenarios', 'simpler', 'solutions', 'like', 'whoosh', 'sufficient', 'choose', 'solution', 'best', 'fits', 'needs']
['search', 'elasticsearch', 'provides', 'powerful', 'capabilities', 'commonly', 'used', 'production', 'environments', 'document']

基于关键的搜索-whoosh

python 复制代码
from keywords_extractor import *

from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser

# Define the schema for the index
schema = Schema(question=TEXT(stored=True))

# Create or open the index
INDEX_DIR = "indexdir"
ix = create_in(INDEX_DIR, schema)  # Use create_in for creating a new index or open_dir for opening an existing one

# Index your documents (replace doc_content with the actual content of your documents)
writer = ix.writer()
doc_content = "what is angular"

questions = ["How to implement autocomplete, I don't know?", "How does Angular work?", "how Python programming language", "Example question", "Another question"]

for question in questions:
    writer.add_document(question=question)

writer.commit()

# Search using keywords
search_keywords = extract_keywords(doc_content)
query_str = " OR ".join(search_keywords)
print(query_str)

with ix.searcher() as searcher:
    query_parser = QueryParser("question", ix.schema)
    query = query_parser.parse(query_str)
    results = searcher.search(query)

    for result in results:
        print(result)

执行结果

python 复制代码
filtered words: ['angular']
angular
<Hit {'question': 'How does Angular work?'}>
python 复制代码
from elasticsearch import Elasticsearch

# Connect to the Elasticsearch server (make sure it's running)
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Create an index
index_name = "your_index_name"

if not es.indices.exists(index=index_name):
    es.indices.create(index=index_name, ignore=400)

# Index a document (replace doc_content with the actual content of your documents)
doc_content = "This is the content of your document."
document = {"content": doc_content}

es.index(index=index_name, body=document)

# Search using keywords
search_keywords = extract_keywords(doc_content)
query_body = {
    "query": {
        "terms": {
            "content": search_keywords
        }
    }
}

results = es.search(index=index_name, body=query_body)

for hit in results['hits']['hits']:
    print(hit['_source'])
相关推荐
AI行业学习10 分钟前
CC-Switch v3.16.1 官方下载 | 安装配置详细教程【2026.6.10】
java·开发语言·vue.js·python·mysql·eclipse·html
早起CaiCai35 分钟前
【Pytorch 实践1】手写数字
人工智能·pytorch·python
吴梓穆1 小时前
Python 语法基础 函数
开发语言·python
Kobebryant-Manba1 小时前
学习文本处理
开发语言·python
m0_617493941 小时前
PaddleOCR报错:OneDnnContext does not have the input Filter 解决方案汇总
python
李可以量化1 小时前
量化迅投 QMT vs 聚宽 (JoinQuant)全面分析
python·量化·qmt·ptrade·聚宽
旅僧1 小时前
运行UMI镜像
python
ellenwan20261 小时前
期货跨期价差程序化怎么做:天勤 SP 合约与腿比例核对
python·区块链
月疯1 小时前
torch:expand和repeate的区别
开发语言·python·深度学习
顾林海2 小时前
Agent入门阶段-编程基础-Python:Python 开发环境与运行方式
python·agent·ai编程