nltk关键字抽取与轻量级搜索引擎(Whoosh, ElasticSearcher)

背景

有时候你想用一句完整的话或一个文本在基于关键字的搜索引擎里搜索,但是如果把整个文本放进去搜索的话,效果不是很好,因为你的搜索引擎是基于关键字而不是sematic search。那怎么抽取关键字呢?

利用NLTK抽取关键的代码

python 复制代码
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

def extract_keywords(text):
    # Tokenize the text
    words = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
    print('filtered words:', filtered_words)
    # Calculate word frequency
    freq_dist = FreqDist(filtered_words)

    # Extract keywords based on frequency or other criteria
    keywords = [word for word, freq in freq_dist.most_common(10)]  # Adjust the number of keywords as needed

    return keywords

if __name__ == '__main__':
    text = """
    Elasticsearch provides powerful search capabilities and is commonly used in production environments for large-scale document search and retrieval. However, it might be overkill for small projects or scenarios where simpler solutions like Whoosh are sufficient. Choose the solution that best fits your needs.
    """
    keywords = extract_keywords(text)
    print(keywords)

执行结果

python 复制代码
filtered words: ['elasticsearch', 'provides', 'powerful', 'search', 'capabilities', 'commonly', 'used', 'production', 'environments', 'document', 'search', 'retrieval', 'however', 'might', 'overkill', 'small', 'projects', 'scenarios', 'simpler', 'solutions', 'like', 'whoosh', 'sufficient', 'choose', 'solution', 'best', 'fits', 'needs']
['search', 'elasticsearch', 'provides', 'powerful', 'capabilities', 'commonly', 'used', 'production', 'environments', 'document']

基于关键的搜索-whoosh

python 复制代码
from keywords_extractor import *

from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser

# Define the schema for the index
schema = Schema(question=TEXT(stored=True))

# Create or open the index
INDEX_DIR = "indexdir"
ix = create_in(INDEX_DIR, schema)  # Use create_in for creating a new index or open_dir for opening an existing one

# Index your documents (replace doc_content with the actual content of your documents)
writer = ix.writer()
doc_content = "what is angular"

questions = ["How to implement autocomplete, I don't know?", "How does Angular work?", "how Python programming language", "Example question", "Another question"]

for question in questions:
    writer.add_document(question=question)

writer.commit()

# Search using keywords
search_keywords = extract_keywords(doc_content)
query_str = " OR ".join(search_keywords)
print(query_str)

with ix.searcher() as searcher:
    query_parser = QueryParser("question", ix.schema)
    query = query_parser.parse(query_str)
    results = searcher.search(query)

    for result in results:
        print(result)

执行结果

python 复制代码
filtered words: ['angular']
angular
<Hit {'question': 'How does Angular work?'}>
python 复制代码
from elasticsearch import Elasticsearch

# Connect to the Elasticsearch server (make sure it's running)
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Create an index
index_name = "your_index_name"

if not es.indices.exists(index=index_name):
    es.indices.create(index=index_name, ignore=400)

# Index a document (replace doc_content with the actual content of your documents)
doc_content = "This is the content of your document."
document = {"content": doc_content}

es.index(index=index_name, body=document)

# Search using keywords
search_keywords = extract_keywords(doc_content)
query_body = {
    "query": {
        "terms": {
            "content": search_keywords
        }
    }
}

results = es.search(index=index_name, body=query_body)

for hit in results['hits']['hits']:
    print(hit['_source'])
相关推荐
551只玄猫6 分钟前
KNN算法基础 机器学习基础1 python人工智能
人工智能·python·算法·机器学习·机器学习算法·knn·knn算法
tang777891 小时前
Python爬虫代理,选短效IP还是长效IP?
爬虫·python·tcp/ip
写文章的大米1 小时前
这份数据验证方案,可以让你的 FastAPI 崩溃率直降90%
python
xingzhemengyou11 小时前
Python 有哪些定时器
前端·python
站大爷IP2 小时前
Python自动整理音乐文件:按艺术家和专辑分类歌曲
python
BBB努力学习程序设计2 小时前
Python 高效处理大数据:生成器(Generator)的工作机制与实战技巧
python
hashiqimiya2 小时前
java程序的并发
java·开发语言·python
2301_811958382 小时前
浏览器下载huggingface网络连接超时,使用镜像源教程
python·tokenizer
red润2 小时前
Python环境变量自动配置:实现生产与开发环境无缝切换
后端·python
知识进脑的肖老千啊2 小时前
LangGraph简单讲解示例——State、Node、Edge
人工智能·python·ai·langchain