nltk关键字抽取与轻量级搜索引擎(Whoosh, ElasticSearcher)

背景

有时候你想用一句完整的话或一个文本在基于关键字的搜索引擎里搜索,但是如果把整个文本放进去搜索的话,效果不是很好,因为你的搜索引擎是基于关键字而不是sematic search。那怎么抽取关键字呢?

利用NLTK抽取关键的代码

python 复制代码
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

def extract_keywords(text):
    # Tokenize the text
    words = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
    print('filtered words:', filtered_words)
    # Calculate word frequency
    freq_dist = FreqDist(filtered_words)

    # Extract keywords based on frequency or other criteria
    keywords = [word for word, freq in freq_dist.most_common(10)]  # Adjust the number of keywords as needed

    return keywords

if __name__ == '__main__':
    text = """
    Elasticsearch provides powerful search capabilities and is commonly used in production environments for large-scale document search and retrieval. However, it might be overkill for small projects or scenarios where simpler solutions like Whoosh are sufficient. Choose the solution that best fits your needs.
    """
    keywords = extract_keywords(text)
    print(keywords)

执行结果

python 复制代码
filtered words: ['elasticsearch', 'provides', 'powerful', 'search', 'capabilities', 'commonly', 'used', 'production', 'environments', 'document', 'search', 'retrieval', 'however', 'might', 'overkill', 'small', 'projects', 'scenarios', 'simpler', 'solutions', 'like', 'whoosh', 'sufficient', 'choose', 'solution', 'best', 'fits', 'needs']
['search', 'elasticsearch', 'provides', 'powerful', 'capabilities', 'commonly', 'used', 'production', 'environments', 'document']

基于关键的搜索-whoosh

python 复制代码
from keywords_extractor import *

from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser

# Define the schema for the index
schema = Schema(question=TEXT(stored=True))

# Create or open the index
INDEX_DIR = "indexdir"
ix = create_in(INDEX_DIR, schema)  # Use create_in for creating a new index or open_dir for opening an existing one

# Index your documents (replace doc_content with the actual content of your documents)
writer = ix.writer()
doc_content = "what is angular"

questions = ["How to implement autocomplete, I don't know?", "How does Angular work?", "how Python programming language", "Example question", "Another question"]

for question in questions:
    writer.add_document(question=question)

writer.commit()

# Search using keywords
search_keywords = extract_keywords(doc_content)
query_str = " OR ".join(search_keywords)
print(query_str)

with ix.searcher() as searcher:
    query_parser = QueryParser("question", ix.schema)
    query = query_parser.parse(query_str)
    results = searcher.search(query)

    for result in results:
        print(result)

执行结果

python 复制代码
filtered words: ['angular']
angular
<Hit {'question': 'How does Angular work?'}>
python 复制代码
from elasticsearch import Elasticsearch

# Connect to the Elasticsearch server (make sure it's running)
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Create an index
index_name = "your_index_name"

if not es.indices.exists(index=index_name):
    es.indices.create(index=index_name, ignore=400)

# Index a document (replace doc_content with the actual content of your documents)
doc_content = "This is the content of your document."
document = {"content": doc_content}

es.index(index=index_name, body=document)

# Search using keywords
search_keywords = extract_keywords(doc_content)
query_body = {
    "query": {
        "terms": {
            "content": search_keywords
        }
    }
}

results = es.search(index=index_name, body=query_body)

for hit in results['hits']['hits']:
    print(hit['_source'])
相关推荐
FinAnalyzer8 分钟前
如何在 InsCodeAI 上搭建并使用 Jupyter Notebook 环境?
ide·python·jupyter
java1234_小锋10 分钟前
【NLP舆情分析】基于python微博舆情分析可视化系统(flask+pandas+echarts) 视频教程 - 微博文章数据可视化分析-文章分类下拉框实现
python·自然语言处理·flask
檀越剑指大厂10 分钟前
【Python系列】Flask 应用中的主动垃圾回收
开发语言·python·flask
檀越剑指大厂17 分钟前
【Python系列】使用 memory_profiler 诊断 Flask 应用内存问题
开发语言·python·flask
WXX_s29 分钟前
【OpenCV篇】OpenCV——03day.图像预处理(2)
人工智能·python·opencv·学习·计算机视觉
Jackilina_Stone2 小时前
【论文|复现】YOLOFuse:面向多模态目标检测的双流融合框架
人工智能·python·目标检测·计算机视觉·融合
双叶8362 小时前
(Python)文件储存的认识,文件路径(文件储存基础教程)(Windows系统文件路径)(基础教程)
开发语言·windows·python
枫昕柚2 小时前
python
开发语言·python
木头左3 小时前
自动驾驶领域中的Python机器学习
python·机器学习·自动驾驶
Dxy12393102163 小时前
Python Requests-HTML库详解:从入门到实战
开发语言·python·html