本篇文章主要介绍如何构建一个Agent能够解析输入的文字,理解其意图,并且在本地文件系统中搜索符合条件的文件。同时还提供一个Web的页面可以交互式查询本地文件系统。
1、功能说明
主要功能时根据用户输入的符合条件的需求描述,如:
python
find the pdf file name contains resume in disk D:\\ updated since Nov 1st 2025
通过使用LLM模型解析用户输入的需求的每一个条件,然后,在本地文件系统中进行查找。
可以根据以下条件搜索本地文件的系统:
- 文件名称
- 文件类型
- 文件修改日期
- 文件大小
- 文件内容
2、环境依赖
参考:https://blog.csdn.net/jimmyleeee/article/details/155646865
对于依赖库有所不同,可以参考如下:
python
pip install langchain-community langchain-ollama streamlit langchain_core sounddevice scipy SpeechRecognition torch torchvision torchaudio
3、构建解析用户输入并且进行查询的Agent
所有功能都是通过类NaturalLanguageFileSearchAgent实现的,它封装了所有的功能,包括:调用LLM解析输入中的参数,并且在本地按照解析获得参数进行搜索。代码如下:
python
import os
import fnmatch
import re
from datetime import datetime, timedelta
from typing import List, Dict, Any, Optional
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
class NaturalLanguageFileSearchAgent:
"""
An intelligent agent that understands natural language queries for file searching using Ollama/Qwen2.5
"""
def __init__(self, model="qwen2.5", base_url="http://localhost:11434"):
self.search_history = []
# Define common file types and their extensions
self.file_types = {
'document': ['.doc', '.docx', '.txt', '.rtf', '.odt', '.wpd'],
'word': ['.doc', '.docx'],
'excel': ['.xls', '.xlsx', '.csv', '.ods'],
'spreadsheet': ['.xls', '.xlsx', '.csv', '.ods'],
'powerpoint': ['.ppt', '.pptx', '.odp'],
'presentation': ['.ppt', '.pptx', '.odp'],
'pdf': ['.pdf'],
'image': ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.svg'],
'photo': ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff'],
'video': ['.mp4', '.avi', '.mkv', '.mov', '.wmv', '.flv'],
'audio': ['.mp3', '.wav', '.flac', '.aac', '.ogg'],
'music': ['.mp3', '.wav', '.flac', '.aac', '.ogg'],
'code': ['.py', '.java', '.cpp', '.js', '.html', '.css', '.php'],
'archive': ['.zip', '.rar', '.7z', '.tar', '.gz'],
'compressed': ['.zip', '.rar', '.7z', '.tar', '.gz']
}
# Define time expressions
self.time_expressions = {
'today': 1,
'yesterday': 2,
'week': 7,
'month': 30,
'year': 365
}
# Define common drive letters for Windows
self.common_drives = ['C:', 'D:', 'E:', 'F:', 'G:', 'H:']
# Initialize Ollama model
self.llm = ChatOllama(model=model, base_url=base_url)
# Define prompt for parsing natural language queries
self.parse_prompt = ChatPromptTemplate.from_messages([
("system", """You are an intelligent file search assistant. Your task is to parse natural language queries
and extract structured search parameters. Always respond in valid JSON format with the following keys:
- name_pattern: file name pattern to search for (string or null)
- extensions: list of file extensions to include (array or null)
- days_old: number of days relative to now (integer or null)
* For "updated since [past date]": positive number of days ago
* For "updated since [future date]": negative number
* Example: If today is Dec 11, 2025 and query says "since Nov 1, 2025", then days_old = 40
- min_size: minimum file size in bytes (integer or null)
- max_size: maximum file size in bytes (integer or null)
- keyword: content keyword to search for (string or null)
- search_path: directory path to search in (string or "SYSTEM_WIDE" for system-wide search)
Examples:
Query: "Find recent Word documents from last 10 days"
Response: {"name_pattern": null, "extensions": [".doc", ".docx"], "days_old": 10, "min_size": null, "max_size": null, "keyword": null, "search_path": "SYSTEM_WIDE"}
Query: "Show me PDF files on my desktop"
Response: {"name_pattern": null, "extensions": [".pdf"], "days_old": null, "min_size": null, "max_size": null, "keyword": null, "search_path": "DESKTOP"}
Query: "Look for resume.pdf in my downloads"
Response: {"name_pattern": "resume.pdf", "extensions": null, "days_old": null, "min_size": null, "max_size": null, "keyword": null, "search_path": "DOWNLOADS"}
Query: "Find PDF files containing resume updated since Nov 1st 2025" (assuming today is Dec 11, 2025)
Response: {"name_pattern": null, "extensions": [".pdf"], "days_old": 40, "min_size": null, "max_size": null, "keyword": "resume", "search_path": "SYSTEM_WIDE"}
IMPORTANT: Always respond ONLY with valid JSON, no extra text."""),
("human", "Query: {query}")
])
self.parser_chain = self.parse_prompt | self.llm | StrOutputParser()
def understand_query_with_llm(self, query: str) -> Dict[str, Any]:
"""
Parse natural language query using LLM and extract search parameters
Args:
query: Natural language query string
Returns:
Dictionary with extracted search parameters
"""
try:
# Get response from LLM
response = self.parser_chain.invoke({"query": query})
print(f"LLM Response: {response}")
# Extract JSON from response if needed
json_str = self._extract_json_from_response(response)
# Parse JSON
import json
params = json.loads(json_str)
# Process search_path
if params['search_path'] == "SYSTEM_WIDE":
params['search_path'] = "SYSTEM_WIDE"
elif params['search_path'] == "DESKTOP":
desktop_path = os.path.join(os.path.expanduser('~'), 'Desktop')
params['search_path'] = desktop_path if os.path.exists(desktop_path) else "."
elif params['search_path'] == "DOWNLOADS":
downloads_path = os.path.join(os.path.expanduser('~'), 'Downloads')
params['search_path'] = downloads_path if os.path.exists(downloads_path) else "."
elif params['search_path'] == "DOCUMENTS":
documents_path = os.path.join(os.path.expanduser('~'), 'Documents')
params['search_path'] = documents_path if os.path.exists(documents_path) else "."
elif not os.path.exists(params['search_path']):
# Fallback to system-wide search if path doesn't exist
params['search_path'] = "SYSTEM_WIDE"
return params
except Exception as e:
print(f"Error parsing with LLM: {e}")
# Fallback to rule-based parsing
return self._understand_query_rule_based(query)
def _extract_json_from_response(self, response: str) -> str:
"""
Extract JSON from LLM response
Args:
response: Raw LLM response
Returns:
Clean JSON string
"""
# Look for JSON object in response
import json
try:
# Try to parse entire response as JSON
json.loads(response)
return response
except:
# Look for JSON object in curly braces
match = re.search(r'\{.*\}', response, re.DOTALL)
if match:
return match.group(0)
else:
# Return default JSON if parsing fails
return '{"name_pattern": null, "extensions": null, "days_old": null, "min_size": null, "max_size": null, "keyword": null, "search_path": "SYSTEM_WIDE"}'
def _understand_query_rule_based(self, query: str) -> Dict[str, Any]:
"""
Fallback rule-based query understanding
Args:
query: Natural language query string
Returns:
Dictionary with extracted search parameters
"""
query = query.lower().strip()
params = {
'name_pattern': None,
'extensions': None,
'days_old': None,
'min_size': None,
'max_size': None,
'keyword': None,
'search_path': None,
'recursive': True
}
# Extract search path from query
params['search_path'] = self._extract_search_path(query)
# Extract file type
for file_type, extensions in self.file_types.items():
if file_type in query:
params['extensions'] = extensions
break
# Extract time expressions
time_patterns = [
r"last\s+(\d+)\s+days?",
r"recent.*?(\d+)\s+days?",
r"past\s+(\d+)\s+days?",
r"(\d+)\s+days?\s+ago",
r"last\s+(\d+)\s+weeks?",
r"recent.*?(\d+)\s+weeks?",
r"(\d+)\s+weeks?\s+ago"
]
for pattern in time_patterns:
match = re.search(pattern, query)
if match:
number = int(match.group(1))
if 'week' in pattern:
number *= 7
params['days_old'] = number
break
# Check for common time expressions
for expr, days in self.time_expressions.items():
if expr in query:
params['days_old'] = days
break
# Handle "updated since [date]" patterns - NEW CODE
date_patterns = [
r"updated\s+(?:since|after)\s+(\w+\s+\d+(?:st|nd|rd|th)?\s+\d{4})",
r"modified\s+(?:since|after)\s+(\w+\s+\d+(?:st|nd|rd|th)?\s+\d{4})"
]
for pattern in date_patterns:
match = re.search(pattern, query)
if match:
date_str = match.group(1)
days_old = self._calculate_days_from_date_string(date_str)
if days_old is not None:
params['days_old'] = days_old
break
# Extract file size expressions
size_patterns = [
r"larger than\s*(\d+)\s*(mb|gb|kb)",
r"bigger than\s*(\d+)\s*(mb|gb|kb)",
r"smaller than\s*(\d+)\s*(mb|gb|kb)",
r"(\d+)\s*(mb|gb|kb)\s*or (larger|bigger|smaller)"
]
for pattern in size_patterns:
match = re.search(pattern, query)
if match:
number = int(match.group(1))
unit = match.group(2).lower()
comparison = match.group(3) if len(match.groups()) > 2 else 'larger'
# Convert to bytes
multiplier = 1
if unit == 'kb':
multiplier = 1024
elif unit == 'mb':
multiplier = 1024 * 1024
elif unit == 'gb':
multiplier = 1024 * 1024 * 1024
size_bytes = number * multiplier
if 'smaller' in comparison:
params['max_size'] = size_bytes
else:
params['min_size'] = size_bytes
break
# Extract keywords for content search
keyword_patterns = [
r"containing\s+['\"]?(.+?)['\"]?$",
r"with\s+['\"]?(.+?)['\"]?$",
r"has\s+['\"]?(.+?)['\"]?$",
r"contains\s+['\"]?(.+?)['\"]?$"
]
for pattern in keyword_patterns:
match = re.search(pattern, query)
if match:
keyword = match.group(1).strip()
keyword = re.sub(r'\s+(file|files|document|documents)?\s*$', '', keyword)
params['keyword'] = keyword
break
# Extract specific filenames or patterns - FIXED VERSION
# Handle "file name contains [pattern]" cases more robustly
if 'file name contains' in query:
# More flexible pattern matching for "file name contains [text]"
# Match everything after "file name contains" until we hit a boundary word or end of string
match = re.search(r"file name contains\s+(.*?)(?:\s+(?:in|on|at|from|to|of|with|by|updated|modified|and)|$)", query)
if match:
filename_part = match.group(1).strip()
# Clean up trailing punctuation
filename_part = re.sub(r'[.,;:]+$', '', filename_part).strip()
if filename_part:
print(f"DEBUG: Found filename pattern: '{filename_part}'")
# If it looks like a full filename with extension, use as-is
# Otherwise add wildcards
if '.' in filename_part and len(filename_part.split('.')[-1]) <= 4:
params['name_pattern'] = filename_part
else:
params['name_pattern'] = f"*{filename_part}*"
else:
# Fallback pattern
match = re.search(r"file name contains\s+(.+)", query)
if match:
filename_part = match.group(1).strip()
filename_part = re.sub(r'[.,;:]+$', '', filename_part).strip()
if filename_part:
print(f"DEBUG: Found filename pattern (fallback): '{filename_part}'")
if '.' in filename_part and len(filename_part.split('.')[-1]) <= 4:
params['name_pattern'] = filename_part
else:
params['name_pattern'] = f"*{filename_part}*"
# General filename extraction if not already set
if params['name_pattern'] is None:
filename_indicators = [
r"name.*?contains\s+['\"]?([^.'\"]+\.[^.'\"]+)",
r"filename.*?contains\s+['\"]?([^.'\"]+\.[^.'\"]+)",
r"search.*?for\s+['\"]?([^.'\"]+\.[^.'\"]+)",
r"find.*?['\"]?([^.'\"]+\.[^.'\"]+)",
r"contains\s+['\"]?([^.'\"]+\.[^.'\"]+)"
]
for pattern in filename_indicators:
match = re.search(pattern, query)
if match:
filename_part = match.group(1).strip()
if '.' in filename_part:
params['name_pattern'] = filename_part
break
# If we still don't have a name pattern but have a keyword that looks like a filename
if params['name_pattern'] is None and params['keyword']:
if '.' in params['keyword'] and len(params['keyword'].split('.')[-1]) <= 4:
params['name_pattern'] = params['keyword']
params['keyword'] = None
print(f"DEBUG: Final parsed parameters: {params}")
return params
def _extract_search_path(self, query: str) -> str:
"""
Extract search path from query or determine system-wide search
Args:
query: Natural language query string
Returns:
Search path string
"""
query_lower = query.lower()
# Look for explicit path mentions
path_patterns = [
r"in\s+disk\s+([a-zA-Z]:\\)", # Handle "in disk D:\"
r"in\s+([a-zA-Z]:\\[^\\s]+(?:\s[^\\s]*?)*)",
r"under\s+([a-zA-Z]:\\[^\\s]+(?:\s[^\\s]*?)*)",
r"from\s+([a-zA-Z]:\\[^\\s]+(?:\s[^\\s]*?)*)",
r"in\s+([a-zA-Z]:)",
r"under\s+([a-zA-Z]:)",
r"from\s+([a-zA-Z]:)"
]
for pattern in path_patterns:
match = re.search(pattern, query_lower)
if match:
path = match.group(1).strip()
if os.path.exists(path):
return path
if ':' in path and not path.endswith('\\'):
fixed_path = path + '\\'
if os.path.exists(fixed_path):
return fixed_path
# Look for common directory references
if 'desktop' in query_lower:
desktop_path = os.path.join(os.path.expanduser('~'), 'Desktop')
if os.path.exists(desktop_path):
return desktop_path
if 'downloads' in query_lower or 'download' in query_lower:
downloads_path = os.path.join(os.path.expanduser('~'), 'Downloads')
if os.path.exists(downloads_path):
return downloads_path
if 'documents' in query_lower or 'document' in query_lower:
documents_path = os.path.join(os.path.expanduser('~'), 'Documents')
if os.path.exists(documents_path):
return documents_path
# Default to system-wide search
if os.name == 'nt': # Windows
return "SYSTEM_WIDE"
else:
return "/"
def search(self, query: str) -> List[Dict[str, Any]]:
"""
Perform search based on natural language query
Args:
query: Natural language query
Returns:
List of matching files
"""
print(f"Understanding query: '{query}'")
params = self.understand_query_with_llm(query)
print(f"Parsed parameters: {params}")
# Debug the parameters
if params.get('days_old') is not None:
target_date = datetime.now() - timedelta(days=params['days_old'])
print(f"DEBUG: Target date for filtering: {target_date}")
# Handle system-wide search
if params['search_path'] == "SYSTEM_WIDE":
print("Performing system-wide search...")
all_results = []
print(f"SYSTEM_WIDE search")
# Search common drives on Windows
for drive in self.common_drives:
drive_path = drive + "\\"
if os.path.exists(drive_path):
print(f"Searching in {drive_path}...")
params_copy = params.copy()
params_copy['search_path'] = drive_path
results = self.advanced_search(query, **params_copy)
all_results.extend(results)
return all_results
else:
print(f" search in path {params['search_path']}")
# Validate path exists
if not os.path.exists(params['search_path']):
print(f"Path {params['search_path']} does not exist, searching in current directory")
params['search_path'] = "."
# Use the advanced search with parsed parameters
return self.advanced_search(query, **params)
def advanced_search(self, original_query: str,
name_pattern: Optional[str] = None,
extensions: Optional[List[str]] = None,
min_size: Optional[int] = None,
max_size: Optional[int] = None,
days_old: Optional[int] = None,
keyword: Optional[str] = None,
search_path: str = ".",
recursive: bool = True) -> List[Dict[str, Any]]:
"""
Perform advanced search with multiple criteria
"""
print(f"Enter in advanced_search")
results = []
# Normalize search path
if search_path == ".":
search_path = os.getcwd()
else:
search_path = os.path.abspath(search_path)
print(f"Advanced search searching in {search_path}")
print(f"Searching for files matching criteria:")
print(f" name_pattern: {name_pattern if name_pattern else 'all'}")
print(f" extensions: {extensions if extensions else 'all'}")
print(f" min_size: {min_size if min_size else 'all'}")
print(f" max_size: {max_size if max_size else 'all'}")
print(f" days_old: {days_old if days_old else 'all'}")
print(f" keyword: {keyword if keyword else 'all'}")
files_processed = 0
files_matched = 0
try:
if recursive:
print("Starting recursive search...")
for root, dirs, files in os.walk(search_path):
# Skip system directories that often cause permission issues
dirs[:] = [d for d in dirs if not d.startswith(('$', 'System Volume Information'))]
for filename in files:
files_processed += 1
file_path = os.path.join(root, filename)
# Limit debug output to avoid overwhelming logs
if files_processed <= 20 or files_matched < 5:
match_result = self._matches_criteria(
file_path, name_pattern, extensions, min_size,
max_size, days_old, keyword)
else:
# After initial files, suppress debug output but still check
# Temporarily disable print for performance
import sys, io
old_stdout = sys.stdout
sys.stdout = io.StringIO()
match_result = self._matches_criteria(
file_path, name_pattern, extensions, min_size,
max_size, days_old, keyword)
sys.stdout = old_stdout
if match_result:
files_matched += 1
results.append(self._get_file_info(file_path))
# Show first few matches
if files_matched <= 10:
print(f"MATCH #{files_matched}: {file_path}")
# Progress indicator for large directories
if files_processed % 1000 == 0:
print(f"Processed {files_processed} files, found {files_matched} matches so far...")
else:
print("Starting non-recursive search...")
with os.scandir(search_path) as entries:
for entry in entries:
if entry.is_file():
files_processed += 1
match_result = self._matches_criteria(
entry.path, name_pattern, extensions, min_size,
max_size, days_old, keyword)
if match_result:
files_matched += 1
results.append(self._get_file_info(entry.path))
# Show first few matches
if files_matched <= 10:
print(f"MATCH #{files_matched}: {entry.path}")
print(f"Search complete. Files processed: {files_processed}, Files matched: {files_matched}")
print(f"Records found: {len(results)}")
except PermissionError as e:
print(f"Permission denied accessing some directories: {e}")
except Exception as e:
print(f"Error during search: {e}")
import traceback
traceback.print_exc()
self.search_history.append({
"type": "natural_language",
"query": original_query,
"criteria": {
"name_pattern": name_pattern,
"extensions": extensions,
"min_size": min_size,
"max_size": max_size,
"days_old": days_old,
"keyword": keyword,
"path": search_path
},
"path": search_path,
"results": len(results),
"timestamp": datetime.now()
})
print(f"Leave in advanced_search")
return results
def _matches_criteria(self, file_path: str, name_pattern: Optional[str],
extensions: Optional[List[str]], min_size: Optional[int],
max_size: Optional[int], days_old: Optional[int],
keyword: Optional[str]) -> bool:
"""Check if a file matches all specified criteria"""
#print(f"\n=== CHECKING FILE: {file_path} ===")
# Name pattern check
if name_pattern:
filename = os.path.basename(file_path)
pattern = name_pattern
#print(f"Name pattern check: looking for '{pattern}' in '{filename}'")
# Handle case-insensitive matching
filename_lower = filename.lower()
pattern_lower = pattern.lower()
# If pattern contains wildcards, use fnmatch
if '*' in pattern_lower or '?' in pattern_lower:
match_result = fnmatch.fnmatch(filename_lower, pattern_lower)
#print(f"Wildcard match '{pattern_lower}' with '{filename_lower}': {match_result}")
if not match_result:
#print(f"REJECTED: Name pattern wildcard mismatch")
return False
else:
# Exact substring matching (case insensitive)
substring_match = pattern_lower in filename_lower
print(f"Substring match '{pattern_lower}' in '{filename_lower}': {substring_match}")
if not substring_match:
#print(f"REJECTED: Name pattern substring mismatch")
return False
else:
print("No name pattern specified")
# Extension check
if extensions:
_, ext = os.path.splitext(file_path)
# Normalize extensions for comparison
normalized_extensions = []
for e in extensions:
if e.startswith('.'):
normalized_extensions.append(e.lower())
else:
normalized_extensions.append('.' + e.lower())
ext_check = ext.lower() in normalized_extensions
print(f"Extension check: file has '{ext.lower()}', looking for {normalized_extensions}, match: {ext_check}")
if not ext_check:
print(f"REJECTED: Extension mismatch")
return False
else:
print("No extension filter specified")
# Size check
if min_size is not None or max_size is not None:
try:
size = os.path.getsize(file_path)
min_check = min_size is None or size >= min_size
max_check = max_size is None or size <= max_size
print(f"Size check: file size {size}, min {min_size}, max {max_size}")
print(f"Min check: {min_check}, Max check: {max_check}")
if not (min_check and max_check):
print(f"REJECTED: Size mismatch")
return False
except (OSError, PermissionError) as e:
print(f"WARNING: Cannot access file size: {e}")
# Don't reject based on size if we can't read it
pass
# Date check
if days_old is not None:
try:
mod_time = datetime.fromtimestamp(os.path.getmtime(file_path))
target_date = datetime.now() - timedelta(days=days_old)
print(f"Date check:")
print(f" File modification time: {mod_time}")
print(f" Target date (since): {target_date}")
print(f" Days old parameter: {days_old}")
print(f" Comparison: {mod_time} >= {target_date} = {mod_time >= target_date}")
# For "updated since [date]", we want files NEWER than or equal to that date
if mod_time < target_date:
print(f"REJECTED: File is older than target date")
return False
else:
print(f"ACCEPTED: File is newer than or equal to target date")
except (OSError, PermissionError) as e:
print(f"WARNING: Cannot access file modification time: {e}")
# Don't reject based on date if we can't read it
pass
else:
print("No date filter specified")
# Content check ======================= Currently disabled
'''
if keyword:
keyword_found = self._file_contains_keyword(file_path, keyword)
print(f"Keyword check: looking for '{keyword}', found: {keyword_found}")
if not keyword_found:
print(f"REJECTED: Keyword '{keyword}' not found in file content")
return False
else:
print(f"ACCEPTED: Keyword '{keyword}' found in file content")
else:
print("No keyword filter specified")
'''
print(f"FINAL RESULT: File ACCEPTED")
return True
def _test_pattern_matching(self):
"""Test function to verify pattern matching works correctly"""
test_cases = [
("*resume*", "my_resume.pdf", True),
("*resume*", "resume_final.docx", True),
("*resume*", "Resume.pdf", True), # Case insensitive
("*resume*", "application.txt", False),
("resume*", "resume_draft.pdf", True),
("*resume", "final_resume.pdf", True),
]
print("\n=== PATTERN MATCHING TESTS ===")
for pattern, filename, expected in test_cases:
result = False
if '*' in pattern or '?' in pattern:
result = fnmatch.fnmatch(filename.lower(), pattern.lower())
else:
result = pattern.lower() in filename.lower()
status = "PASS" if result == expected else "FAIL"
print(f"{status}: Pattern '{pattern}' with '{filename}' -> {result} (expected {expected})")
print("=== END TESTS ===\n")
def _get_file_info(self, file_path: str) -> Dict[str, Any]:
"""Get detailed information about a file"""
try:
stat = os.stat(file_path)
return {
"path": file_path,
"name": os.path.basename(file_path),
"size": stat.st_size,
"modified": datetime.fromtimestamp(stat.st_mtime),
"created": datetime.fromtimestamp(stat.st_ctime),
"extension": os.path.splitext(file_path)[1],
"directory": os.path.dirname(file_path)
}
except (OSError, PermissionError):
return {
"path": file_path,
"name": os.path.basename(file_path),
"error": "Unable to access file information"
}
def _file_contains_keyword(self, file_path: str, keyword: str) -> bool:
"""Check if a file contains a keyword in its content"""
try:
# Only check text-based files
text_extensions = ['.txt', '.py', '.js', '.html', '.css', '.csv', '.md', '.json', '.xml']
_, ext = os.path.splitext(file_path)
if ext.lower() not in text_extensions:
return False
# Skip very large files
if os.path.getsize(file_path) > 10 * 1024 * 1024: # 10MB limit
return False
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
content = f.read()
return keyword.lower() in content.lower()
except (OSError, PermissionError, UnicodeDecodeError):
return False
def _calculate_days_from_date_string(self, date_str: str) -> Optional[int]:
"""
Calculate days old from a date string like "Nov 1st 2025"
Args:
date_str: Date string in format like "Nov 1st 2025"
Returns:
Number of days between now and the given date (positive if past, negative if future)
"""
try:
# Clean up the date string
# Remove ordinal suffixes (st, nd, rd, th)
date_str = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', date_str)
print(f"DEBUG: Parsing date string: '{date_str}'")
# Parse the date
date_obj = datetime.strptime(date_str, "%b %d %Y")
print(f"DEBUG: Parsed date object: {date_obj}")
# Calculate difference in days
delta = datetime.now() - date_obj
days = delta.days
print(f"DEBUG: Days difference: {days}")
return days
except Exception as e:
print(f"Error parsing date string '{date_str}': {e}")
return None
def get_search_history(self) -> List[Dict]:
"""Get history of all searches performed"""
return self.search_history
def clear_search_history(self):
"""Clear search history"""
self.search_history.clear()
def format_file_info(self, file_info: Dict[str, Any]) -> str:
"""
Format file information for display, including modification time
Args:
file_info: Dictionary containing file information
Returns:
Formatted string with file details
"""
try:
# Format file size
size = file_info['size']
if size < 1024:
size_str = f"{size} B"
elif size < 1024 * 1024:
size_str = f"{size // 1024} KB"
elif size < 1024 * 1024 * 1024:
size_str = f"{size // (1024 * 1024)} MB"
else:
size_str = f"{size // (1024 * 1024 * 1024)} GB"
# Format modification time
mod_time = file_info['modified']
mod_time_str = mod_time.strftime("%Y-%m-%d %H:%M:%S")
# Return formatted string
return f"{file_info['name']} ({size_str}, modified: {mod_time_str})"
except Exception as e:
# Fallback if there's an error in formatting
return f"{file_info['name']} ({file_info['size']} bytes)"
# Example usage
if __name__ == "__main__":
# Create agent instance
agent = NaturalLanguageFileSearchAgent()
# Run pattern matching tests
agent._test_pattern_matching()
#exit()
'''
# Test queries
test_queries = [
"Please help search the file name contains resume.pdf",
"Find recent Word documents from last 10 days",
"Show me PDF files on my desktop",
"Look for images in my downloads folder from last week"
]
print("Testing natural language file search agent with Ollama/Qwen2.5:")
print("=" * 60)
for query in test_queries:
print(f"\nQuery: {query}")
try:
result = agent.search(query)
print(f"Found {len(result)} files")
for file in result[:3]: # Show first 3 results
print(f" - {agent.format_file_info(file)}")
except Exception as e:
print(f"Error: {e}")
print("-" * 40)
'''
try:
result = agent.search("find the pdf file name contains resume in disk D:\\")
#result = agent.search("find the pdf file name contains resume in disk D:\\ updated since Nov 1st 2025")
print(f"Found {len(result)} files")
for file in result[:10]: # Show first 10 results
print(f" - {agent.format_file_info(file)}")
except Exception as e:
print(f"Error: {e}")
print("-" * 40)
可以根据本地测试文件的条件修改main函数中的测试的Query,在Console中执行:python filename.py 测试Agent是否可以正常工作。
4、构建Web页面
Web页面主要包含一个输入框,和一个查询的按钮,由于查询时间可能比较长,再加上了一个进度条。查询成功之后,查询获得的文件以表格的形式显示在输入框的下方。代码如下:
python
import streamlit as st
import os
import sys
import traceback
import numpy as np
import sounddevice as sd
import scipy.io.wavfile as wav
import speech_recognition as sr
from scipy.io.wavfile import write
import tempfile
# Set page configuration
st.set_page_config(
page_title="File Search Agent",
page_icon="🔍",
layout="wide"
)
# Custom CSS for better appearance
st.markdown("""
<style>
.stProgress > div > div > div {
background-color: #4CAF50;
}
.file-card {
border: 1px solid #ddd;
border-radius: 5px;
padding: 10px;
margin: 5px 0;
background-color: #f9f9f9;
}
.file-name {
font-weight: bold;
color: #2c3e50;
}
.file-details {
font-size: 0.9em;
color: #7f8c8d;
}
.search-history {
background-color: #ecf0f1;
padding: 10px;
border-radius: 5px;
margin-top: 20px;
}
.status-message {
padding: 10px;
border-radius: 5px;
margin: 10px 0;
}
.recording {
background-color: #f44336 !important;
animation: pulse 1s infinite;
}
@keyframes pulse {
0% { opacity: 1; }
50% { opacity: 0.5; }
100% { opacity: 1; }
}
</style>
""", unsafe_allow_html=True)
def format_file_size(size_bytes):
"""Format file size in human readable format"""
if size_bytes < 1024:
return f"{size_bytes} B"
elif size_bytes < 1024 * 1024:
return f"{size_bytes // 1024} KB"
elif size_bytes < 1024 * 1024 * 1024:
return f"{size_bytes // (1024 * 1024)} MB"
else:
return f"{size_bytes // (1024 * 1024 * 1024)} GB"
@st.cache_resource
def get_search_agent():
"""Initialize and cache the search agent"""
try:
# Import here to isolate potential issues
from file_search_nlp_agent import NaturalLanguageFileSearchAgent
return NaturalLanguageFileSearchAgent()
except Exception as e:
st.error(f"Failed to initialize search agent: {str(e)}")
st.error(f"Traceback: {traceback.format_exc()}")
return None
def initialize_session_state():
"""Initialize session state variables"""
if 'search_results' not in st.session_state:
st.session_state.search_results = []
if 'search_history' not in st.session_state:
st.session_state.search_history = []
if 'is_searching' not in st.session_state:
st.session_state.is_searching = False
if 'current_query' not in st.session_state:
st.session_state.current_query = ""
if 'is_recording' not in st.session_state:
st.session_state.is_recording = False
if 'voice_query' not in st.session_state:
st.session_state.voice_query = ""
return True
def add_to_search_history(query):
"""Add query to search history"""
if query not in [h['query'] for h in st.session_state.search_history]:
import pandas as pd
st.session_state.search_history.append({
'query': query,
'timestamp': pd.Timestamp.now()
})
def record_audio(duration=5, sample_rate=44100):
"""Record audio using sounddevice and convert to text"""
try:
st.info("🎤 Recording... Please speak now.")
# Record audio
audio_data = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1, dtype=np.int16)
sd.wait() # Wait until recording is finished
# Save to temporary WAV file
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as tmp_file:
wav_file = tmp_file.name
write(wav_file, sample_rate, audio_data)
# Use speech recognition
recognizer = sr.Recognizer()
with sr.AudioFile(wav_file) as source:
audio = recognizer.record(source)
# Convert to text
st.info("🔄 Converting speech to text...")
text = recognizer.recognize_google(audio)
st.success(f"✅ Recognized: {text}")
# Clean up temporary file
os.unlink(wav_file)
return text
except sr.UnknownValueError:
st.error("❓ Could not understand audio. Please try again.")
return None
except sr.RequestError as e:
st.error(f"🚫 Speech recognition service error: {e}")
return None
except Exception as e:
st.error(f"❌ Error recording audio: {str(e)}")
return None
def display_search_results():
"""Display search results in a formatted way"""
if st.session_state.search_results is not None:
st.divider()
st.header(f"📁 Search Results ({len(st.session_state.search_results)} files found)")
# Show results count
st.markdown(f"Showing results for: **{st.session_state.current_query}**")
# Display results in a table
if st.session_state.search_results:
# Prepare data for dataframe
display_data = []
for file_info in st.session_state.search_results:
try:
display_data.append({
"File Name": file_info.get('name', 'N/A'),
"Size": format_file_size(file_info.get('size', 0)),
"Modified": file_info.get('modified', '').strftime("%Y-%m-%d %H:%M:%S") if file_info.get('modified') else 'N/A',
"Directory": file_info.get('directory', 'N/A')[:50] + "..." if len(file_info.get('directory', '')) > 50 else file_info.get('directory', 'N/A')
})
except Exception:
continue
# Display as dataframe
import pandas as pd
df = pd.DataFrame(display_data)
st.dataframe(df, use_container_width=True, height=400)
# Option to show detailed view
st.divider()
st.subheader("📋 Detailed View")
num_to_show = st.slider("Number of files to display", 1, min(50, len(st.session_state.search_results)), 10)
for i, file_info in enumerate(st.session_state.search_results[:num_to_show]):
with st.container():
st.markdown(f"""
<div class="file-card">
<div class="file-name">{file_info.get('name', 'N/A')}</div>
<div class="file-details">
<strong>Path:</strong> {file_info.get('path', 'N/A')}<br>
<strong>Size:</strong> {format_file_size(file_info.get('size', 0))}<br>
<strong>Modified:</strong> {file_info.get('modified', '').strftime("%Y-%m-%d %H:%M:%S") if file_info.get('modified') else 'N/A'}<br>
<strong>Created:</strong> {file_info.get('created', '').strftime("%Y-%m-%d %H:%M:%S") if file_info.get('created') else 'N/A'}<br>
<strong>Extension:</strong> {file_info.get('extension', 'N/A')}
</div>
</div>
""", unsafe_allow_html=True)
st.markdown("---")
else:
# Display message when no results found
st.info("🔍 No files found matching your query. Try adjusting your search terms or checking the search path.")
# Add some helpful suggestions
st.markdown("""
**💡 Tips for better search results:**
- Check if the file path exists
- Try using broader search terms
- Verify file extensions (e.g., .pdf, .docx)
- Make sure you have permissions to access the location
- Try searching in a different directory
""")
def main():
st.title("🔍 File Search Agent")
st.markdown("Search for files on your computer using natural language queries")
# Initialize session state
if not initialize_session_state():
st.stop()
# Get search agent
search_agent = get_search_agent()
if search_agent is None:
st.warning("Search agent is not available. Some features may not work.")
return
# Sidebar
with st.sidebar:
st.header("⚙️ Settings")
model = st.selectbox("Select Model", ["qwen2.5"], index=0, disabled=st.session_state.is_searching)
st.divider()
st.header("ℹ️ About")
st.markdown("""
This agent can search for files using natural language queries such as:
- "Find PDF files on my desktop"
- "Look for resume.docx in D:\\"
- "Show me images from last week"
- "Find large video files (>100MB)"
""")
# Display search history
if st.session_state.search_history:
st.divider()
st.header("🕒 Recent Searches")
# Create a copy to avoid issues with reversed iterator
history_items = list(reversed(st.session_state.search_history[-5:]))
for i, history_item in enumerate(history_items):
button_key = f"history_{i}_{hash(history_item['query'])}" # Unique key
if st.button(f"{history_item['query'][:30]}{'...' if len(history_item['query']) > 30 else ''}",
key=button_key,
help=history_item['query'],
disabled=st.session_state.is_searching):
st.session_state.current_query = history_item['query']
st.session_state.is_searching = True
st.rerun()
# Main content
st.subheader("Enter your search query")
# Voice input option
col_voice1, col_voice2, col_voice3 = st.columns([1, 2, 2])
with col_voice1:
voice_input = st.checkbox("🎤 Enable Voice Input",
key="voice_input_checkbox",
disabled=st.session_state.is_searching)
with col_voice2:
if voice_input:
record_duration = st.slider("Recording Duration (seconds)", 3, 10, 5)
with col_voice3:
if voice_input:
if st.button("🎙️ Record Query",
key="record_button",
disabled=st.session_state.is_searching,
type="primary"):
st.session_state.is_recording = True
st.rerun()
# Handle voice recording
if st.session_state.is_recording:
with st.spinner("🎤 Recording... Please speak now"):
voice_query = record_audio(duration=record_duration)
if voice_query:
st.session_state.voice_query = voice_query
st.session_state.current_query = voice_query
st.session_state.is_recording = False
st.rerun()
# Display recognized voice query
if st.session_state.voice_query and voice_input:
st.info(f"🎤 Recognized voice query: **{st.session_state.voice_query}**")
col1, col2 = st.columns([3, 1])
with col1:
query = st.text_input("Enter your search query:",
placeholder="e.g., Find PDF files on my desktop",
key="query_input",
value=st.session_state.current_query,
disabled=st.session_state.is_searching)
with col2:
st.write("") # Empty space for alignment
st.write("") # Empty space for alignment
search_button = st.button("🔍 Search",
type="primary",
use_container_width=True,
disabled=st.session_state.is_searching)
# Handle search
if search_button and (query or st.session_state.voice_query):
search_query = query if query else st.session_state.voice_query
st.session_state.current_query = search_query
st.session_state.is_searching = True
add_to_search_history(search_query)
# Show progress
status_placeholder = st.empty()
progress_bar = st.progress(0)
status_placeholder.markdown('<div class="status-message" style="background-color: #e3f2fd;">🔄 Searching for files...</div>', unsafe_allow_html=True)
progress_bar.progress(25)
try:
# Perform search
progress_bar.progress(50)
status_placeholder.markdown('<div class="status-message" style="background-color: #e3f2fd;">🔍 Analyzing query...</div>', unsafe_allow_html=True)
results = search_agent.search(search_query)
st.session_state.search_results = results
progress_bar.progress(75)
status_placeholder.markdown('<div class="status-message" style="background-color: #e3f2fd;">📊 Formatting results...</div>', unsafe_allow_html=True)
# Update UI
progress_bar.progress(100)
if len(results) > 0:
status_placeholder.markdown(f'<div class="status-message" style="background-color: #c8e6c9;">✅ Search completed! Found {len(results)} files.</div>', unsafe_allow_html=True)
else:
status_placeholder.markdown(f'<div class="status-message" style="background-color: #fff3cd; color: #856404;">🔍 Search completed. No files found matching your query.</div>', unsafe_allow_html=True)
# Reset searching state
st.session_state.is_searching = False
st.session_state.voice_query = "" # Clear voice query after search
# Rerun to update the UI
st.rerun()
except Exception as e:
st.session_state.is_searching = False
status_placeholder.markdown(f'<div class="status-message" style="background-color: #ffcdd2;">❌ Error: {str(e)}</div>', unsafe_allow_html=True)
progress_bar.empty()
st.error(f"An error occurred during search: {str(e)}")
st.error(f"Details: {traceback.format_exc()}")
# Display results
display_search_results()
# Welcome message for first-time users
if not st.session_state.search_results and not query and not st.session_state.voice_query:
st.info("💡 Tip: Enter a natural language query above to search for files. Examples:\n\n"
"- 'Find PDF files on my desktop'\n"
"- 'Look for resume.docx in D:\\'\n"
"- 'Show me images from last week'\n"
"- 'Find large video files (>100MB)'\n\n"
"🎤 Enable voice input to speak your query instead of typing!")
if __name__ == "__main__":
# Check if required libraries are available
try:
import sounddevice as sd
import scipy
import speech_recognition as sr
except ImportError as e:
st.error(f"Required libraries not found: {str(e)}")
st.error("Please install required libraries:")
st.code("pip install sounddevice scipy SpeechRecognition")
st.stop()
main()
运行Web页面程序:streamlit run file_search_app.py,就可以在跳出的Web页面里输入想要查询的文件:

文件的详细信息显示如下:

5、总结
Windows默认的文件查找功能一般不能满足各种查找需求,可以通过这个Agent按照文件的需求查找各种条件的需求。关于文件内容的关键词的查找由于比较耗时,比较少使用,目前,代码注释了。 而且,关于语音输入的功能尚未调试完成,有兴趣的可以进一步完善。
本Agent在实现的过程中使用了AI工具通义千问工具辅助,不过,在使用过程中也遇到了一些幻觉,如下:

需要在原来代码的基础上,加上日志打印参数解析步骤和结果,以及在文件系统的查询结果,才能够清晰地看出问题。 找到问题之后,通过Query让工具在进一步优化,工具还是能够胜任的。