人工智能基础知识笔记二十四:构建一个可以解析PDF简历的Agent

本篇文章主要介绍如何构建一个Agent能够解析PDF格式的简历,并将其中的简历的各个部分内容解析出来,以JSON的格式打印出来,支持中文和英文。

1、环境依赖

首先,需要去https://ollama.com/download 安装ollama软件,然后,根据需求去下载不同的模型,例如:qwen2.5 或者 llama3.2 , 可以使用一下命令:

bash 复制代码
ollama pull llama3.2
ollama pull qwen2.5

下载成功之后,可以执行:ollama list查看有哪些模型:

其次,是安装依赖的库:

Groovy 复制代码
langchain_ollama
langchain_core
streamlit
pymupdf

2、构建可以解析简历的Agent

主要包括:通过一个prompt的模板来指导Agent如何解析和输出JSON格式的文档,这了可以根据不同的简历的需求进行适当调整。以防输出的JSON格式不正确,可以进一步通过一个validate_json的Agent在进行一次校正。

python 复制代码
from langchain_ollama import ChatOllama
from langchain_core.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate

from langchain_core.output_parsers import StrOutputParser, JsonOutputParser  

base_url  = "http://localhost:11434"
#model = 'llama3.2'
model = "qwen2.5"

llm = ChatOllama(model=model, base_url=base_url)

system = SystemMessagePromptTemplate.from_template("""You are a helpful assistant. You are helpful AI assistant who answer questions based on the context provided.""")

prompt = """
    **Task:** Extract key information from the following resume text.
    **Resume Text:**
    {context}

    **Instructions:**
    Please extract the following information from the resume and format it in a clear structure:
    1. Key Skills
    2. Interests
    3. Preferred Job Types

    **Output Format:**
    Return a JSON object with the following keys:
    1. "skills": A list of key skills
    2. "interests": A list of interests
    3. "job_types": A list of preferred job types

    1. **Contact Information:**
    - Name
    - Email
    - Phone
    - website/Portfolio

    2. **Education:**
    - Degree
    - Institution
    - Field of Study
    - Graduation Year

    3. **Work Experience:**
    - Job Title
    - Company
    - Dates
    - Description
    - Responsibilities / Projects

    4. **Projects:**
    - Project Title
    - Description/Technologies Used
    - Outcomes/Results
    
    5. **Skills:**
    - Programming Languages
    - Frameworks
    - Databases
    - Tools
    - Other Skills

    6. **Interests:**
    - Hobbies
    - Passions
    - Areas of Interest

    7. **additional information:** (if appicable)
    - Certificates
    - Languages
    - Awards or Honors
    - Passions
    - Professional Affiliations

    **Question**:
    {question}

    **Extracted Information:**

"""

prompt = HumanMessagePromptTemplate.from_template(prompt)

def ask_llm(question, context):
    messages = [system, prompt]
    template = ChatPromptTemplate.from_messages(messages)

    qa_chain = template | llm | StrOutputParser()
    return qa_chain.invoke({"question": question, "context": context})


def validate_json(data):
    json_prompt = """
            Please validate and correct the following JSON string.

            **Extracted Information**:
            {data}

            Provide only the corrected JSON, with no preamble or explanation.

            **Corrected JSON**:
    """

    json_prompt = HumanMessagePromptTemplate.from_template(json_prompt)
    json_messages = [system, json_prompt]
    json_template = ChatPromptTemplate.from_messages(json_messages)

    json_chain = json_template | llm | JsonOutputParser()
    return json_chain.invoke({"data": data})

3、通过Streamlit构建Web page实现简历解析通能

通过streamlit实现一个最简单的Web page,可以上传一个PDF格式的简历,并且解析简历的格式,输出为JSON格式。

示例代码如下:

python 复制代码
import streamlit as st
import pymupdf

from script.llm import ask_llm, validate_json

st.title("Resume Parser")
st.write("Upload a resume in PDF format to extract information") 
uploaded_file = st.file_uploader("Choose a file")

if uploaded_file is not None:
    bytearray = uploaded_file.read()
    pdf = pymupdf.open(stream=bytearray, filetype="pdf")

    context = ""
    for page in pdf:
        context = context + "\n\n" + page.get_text()

    pdf.close()

#st.write(context)

question = """
        You are tasked to parse a resume. Your goal is to extract the related information from the resume in a valid structured format.
        If the resume is Chinese, please translate it into English before parsing.
        Do not write preamble or explanation.
        """

if st.button("Parse Resume"):
    with st.spinner("Parsing..."):
        result = ask_llm(question, context)
    
    with st.expander("Validating JSON..."):
        result = validate_json(result)
    
    st.write("**Extracted Information:**")
    st.write(result)

    st.write("**You can use the parsed JSON to build your resume:**")

    st.balloons()

运行Web应用,会显示Web page如下:

选择PDF格式的简历,无论是中文简历和英文简历都可以,点击【Parse Resume】按钮,可以看到显示结果如下:

相关推荐
week_泽1 分钟前
8、OpenCV BF暴力特征匹配笔记
人工智能·笔记·opencv
全栈游侠11 分钟前
GT2933触摸驱动分析 -中断处理
linux·笔记
找方案13 分钟前
hello-agents 学习笔记:解锁智能体三大经典范式,从原理到实战
javascript·笔记·学习·hello-agents
QT 小鲜肉24 分钟前
【Linux命令大全】001.文件管理之lsattr命令(实操篇)
linux·运维·服务器·笔记·elasticsearch
一抹轻笑动人35 分钟前
Viger笔记
笔记·golang
week_泽35 分钟前
OpenCV图像拼接原理与实践笔记
人工智能·笔记·opencv
iconball40 分钟前
个人用云计算学习笔记 --32 Docker和docker swarm
运维·笔记·学习·docker·容器·云计算
QT 小鲜肉1 小时前
【Linux命令大全】001.文件管理之mdir命令(实操篇)
linux·运维·服务器·chrome·笔记
三块可乐两块冰1 小时前
【第二十五周】机器学习笔记二十四
人工智能·笔记·机器学习
摇滚侠1 小时前
Java 零基础全套视频教程,日期时间 API,笔记147-148
java·开发语言·笔记