人工智能基础知识笔记二十四:构建一个可以解析PDF简历的Agent

本篇文章主要介绍如何构建一个Agent能够解析PDF格式的简历,并将其中的简历的各个部分内容解析出来,以JSON的格式打印出来,支持中文和英文。

1、环境依赖

首先,需要去https://ollama.com/download 安装ollama软件,然后,根据需求去下载不同的模型,例如:qwen2.5 或者 llama3.2 , 可以使用一下命令:

bash 复制代码
ollama pull llama3.2
ollama pull qwen2.5

下载成功之后,可以执行:ollama list查看有哪些模型:

其次,是安装依赖的库:

Groovy 复制代码
langchain_ollama
langchain_core
streamlit
pymupdf

2、构建可以解析简历的Agent

主要包括:通过一个prompt的模板来指导Agent如何解析和输出JSON格式的文档,这了可以根据不同的简历的需求进行适当调整。以防输出的JSON格式不正确,可以进一步通过一个validate_json的Agent在进行一次校正。

python 复制代码
from langchain_ollama import ChatOllama
from langchain_core.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate

from langchain_core.output_parsers import StrOutputParser, JsonOutputParser  

base_url  = "http://localhost:11434"
#model = 'llama3.2'
model = "qwen2.5"

llm = ChatOllama(model=model, base_url=base_url)

system = SystemMessagePromptTemplate.from_template("""You are a helpful assistant. You are helpful AI assistant who answer questions based on the context provided.""")

prompt = """
    **Task:** Extract key information from the following resume text.
    **Resume Text:**
    {context}

    **Instructions:**
    Please extract the following information from the resume and format it in a clear structure:
    1. Key Skills
    2. Interests
    3. Preferred Job Types

    **Output Format:**
    Return a JSON object with the following keys:
    1. "skills": A list of key skills
    2. "interests": A list of interests
    3. "job_types": A list of preferred job types

    1. **Contact Information:**
    - Name
    - Email
    - Phone
    - website/Portfolio

    2. **Education:**
    - Degree
    - Institution
    - Field of Study
    - Graduation Year

    3. **Work Experience:**
    - Job Title
    - Company
    - Dates
    - Description
    - Responsibilities / Projects

    4. **Projects:**
    - Project Title
    - Description/Technologies Used
    - Outcomes/Results
    
    5. **Skills:**
    - Programming Languages
    - Frameworks
    - Databases
    - Tools
    - Other Skills

    6. **Interests:**
    - Hobbies
    - Passions
    - Areas of Interest

    7. **additional information:** (if appicable)
    - Certificates
    - Languages
    - Awards or Honors
    - Passions
    - Professional Affiliations

    **Question**:
    {question}

    **Extracted Information:**

"""

prompt = HumanMessagePromptTemplate.from_template(prompt)

def ask_llm(question, context):
    messages = [system, prompt]
    template = ChatPromptTemplate.from_messages(messages)

    qa_chain = template | llm | StrOutputParser()
    return qa_chain.invoke({"question": question, "context": context})


def validate_json(data):
    json_prompt = """
            Please validate and correct the following JSON string.

            **Extracted Information**:
            {data}

            Provide only the corrected JSON, with no preamble or explanation.

            **Corrected JSON**:
    """

    json_prompt = HumanMessagePromptTemplate.from_template(json_prompt)
    json_messages = [system, json_prompt]
    json_template = ChatPromptTemplate.from_messages(json_messages)

    json_chain = json_template | llm | JsonOutputParser()
    return json_chain.invoke({"data": data})

3、通过Streamlit构建Web page实现简历解析通能

通过streamlit实现一个最简单的Web page,可以上传一个PDF格式的简历,并且解析简历的格式,输出为JSON格式。

示例代码如下:

python 复制代码
import streamlit as st
import pymupdf

from script.llm import ask_llm, validate_json

st.title("Resume Parser")
st.write("Upload a resume in PDF format to extract information") 
uploaded_file = st.file_uploader("Choose a file")

if uploaded_file is not None:
    bytearray = uploaded_file.read()
    pdf = pymupdf.open(stream=bytearray, filetype="pdf")

    context = ""
    for page in pdf:
        context = context + "\n\n" + page.get_text()

    pdf.close()

#st.write(context)

question = """
        You are tasked to parse a resume. Your goal is to extract the related information from the resume in a valid structured format.
        If the resume is Chinese, please translate it into English before parsing.
        Do not write preamble or explanation.
        """

if st.button("Parse Resume"):
    with st.spinner("Parsing..."):
        result = ask_llm(question, context)
    
    with st.expander("Validating JSON..."):
        result = validate_json(result)
    
    st.write("**Extracted Information:**")
    st.write(result)

    st.write("**You can use the parsed JSON to build your resume:**")

    st.balloons()

运行Web应用,会显示Web page如下:

选择PDF格式的简历,无论是中文简历和英文简历都可以,点击【Parse Resume】按钮,可以看到显示结果如下:

相关推荐
jimmyleeee2 小时前
人工智能基础知识笔记二十五:构建一个优化PDF简历的Agent
人工智能·笔记
地中海~2 小时前
LARGE LANGUAGE MODELS ARE NOT ROBUST ICLR2024
人工智能·笔记·nlp
archko2 小时前
用rust+slint编写一个pdf阅读器2
pdf
im_AMBER2 小时前
Leetcode 70 好数对的数目 | 与对应负数同时存在的最大正整数
数据结构·笔记·学习·算法·leetcode
坚定信念,勇往无前2 小时前
vue3图片,pdf,word,excel,ppt多格式文件预览组件Vue Doc Viewers Plus
pdf·word·excel
hd51cc4 小时前
MFC消息 学习笔记
笔记·学习·mfc
雷工笔记12 小时前
MES学习笔记之SCADA采集的数据如何与MES中的任务关联起来?
笔记·学习
繁星星繁12 小时前
【C++】脚手架学习笔记 gflags与 gtest
c++·笔记·学习
2301_8107463113 小时前
CKA冲刺40天笔记 - day20-day21 SSL/TLS详解
运维·笔记·网络协议·kubernetes·ssl