这段代码通过提取、查询、替换DOI,生成参考文献列表来处理Word文档,可按功能模块划分:
- 导入模块
python
import re
from docx import Document
from docx.oxml.ns import qn
from habanero import Crossref
导入正则表达式模块re
用于文本模式匹配,python - docx
库中的Document
类操作Word文档,qn
函数处理命名空间(代码中未实际使用),以及habanero
库的Crossref
类,用于通过DOI查询参考文献信息。
- 提取DOI函数
python
def extract_dois(text):
doi_pattern = r'(10\.\d{4,9}/[-._;()/:A-Z0-9]+)'
return re.findall(doi_pattern, text, re.IGNORECASE)
定义extract_dois
函数,接收文本参数text
,使用正则表达式doi_pattern
匹配DOI格式,通过re.findall
函数提取所有符合格式的DOI字符串,返回包含这些DOI的列表,忽略大小写。
- 获取参考文献函数
python
def get_reference(doi):
cr = Crossref()
try:
result = cr.works(ids=doi)
if'message' in result:
message = result['message']
# 提取作者信息
authors = []
if 'author' in message:
for author in message['author']:
if 'family' in author and 'given' in author:
last_name = author['family']
first_initial = author['given'][0] if author['given'] else ''
authors.append(f"{last_name}, {first_initial}.")
author_str = ', '.join(authors)
# 提取年份、标题等其他信息
year = message['issued']['date - parts'][0][0] if 'issued' in message and 'date - parts' in message['issued'] and message['issued']['date - parts'] else 'n.d.'
title = message['title'][0] if 'title' in message and message['title'] else 'No title'
journal = message['container - title'][0] if 'container - title' in message and message['container - title'] else 'No journal'
volume = message['volume'] if 'volume' in message else 'No volume'
issue = message['issue'] if 'issue' in message else 'No issue'
pages = message['page'] if 'page' in message else 'No pages'
reference = f"{author_str} ({year}). {title}. {journal}, {volume}({issue}), {pages}. doi:{doi}"
return reference
else:
return None
except Exception:
return None
get_reference
函数接收DOI参数doi
,创建Crossref
实例cr
查询该DOI对应的参考文献信息。尝试获取查询结果,若结果中存在message
字段,则从中提取作者、年份、标题、期刊、卷号、期号、页码等信息,格式化为APA格式参考文献字符串并返回;若查询失败或出现异常,返回None
。
- 主处理函数
python
def convert_dois_in_word(input_file, output_file):
doc = Document(input_file)
all_dois = []
doi_original_index = {}
index = 1
# 提取文档中所有DOI并编号
for paragraph in doc.paragraphs:
dois = extract_dois(paragraph.text)
for doi in dois:
if doi not in all_dois:
all_dois.append(doi)
doi_original_index[doi] = index
index += 1
references = []
successful_dois = []
failed_dois = []
# 获取每个DOI的参考文献信息
for doi in all_dois:
reference = get_reference(doi)
if reference:
references.append(reference)
successful_dois.append(doi)
else:
failed_dois.append(doi)
# 将文档中的DOI替换为上标引用序号
for paragraph in doc.paragraphs:
for doi in all_dois:
if doi in successful_dois:
index = successful_dois.index(doi) + 1
runs = paragraph.runs
for run in runs:
if doi in run.text:
parts = run.text.split(doi)
run.text = parts[0]
new_run = paragraph.add_run(f"[{index}]")
new_run.font.superscript = True
run = paragraph.add_run(parts[1])
# 在文档末尾添加参考文献列表
doc.add_page_break()
doc.add_heading('参考文献', level=1)
for i, reference in enumerate(references, start=1):
doc.add_paragraph(f"[{i}] {reference}")
doc.save(output_file)
# 打印转换结果
print("成功转换的DOI:")
for doi in successful_dois:
print(doi)
print("\n转换失败的DOI:")
for doi in failed_dois:
original_index = doi_original_index[doi]
print(f"{original_index}. {doi}")
convert_dois_in_word
函数接收输入、输出文件路径参数input_file
、output_file
。打开输入Word文档,遍历段落提取所有DOI,为每个唯一DOI编号并存储。尝试获取每个DOI的参考文献信息,区分成功与失败的DOI。再次遍历段落,将成功获取信息的DOI替换为上标引用序号。在文档末尾添加分页符、"参考文献"标题及格式化的参考文献列表,最后保存文档并打印成功和失败转换的DOI信息。
- 使用示例
python
input_file = 'input.docx'
output_file = 'output.docx'
convert_dois_in_word(input_file, output_file)
定义输入、输出文件路径,调用convert_dois_in_word
函数执行对Word文档DOI的转换和参考文献生成操作。