1.read_docx
read_docx() is the starting point for creating word documents from R.
It creates an R object representing a word document that can be manipulated programmatically.
1-1 函数设计目标
read_docx到底读入哪些内容?
首先,R会用unpack_folder()函数将docx文件解压到临时目录,视DOCX文件复杂性,可以得多个文件夹,主要包括:_rels(内含.rels文件),docProps(内含app.xml、core.xml两个文件),word(内含两个文件夹、八个文件),customXml。和一个文件[Content_types].xml
与文档内容相关的有两个文件在word文件夹中:./word/document.xml和./word/_rels/document.xml.rels
officer使用R6Class存储数据,使用openxml_document、docx_part两个存在继承关系的类
openxml_document是R6Class,私有属性有五个:filename,rels_filename,doc,rels_doc,reldir。公用方法有:feed()(更新filename,rels_filename,doc,rels_doc属性值),file_name()(get privagefilename),name()(getbasename(filename)),get()(getprivatefilename),name()(get basename(filename)),get()(get privatefilename),name()(getbasename(filename)),get()(getprivatedoc),replace_xml()(update privatedoc),dirname()(getprivatedoc),dir_name()(get privatedoc),dirname()(getprivatereldir),rel_df()(privaterelsdocrels_docrelsdocget_data()),relationship()(get private$rels_doc)。它的初始化方法,形参只有一个:dir,函数读取rels文件,更新openxml_document对象中rels相关的两个属性值:
{r}
private$reldir = dir
private$rels_doc <- relationship$new()
openxml_document$feed()方法,形参只有一个:file,这个参数file,应该是.../word/document.xml ,其他目录均无法得到源代码的结果。更新openxml_document对象四个属性值
{r}
private$filename <- file #../word/document.xml
private$doc <- read_xml(file) #../word/document.xml
private$rels_filename <- file.path( dirname(file), "_rels", #../word/_rels/document.xml.rels
paste0(basename(file), ".rels") )
private$rels_doc <-
relationship$new()$feed_from_xml(private$rels_filename)
# else private$rels_doc <- relationship$new()
openxml_document的子类docx_part初始化方法,共有三个私有属性:package_dir,cursor,body_xpath。以word文件夹为目标,初始化函数的形参有四个:path,即package_dir,是docx文件解压所在的目录;main_file是document.xml,cursor待补充,body_path待补充。初始化函数调用openxml_document的初始化方法和feed方法外,另外更新两个值
{r}
super$initialize("word")
private$package_dir <- path
private$body_xpath <- body_xpath
super$feed(file.path(private$package_dir,"word",main_file))
private$cursor <- cursor
调用docx_part初始化方法完成,读取word文件夹下的document.xml.rels、document.xml文件内容,并更新以下属性值:
{r}
private$reldir = "word" # 待证
private$rels_doc <- read_xml('../word/_rels/document.xml.rels')
private$filename <- "../word/document.xml"
private$doc <- read_xml("../word/document.xml")
private$rels_filename <- "../word/_rels/document.xml.rels"
private$package_dir <- "../"
private$body_xpath <- body_xpath
private$cursor <- cursor
1-2 命名空间
1-2-1 命名空间的意义
XML 是一种非常好用的标记语言,它具有极好的可扩展性,因此当我们需要访问 XML 文档时,有可能会出现这样的情况:
在同一份 XML 文档中可能出现多个同名的标签和属性,而这些标签和属性意义的又是完全不同的,遇到这种情况如果我们不能从语法上提供区别,则 XML 处理器将无法区分它们
1-2-2 XML标准
XML 命名空间为我们提供了一个标准的语法,声明 XML 名称空间,并为 XML 文档里的某个元素确定命名空间
要在文档里使用 XML 命名空间,元素名就变成了限定名(qualified names 缩写为 qName),限定名分成了两部分,一部分就是我们之前使用的元素名,另一部分是命名空间的前缀,它确定了这个名称所在的命名空间
<b:book xmlns:b="http://www.yeahstack.com/xml/b">
我们在根标签中添加了一个 xmlns:b 属性,xmlns 代表的是 xml namespace,b 是我们声明的命名空间前缀,b 本身并没有意义,可以将它理解为是 http://www.yeahstack.com/xml/b 的一个别名,我们在标签中使用 b,就相当于使用这个 uri 地址。一旦使用了 b 这个前缀,就代表这个标签是属于 http://www.yeahstack.com/xml/b 这个唯一标识命名空间下的元素
我们还可以在一个文档中定义多个命名空间,如下的语法也是没有问题的:
{xml}
<b:book xmlns:b="http://www.yeahstack.com/xml/b"
xmlns:a="http://www.yeahstack.com/xml/a">
1-2-3 默认的命名空间
这样我们在文档中就可以使用 a 和 b 两个前缀来区分不同的命名空间中的标签了.但是实际上咱们所使用的前缀并不友好,为了方便识别在开发中尽量使用便于识别的前缀,比如 book,author 等
采用以上的方式声明命名空间已经可以很好的解决了咱们的问题,但是这种方式显得有一些麻烦,因为每一个标签都需要加上一个前缀,不如直接写标签名来的爽快.所以 XML 还给我们提供了一种方式可以声明一个默认的命名空间,具体如下:
{xml}
<b:book xmlns="http://www.yeahstack.com/xml/b"
xmlns:a="http://www.yeahstack.com/xml/a">
上边的 xmlns="http://www.yeahstack.com/xml/b"并没有指定前缀,那么这种没有指定前缀的命名空间就会作为页面中元素的默认命名空间,除非在标签中使用其他命名空间的前缀,否则解析器都会认为元素是在默认命名空间下存在
但是要注意的是一个文档中只能有一个默认的命名空间,如下的语法是错误的:
{xml}
<b:book xmlns="http://www.yeahstack.com/xml/b"
xmlns="http://www.yeahstack.com/xml/a">
这里我们指定了两个命名空间而都没有使用前缀,解析器在解析文档时会不知道使用哪个命名空间,所以在一个文档中只能有一个默认的命名空间,其他命名空间必须使用前缀
1-3 读取文档设置参数:update_docx_settings_from_file
本函数在docx_settings.R中定义,从settings.xml文件中读取settings,需要从中获取以下属性的值:
- zoom 指定应用程序显示文档时应应用于文档的放大级别
- default_tab_stop 自动制表位之间的距离
- hyphenation_zone 指定在自动或手动断字符此文档内容时应使用的断字区域
- decimal_symbol 用于字段代码评估 的
decimalSymbol(基数点) - list_separator (字段代码评估) 的列表分隔符
- compatibility_mode
- enev_and_odd_headers 不同的偶数/奇数页页眉和页脚
- auto_hyphenation
但这些属性不是每个文件中都有用户自定义设置,所以先设置一个默认值,再用文件内容更新
总体思路是:
xml %>% read_xml() %>% xml_child() %>% xml_attr()
注意:update_docx_settings_from_file()并没有涉及R6Class对象的属性,而是直接返回list结构。
这样做,应该是本函数只负责提供一个通用的函数,不负责对象创建,需要时简单调用即可
{r}
# docx_settings
docx_settings <- function(
zoom = 1,
default_tab_stop = .5,
hyphenation_zone = .25,
decimal_symbol = ".",
list_separator = ";",
compatibility_mode = "15",
enev_and_odd_headers = FALSE,
auto_hyphenation = FALSE){
x <- list(
zoom = zoom,
default_tab_stop = default_tab_stop,
hyphenation_zone = hyphenation_zone,
decimal_symbol = decimal_symbol,
list_separator = list_separator,
even_and_odd_headers = even_and_odd_headers,
auto_hyphenation = auto_hyphenation,
compatibility_mode = compatibility_mode
)
class(x) <- "docx_settings"
x
}
# update_docx_settings_from_file
update_docx_settings_from_file <- function(x,file){
if(is.null(file)){
cli::cli_abort(
"File settings for word documen is null."
)
}
if(!file.exists(file)){
cli::cli_abort(
"File does not exists {.file{file}}"
)
}
node_doc <- read_xml(file) # 读取settings.xml内容用
# zoom 指定应用程序显示文档时应应用于文档的放大级别。
# 通过使用存储在此元素上的两个属性来指定缩放级别:
# val,用于存储应用于文档的缩放类型
# 百分比,用于存储呈现文档时要使用的缩放百分比
# 当对象序列化为 xml 时,其限定名称为 w:zoom
node_zoom <- xml_child(node_doc,"w:zoom")
if(!inherits(node_zoom,"xml_missing")){
x$zoom <- as.integer(xml_attr(node_zoom,"percent"))/100
}
# 自动制表位之间的距离。
# 将对象序列化为 xml 时,其限定名称为 w:defaultTabStop
node_tab_stop <- xml_child(ndoe_doc,"w:defaultTabStop")
if(!inherits(node_tab_stop,"xml_missing")){
x$default_tab_stop <- as.integer(xml_attr(node_tab_stop,"val"))/1440
}
# w:hyphenation 指定在自动或手动断字符此文档内容时应使用的断字区域。
# 断字区域是可在 (行末尾保留的空格量,或添加到对齐行) 之前,
# 应在文档中 (的下一个单词上尝试断字,以减少) 行上的空格量。
# 较小的断字区域应减少给定文档正文文本右边缘的不规则性,
# 因为更多的单词被断字。
# 相反,较大的断字区域应会增加给定文档文本右边缘的不规则性,
# 因为断字的字数更少
node_hyphenation_zone <- xml_child(node_doc,"w:hyphenation")
if(!inherits(node_hyphenation_zone,"xml_missing")){
x$hyphenation_zone <- as.integer(xml_attr(
node_hyphenation_zone,
"val"
))/1440
}
node_auto_hyphenation <- xml_child(node_doc,"w:autoHyphenation")
if(!inherits(node_auto_hyphenation,"xml_misssing")){
x$auto_hyphenation <- TRUE
}
# 用于字段代码评估 的 decimalSymbol (基数点)
# 此元素指定在评估当前文档中所有字段的内容时应解释为基数点的字符
node_decimal_symbol <- xml_child(node_doc,"w:decimalSymbol")
if(!inherits(node_decimal_symbol,"xml_missing")){
x$decimal_symbol <- xml_attr(node_decimal_symbol,"val")
}
# listSeparator (字段代码评估) 的列表分隔符
# 此元素指定在评估当前文档中所有字段的内容时应解释为列表项分隔符的字符。
# 根据当前文档的内容评估字段指令时,必须知道必须将其视为列表分隔符的字符,
# 该分隔符必须用于评估本文档内容中的字段,而不考虑加载文件的应用程序的区域设置。
node_list_separator <- xml_node(node_doc,"w:listSepartor")
if(!inherits(node_list_separator,"xml_missing")){
x$list_separator <- xml_attr(node_list_separator,"val")
}
# 不同的偶数/奇数页页眉和页脚
node_evenodd_headers <- xml_child(node_doc,"w:evenAndOddHeaders")
if(!inherits(node_evenodd_headers,"xml_missing")){
x$even_and_odd_headers <- TRUE
}else{
x$even_and_Odd_headers <- FALSE
}
x
}
相关函数在relationship.R中定义,处理的是《.rels》文件
1-4 读取文档关系设置:relationship
1-4-1 Structure of a WordProcessingML Document
The basic document structure of a WordProcessingML document consists of the document and body elements, followed by one or more block level elements such as p, which represents a paragraph. A paragraph contains one or more r elements. The r stands for run, which is a region of text with a common set of properties, such as formatting. A run contains one or more t elements. The t element contains a range of text. The following code example shows the WordprocessingML markup for a document that contains the text "Example text."
Open XML 文档存储为包,其格式由 ISO/IEC 29500 定义。 包可以具有多个彼此之间存在关系的部件。 部件之间的关系控制文档的类别。 如果文档的包关系项包含与主文档部件的关系,可将文档定义为字处理文档。 如果文档的包关系项包含与演示文稿部件的关系,可将文档定义为演示文稿文档。 如果文档的包关系项包含与工作簿部件的关系,可将文档定义为电子表格文档。 在本操作方法主题中,您将使用字处理文档包。
{xml}
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
<w:r>
<w:t>Example text.</w:t>
</w:r>
</w:p>
</w:body>
</w:document>
1-4-2 the R6 class:relationship
最终获取的参数包括ID、type、target、target_mode、ext_src,get_data()则把这些参数用数据框结构返回
{r}
#' @importForm R6 R6Class
relationship <- R6Class(
"relationship",
public = list(
initialize = function(
id = character(0),
type = character(0),
target = character(0)
){
private$id <- id
private$type <- type
private$target <- target
private$target_mode <- rep(NA_character_,length(target))
private$ext_src <- character(length(id))
},
feed_from_xml <- function(path){
doc <- read_xml(x=path)
children <- xml_children(doc)
xn <- xml_ns(doc)
private$id <- c(private$id,
vapply(
children,
xml_attr,
NA_character_,
attr = "Id",
ns
))
private$type <- c(private$type,
vapply(children,xml_attr,NA_character_,
attr = "Type",ns ))
private$target <- c(
private$target,
vapply(children,xml_attr,NA_character_,
attr = "Target",ns)
)
private$target_mode <- c(private$target_mode,
vapply(children,xml_attr,NA_character_,
attr = "TargetMode",ns))
private$ext_src <- c(private$ext_src,character(length(children)))
self
},
write = function(path){
if(length(private$id)){
str <- paste0(
"<Relationship Id = \"",
private$id,
"\" Typt=\"",
private$type,
"\" Target=\"",
htmlEscapeCopy(private$target),
ifelse(
is.na(private$target_mode),
"",
"\" TargetMode=\"External"
),
"\"/>"
)
}else{
sr <- character(length = 0)
}
str <- c(
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>",
"\n<Relationships xmlns=\"http:www\">",
src,
"</Relationships>"
)
dir.create(dirname(path),showWarnings = FALSE,recursive = FALSE)
writeLines(str,con = path,useBytes = TRUE)
self
},
get_next_id <- function(){
max(c(0,private$get_int_id()),na.rm = TRUE)+1
},
get_data = function(){
data <- data.frame(
id = character(0),
int_id = integer(0),
type = character(0),
target = character(0),
target_mode = character(0),
ext_src = character(0),
stringsAsFactors = FALSE
)
if(length(private$id)){
chr_stripped_id <- gsub("rId([0-9]+)","\\1",private$id)
test_id_as_num <- grepl("^rId[0-9]+",private$id)
if(any(!test_id_as_num)){
wrong_targets <- paste0(
"invalid id(s) for {shQuote(private$target[!test_id_as_num])}"
)
names(wrong_targets) <- rep("*",length(wrong_targets))
cli::cli_warn(
c(
"x" = "relationship file contains invalid id(s).",
wrong_targets
)
)
}
data <- data.frame(
id = private$id,
int_id = as.integer(chr_stripped_id),
type = private$type,
target = private$targe,
target_mode = private$target_mode,
ext_src = private$ext_src,
stringsAsFactors = FALSE
)
data[order(data$id),]
}
data
},
get_images_path = function(){
# URL字符串可以分成两部分,文件名+后缀=basename(),路径=dirname()
# %in% 返回一个由T|F构成的向量
is_img <- basename(private$type) %in% "image"
targets <- private$target[is_img]
names(targets) <- private$id[is_img]
targets
},
add_img <- function(src,root_target){
src <- setdiff(src,private$ext_src)
if(!length(src)){
return(self)
}
if(any(grepl(" ",basename(src)))){
stop(
paste(src,collapse = ", "),
": images with blanks in their basenames ",
call. = FALSE
)
}
last_id <- max(c(0,private$get_int_id()),na.rm = TRUE)
id <- paste0("rId",seq_along(src) + last_id)
type <- rep(
"http://wwww/image",
length(src)
)
target <- file.path(root_target,basename(src))
private$id <- c(private$id,id)
private$type <- c(private$type,type)
private$target <- c(private$target,target)
private$target_mode <- c(private$target_mode,rep(NA,length(id)))
private$ext_src <- c(private$ext_src,src)
self
},
add_drawing = function(src,root_target){
src <- setdiff(src,private$ext_src)
if(!length(src)){
return(self)
}
last_id <- max(c(0,private$get_int_id()),na.rm = TRUE)
id <- paste0("rId",seq_along(src) + last_id)
type <- rep(
"http://www/relationships/drawing",
length(src)
)
target <- file.path(root_target,basename(src))
private$id <- unlist(c(private$id,id))
private$type <- unlist(c(private$type,type))
private$target <- unlist(c(private$target,target))
private$target_mode <- unlist(
c(private$target_mode,rep(NA,length(id)))
)
private$ext_src <- unlist(c(private$ext_src,
rep(NA,length(id))))
self
},
add = function(id,type,target,target_mode = NA){
Encoding(target) <- "UTF-8"
if(!target %in% private$target){
private$id <- c(private$id,id)
private$type <- c(private$type,type)
private$target <- c(private$target,target)
private$target_mode <- c(private$target_mode,target_mode)
private$ext_src <- c(private$ext_src,"")
}
self
},
remove = function(target){
id <- which(basename(private$target) %in% basename(target))
private$id <- private$id[-id]
private$type <- private$type[-id]
private$target <- private$target[-id]
private$target_mode <- private$target_mode[-id]
private$ext_src <- private$ext_src[-id]
self
},
show = function(){
print(self$get_data())
}
),
private = list(
id = NA,
type = NA,
target = NA,
target_mode = NA,
ext_src = NA,
get_int_id = function(){
if(length(private$id) > 0 ){
as.integer(gsub("rId([0-9]+)","\\1",private$id))
}else{
integer(0)
}
}
)
)
1-5 read_custom_properties
该函数在custom_propertied.R中定义
{r}
read_custom_propertied <- function(package_dir){
filename <- file.path(package_dir,"docProps/custom.xml")
if(!file.exists(filename)){
filename <- system.file(package = "officer","template/custom.xml")
}
doc <- read_xml(filename)
all_children <- xml_children(doc)
pid_values <- vapply(all_children,xml_attr,NA_character_,"pid")
name_values <- vapply(all_children,xml_attr,NA_character_,"name")
properties_types <- vapply(all_children,
function(x){
xml_name(xml_child(x,1))
},NA_character_)
value_values <- vapply(all_children,function(x){
as.character(xml_child(x,1))
},NA_character_)
pat1 <- sprintf("<vt\\:(%s)/>",paste0(properties_types,collapse = "|"))
# 不知道pat2的pattern是否应该为"<vt\\:(%s)>"
pat2 <- sprintf("<vt\\:(%s)",paste0(properties_types,collapse = "|"))
pat3 <- sprintf("</vt\\:(%s)>",paste0(properties_types,collapse = "|"))
value_values <- gsub(pat1,"",value_values)
value_values <- gsub(pat2,"",value_values)
value_values <- gsub(pat3,"",value_values)
str <- c(pid_values,name_values,properties_types,value_values)
z <- matrix(str,
ncol = 4,
dimnames = list(NULL,c("pid","name","type","value")))
z <- list(data=z)
class(z) <- "custom_properties"
z
}
1-6 read_core_properties
该函数在core_propertied.R中定义
{r}
read_core_properties <- function(package_dir){
filename <- file.path(package_dir,"docProps/core.xml")
if(!file.exists(filename)){
filename <- system.file(package = "officer","template/core.xml")
}
doc <- read_xml(filename)
ns_ <- xml_ns(doc)
all_ <- xml_find_all(doc,"/cp:coreProperties/*")
if(length(all_)<1){
out <- list(data=structure(character(0),.Dim = c(0L,4L),
.Dimnames=list(NULL,c("ns","name","attrs","value"))),
ns = c(cp="http://",dc="http://",dcmitype="htttp",
dcterms="http://",xsi="http://"))
return(out)
}
names_ <- sapply(all_,xml_name,ns = ns_)
names_ <- strsplit(names_,":")
# concat all attrs in single chars
attrs <- sapply(xml_attrs(all_),function(ns_){
paste0("xsi:",names(ns_),'=\"',ns_,'\"',collapse = " ")
})
attrs <- ifelse(sapply(xml_attrs(all_),length)<1,
"",paste0(" ",attrs))
propnames <- sapply(names_,function(x) x[2])
props <- matrix(c(sapply(names_,function(x) x[1]),
propnames,attrs,xml_text(all_)),ncol = 4)
colnames(props) <- c("ns","name","attrs","value")
rownames(props) <- propnames
z <- list(data = props,ns = unclass(ns_))
class(z) <- "core_properties"
z
}
1-7 content_type
该函数在openxml_content_type.R中定义
{r}
content_type <- R6Class(
"content_type",
public = list(
initialize = function(package_dir){
private$filename <- file.path(package_dir,"[Content_Type].xml")
doc <- read_xml(x = private$filename)
ns <- xml_ns(doc)
node_template <- xml_find_first(doc, "d1:Override[@ContentType='application/vnd.openxmlformats-officedocument.presentationml.template.main+xml']")
if (!inherits(node_template, "xml_missing")) {
xml_attr(node_template, 'ContentType') <- "application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml"
}
# 问什么前面node_template没有利用,下面重新赋值?
# 前面那是ppt,后面这个是word,可能是作者忘记删除前者了
node_template <- xml_find_first(doc, "d1:Override[@ContentType='application/vnd.openxmlformats-officedocument.wordprocessingml.template.main+xml']")
if (!inherits(node_template, "xml_missing")) {
xml_attr(node_template, 'ContentType') <- "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
}
}
),
private = list(
)
)
content_type的源代码打字太费劲,放到一边,先不看了
1-8 docx_part
该函数在docx_part.R中定义
{r}
docx_part <- R6Class(
"docx_part",
inherit = openxml_document,
public = list(
initialize = function(path,main_file,cursor,body_xpath){
super$initialize("word")
private$package_dir <- path
private$body_xpath <- body_xpath
super$feed(file.path(private$package_dir,"word",main_file))
private$cursor <- cursor
},
length = function(){
# self$get()返回xml
xml_length(xml_find_first(self$get(),private$body_xpath))
},
patch_wml = function(
package_dir,
styles_info_tbl_from,
styles_info_tbl_to,
numbering_mapping,
par_style_mapping = list(),
run_style_mapping = list(),
tbl_style_mapping = list(),
addditional_ns = character(0),
prepend_chunks_on_styles = list()
){
doc_str <- self$encode_wml_str(additional_ns = additional_ns)
document_rels <- self$rel_df()
# images processing ------
doc_from_img <- document_rels[basename(document_rels$type) %in% "image",]
for(i in seq_len(nrow(doc_from_img))){
fileext <- paste0(".",tools::file_ext(doc_from_img$target[i]))
new_file <- tempfile(fileext = fileext)
file.copy(
from = file.path(
package_dir,
"word/media",
basename(doc_from_img$target[i])
),
to = new_file,
overwrite = TRUE
)
pat <- "r:embed=\"%s\""
pat <- sprintf(pat,doc_from_img$id[i])
m <- gregexpr(pat,doc_str)
regmatches(doc_str,m) <- sprintf(
"r:embed=\"%s\"",
new_file
)
}
# external links processing -----
doc_from_hl <- document_rels[basename(document_rels$type) %in% "hyperlink",]
for(i in seq_len(nrow(doc_from_hl))){
pat <- "r:id=\"%s\""
pat <- sprintf(pat,doc_from_hl$id[i])
m <- gregexpr(pat,doc_str)
regmatches(doc_str,m) <-
sprintf("r:id=\"%s\"",doc_from_hl$target[i])
}
# numberings processing ------
for(i in seq_len(nrow(numbering_mapping))){
id_from <- numbering_mapping$from[i]
id_to <- numbering_mapping$to[i]
m <- gregexpr(
sprintf("<w:numId w:val=\"%s\"/>",
as.character(id_to))
)
}
sty_par_info_to <- styles_info_tbl_to[
styles_info_tbl_from$style_type %in% "paragraph",
]
# append chuncks when specific styles are found ------
for(style in names(prepend_chunks_on_styles)){
style_id <- sty_par_info_from$style_id[styles_info_tbl_from$style_name %in% style]
# find all paragraphs with this style
match_pstyle <- grep(sprintf("W:pStyle w:val=\"$s\"",style_id),doc_str)
# find all </w:pPr> after each match
match_end_ppr <- grep("</w:pPr>",doc_str)
for(par_i in match_pstyle){
# find next </w:pPr>
current_match_end_ppr <- match_end_ppr[match_end_ppr>par_i]
current_match_end_ppr <- head(current_match_end_ppr,n=1)
# prepend chunk
if(length(current_match_end_ppr) == 1){
doc_str[current_match_end_ppr] <- paste0(
doc_str[current_match_end_ppr],
to_wml(prepend_chunks_on_styles[[style]])
)
}
}
}
# par styles processing ------
m <- gregexpr("w:pStyle w:val=\"[[:alnum:]]+\"",doc_str)
zz <- regmatches(doc_str,m)
zz <- unlist(zz)
p_styles <- gsub("w:pStyle w:val=\"([[:alnum:]]+)\"","\\1",zz)
p_styles <- unique(p_styles)
p_styles <- setdiff(p_styles,sty_par_info_to$style_id)
mapping_styles <- style_mapping_fortify(
style_ids = p_styles,
style_mapping = par_style_mapping,
styles_info_tbl_from = styles_info_tbl_from,
styles_info_tbl_to = styles_info_tbl_to,
style_type = "paragraph",
document_part_label = sefl$document_part_label()
)
for(i in seq_len(nro(mapping_styles))){
doc_str <- gsub(
paste0("w:pStryle w:val=\"",mapping_styles$id_from[i],"\""),
sprintf("w:pStyle w:val=\"%s\"",mapping_styles$id_to[i]),
doc_str
)
}
# runs/characters styles processing -----
sty_chr_info_fo <- styles_info_tbl_to[
styles_info_tbl_to$style_type %in% "character",
]
m <- gregexpr("w:rStyle w:val=\"[[:alnum:]]+\"",doc_str)
zz <- regmatches(doc_str,m)
zz <- unlist(zz)
r_styles <- gsub("w:rStyle w:val=\"([[:alnum:]]+)\"","\\1",zz)
r_styles <- unique(r_styles)
r_styles <- setdiff(r_styles,sty_chr_info_to$style_id)
mapping_styles <- style_mapping_fortify(
styles_ids = r_styles,
style_mapping = run_style_mapping,
styles_info_tbl_from = styles_info_tbl_from,
styles_info_tbl_to = styles_info_tbl_to,
style_type = "character",
document_part_label = self$document_part_label()
)
for(i in seq_len(nrow(mapping_styles))){
doc_str <- gsub(
paste0("w:rStyle w:val=\"",mapping_styles$id_from[i],"\""),
sprintf("w:rStyle w:val=\"%s\"",mapping_styles$id_to[i]),
doc_str
)
}
# tables styles processing -----
sty_tab_info_to <- styles_info_tbl_to[
styles_info_tbl_to$style_type %in% "table",
]
m <- gregexpr("w:tblStyle w:val=\"[[:alnum:]]+\"",doc_str)
zz <- regmatches(doc_str,m)
zz <- unlist(zz)
tbl_styles <- gsub("w:tblStyle w:val=\"([[:alnum:]]+)\"","\\1",zz)
tbl_styles <- unique(tbl_styles)
tbl_styles <- setdiff(tbl_styles,sty_tab_info_to$style_id)
mapping_styles <- style_mapping_fortify(
style_ids = tbl_styles,
style_mapping = tbl_style_mapping,
styles_info_tbl_from = styles_info_tbl_from,
styles_info_tbl_to = styles_info_tbl_to,
style_type = "table",
document_part_label = self$document_part_label()
)
for(i in seq_len(nrow(mapping_styles))){
doc_str <- gsub(
paste0("w:tblStyle w:val=\"",mapping_styles$id_from[i],"\""),
sprintf("w:tblStyle w:val=\"%s\"",mapping_styles$id_to[i]),
doc_str
)
}
doc_str
}
),
private = list(
package_dir = NULL,
cursor = NULL,
body_xpath = NULL
)
)
再看body_part的父类openxml_document的定义,在openxml_document.R中
{r}
openxml_document <- R6Class(
"openxml_document",
public = list(
initialize = function(dir){
private$reldir = dir
private$rels_doc = relationship$new()
},
feed = function(file){
private$filename <- file
private$doc <- read_xml(file)
private$rels_filename <- file.path(dirname(file),
"_rels",paste0(basename(file),".rels"))
if(file.exists(private$filename))
private$rels_doc <- relationship$new$feed_from_xml(private$rels_filename)
else private$rels_doc <- relationship$new()
self
},
file_name = function(){
private$filename
},
name = function(){
basename(private$filename)
},
get = function(){
private$doc
},
replace_xml = function(file){
private$doc <- read_xml(file,options = "NOBLANKS")
},
dir_name = function(){
private$reldir
},
save = function(){
private$doc <- read_xml(as.character(private$doc),options = "NSCLEAN")
write_xml(private$doc,file = private$filename)
if(nrow(self$rel_df())>0){
private$rels_doc$write(private$rels_filename)
}
self
},
remove = function(){
unlink(private$filename)
if(file.exists(private$rels_filename))
unlink(private$rels_filename)
private$filename
},
rel_df = function(){
private$rels_doc$get_data()
},
relationship = function(){
private$rels_doc
}
),
private = list(
filename = NULL,
rels_filename = NULL,
rels_doc = NULL,
reldir = NULL
)
)
1-9 update_hf_list()
函数在post-proc.R中定义
{r}
update_hf_list <- function(part_list = list(),type="header",package_dir){
files <- list.files(
path = file.path(package_dir,"word"),
pattern = sprintf("^%s[0-9]*.xml$",type)
)
files <- files[!basename(files) %in% names(part_list)]
if(type %in% "header"){
cursor <- "/w:hdr/*[1]"
body_xpath <- "/w:hdr"
}else{
cursor <- "/w:ftr/*[1]"
body_xpath <- "/w:ftr"
}
new_list <- lapply(files,function(x){
docx_part$new(
path = package_dir,
main_file = x,
cursor = cursor,
body_xpath = body_xpath
)
})
names(new_list) <- basename(files)
append(part_list,new_list)
}
回到read_docx源码
1-10 read_docx_styles
函数在read_docx_styles.R中定义
{r}
read_docx_styles <- function(package_dir){
style_file <- file.path(package_dir,"word/styles.xml")
doc <- read_xml(styles_file)
all_styles <- xml_find_all(doc,"/w:styles/w:style")
ppr <- read_ppr(all_styles)
rpr <- read_rpr(all_styles)
main <- data.frame(stringsAsFactors = FALSE,
style_type = xml_attr(all_styles,"type"),
style_id = xml_attr(all_styles,"styleId"),
style_name = xml_attr(xml_child(all_styles,"w:name"),"val"),
base_on = xml_attr(xml_child(all_styles,"w:basedOn"),"val"),
is_custom = xml_attr(all_styles,"customStyle") %in% "1",
is_default = xml_attr(all_styles,"default") %in% "1"
)
out <- cbind(main,ppr,rpr)
out
}
1-11 read_docx源码
{r}
read_docx <- function(path=NULL){
# 测试文档是否存在
## 路径值非空但文件不存在
if(!is.null(path) && !file.exists(path)){
stop("could not find file",shQuote(path),call. = FALSE)
}
## 路径值空
if(is.null(path)){
path <- system.file(package = "officer","template/template.docx")
}
## 判断文件类型
if(!grepl("\\.(docx|dotx)$",path,ignore.case = TRUE)){
stop("only support docx files",call. = FALSE)
}
# 将docx文件解压为xml
package_dir <- tempfile()
unpack_folder(file=path,folder=package_dir)
# 创建`rdocx`类的数据结构
# R语言 StructureClasses 位于 methods 包 (package)。
# 虚拟类structure及其扩展类是类似于S语言结构 (例如数组和时间序列)的正式类
# 注意不是R6Class类
obj <- structure(list(package_dir = package_dir),
.Names = c("package_dir"),
class = "rdocx"
)
obj$settings <- update_docx_settings_from_file(
x = docx_settings(),
file = file.path(package_dir,"word","settings.xml")
)
# 先提供一个零模板,再用文件具体设置更新,这个思路要记住
# relationship是R6类,还是先创建一个零对象
obj$rel <- relationship$new()
obj$rel$feed_from_xml(file.path(package_dir,"_rels",".rels"))
obj$doc_properties_custom <- read_custom_properties(package_dir)
obj$doc_properties <- read_core_properties(package_dir)
obj$content_type <- content_type$new(pakcage_dir)
obj$doc_obj <- body_part$new(
package_dir,
main_file = "document.xml",
cursor = "/w:document/w:body/*[1]",
body_xpath = "/w:document/w:body"
)
obj$styles <- read_docx_styles(package_dir)
obj$officer_cursor <- officer_cursor(obj$doc_obj$get())
obj$header <- update_hf_list(part_list = list(),type = "header",package_dir = pacakge_dir)
obj$footer <- update_hf_list(part_list = list(),type = "footer",package_dir = package_dir)
if(!file.exists(file.path(package_dir,"word","comments.xml"))){
file.copy(
system.file(package = "officer","template","comments.xml"),
file.path(package_dir,"word","comments.xml"),
copy.mode = FALSE
)
obj$content_type$add_override(
setNames("application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml","/wrod/commets.xml")
)
}
obj$comments <- docx_part$new(
package_dir,
main_file = "comments.xml",
cursor = "/w:comments/*[last()]",
body_xpath = "/w:comments"
)
if(!file.exists(file.path(package_dir,"word","footnotes.xml"))){
file.copy(
system.file(package = "officer","template","footnotes.xml"),
file.path(package_dir,"word","footnotes.xml"),
copy.mode = FALSE
)
obj$content_type <- add_override(
setNames("application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml","/word/footnotes.xml")
)
}
obj$footnotes <- footnotes_part$new(
package_dir,
main_file = "footnotes.xml",
cursor = "/w:footnotes/*[last()]",
body_xpath <- "/w:footnotes"
)
default_refs <- obj$styles[obj$styles$is_default,]
obj$default_styles <- setNames(as.list(default_refs$style_name),default_refs$style_type)
last_sect <- xml_find_first(obj$doc_obj$get(),"/w:docuemnt/w:body/w:sectPr[last()]")
obj <- cursor_end(obj)
obj
}