当前问题
在游戏开发中,通常需要将庞杂且改变频繁 的数据配置成静态数据表,一是方便代码读取,二是方便策划调试。
如果没有自研的导表工具,一般都采用Excel配置、脚本导出 的方法。在实际工作中,随着项目的进展,一个Excel文件可能有多个sheet,而整个项目又有多个Excel文件,从sheet名反查Excel文件是实际开发中的常见操作。
那么该如何高效地进行反查呢?
普通遍历
对于反查的需求,首先想到的方法是遍历Excel文件所在目录 ,读取每个Excel文件并查找是否有符合的sheet名。
python
from xlrd import open_workbook
import time
import os
import sys
import re
def findExcel():
if not sys.argv:
return
keyWord = sys.argv[1]
path = None
if len(sys.argv) >=3:
path = sys.argv[2]
print(f"======== start finding target '{keyWord}' ========")
for filename in os.listdir(path):
if filename[-4:] != "xlsx":
continue
try:
findTarget(path, filename, keyWord)
except:
continue
def findTarget(path, filename, keyWord):
filepath = path + "\\" + filename
workbook = open_workbook(filepath)
for name in workbook.sheet_names():
if not re.search(keyWord, name):
continue
print(f"sheet name: {name} -> file name {filename}")
if __name__ == "__main__":
startTime = time.time()
findExcel()
endTime = time.time()
print(f"total cost time {endTime - startTime}s")
由于Excel文件较多,采用这种暴力循环的方式耗时较长,总耗时22s多,具体如下:
并行查找
为了提高效率,可以利用CPU的多核优势,采取多进程并行查找的方法。
multiprocessing.Pool会默认创建与CPU核数相同的进程数,让多个Excel文件同时进行查找。
python
from xlrd import open_workbook
import os
import sys
import time
import multiprocessing
import re
def process_excel(filename, path, keyWord):
"""单独的进程函数来处理单个Excel文件"""
try:
filepath = os.path.join(path, filename)
workbook = open_workbook(filepath)
for name in workbook.sheet_names():
if not re.search(keyWord, name):
continue
print(f"sheet name: {name} -> file name {filename}")
except Exception as e:
print(f"Error processing {filename}: {e}")
def findExcel():
if not sys.argv:
return
keyWord = sys.argv[1]
path = sys.argv[2] if len(sys.argv) >= 3 else '.'
print(f"======== start finding target '{keyWord}' ========")
# 使用进程池处理文件
with multiprocessing.Pool() as pool:
excel_files = [f for f in os.listdir(path) if f.endswith('.xlsx')]
pool.starmap(process_excel, [(filename, path, keyWord) for filename in excel_files])
if __name__ == "__main__":
startTime = time.time()
findExcel()
endTime = time.time()
print(f"total cost time {endTime - startTime}s")
并行查找下,效率得到大大提升,总耗时仅5.1s左右,具体如下:
xml解析
在上述优化后是否还有效率更高的实现呢?这时候就要从Excel文件的底层入手了。
查阅资料可以知道,Excel文件其实是一个zip文件,里面封装了数据、表和表之间的引用以及样式等信息,
将xlsx后缀名改为zip后点开查看,目录如下:
shell
.
├── [Content_Types].xml
├── _rels
├── docProps
│ ├── app.xml
│ ├── core.xml
│ └── custom.xml
└── xl
├── _rels
│ └── workbook.xml.rels
├── comments1.xml
├── comments2.xml
├── comments3.xml
├── drawings
│ ├── vmlDrawing1.vml
│ ├── vmlDrawing2.vml
│ └── vmlDrawing3.vml
├── sharedStrings.xml
├── styles.xml
├── theme
│ └── theme1.xml
├── workbook.xml
└── worksheets
├── _rels
│ ├── sheet1.xml.rels
│ ├── sheet2.xml.rels
│ └── sheet3.xml.rels
├── sheet1.xml
├── sheet2.xml
├── sheet3.xml
├── sheet4.xml
└── sheet5.xml
其中workbook.xml包含了每个sheet的基本信息,因此只需读取这个xml文件,即可快速反查。代码如下:
python
from lxml import etree
import os
import sys
import multiprocessing
import zipfile
import shutil
import time
import re
def parserWorkBook(filename, path, keyWord):
dirName = f'temp_directory_{os.getpid()}'
try:
with zipfile.ZipFile(f"{path}\\{filename}", 'r') as zip_ref:
zip_ref.extractall(dirName)
except Exception as e:
# print(f"Error processing {filename}: {e}")
return
# 解析workbook.xml
try:
with open(f'{dirName}/xl/workbook.xml', 'rb') as file:
tree = etree.parse(file)
root = tree.getroot()
# 遍历工作表信息
for sheet in root.findall('.//{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheet'):
sheetname = sheet.attrib['name']
if not re.search(keyWord, sheetname):
continue
print(f"sheet name: {sheetname} -> file name {filename}")
except Exception as e:
# print(f"Error processing {filename}: {e}")
return
def findExcel():
if not sys.argv:
return
keyWord = sys.argv[1]
path = sys.argv[2] if len(sys.argv) >= 3 else '.'
print(f"======== start finding target '{keyWord}' ========")
# 使用进程池处理文件
with multiprocessing.Pool() as pool:
excel_files = [f for f in os.listdir(path) if f.endswith('.xlsx')]
pool.starmap(parserWorkBook, [(filename, path, keyWord) for filename in excel_files])
if __name__ == "__main__":
startTime = time.time()
findExcel()
endTime = time.time()
print(f"total cost time {endTime - startTime}s")
for filename in os.listdir(os.curdir):
if filename.startswith('temp_directory_'):
shutil.rmtree(filename)
由于不需要加载表中的无关数据 ,也不需要建立数据和数据、表与表之间的引用关系,因此效率得到进一步提升,最终总耗时提升到1.3s左右,具体如下: