利用上文程序生成了三种不同标签属性组合的xlsx文件,连同原有的duckdb和wps格式,一共五种,统一用rusty_sheet的read_sheet来读取并写入内存表。
因为程序内存需求的限制,这次提取了3000行256列的数据。脚本如下
sql
copy (from 'wps40000_256.xlsx' limit 3000) to '4/duck_h0.xlsx'; --然后用wps打开,另存为'4/wps_h0.xlsx'
.system python3 stripxml3.py 4/wps_h0.xlsx 4
create or replace table t2a as from read_sheet('4/duck_h0.xlsx');
create or replace table t3a as from read_sheet('4/wps_h0.xlsx');
create or replace table t4a as from read_sheet('4/wps0.xlsx');
create or replace table t5a as from read_sheet('4/wps1.xlsx');
create or replace table t6a as from read_sheet('4/wps2.xlsx');
生成的文件:
wps0.xlsx - 无span, 无row的r, 无col的r
wps1.xlsx - 有span, 无row的r, 无col的r
wps2.xlsx - 有span, 有row的r, 无col的r
用默认压缩级别6测试,同时比较文件大小和read_sheet读写时间
sql
D .system ls -l 4/duck*xlsx
-rw-r--r-- 1 root root 9315700 Aug 21 10:28 4/duck_h0.xlsx
D .system ls -l 4/wps*xlsx
-rw-r--r-- 1 root root 6838304 Aug 21 13:59 4/wps0.xlsx
-rw-r--r-- 1 root root 6841962 Aug 21 13:59 4/wps1.xlsx
-rw-r--r-- 1 root root 6853863 Aug 21 13:59 4/wps2.xlsx
-rw-r--r-- 1 1000 1000 10313455 Aug 21 13:58 4/wps_h0.xlsx
D create or replace table t2a as from read_sheet('4/duck_h0.xlsx');
Run Time (s): real 1.082 user 0.968000 sys 0.116000
D
D create or replace table t3a as from read_sheet('4/wps_h0.xlsx');
Run Time (s): real 1.088 user 1.212000 sys 0.128000
D
D create or replace table t4a as from read_sheet('4/wps0.xlsx');
Run Time (s): real 0.762 user 0.748000 sys 0.016000
D
D create or replace table t5a as from read_sheet('4/wps1.xlsx');
Run Time (s): real 0.893 user 0.900000 sys 0.044000
D
D create or replace table t6a as from read_sheet('4/wps2.xlsx');
Run Time (s): real 0.777 user 0.940000 sys 0.012000
可见,duckdb格式包含行列r属性,不包含span属性,WPS的格式增加了1MB的大小,基本上是span属性占用的空间造成的,从五种文件的读写时间看,前两种差别不大,后三种各自差别不大。
结论就是去掉列的r行列号属性后,读写性能提高了20%,而行的span属性,r行号属性是否保留影响不大。
再改变压缩级别为9
sql
D .system ls -l 4/wps*xlsx
-rw-r--r-- 1 root root 6746806 Aug 21 14:04 4/wps0.xlsx
-rw-r--r-- 1 root root 6749691 Aug 21 14:04 4/wps1.xlsx
-rw-r--r-- 1 root root 6760856 Aug 21 14:04 4/wps2.xlsx
-rw-r--r-- 1 1000 1000 10313455 Aug 21 13:58 4/wps_h0.xlsx
D create or replace table t4a as from read_sheet('4/wps0.xlsx');
Run Time (s): real 0.788 user 0.744000 sys 0.044000
D
D create or replace table t5a as from read_sheet('4/wps1.xlsx');
Run Time (s): real 0.757 user 0.744000 sys 0.016000
D
D create or replace table t6a as from read_sheet('4/wps2.xlsx');
Run Time (s): real 0.827 user 0.744000 sys 0.080000
文件大小减少不到100K,几乎可以忽略不计,read_sheet读写无明显变化
再改变压缩级别为3
sql
D .system ls -l 4/wps*xlsx
-rw-r--r-- 1 root root 7154271 Aug 21 14:07 4/wps0.xlsx
-rw-r--r-- 1 root root 7158605 Aug 21 14:07 4/wps1.xlsx
-rw-r--r-- 1 root root 7172998 Aug 21 14:07 4/wps2.xlsx
-rw-r--r-- 1 1000 1000 10313455 Aug 21 13:58 4/wps_h0.xlsx
D create or replace table t4a as from read_sheet('4/wps0.xlsx');
Run Time (s): real 0.771 user 0.756000 sys 0.012000
D
D create or replace table t5a as from read_sheet('4/wps1.xlsx');
Run Time (s): real 0.762 user 0.916000 sys 0.008000
D
D create or replace table t6a as from read_sheet('4/wps2.xlsx');
Run Time (s): real 0.772 user 0.768000 sys 0.008000
文件大小增加了300K,read_sheet读写无明显变化。
似乎平衡压缩时间和空间默认级别是最优的。
再额外用duckdb的excel测试一遍
sql
D create or replace table t2 as from '4/duck_h0.xlsx';
Run Time (s): real 1.010 user 1.032000 sys 0.036000
D
D create or replace table t3 as from '4/wps_h0.xlsx';
Run Time (s): real 0.999 user 1.836000 sys 0.028000
D
D create or replace table t4 as from '4/wps0.xlsx';
Run Time (s): real 0.640 user 1.272000 sys 0.004000
D
D create or replace table t5 as from '4/wps1.xlsx';
Run Time (s): real 0.663 user 1.312000 sys 0.012000
D
D create or replace table t6 as from '4/wps2.xlsx';
Run Time (s): real 0.653 user 1.292000 sys 0.008000
也是类似的结果。
这里需要补充一个细节,WPS在重新保存时损失了精度,由最多15位有效数字减少到14位。
本次测试的duckdb格式由wps40000_256.xlsx重新转换而来,所以都是14位精度。
如果用原始duckdb格式生成,可以从前两行看出不同。
sql
D from t3 limit 2;
┌────────┬───────────────────┬───────────────────┬───────────────────┬───┬───────────────────┬───────────────────┬───────────────────┬───────────────────┬───────────────────┐
│ A1 │ B1 │ C1 │ D1 │ ... │ JS1 │ JT1 │ JU1 │ JV1 │ JW1 │
│ double │ double │ double │ double │ │ double │ double │ double │ double │ double │
├────────┼───────────────────┼───────────────────┼───────────────────┼───┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┤
│ 1162.0 │ 0.505097953594325 │ 0.227422430713457 │ 0.582377087901833 │ ... │ 0.41376105295772 │ 0.308463635432716 │ 0.816321922280297 │ 0.615496003220976 │ 0.703933211108833 │
│ 1215.0 │ 0.560794684273284 │ 0.831577667213861 │ 0.928908924059865 │ ... │ 0.602140846924642 │ 0.211443524629613 │ 0.509366372179941 │ 0.860901326092285 │ 0.723913421773335 │
├────────┴───────────────────┴───────────────────┴───────────────────┴───┴───────────────────┴───────────────────┴───────────────────┴───────────────────┴───────────────────┤
│ 2 rows 257 columns (9 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.015 user 0.012000 sys 0.000000
D copy (from 'duck40000_256.xlsx' limit 3000) to '4/duck.xlsx';
Run Time (s): real 5.669 user 9.996000 sys 0.268000
D create or replace table t2 as from '4/duck.xlsx';
Run Time (s): real 0.999 user 1.888000 sys 0.036000
D from t2 limit 2;
┌────────┬────────────────────┬─────────────────────┬────────────────────┬───┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┐
│ A1 │ B1 │ C1 │ D1 │ ... │ JS1 │ JT1 │ JU1 │ JV1 │ JW1 │
│ double │ double │ double │ double │ │ double │ double │ double │ double │ double │
├────────┼────────────────────┼─────────────────────┼────────────────────┼───┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┤
│ 1162.0 │ 0.5050979535943245 │ 0.22742243071345725 │ 0.5823770879018332 │ ... │ 0.4137610529577199 │ 0.3084636354327157 │ 0.816321922280297 │ 0.6154960032209756 │ 0.7039332111088327 │
│ 1215.0 │ 0.560794684273284 │ 0.8315776672138608 │ 0.9289089240598649 │ ... │ 0.6021408469246416 │ 0.2114435246296129 │ 0.5093663721799409 │ 0.8609013260922846 │ 0.7239134217733347 │
├────────┴────────────────────┴─────────────────────┴────────────────────┴───┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┤
│ 2 rows 257 columns (9 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘