背景
事情是这样的:
某公司的某系统,可以导入、导出Excel文件,最早后台使用的是POI中的XSSFWorkbook,随着操作的数据量的增加,XSSFWorkbook的弊端就显现了,然后调整为使用SXSSFWorkbook进行数据导出操作。
然后一个很奇怪的问题就出现了:导出的.xlsx格式的文件,如果不在电脑上打开并保存一次,直接导入系统,通用的解析excel的方法失效了,获取不到任何数据,打开.xlsx文件却是可以正常看到数据的
明明有数据的Excel文件,在不保存的的情况下,导入系统,为什么解析获取不到数据呢?保存后,却又可以正常解析?
想要搞清楚这个问题发生的根本原因,有一个知识点是必须要搞明白的!
.xlsx格式的文件到底是什么?
参考文章:你尝试过更改为 .zip 之后解压 .xlsx 文件拆解 Excel 的储存原理吗?
SXSSFWorkbook介绍
Streaming version of XSSFWorkbook implementing the "BigGridDemo" strategy. This allows to write very large files without running out of memory as only a configurable portion of the rows are kept in memory at any one time. You can provide a template workbook which is used as basis for the written data. See poi.apache.org/spreadsheet... for details. Please note that there are still things that still may consume a large amount of memory based on which features you are using, e.g. merged regions, comments, ... are still only stored in memory and thus may require a lot of memory if used extensively. SXSSFWorkbook defaults to using inline strings instead of a shared strings table. This is very efficient, since no document content needs to be kept in memory, but is also known to produce documents that are incompatible with some clients. With shared strings enabled all unique strings in the document has to be kept in memory. Depending on your document content this could use a lot more resources than with shared strings disabled. Carefully review your memory budget and compatibility needs before deciding whether to enable shared strings or not.\
google翻译如下:实现"BigGridDemo"策略的 XSSFWorkbook 的流式版本。这允许写入非常大的文件而不会耗尽内存,因为任何时候只有行的可配置部分保留在内存中。您可以提供一个模板工作簿,用作书面数据的基础。有关详细信息,请参阅 poi.apache.org/spreadsheet... SXSSFWorkbook 默认使用内联字符串而不是共享字符串表。这是非常有效的,因为不需要将文档内容保存在内存中,但也已知会生成与某些客户端不兼容的文档。启用共享字符串后,文档中的所有唯一字符串都必须保存在内存中。根据您的文档内容,这可能会使用比禁用共享字符串更多的资源。在决定是否启用共享字符串之前,请仔细检查您的内存预算和兼容性需求。
构造方法(一言不合贴代码)
java
/**
* Constructs an workbook from an existing workbook.
* <p>
* When a new node is created via {@link SXSSFSheet#createRow} and the total number
* of unflushed records would exceed the specified value, then the
* row with the lowest index value is flushed and cannot be accessed
* via {@link SXSSFSheet#getRow} anymore.
* </p>
* <p>
* A value of <code>-1</code> indicates unlimited access. In this case all
* records that have not been flushed by a call to <code>flush()</code> are available
* for random access.
* </p>
* <p>
* A value of <code>0</code> is not allowed because it would flush any newly created row
* without having a chance to specify any cells.
* </p>
*
* @param workbook the template workbook
* @param rowAccessWindowSize the number of rows that are kept in memory until flushed out, see above.
* @param compressTmpFiles whether to use gzip compression for temporary files
* @param useSharedStringsTable whether to use a shared strings table
*/
public SXSSFWorkbook(XSSFWorkbook workbook, int rowAccessWindowSize, boolean compressTmpFiles, boolean useSharedStringsTable) {
setRandomAccessWindowSize(rowAccessWindowSize);
setCompressTempFiles(compressTmpFiles);
if (workbook == null) {
_wb = new XSSFWorkbook();
_sharedStringSource = useSharedStringsTable ? _wb.getSharedStringSource() : null;
} else {
_wb=workbook;
_sharedStringSource = useSharedStringsTable ? _wb.getSharedStringSource() : null;
for ( Sheet sheet : _wb ) {
createAndRegisterSXSSFSheet( (XSSFSheet)sheet );
}
}
}
查看SXSSFWorkbook源码就知道,虽然提供有多个构造方法,但最终都会调用构造方法
java
SXSSFWorkbook(
XSSFWorkbook workbook,
int rowAccessWindowSize,
boolean compressTmpFiles,
boolean useSharedStringsTable
)
下面根据方面描述简单介绍下这个构造方法的几个参数:
· workbook - 参数workbook对象很好理解,就是实际excel操作的对象--XSSFWorkbook
· rowAccessWindowSize - the number of rows that are kept in memory until flushed out,内存中保留的数据行数,超过这个数,就把数据写入磁盘
· compressTmpFiles - whether to use gzip compression for temporary files,是否压缩生成的临时文件
· useSharedStringsTable - whether to use a shared strings table,是否使用共享字符串表(这个参数正是我所遇到问题的正在症结所在,且是在使用过程中很少被关注的一个参数,度娘上搜索SXSSFWorkbook时,也很少有人关注,大家关注的重点都是rowAccessWindowSize)
useSharedStringsTable=true时会发生什么?
java
_sharedStringSource = useSharedStringsTable ? _wb.getSharedStringSource() : null;
查看根构造发现useSharedStringsTable=true时,其实就是通过workbook对象获取对象SharedStringsTable, 那这个SharedStringsTable到底是个什么东西呢?
java
/**
* shared string table - a cache of strings in this workbook
*/
protected final SharedStringsTable _sharedStringSource;
下面是类SharedStringsTable文档定义上的描述:
Table of strings shared across all sheets in a workbook.A workbook may contain thousands of cells containing string (non-numeric) data. Furthermore this data is very likely to be repeated across many rows or columns. The goal of implementing a single string table that is shared across the workbook is to improve performance in opening and saving the file by only reading and writing the repetitive information once.Consider for example a workbook summarizing information for cities within various countries. There may be a column for the name of the country, a column for the name of each city in that country, and a column containing the data for each city. In this case the country name is repetitive, being duplicated in many cells. In many cases the repetition is extensive, and a tremendous savings is realized by making use of a shared string table when saving the workbook. When displaying text in the spreadsheet, the cell table will just contain an index into the string table as the value of a cell, instead of the full string.The shared string table contains all the necessary information for displaying the string: the text, formatting properties, and phonetic properties (for East Asian languages).
Google翻译:工作簿中所有工作表共享的字符串表。
工作簿可能包含数千个包含字符串(非数字)数据的单元格。此外,这些数据很可能在许多行或列中重复。实现跨工作簿共享的单个字符串表的目标是通过仅读取和写入重复信息一次来提高打开和保存文件的性能。
例如,考虑一个总结各国城市信息的工作簿。可能有一列是国家名称,一列是该国家每个城市的名称,还有一列包含每个城市的数据。在这种情况下,国家名称是重复的,在许多单元格中重复。在许多情况下,重复是广泛的,并且通过在保存工作簿时使用共享字符串表实现了巨大的节省。在电子表格中显示文本时,单元格表将仅包含字符串表的索引作为单元格的值,而不是完整的字符串。
共享字符串表包含显示字符串所需的所有信息:文本、格式属性和语音属性(针对东亚语言)。
而当useSharedStringsTable=true时,生成的excel文件解压缩后与标准的文件结构有差异(请容许我偷个懒,给你个机会自己看清楚 ),这种情况导出来的Excel文件内容展示都是正常的,在保存一次后,最终压缩包解压后格式就又与保持一致了。
如此,文章开头的问题的根本原因就找到了、找到了。
虽然解决问题的过程比较曲折,但是同时又是个学习增涨见识的过程:Excel文件竟然是个压缩包???第三方包提供的方法,没搞清楚所有参数用法,不要随便用、不要随便!!!。
在痛并快乐中成长!!!