pdfjs 实现给定pdf数据切片高亮并且跳转
pdfjs 类的改写
需求: pdf文件被解析成多个分段,每个分段需要能够展示,并且通过点击分段实现源pdf内容的高亮以及跳转需求。
pdfjs 中文文档
https://gitcode.gitcode.host/docs-cn/pdf.js-docs-cn/getting_started/index.html
https://github.com/mozilla/pdf.js
文档不够详细。pdf难就难在文档上
基本展示需求的实现
pdf.js 是一个由 Mozilla 开发的 JavaScript 库,可以在 Web 浏览器中显示 PDF 文档。pdf.js 将 PDF 文档转换为 HTML5 Canvas 元素,并使用 JavaScript 控制文档的呈现和交互。pdf.js 使得不需要在计算机上安装 Adobe Reader 或其他 PDF 阅读器就可以在 Web 上阅读 PDF 文档成为可能。pdf.js是一个免费的开源软件,使用和修改都非常方便。
pdf.js / src 是pdf.js 的 api层
pdf.js / web 是显示层,在api层的基础上进行UI展示,包括:pdf分页懒加载、切换页码、缩放、查找文字、选择本地文件、侧边栏导航、打印等功能。
javascript
预构建版本
├── build/
│ ├── pdf.js - display layer
│ ├── pdf.js.map - display layer's source map
│ ├── pdf.worker.js - core layer
│ └── pdf.worker.js.map - core layer's source map
├── web/
│ ├── cmaps/ - character maps (required by core)
│ ├── compressed.tracemonkey-pldi-09.pdf - PDF file for testing purposes
│ ├── debugger.js - helpful debugging features
│ ├── images/ - images for the viewer and annotation icons
│ ├── locale/ - translation files
│ ├── viewer.css - viewer style sheet
│ ├── viewer.html - viewer layout
│ ├── viewer.js - viewer layer
│ └── viewer.js.map - viewer layer's source map
└── LICENSE
源码版本
├── docs/ - website source code
├── examples/ - simple usage examples
├── extensions/ - browser extension source code
├── external/ - third party code
├── l10n/ - translation files
├── src/
│ ├── core/ - core layer
│ ├── display/ - display layer
│ ├── shared/ - shared code between the core and display layers
│ ├── interfaces.js - interface definitions for the core/display layers
│ ├── pdf.*.js - wrapper files for bundling
│ └── worker_loader.js - used for developer builds to load worker files
├── test/ - unit, font and reference tests
├── web/ - viewer layer
├── LICENSE
├── README.md
├── gulpfile.js - build scripts/logic
├── package-lock.json - pinned dependency versions
└── package.json - package definition and dependencies
展示功能很多人都做过了。我就不写了,粘一篇文章
pdf.js使用全教程
高亮功能的实现
由于后端的切片内容和前端从pdfjs中拿到的切片内容应该是相同的。即 ai \n mind\n 是 一款\n
之类的切片。所以我们可以用后端切片去匹配我们前端的切片。(通过数据长度以即每个片的index来判断),pdfjs也是通过这样的方式来实现的高亮。
- 切片数据格式: [切片1,切片2](注意)
- 移除其他功能(直接在pdfview.html 文件中添加hidden类名实现隐藏)
- 切片渲染可参考pdf的查找功能,通过切片数据与pdf.js 解析出的文本数据 计算出数据,该数据结构与查找高亮的数据结构保持一致,通过pdfjs原生的渲染功能来进行渲染;
- 切片定位可参考pdf的查找功能。
查询功能分析
findController.pageMatches:第n页匹配到,相对于本页文本数据的开始index
findController.pageMatchesLenght:第n页匹配到的匹配字符串的长度
this._convertMatches 方法处理后的数据
粘贴 updateMatches方法
javascript
_updateMatches(reset = false) { // 清空原高亮筛选逻辑,调用_renderMatches渲染新高亮样式
if (!this.enabled && !reset) {
return;
}
const {
findController,
matches,
pageIdx
} = this;
const {
textContentItemsStr,
textDivs
} = this;
let clearedUntilDivIdx = -1;
for (const match of matches) {
const begin = Math.max(clearedUntilDivIdx, match.begin.divIdx);
for (let n = begin, end = match.end.divIdx; n <= end; n++) {
const div = textDivs[n];
div.textContent = textContentItemsStr[n];
div.className = "";
}
clearedUntilDivIdx = match.end.divIdx + 1;
}
if (!findController?.highlightMatches || reset) {
return;
}
console.log('findController.pageMatches 第n页匹配到,相对于本页文本数据的开始index',findController.pageMatches)
console.log('findController.pageMatchesLength 第n页匹配到的匹配字符串的长度', findController.pageMatchesLength)
const pageMatches = findController.pageMatches[pageIdx] || null;
console.log('pageMatches', pageMatches);
const pageMatchesLength = findController.pageMatchesLength[pageIdx] || null;
this.matches = this._convertMatches(pageMatches, pageMatchesLength);
console.log('this.matches', this.matches)
this._renderMatches(this.matches);
}
}
text_highlighter 类中处理逻辑
enable:页面渲染时调用,初始化绑定事件,调用页面渲染更新
_convertMatches:将数据1转换成数据2
_updateMatches:清空原高亮样式,调用_renderMatches渲染新高亮样式
_renderMatches:渲染高亮,并调用findController.scrollMatchIntoView
滚动到指定位置
切片数据处理
javascript
数据结构说明
切片数据: 必须有分页信息不然无法匹配每页的textHeight实例
[
// 第一页切片
{
pageIndex: 0,
cutInfo: [
'内容1','内容2'.....
]
}
// 第二页切片
{
pageIndex: 1,
cutInfo: [
'内容1','内容2'.....
]
}
]
注意事项:
1、normalizeUnicode处理文本数据,如 fi 这是一个字符,前后端解析可能会不一致,将前后端解析出来的数据使用pdf.js api暴露出来的normalizeUnicode进行处理,处理后为i f
,两个字符,
2、空白字符过滤:前后端数据可能会存在空格、换行等空白字符差异(存在什么样的差异,比如前端会将多个空格合并成1个空格),计算时需要过滤。(我处理的正则/\s|\u0000|./g)
3、后端的切片数据 与 前端pdf拿到的数据有出入,渲染时无法完全对应
切片数据处理
1、将切片数据处理成分页的数据2,命名为pagesMatches,并添加自定义标识,
2、在text_highlighter中注册事件updatePagesMatches,用于接收存储pagesMatches,并调用_updateMatches重新渲染。
3、改造text_highlighter中_updateMatches,1清空原高亮出代码,2将第n页pagesMatches与查询的数据2合并,生成新的有自定义标识的数据2,使高亮渲染切片后查询功能正常。
4、调用_renderMatches进行渲染,将扩展的字段添加到html元素中,并添加样式。高亮渲染完成了。
5、高亮定位:
pagesMatches数据添加扩展字段isSelected
使所在页码滚动到可视区域PDFViewerApplication.pdfViewer.currentPageNumber=n
在_renderMatches渲染时根据 isSelected 与搜索选中selected 判断获取应该滚动到的html元素
调用findController.scrollMatchIntoView进行滚动
javascript
// 粘贴 textheight类
class TextHighlighter {
constructor({
findController,
eventBus,
pageIndex
}) {
this.findController = findController;
this.matches = [];
this.eventBus = eventBus;
this.pageIdx = pageIndex;
this._onUpdateTextLayerMatches = null;
this.textDivs = null;
this.textContentItemsStr = null;
this.enabled = false;
// 没有则创建 _onUpdatePagesMatches
if (!this._onUpdatePagesMatches) {
this._onUpdatePagesMatches = (evt) => {
if (evt.pagesMatches !== defaultPagesMatches) {
defaultPagesMatches = evt.pagesMatches;
defaultPagesMatchesIsFocus = true;
sessionStorage.removeItem("pdfFindBar");
}
this._updateMatches(false);
};
this.eventBus._on("updatePagesMatches", this._onUpdatePagesMatches);
}
}
setTextMapping(divs, texts) {
this.textDivs = divs;
this.textContentItemsStr = texts;
}
enable() { // 页面渲染时调用,初始化绑定事件,调用页面渲染更新
console.log('enable')
if (!this.textDivs || !this.textContentItemsStr) {
throw new Error("Text divs and strings have not been set.");
}
if (this.enabled) {
throw new Error("TextHighlighter is already enabled.");
}
console.log('页面渲染------------');
this.enabled = true;
if (!this._onUpdateTextLayerMatches) {
this._onUpdateTextLayerMatches = evt => {
if (evt.pageIndex === this.pageIdx || evt.pageIndex === -1) {
this._updateMatches();
}
};
this.eventBus._on("updatetextlayermatches", this._onUpdateTextLayerMatches);
}
if (!this._onUpdatePagesMatches) {
this._onUpdatePagesMatches = (evt) => {
if (evt.pagesMatches !== defaultPagesMatches) {
defaultPagesMatches = evt.pagesMatches;
defaultPagesMatchesIsFocus = true;
sessionStorage.removeItem("pdfFindBar");
}
this._updateMatches(false);
};
this.eventBus._on("updatePagesMatches", this._onUpdatePagesMatches);
}
this._updateMatches();
}
disable() {
if (!this.enabled) {
return;
}
console.log('disable')
this.enabled = false;
if (this._onUpdateTextLayerMatches) {
this.eventBus._off("updatetextlayermatches", this._onUpdateTextLayerMatches);
this._onUpdateTextLayerMatches = null;
}
// disable时候移除监听方法
if (this._onUpdatePagesMatches) {
this.eventBus._off("updatePagesMatches", this._onUpdatePagesMatches);
this._onUpdatePagesMatches = null;
}
this._updateMatches(true);
}
_convertMatches(matches, matchesLength) { // _convertMatches:将数据转换成begin end 格式
if (!matches) {
return [];
}
const {
textContentItemsStr
} = this;
let i = 0
let iIndex = 0;
const end = textContentItemsStr.length - 1;
const result = [];
try {
for (let m = 0, mm = matches.length; m < mm; m++) {
let matchIdx = matches[m];
while (i !== end && matchIdx >= iIndex + textContentItemsStr[i].length) {
iIndex += textContentItemsStr[i].length;
i++;
}
if (i === textContentItemsStr.length) {
console.error("Could not find a matching mapping");
}
const match = {
begin: {
divIdx: i,
offset: matchIdx - iIndex
}
};
matchIdx += matchesLength[m];
while (i !== end && matchIdx > iIndex + textContentItemsStr[i].length) {
iIndex += textContentItemsStr[i].length;
i++;
}
match.end = {
divIdx: i,
offset: matchIdx - iIndex
};
result.push(match);
}
} catch {
debugger
console.log(2222222222);
}
debugger
return result;
}
_renderMatches(matches) {
// Early exit if there is nothing to render.
if (matches.length === 0) {
return;
}
const isPagesMatch = sessionStorage.getItem("pdfFindBar") !== "pdfFindBar";
const { textContentItemsStr, textDivs, findController, pageIdx } = this;
if (!textDivs?.length) {
return;
}
const isSelectedPage = findController?.selected
? pageIdx === findController.selected.pageIdx
: true;
const selectedMatchIdx = findController?.selected?.matchIdx ?? 0;
// const highlightAll = !options ? findController.state.highlightAll : true;
const highlightAll = true;
let prevEnd = null;
const infinity = {
divIdx: -1,
offset: undefined,
};
function beginText(begin, className, styles) {
const divIdx = begin.divIdx;
if (!textDivs[divIdx]) {
return;
}
textDivs[divIdx].textContent = "";
return appendTextToDiv(divIdx, 0, begin.offset, className, styles);
}
function appendTextToDiv(divIdx, fromOffset, toOffset, className, styles) {
let div = textDivs[divIdx];
if (!div) {
return;
}
if (div.nodeType === Node.TEXT_NODE) {
const span = document.createElement("span");
div.before(span);
span.append(div);
textDivs[divIdx] = span;
div = span;
}
const content = textContentItemsStr[divIdx].substring(
fromOffset,
toOffset
);
const node = document.createTextNode(content);
if (className) {
const span = document.createElement("span");
if (styles && span) {
for (let p in styles) {
span.style[p] = styles[p];
}
}
span.className = `${className} appended`;
span.append(node);
div.append(span);
return className.includes("selected") ? span.offsetLeft : 0;
}
div.append(node);
return 0;
}
let i0 = selectedMatchIdx,
i1 = i0 + 1;
if (highlightAll) {
i0 = 0;
i1 = matches.length;
} else if (!isSelectedPage) {
// Not highlighting all and this isn't the selected page, so do nothing.
return;
}
let lastDivIdx = -1;
let lastOffset = -1;
let selectedElement;
let findIndex = -1;
for (let i = i0; i < i1; i++) {
const match = matches[i];
const begin = match.begin;
if (begin.divIdx === lastDivIdx && begin.offset === lastOffset) {
// It's possible to be in this situation if we searched for a 'f' and we
// have a ligature 'ff' in the text. The 'ff' has to be highlighted two
// times.
continue;
}
lastDivIdx = begin.divIdx;
lastOffset = begin.offset;
const end = match.end;
if (match.sectionIndex === undefined) {
findIndex += 1;
}
const isSelected = isPagesMatch
? match.isSelected
: isSelectedPage && findIndex === selectedMatchIdx;
const highlightSuffix = " " + match.className;
let selectedLeft = 0;
// Match inside new div.
if (!prevEnd || begin.divIdx !== prevEnd.divIdx) {
// If there was a previous div, then add the text at the end.
if (prevEnd !== null) {
appendTextToDiv(prevEnd.divIdx, prevEnd.offset, infinity.offset);
}
// Clear the divs and set the content until the starting point.
beginText(begin);
} else {
appendTextToDiv(prevEnd.divIdx, prevEnd.offset, begin.offset);
}
if (begin.divIdx === end.divIdx) {
selectedLeft = appendTextToDiv(
begin.divIdx,
begin.offset,
end.offset,
"highlight" + highlightSuffix,
match.styles
);
} else {
selectedLeft = appendTextToDiv(
begin.divIdx,
begin.offset,
infinity.offset,
"highlight begin" + highlightSuffix,
match.styles
);
for (let n0 = begin.divIdx + 1, n1 = end.divIdx; n0 < n1; n0++) {
if (textDivs[n0]) {
if (match.styles) {
for (let p in match.styles) {
textDivs[n0].style[p] = match.styles[p];
}
}
textDivs[n0].className = "highlight middle" + highlightSuffix;
}
}
beginText(end, "highlight end" + highlightSuffix, match.styles);
}
prevEnd = end;
if (!selectedElement && isSelected) {
let divIdx = begin.divIdx;
while (!textContentItemsStr[divIdx] && divIdx <= end.divIdx) {
divIdx++;
}
const div = textDivs[divIdx];
let isOut = false;
// 定位元素需要在可视区域内
try{
const divStyle = div.style;
let textLayerNode = div;
while (
!textLayerNode.classList.contains("textLayer") &&
textLayerNode.parentElement
) {
textLayerNode = textLayerNode.parentElement;
}
let left = parseFloat(divStyle.left.match(/\d+/g)[0] || "0");
let top = parseFloat(divStyle.top.match(/\d+/g)[0] || "0");
if (
(divStyle.left.includes("%") && left >= 100) ||
(divStyle.top.includes("%") && top >= 100)
) {
isOut = true;
}
if (textLayerNode.classList.contains("textLayer")) {
let width = parseFloat(textLayerNode.style.width.match(/\d+/g)[0] || "0");
let height = parseFloat(textLayerNode.style.height.match(/\d+/g)[0] || "0");
if (
(divStyle.left.includes("px") && left > width) ||
(!divStyle.top.includes("px") && top > height)
) {
isOut = true;
}
}
} catch(e) {
console.error(e)
}
if (!isOut && defaultPagesMatchesIsFocus && isPagesMatch) {
selectedElement = div;
} else if (!isOut && !isPagesMatch) {
selectedElement = div;
}
}
}
if (selectedElement && findController) {
findController.scrollMatchIntoView({
element: selectedElement,
selectedLeft: 0,
pageIndex: pageIdx,
matchIndex: selectedMatchIdx,
});
defaultPagesMatchesIsFocus = false;
}
if (prevEnd) {
appendTextToDiv(prevEnd.divIdx, prevEnd.offset, infinity.offset);
}
}
/**
*
* @desc 合并数据方法
* @returns
*/
_merageMatches(baseMatchs, matches) {
while (matches.length) {
const match = matches[0];
const beginIndex = baseMatchs.findIndex((item) => {
return (
(item.begin.divIdx < match.begin.divIdx ||
(item.begin.divIdx === match.begin.divIdx &&
item.begin.offset <= match.begin.offset)) &&
(item.end.divIdx > match.begin.divIdx ||
(item.end.divIdx === match.begin.divIdx &&
item.end.offset > match.begin.offset))
);
});
const endIndex = baseMatchs.findIndex((item) => {
return (
(item.begin.divIdx < match.end.divIdx ||
(item.begin.divIdx === match.end.divIdx &&
item.begin.offset <= match.end.offset)) &&
(item.end.divIdx > match.end.divIdx ||
(item.end.divIdx === match.end.divIdx &&
item.end.offset >= match.end.offset))
);
});
if (endIndex === -1 && beginIndex === -1) {
baseMatchs.push({
...match,
});
matches.shift();
continue;
}
if (endIndex !== 1 && beginIndex !== -1 && endIndex !== beginIndex) {
baseMatchs[beginIndex].end = { ...match.begin };
baseMatchs[endIndex].begin = { ...match.end };
baseMatchs.splice(beginIndex + 1, endIndex - beginIndex - 1, match);
matches.shift();
continue;
}
if (endIndex !== -1 && beginIndex !== -1 && endIndex === beginIndex) {
baseMatchs.splice(
beginIndex,
1,
{
...baseMatchs[beginIndex],
end: { ...match.begin },
},
match,
{
...baseMatchs[beginIndex],
begin: { ...match.end },
}
);
matches.shift();
continue;
}
if (endIndex !== -1 && beginIndex === -1) {
baseMatchs[endIndex].begin = { ...match.end };
baseMatchs.splice(0, 0, match);
matches.shift();
continue;
}
if (endIndex === -1 && beginIndex !== -1) {
baseMatchs[beginIndex].end = { ...match.begin };
baseMatchs.push(match);
matches.shift();
continue;
}
matches.shift();
console.log("没有处理的", endIndex, beginIndex);
}
baseMatchs.sort((a, b) => {
return a.begin.divIdx - b.begin.divIdx;
});
return baseMatchs.filter((item) => {
if (item.begin.divIdx === item.end.divIdx && item.begin.offset === item.end.offset) {
return false
}
return true
})
}
_updateMatches(reset = false) { // 清空原高亮筛选逻辑,调用_renderMatches渲染新高亮样式
if (!this.enabled && !reset) {
return;
}
// this.pageIdx 当前页数index
const { findController, pageIdx } = this;
const { textContentItemsStr, textDivs, matches = [] } = this;
// console.log('findController', findController);
// console.log('pageIdx', pageIdx);
// console.log('textContentItemsStr', textContentItemsStr);
// console.log('textDivs', textDivs);
// console.log('matches', matches);
// 清楚匹配项
for (let i = 0, ii = matches.length; i < ii; i++) {
const match = matches[i];
const begin = match.begin.divIdx;
for (let n = begin, end = match.end.divIdx; n <= end; n++) {
const div = textDivs[n];
div.textContent = textContentItemsStr[n];
div.className = "";
}
}
// console.log('defaultPagesMatches',defaultPagesMatches);
let sectionMatches = [...(defaultPagesMatches?.[this.pageIdx] || [])];
if (findController?.highlightMatches && !reset) {
const pageMatches = findController.pageMatches[pageIdx] || null;
const pageMatchesLength =
findController.pageMatchesLength[pageIdx] || null;
const findMatches = this._convertMatches(pageMatches, pageMatchesLength);
console.log('findMatches', findMatches);
const selectedMatchIdx = findController.selected.matchIdx;
pageIdx === findController.selected.pageIdx &&
findMatches[selectedMatchIdx] &&
(findMatches[selectedMatchIdx].className = "selected");
this.matches = this._merageMatches(sectionMatches, findMatches);
console.log('this.matches', this.matches);
this._renderMatches(this.matches || []);
return;
}
console.log('2222sectionMatches', sectionMatches);
this.matches = sectionMatches;
this._renderMatches(sectionMatches || []);
// const {
// findController,
// matches,
// pageIdx
// } = this;
// const {
// textContentItemsStr,
// textDivs
// } = this;
// let clearedUntilDivIdx = -1;
// for (const match of matches) {
// const begin = Math.max(clearedUntilDivIdx, match.begin.divIdx);
// for (let n = begin, end = match.end.divIdx; n <= end; n++) {
// const div = textDivs[n];
// div.textContent = textContentItemsStr[n];
// div.className = "";
// }
// clearedUntilDivIdx = match.end.divIdx + 1;
// }
// if (!findController?.highlightMatches || reset) {
// return;
// }
// console.log('findController.pageMatches 第n页匹配到,相对于本页文本数据的开始index',findController.pageMatches)
// console.log('findController.pageMatchesLength 第n页匹配到的匹配字符串的长度', findController.pageMatchesLength)
// const pageMatches = findController.pageMatches[pageIdx] || null;
// console.log('pageMatches', pageMatches);
// const pageMatchesLength = findController.pageMatchesLength[pageIdx] || null;
// this.matches = this._convertMatches(pageMatches, pageMatchesLength);
// console.log('this.matches', this.matches)
// this._renderMatches(this.matches);
}
}
javascript
调用:
function handleTest(evt: any) {
let str = 'AiMind\n文档库'
let pdfJs = document.getElementsByTagName('iframe')[0]
let PDFViewerApplication = (window[0] as any).PDFViewerApplication
PDFViewerApplication.pageIndex = 1
let update = PDFViewerApplication.eventBus
let normalizeUnicode = (window[0] as any).pdfjsLib.normalizeUnicode
let unicodeHandledStr = normalizeUnicode(str)
const regex = /\s|\u0000|\./g;
let regHandledStr = unicodeHandledStr.replace(regex, ' ')
let metchShotStr = regHandledStr.split(' ')
// 写死数据测试
let testData = {
0: [
{
sectionIndex: 0,
className: 'section-0 section-color-0',
begin: {
divIdx: 0,offset: 0
}, end: {
divIdx: 0, offset: 2
}
},
{
sectionIndex: 1,
className: 'section-1 section-color-0',
begin: {
divIdx: 3,offset: 0
}, end: {
divIdx: 3, offset: 50
}
},
]
}
// 通知pdfjs 高亮渲染
update.dispatch('updatePagesMatches', { pagesMatches:testData })
}