HTMLRewriter 在测试中的妙用 - Bun 单元测试系列

💎 价值

本文将在"实战"中介绍 bun 内置的速度极快的 HTMLRewriter，4w 多字符的 HTML 处理只需要不到一毫秒！

`HTMLRewriter`

HTMLRewriter 允许你使用 CSS 选择器来转换 HTML 文档。它支持 Request、Response 以及字符串作为输入。Bun 的实现基于 Cloudflare 的 lol-html。

bun.sh/docs/api/ht...

"是 CSS 选择器我们有救了！"，因为 CSS 选择器的强大和灵活毋庸置疑。但是别高兴太早，HTMLRewriter 对选择器的支持并没有浏览器那么全乎。

综上所述：利用 HTMLRewriter 我们可以操作 HTML，而无需借助三方依赖，如 parse5 HTML parser and serializer .)。本身我们的项目单元测试运行时就是 bun，那为何不用 bun 内置的 HTMLRewriter？速度快、0依赖。

先看看官方文档给出的示例，熟悉其 API 风格：

ts 复制代码

rewriter.on("div.content", {
  // Handle elements
  element(element) {
    element.setAttribute("class", "new-content");
    element.append("<p>New content</p>", { html: true });
  },
  // Handle text nodes
  text(text) {
    text.replace("new text");
  },
  // Handle comments
  comments(comment) {
    comment.remove();
  },
});

on(selector, handlers) 方法允许你为一组符合 CSS 选择器的 HTML 元素注册处理函数。解析过程中，每遇到匹配的元素，对应的处理函数就会被调用。可以用来处理 HTML数据，比如爬虫等，比 Python 的 beautifulsoup 更好用。

缺点先讲一下后续随着案例也会具体讲到：

API 并非如 DOM 一样多，比如无法访问父节点
CSS 选择器支持没有浏览器支持那么全面。

案例一、`filter` 的进阶 🧗‍♂️：利用 bun 内置的 `HTMLRewriter`

上一篇文章"HTML 处理以及性能对比 - Bun 单元测试系列"讲到，单元测试生成的文件路径是绝对路径，系统不同则分隔符不同，而且本地生成的路径和 CI 甚至同事之间也会不一样，会导致"明明在我的电脑运行好好的"的经典问题。需要将这些路径"稳定化"，即抹平不同操作系统以及 CI 环境本地环境的差异，比如：

Windows下 D:\\workspace\\foo\\src\\assets\\user.png
CI 环境下 /app/src/assets/user.png to user.png

故我们需要找出这些图片路径并转换成 user.png 或 submit-icon.png 类似这种。

我们介绍过使用 parse5 需要自行递归所有节点，详见上一篇文章。

而使用 HTMLRewriter 我们只需要监听某个元素然后针对其操作即可，API 设计非常人性化，代码可读性很高。

将所有 img src 转换成操作系统、环境无关路径代码：

ts 复制代码

function filter(html: string, ignoreAttrs: IFilter): string {
  // console.time("filter html using HTMLRewriter");
  const rewriter = new HTMLRewriter().on("img", {
    element(node) {
      for (const [name, value] of node.attributes) {
        // 自定义匹配
        const shouldIgnore = ignoreAttrs(node, { name, value });

        if (typeof shouldIgnore === "boolean") {
          node.removeAttribute(name);
        } else {
          // 自定义替换
          node.setAttribute(name, shouldIgnore);
        }
      }
    },
  });

  const result = rewriter.transform(html);
  // console.timeEnd("filter html using HTMLRewriter");

  return result;
}

使用：

ts 复制代码

function toStableHTML(html: string): string {
  return filter(html.trim(), (node, attr) => {
    const isSrcDiskPath =
      node.tagName === 'img' &&
      attr.name === 'src' &&
      (/^[a-zA-Z]:/.test(attr.value) || attr.value.startsWith('/app/'))

    if (isSrcDiskPath) {
      // D:\\workspace\\foo\\src\\assets\\user-2.png
      // to user-2.png
      // /app/src/assets/submitIcon.png to submitIcon.png
      return `...DISK_PATH/${path.basename(attr.value)}`
    }

    // 保留，不做处理
    return false
  })
}

性能对比：

实验数据 HTML 长度 main.innerHTML.length: 41685

ts 复制代码

[0.59ms] filter html using HTMLRewriter
[0.60ms] filter html using HTMLRewriter
[0.85ms] filter html using HTMLRewriter
[0.58ms] filter html using HTMLRewriter
[0.91ms] filter html using HTMLRewriter

五次取平均值，filter 从 parse5 的 10ms 提升到 0.70ms，只有原来的 <math xmlns="http://www.w3.org/1998/Math/MathML"> 1 14 \frac{1} {14} </math>141。

案例二、简化 HTML 断言

比如我们想断言 katex 渲染的页面中有多少个行内公式、块级公式，而且只想断言基础结构否则 snapshot 太大。

上述图中我们想断言存在 <math xmlns="http://www.w3.org/1998/Math/MathML"> i 1... i 4 i1 ... i4 </math>i1...i4 四个行内以及 <math xmlns="http://www.w3.org/1998/Math/MathML"> b 1 b1 </math>b1 一个块级公式，

通过 CSS 选择器 .katex:not(.katex-block .katex) 我们可以选中不在块级的行内元素，一共四个，但是此种写法 HTMLRewriter 并不支持。仅支持简单的 .foo:not(.bar) 不支持 :has，总共支持 20 种 CSS 写法（9 种是属性选择，故实际仅支持 11 种）：

*
E
E:nth-child(n)
E:first-child
E:nth-of-type(n)
E:first-of-type
E:not(s)
E.warning
E#myid
E[foo]
E[foo="bar"]
E[foo="bar" i]
E[foo="bar" s]
E[foo~="bar"]
E[foo^="bar"]
E[foo$="bar"]
E[foo*="bar"]
E[foo|="en"]
E F
E > F

developers.cloudflare.com/workers/run...

故我们只能取巧的断言一个 5 个 inline 1 个 block：

ts 复制代码

  const rewriter = new HTMLRewriter()
    .on('span.katex', {
      element() {
        inlineCount++
      },
    })
    .on('p.katex-block', {
      element() {
        blockCount++
      },
    })
    
  ...
  // 4 个行内，还有 1 个在块级公式中
  // 无法通过 css 选择器选中不在块级公式中的，故断言 5
  expect(inlineCount).toBe(4 + 1)
  expect(blockCount).toBe(1)

其次针对 katex 生成的复杂的 HTML 结构：

我们只想断言简单的 katex-mathml 结构即可证明渲染正常，katex-html 用于渲染过于复杂无需断言，前者13个节点350个字符，后者49，1933个字符 🤯。

这里该怎么写呢，很简单删除这个节点即可，做得更稳妥一点可以断言其出现的次数：

diff 复制代码

+.on('.katex-html', {
+  element(element) {
+    // HTML 节点太多了，只要断言最基本的输出即可
+    element.remove()
+  },
+ })
+
+ expect(await format(result)).toMatchSnapshot()

这样我们的 HTML snapshot 就比较"清爽"便于 diff 和维护 ✨。

元素除了支持 remove 还支持：

bun v1.2.19

ts 复制代码

interface Element {
  getAttribute(name: string): string | null;
  /** Check if an attribute exists */
  hasAttribute(name: string): boolean;
  /** Set an attribute value */
  setAttribute(name: string, value: string): Element;
  /** Remove an attribute */
  removeAttribute(name: string): Element;
  /** Insert content before this element */
  before(content: Content, options?: ContentOptions): Element;
  /** Insert content after this element */
  after(content: Content, options?: ContentOptions): Element;
  /** Insert content at the start of this element */
  prepend(content: Content, options?: ContentOptions): Element;
  /** Insert content at the end of this element */
  append(content: Content, options?: ContentOptions): Element;
  /** Replace this element with new content */
  replace(content: Content, options?: ContentOptions): Element;
  /** Remove this element and its contents */
  remove(): Element;
  /** Remove this element but keep its contents */
  removeAndKeepContent(): Element;
  /** Set the inner content of this element */
  setInnerContent(content: Content, options?: ContentOptions): Element;
  /** Add a handler for the end tag of this element */
  onEndTag(handler: (tag: EndTag) => void | Promise<void>): void;
}

以及只读的属性：

ts 复制代码

interface Element {
  /** The tag name in lowercase (e.g. "div", "span") */
  tagName: string;
  /** Iterator for the element's attributes */
  readonly attributes: IterableIterator<[string, string]>;
  /** Whether this element was removed */
  readonly removed: boolean;
  /** Whether the element is explicitly self-closing, e.g. <foo /> */
  readonly selfClosing: boolean;
  /**
   * Whether the element can have inner content. Returns `true` unless
   * - the element is an [HTML void element](https://html.spec.whatwg.org/multipage/syntax.html#void-elements)
   * - or it's self-closing in a foreign context (eg. in SVG, MathML).
   */
  readonly canHaveContent: boolean;
  /** The element's namespace URI */
  readonly namespaceURI: string;
  /** Get an attribute value by name */
}

完整代码

ts 复制代码

async function expectBriefHTML(actual: string) {
  let inlineCount = 0
  let blockCount = 0
  let katexHTMLCount = 0

  const rewriter = new HTMLRewriter()
    .on('span.katex', {
      element() {
        inlineCount++
      },
    })
    .on('p.katex-block', {
      element() {
        blockCount++
      },
    })
    .on('.katex-html', {
      element(element) {
        // HTML 节点太多了，只要断言最基本的输出即可
        element.remove()
        katexHTMLCount++
      },
    })

  const result = rewriter.transform(actual)

  // 4 个行内，还有 4 个在块级公式中
  // 无法通过 css 选择器选中不在块级公式中的，故断言 8
  expect(inlineCount).toBe(4 + 1)
  expect(blockCount).toBe(1)
  expect(katexHTMLCount).toBe(5) // 所有公式数量

  expect(await format(result)).toMatchSnapshot()
}

欢迎关注"JavaScript与编程艺术"

HTMLRewriter 在测试中的妙用 - Bun 单元测试系列

💎 价值

HTMLRewriter

案例一、filter 的进阶 🧗‍♂️：利用 bun 内置的 HTMLRewriter

性能对比：

案例二、简化 HTML 断言

完整代码

`HTMLRewriter`

案例一、`filter` 的进阶 🧗‍♂️：利用 bun 内置的 `HTMLRewriter`