【Algorithms 4】算法（第4版）学习笔记 23 - 5.4 正则表达式

文章目录

- 前言
- 参考目录
- 学习笔记
- - 1：正则表达式
  - 1.1：表示
  - 1.2：快捷表示
  - [2：正则表达式与非确定有限状态自动机 REs and NFAs](#2：正则表达式与非确定有限状态自动机 REs and NFAs)
  - 2.1：二元性
  - 2.2：模式匹配实现
  - [2.3：非确定有限状态自动机 Nondeterministic finite-state automata](#2.3：非确定有限状态自动机 Nondeterministic finite-state automata)
  - 2.4：非确定性
  - [3：NFA 模拟](#3：NFA 模拟)
  - [3.1：demo 演示](#3.1：demo 演示)
  - [3.2：Java 实现](#3.2：Java 实现)
  - 3.3：分析
  - [4：NFA 构造](#4：NFA 构造)
  - [4.1：构造与正则表达式对应的 NFA](#4.1：构造与正则表达式对应的 NFA)
  - 4.2：实现
  - [4.3：demo 演示](#4.3：demo 演示)
  - [4.4：Java 实现](#4.4：Java 实现)
  - 4.5：分析
  - 5：非正则表达式
  - 6：背景
  - 7：小结

前言

本篇主要内容包括：正则表达式 、非确定有限状态自动机 NFA。

建议在学习本篇之前先行学习或回顾上一篇子字符串查找的内容。

参考目录

B站普林斯顿大学《Algorithms》视频课
（请自行搜索。主要以该视频课顺序来进行笔记整理 ，课程讲述的教授本人是该书原版作者之一 Robert Sedgewick。）
微信读书《算法（第4版）》
（本文主要内容来自《5.4 正则表达式》）
官方网站
（有书本配套的内容以及代码）

学习笔记

注1：下面引用内容如无注明出处，均是书中摘录。
注2：所有 demo 演示均为视频 PPT demo 截图。
注3：如果 PPT 截图中没有翻译，会在下面进行汉化翻译，因为内容比较多，本文不再一一说明。

1：正则表达式

1.1：表示

对应书本章节：《5.4.1 使用正则表达式描述模式》

5.4.1.1 连接操作
5.4.1.2 或操作
5.4.1.3 闭包操作
5.4.1.4 括号

1.2：快捷表示

对应书本章节：《5.4.2 缩略写法》

5.4.2.1 字符集描述符
5.4.2.2 闭包的简写
5.4.2.3 转义序列

2：正则表达式与非确定有限状态自动机 REs and NFAs

2.1：二元性

RE（正则表达式）： 简洁描述一组字符串的方法。
DFA（确定有限状态自动机）： 一种机器，用于判断给定的字符串是否属于预定义的字符串集合。

克林宁定理（Kleene's theorem）：

对于任何确定有限状态自动机（DFA），都存在一个能够描述相同字符串集合的正则表达式（RE）。
对于任何正则表达式（RE），都存在一个能够识别相同字符串集合的确定有限状态自动机（DFA）。

2.2：模式匹配实现

类似于 KMP 算法：

不需要文本输入流回溯。
确保二次时间复杂度（通常为线性时间）。

基础抽象概念： 非确定有限状态自动机（NFA）。

基本策略：[应用克林宁定理]

从正则表达式构建 NFA。
使用文本作为输入模拟 NFA。

2.3：非确定有限状态自动机 Nondeterministic finite-state automata

对应书本章节：《5.4.4 非确定有限状态自动机》。

也有可能进入错误状态并停滞：

2.4：非确定性

Q. 如何确定一个字符串是否被自动机所匹配？
DFA（确定有限状态自动机）： 判定较为简单，因为对于每个状态和输入字符，恰好有一个适用的转换。
NFA（非确定有限状态自动机）： 可能存在多个适用的转换；需要正确选择其中一个！

Q. 如何模拟 NFA？
A. 系统地考虑所有可能的转换序列来进行模拟。

3：NFA 模拟

3.1：demo 演示

该 demo 建议多观看几遍视频理解操作步骤。

3.2：Java 实现

edu.princeton.cs.algs4.NFA

edu.princeton.cs.algs4.NFA#NFA

java 复制代码

/**
     * Initializes the NFA from the specified regular expression.
     *
     * @param  regexp the regular expression
     */
    public NFA(String regexp) {
        this.regexp = regexp;
        m = regexp.length();
        Stack<Integer> ops = new Stack<Integer>();
        graph = new Digraph(m+1);
        for (int i = 0; i < m; i++) {
            int lp = i;
            if (regexp.charAt(i) == '(' || regexp.charAt(i) == '|')
                ops.push(i);
            else if (regexp.charAt(i) == ')') {
                int or = ops.pop();

                // 2-way or operator
                if (regexp.charAt(or) == '|') {
                    lp = ops.pop();
                    graph.addEdge(lp, or+1);
                    graph.addEdge(or, i);
                }
                else if (regexp.charAt(or) == '(')
                    lp = or;
                else assert false;
            }

            // closure operator (uses 1-character lookahead)
            if (i < m-1 && regexp.charAt(i+1) == '*') {
                graph.addEdge(lp, i+1);
                graph.addEdge(i+1, lp);
            }
            if (regexp.charAt(i) == '(' || regexp.charAt(i) == '*' || regexp.charAt(i) == ')')
                graph.addEdge(i, i+1);
        }
        if (ops.size() != 0)
            throw new IllegalArgumentException("Invalid regular expression");
    }

edu.princeton.cs.algs4.NFA#recognizes

java 复制代码

/**
     * Returns true if the text is matched by the regular expression.
     *
     * @param  txt the text
     * @return {@code true} if the text is matched by the regular expression,
     *         {@code false} otherwise
     */
    public boolean recognizes(String txt) {
        DirectedDFS dfs = new DirectedDFS(graph, 0);
        Bag<Integer> pc = new Bag<Integer>();
        for (int v = 0; v < graph.V(); v++)
            if (dfs.marked(v)) pc.add(v);

        // Compute possible NFA states for txt[i+1]
        for (int i = 0; i < txt.length(); i++) {
            if (txt.charAt(i) == '*' || txt.charAt(i) == '|' || txt.charAt(i) == '(' || txt.charAt(i) == ')')
                throw new IllegalArgumentException("text contains the metacharacter '" + txt.charAt(i) + "'");

            Bag<Integer> match = new Bag<Integer>();
            for (int v : pc) {
                if (v == m) continue;
                if ((regexp.charAt(v) == txt.charAt(i)) || regexp.charAt(v) == '.')
                    match.add(v+1);
            }
            if (match.isEmpty()) continue;

            dfs = new DirectedDFS(graph, match);
            pc = new Bag<Integer>();
            for (int v = 0; v < graph.V(); v++)
                if (dfs.marked(v)) pc.add(v);

            // optimization if no states reachable
            if (pc.size() == 0) return false;
        }

        // check for accept state
        for (int v : pc)
            if (v == m) return true;
        return false;
    }