19 - 正则表达式

正则表达式（Regular Expression，简称 regex）是一种文本匹配的工具。说白了就是用一套"暗号"来描述你想找的文本模式。

基础语法

先说个前提，正则表达式不是 Python 特有的，几乎所有编程语言都支持。所以学会了到处能用。

导入 re 模块

python 复制代码

import re

最简单的匹配

普通字符就匹配它自己：

python 复制代码

result = re.search(r"hello", "say hello world")
print(result)  # <re.Match object; span=(4, 9), match='hello'>

re.search 在字符串里找第一个匹配。找到了返回 Match 对象，找不到返回 None。

注意字符串前面的 r（raw string），因为正则里经常用反斜杠，加 r 就不用双写反斜杠了。

元字符

这些字符在正则里有特殊含义：

字符	含义	例子
`.`	任意一个字符（除换行）	`h.t` → hat, hot, h3t
`\d`	数字 `[0-9]`	`\d\d` → 匹配两位数字
`\D`	非数字	`\D+` → 匹配非数字部分
`\w`	字母数字下划线	`\w+` → 匹配一个"单词"
`\W`	非字母数字下划线
`\s`	空白字符（空格、Tab、换行）
`\S`	非空白字符
`^`	字符串开头	`^hello` → 以 hello 开头
`$`	字符串结尾	`world$` → 以 world 结尾

python 复制代码

# 匹配手机号（简化版）
re.search(r"\d{11}", "我的号码是13812345678")

# 匹配邮箱开头
re.search(r"^\w+@", "xiaoming@example.com")

量词

控制前面的元素出现几次：

量词	含义	例子
`*`	0 次或多次	`\d*` → 任意位数字（包括 0 位）
`+`	1 次或多次	`\d+` → 至少一位数字
`?`	0 次或 1 次	`colou?r` → color 或 colour
`{n}`	恰好 n 次	`\d{4}` → 4 位数字
`{n,m}`	n 到 m 次	`\d{2,4}` → 2-4 位数字
`{n,}`	至少 n 次	`\d{3,}` → 至少 3 位数字

python 复制代码

# 匹配年份
re.search(r"\d{4}", "发表于2024年")

# 匹配价格（1-3位数字加可选的小数部分）
re.search(r"\d{1,3}(\.\d{1,2})?", "价格：99.99元")

字符集 `[]`

方括号里列出允许的字符：

python 复制代码

# 匹配元音字母
re.findall(r"[aeiou]", "hello world")  # ['e', 'o', 'o']

# 匹配大写字母
re.findall(r"[A-Z]", "Hello World Python")  # ['H', 'W', 'P']

# 取反（^放在[]里面表示"不是这些"）
re.findall(r"[^0-9]", "abc123def")  # ['a', 'b', 'c', 'd', 'e', 'f']

# 范围
re.findall(r"[a-z]+", "Hello World")  # ['ello', 'orld']

分组 `()`

用圆括号把一部分正则包起来，形成一个组：

python 复制代码

# 提取日期中的年月日
text = "日期：2024-05-25"
match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)
if match:
    print(match.group(1))  # 2024（第一个组）
    print(match.group(2))  # 05
    print(match.group(3))  # 25
    print(match.groups())  # ('2024', '05', '25')

命名分组

给组起个名字，比数字更直观：

python 复制代码

match = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text)
if match:
    print(match.group("year"))   # 2024
    print(match.group("month"))  # 05
    print(match.group("day"))    # 25

非捕获分组 `(?:...)`

有时候你只是需要用括号来做优先级或量词，不想捕获：

python 复制代码

# (?:...) 不创建组
match = re.search(r"(?:https?://)([\w.]+)", "https://www.example.com")
print(match.group(1))  # www.example.com（第一组就是域名，没有协议那组）

re 模块常用函数

re.search --- 找第一个匹配

python 复制代码

match = re.search(r"\d+", "abc 123 def 456")
print(match.group())  # 123（只找到第一个）

re.match --- 从开头匹配

python 复制代码

# match 只从字符串开头匹配
print(re.match(r"\d+", "123abc"))   # 匹配成功
print(re.match(r"\d+", "abc123"))   # None（开头不是数字）

match 跟 search 的区别：match 必须从开头开始匹配，search anywhere 都行。

re.findall --- 找所有匹配

python 复制代码

# 找所有数字
numbers = re.findall(r"\d+", "我有 3 个苹果和 5 个橘子")
print(numbers)  # ['3', '5']

# 如果有分组，返回组的内容
dates = re.findall(r"(\d{4})-(\d{2})", "2024-01 2024-02 2024-03")
print(dates)  # [('2024', '01'), ('2024', '02'), ('2024', '03')]

re.finditer --- 返回迭代器

跟 findall 类似，但返回 Match 对象的迭代器，适合大量匹配时省内存：

python 复制代码

for match in re.finditer(r"\d+", "1 22 333"):
    print(f"位置 {match.start()}-{match.end()}: {match.group()}")
# 位置 0-1: 1
# 位置 2-4: 22
# 位置 5-8: 333

re.sub --- 替换

python 复制代码

# 把数字替换成 *
result = re.sub(r"\d+", "*", "我有 3 个苹果和 5 个橘子")
print(result)  # 我有 * 个苹果和 * 个橘子

# 用函数做替换
def double(match):
    return str(int(match.group()) * 2)

result = re.sub(r"\d+", double, "3 和 5")
print(result)  # 6 和 10

# 脱敏手机号
def mask_phone(match):
    phone = match.group()
    return phone[:3] + "****" + phone[7:]

result = re.sub(r"1\d{10}", mask_phone, "号码：13812345678")
print(result)  # 号码：138****5678

re.split --- 分割

python 复制代码

# 按数字分割
parts = re.split(r"\d+", "abc123def456ghi")
print(parts)  # ['abc', 'def', 'ghi']

# 按多种分隔符分割
parts = re.split(r"[,;|]", "a,b;c|d")
print(parts)  # ['a', 'b', 'c', 'd']

编译正则

如果同一个正则要用很多次，先编译可以提高性能：

python 复制代码

# 编译一次
email_pattern = re.compile(r"[\w.+-]+@[\w-]+\.[\w.]+")

# 使用多次
print(email_pattern.findall("联系我：a@b.com 或 c@d.org"))
print(email_pattern.search("邮箱：test@example.com"))

贪婪与非贪婪

量词默认是贪婪的------尽可能多地匹配：

python 复制代码

text = "<h1>标题</h1>"
match = re.search(r"<.+>", text)
print(match.group())  # <h1>标题</h1>（匹配了整个字符串！）

加 ? 变成非贪婪（尽可能少地匹配）：

python 复制代码

match = re.search(r"<.+?>", text)
print(match.group())  # <h1>（只匹配到第一个 >）

贪婪	非贪婪
`*`	`*?`
`+`	`+?`
`{n,m}`	`{n,m}?`

常用标志

在正则末尾或 re.compile 的第二个参数中设置：

python 复制代码

# re.IGNORECASE (re.I) --- 忽略大小写
re.search(r"hello", "HELLO", re.IGNORECASE)  # 匹配成功

# re.MULTILINE (re.M) --- 多行模式，^ 和 $ 匹配每行
text = "第一行\n第二行\n第三行"
re.findall(r"^\w+", text, re.MULTILINE)  # ['第一行', '第二行', '第三行']

# re.DOTALL (re.S) --- 让 . 匹配换行符
re.search(r"a.b", "a\nb", re.DOTALL)  # 匹配成功

# 组合使用
re.search(r"pattern", text, re.I | re.M)

实际例子

验证邮箱

python 复制代码

def is_valid_email(email):
    pattern = r"^[\w.+-]+@[\w-]+\.[\w.]+$"
    return bool(re.match(pattern, email))

print(is_valid_email("test@example.com"))   # True
print(is_valid_email("not-an-email"))       # False

提取 URL

python 复制代码

text = "请访问 https://www.example.com 或 http://test.org/path?q=1"
urls = re.findall(r"https?://[\w./\-?=&]+", text)
print(urls)  # ['https://www.example.com', 'http://test.org/path?q=1']

解析日志

python 复制代码

log = '[2024-05-25 14:30:00] ERROR: 数据库连接失败 (timeout=30s)'
pattern = r'\[(?P<time>[\d\- :]+)\] (?P<level>\w+): (?P<message>.+)'
match = re.search(pattern, log)
if match:
    print(f"时间：{match.group('time')}")
    print(f"级别：{match.group('level')}")
    print(f"信息：{match.group('message')}")

字符串清理

python 复制代码

text = "  hello    world  \n\n  python  "
# 多个空白替换为单个空格
cleaned = re.sub(r"\s+", " ", text).strip()
print(cleaned)  # "hello world python"

本章小结

\d、\w、\s 是最常用的元字符
*、+、?、{n,m} 控制匹配次数
() 分组，(?P<name>...) 命名分组
re.search 找第一个，re.findall 找所有，re.sub 替换
量词默认贪婪，加 ? 变非贪婪
频繁使用的正则先 re.compile 编译

面试题

Q1：re.search 和 re.match 有什么区别？
点击查看答案

re.search：在整个字符串中查找第一个匹配，匹配位置不限
re.match：只从字符串开头匹配，开头不匹配就返回 None

python 复制代码

re.search(r"\d+", "abc123")  # 匹配成功（123）
re.match(r"\d+", "abc123")   # None（开头不是数字）
re.match(r"\d+", "123abc")   # 匹配成功（123）

如果需要 search 但只匹配开头，可以用 ^ 锚点：re.search(r"^\d+", text)。

Q2：贪婪匹配和非贪婪匹配有什么区别？
点击查看答案

贪婪匹配（默认）尽可能多地匹配字符，非贪婪匹配尽可能少地匹配。

python 复制代码

text = "<a>hello</a>"
re.search(r"<.+>", text).group()    # '<a>hello</a>'（贪婪）
re.search(r"<.+?>", text).group()   # '<a>'（非贪婪）

在量词后加 ? 切换为非贪婪模式：*?、+?、{n,m}?。

常见场景：提取 HTML 标签、JSON 字段时通常用非贪婪。

Q3：re.findall 在有分组和没分组时返回值有什么不同？
点击查看答案

没有分组：返回匹配到的完整字符串列表
python 复制代码
```
re.findall(r"\d+", "a1 b22 c333")  # ['1', '22', '333']
```
有一个分组：返回组内容的列表
python 复制代码
```
re.findall(r"(\d+)", "a1 b22")  # ['1', '22']
```

有多个分组：返回元组列表

python 复制代码

re.findall(r"(\w+)=(\d+)", "a=1 b=2")  # [('a', '1'), ('b', '2')]

如果只需要部分匹配结果，用分组可以精确控制返回内容。

Q4：如何匹配一个 IP 地址？
点击查看答案

简单版（不验证范围）：

python 复制代码

r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"

严格版（每段 0-255）：

python 复制代码

r"(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)"

实际项目中推荐用 ipaddress 标准库验证：

python 复制代码

import ipaddress
try:
    ipaddress.ip_address("192.168.1.1")
    # 合法
except ValueError:
    # 不合法

正则适合从文本中提取疑似 IP 的字符串，验证合法性用专用库更可靠。

19 - 正则表达式