Base Tools-Associate-Fifth：re库详解

关联标准：POSIX（Portable Operating System Interface）

re 是 Python 中的正则表达式库 ，这是处理字符串匹配、查找、替换等操作的核心工具

一、re 库核心概念

Python 内置的正则表达式处理模块，正则表达式（Regex）是一种字符串匹配的模式，可以高效地实现：

字符串查找、匹配、替换
字符串分割、提取特定内容
数据校验（如手机号、邮箱、身份证号）

基础元字符（核心匹配规则）

元字符	含义	示例
`.`	匹配任意单个字符（除换行符 `\n`）	`a.c` 匹配 `abc`、`a1c`、`a#c`
`^`	匹配字符串开头	`^abc` 匹配 `abc123`，不匹配 `123abc`
`$`	匹配字符串结尾	`abc$` 匹配 `123abc`，不匹配 `abc123`
`*`	匹配前一个字符 0 次或多次	`ab*c` 匹配 `ac`、`abc`、`abbbc`
`+`	匹配前一个字符 1 次或多次	`ab+c` 匹配 `abc`、`abbbc`，不匹配 `ac`
`?`	匹配前一个字符 0 次或 1 次	`ab?c` 匹配 `ac`、`abc`，不匹配 `abbbc`
`{n}`	匹配前一个字符恰好 n 次	`ab{2}c` 匹配 `abbc`，不匹配 `abc`
`{n,}`	匹配前一个字符至少 n 次	`ab{2,}c` 匹配 `abbc`、`abbbc`
`{n,m}`	匹配前一个字符 n 到 m 次	`ab{1,2}c` 匹配 `abc`、`abbc`
`[]`	匹配括号内的任意一个字符	`a[0-9]c` 匹配 `a1c`、`a5c`，不匹配 `abc`
`	`	或，匹配多个模式中的一个
`\`	转义字符（匹配特殊字符本身）	`a\.c` 匹配 `a.c`，不匹配 `abc`
`\d`	匹配数字（等价于 `[0-9]`）	`a\dc` 匹配 `a1c`、`a8c`
`\w`	匹配字母 / 数字 / 下划线（等价于 `[a-zA-Z0-9_]`）	`a\wc` 匹配 `abc`、`a5c`、`a_c`
`\s`	匹配空白字符（空格 / 制表符 / 换行等）	`a\sc` 匹配 `a c`、`a\tc`
`\D/\W/\S`	分别是 `\d/\w/\s` 的反向匹配	`\D` 匹配非数字

二、re 库核心函数（常用）

re.match ()：从字符串开头匹配

语法：re.match(pattern, string, flags=0)
特点：只匹配字符串开头，匹配成功返回 Match 对象，失败返回 None
常用方法：group()（获取匹配结果）、span()（获取匹配位置）

python 复制代码

import re

# 示例1：匹配开头
result = re.match(r'abc', 'abc123')
if result:
    print("匹配结果：", result.group())  # 输出：abc
    print("匹配位置：", result.span())  # 输出：(0, 3)

# 示例2：不匹配开头（返回None）
result2 = re.match(r'abc', '123abc')
print(result2)  # 输出：None

re.fullmatch()：判断整个字符串是否完全匹配正则规则

语法：re.fullmatch(pattern, string, flags=0)
特点：判断整个字符串 是否完全匹配正则规则（区别于 match() 仅匹配开头）

python 复制代码

import re

# 验证是否是纯数字字符串（完全匹配）
print(re.fullmatch(r'\d+', '123'))  # <re.Match object>（匹配）
print(re.fullmatch(r'\d+', '123abc'))  # None（不完全匹配）

re.search ()：在字符串任意位置匹配第一个符合条件的结果

语法：re.search(pattern, string, flags=0)
特点：扫描整个字符串，返回第一个匹配结果（无则返回 None）

python 复制代码

import re

# 示例：匹配任意位置的第一个结果
result = re.search(r'abc', '123abc456abc')
if result:
    print(result.group())  # 输出：abc
    print(result.span())   # 输出：(3, 6)

re.findall ()：查找字符串中所有符合条件的结果

语法：re.findall(pattern, string, flags=0)
特点：返回所有匹配结果的列表（无则返回空列表）

python 复制代码

import re

# 示例1：提取所有数字
result = re.findall(r'\d+', 'abc123def456ghi789')
print(result)  # 输出：['123', '456', '789']

# 示例2：提取邮箱（简单匹配）
result2 = re.findall(r'\w+@\w+\.\w+', '我的邮箱是test@163.com，备用邮箱是abc@qq.com')
print(result2)  # 输出：['test@163.com', 'abc@qq.com']

re.finditer()：查找字符串中所有匹配结果

语法：re.finditer(pattern, string, flags=0)
特点：返回一个包含Match 对象的迭代器（而非列表）
核心优势 ：节省内存（尤其处理超长文本）、能获取匹配位置信息（span()/start()/end()）

python 复制代码

import re

text = "商品A：99元，商品B：199元"
# 遍历迭代器获取匹配结果+位置
for match in re.finditer(r'\d+', text):
    print(f"匹配内容：{match.group()}，位置：{match.span()}")
"""
输出：
匹配内容：99，位置：(5, 7)
匹配内容：199，位置：(13, 16)
"""

re.sub ()：替换字符串中匹配的内容

语法：re.sub(pattern, repl, string, count=0, flags=0)
参数：
- repl：替换后的字符串（或函数）
- count：替换次数（默认 0 表示全部替换）
特点：返回替换后的新字符串

python 复制代码

import re

# 示例1：替换所有数字为*
result = re.sub(r'\d+', '*', '手机号：13812345678')
print(result)  # 输出：手机号：*

# 示例2：只替换前2个数字段
result2 = re.sub(r'\d+', '*', '123abc456def789', count=2)
print(result2)  # 输出：*abc*def789

# 示例3：用函数替换（将数字加1）
def add_one(matched):
    num = int(matched.group())
    return str(num + 1)

result3 = re.sub(r'\d+', add_one, 'a1b2c3')
print(result3)  # 输出：a2b3c4

re.subn()：替换字符串中匹配的内容

语法：re.subn(pattern, repl, string, count=0, flags=0)
特点：和 re.sub() 功能一致（替换匹配内容），但额外返回替换次数 （元组形式：(替换后的字符串, 替换次数)）

python 复制代码

import re

# 替换数字为*，并获取替换次数
result, count = re.subn(r'\d+', '*', 'a1b22c333')
print("替换结果：", result)  # 输出：a*b*c*
print("替换次数：", count)  # 输出：3

re.split ()：按匹配的内容分割字符串

语法：re.split(pattern, string, maxsplit=0, flags=0)
参数：maxsplit 最大分割次数（默认 0 表示无限制）
特点：返回分割后的列表

python 复制代码

import re

# 示例1：按任意数字分割
result = re.split(r'\d+', 'a1b22c333d')
print(result)  # 输出：['a', 'b', 'c', 'd']

# 示例2：最多分割2次
result2 = re.split(r'\d+', 'a1b22c333d', maxsplit=2)
print(result2)  # 输出：['a', 'b', 'c333d']

re.compile ()：预编译正则表达式（提升效率）

语法：re.compile(pattern, flags=0)
适用场景：多次使用同一正则表达式时，预编译可减少重复解析，提升性能

python 复制代码

import re

# 预编译正则（匹配手机号）
phone_pattern = re.compile(r'1[3-9]\d{9}')

# 多次使用编译后的正则
text1 = '我的手机号：13812345678'
text2 = '他的手机号：13987654321'

result1 = phone_pattern.search(text1)
result2 = phone_pattern.search(text2)

print(result1.group())  # 输出：13812345678
print(result2.group())  # 输出：13987654321

三、次要但实用的工具函数（补充）

re.escape()

作用：自动转义正则中的特殊字符（如 .、*、?、\ 等），避免手动转义出错
适用场景：匹配包含特殊字符的普通字符串（如文件名、URL）

python 复制代码

import re

# 要匹配的字符串包含特殊字符：a.b*c
text = "test a.b*c 123"
# 自动转义特殊字符
pattern = re.escape("a.b*c")
print(pattern)  # 输出：a\.b\*c
print(re.search(pattern, text).group())  # 输出：a.b*c

re.purge()

作用：清空正则表达式的缓存（Python 会缓存编译后的正则，极少场景需要手动清空）

python 复制代码

import re

re.purge()  # 清空正则缓存

四、常用 flags（匹配模式）

flags 参数用于修改正则的匹配规则，常用值：

标志	含义
`re.I` / `re.IGNORECASE`	忽略大小写
`re.M` / `re.MULTILINE`	多行模式（`^`/`$` 匹配每行开头 / 结尾）
`re.S` / `re.DOTALL`	让 `.` 匹配包括换行符在内的所有字符

python 复制代码

import re

# 示例1：忽略大小写
result = re.search(r'abc', 'ABC123', re.I)
print(result.group())  # 输出：ABC

# 示例2：多行模式
text = '''
line1: abc
line2: 123
'''
# 匹配每行开头的line
result2 = re.findall(r'^line', text, re.M)
print(result2)  # 输出：['line', 'line']

# 示例3：让.匹配换行符
result3 = re.search(r'a.b', 'a\nb', re.S)
print(result3.group())  # 输出：a\nb

四、分组匹配（提取特定部分）

用 () 表示分组，可通过 group(n) 提取第 n 个分组的内容（group(0) 是整个匹配结果）

python 复制代码

import re

# 示例：提取姓名和年龄
text = '姓名：张三，年龄：25；姓名：李四，年龄：30'
pattern = r'姓名：(.*?)，年龄：(\d+)'
result = re.findall(pattern, text)
print(result)  # 输出：[('张三', '25'), ('李四', '30')]

# 示例：分组命名（更易读）
pattern2 = r'姓名：(?P<name>.*?)，年龄：(?P<age>\d+)'
result2 = re.search(pattern2, text)
print(result2.group('name'))  # 输出：张三
print(result2.group('age'))   # 输出：25

总结

核心定位 ：
- re 库是 Python 处理正则表达式的内置模块，核心用于字符串的匹配、查找、替换、分割
核心函数
- match()：匹配字符串开头
- search()：匹配任意位置第一个结果
- findall()：获取所有匹配结果
- sub()：替换匹配内容
- compile()：预编译提升效率
关键技巧
- 用元字符构建匹配规则，() 实现分组提取
- 多次使用同一正则时优先用 compile()
- 通过 flags 调整匹配模式（如忽略大小写、多行匹配）