Python re模块常用方法简要总结

Python的re模块提供了正则表达式操作功能，用于文本的模式匹配、搜索、替换等操作。

1. 基本匹配方法

re.match() - 从字符串开头匹配

python 复制代码

import re

pattern = r"hello"
text = "hello world"

# 匹配开头
result = re.match(pattern, text)
if result:
    print("匹配成功:", result.group())  # 匹配成功: hello
    print("匹配位置:", result.span())    # 匹配位置: (0, 5)

# 匹配失败
result = re.match(r"world", text)
print(result)  # None（因为不是从开头匹配）

re.search() - 搜索整个字符串

python 复制代码

import re

pattern = r"world"
text = "hello world"

result = re.search(pattern, text)
if result:
    print("找到:", result.group())      # 找到: world
    print("位置:", result.span())        # 位置: (6, 11)
    print("起始位置:", result.start())    # 起始位置: 6
    print("结束位置:", result.end())      # 结束位置: 11

re.findall() - 查找所有匹配

此函数用于在字符串中找到所有匹配正则表达式的子串，如果没有找到匹配项，则返回一个空列表。如果能找到匹配项，根据正则表达式中捕获组括号数量的不同，re.findall() 的返回值可能会有所不同：

括号数量	返回值类型	示例
0个括号	List $str$ ，列表唯一元素是整个正则匹配的完整内容字符串	re.findall(r'\d+', 'a1b2c3') → $'1','2','3'$
1个括号	List $str$ ，列表的每个元素是括号内捕获的内容字符串	re.findall(r'(\d+)', 'a1b2c3') → $'1','2','3'$
多个括号	List $Tuple\[str, ...$ ]，列表的每一个元素是一个元组，每个元组包含所有括号捕获的内容字符串	re.findall(r'(\d+)( $a-z$ )', '1a2b') → $('1','a'), ('2','b')$

python 复制代码

import re

text = "苹果10元，香蕉5元，橙子8元"

# 查找所有数字
numbers = re.findall(r"\d+", text)
print(numbers)  # ['10', '5', '8']

# 查找所有商品
products = re.findall(r"([\u4e00-\u9fa5]+)\d+元", text)
print(products)  # ['苹果', '香蕉', '橙子']

# 查找所有商品和价格
items = re.findall(r"([\u4e00-\u9fa5]+)(\d+)元", text)
print(items)  # [('苹果', '10'), ('香蕉', '5'), ('橙子', '8')]

re.finditer() - 返回匹配迭代器

python 复制代码

import re

text = "Python 3.9, Python 3.10, Python 3.11"

# 返回匹配对象迭代器
for match in re.finditer(r"Python (\d+\.\d+)", text):
    print(f"版本: {match.group(1)} 位置: {match.span()}")
# 版本: 3.9 位置: (0, 9)
# 版本: 3.10 位置: (11, 22)
# 版本: 3.11 位置: (24, 35)

`re.match()`、`re.search()`、`re.findall()` 、`re.finditer()` 返回值对比

函数	返回值类型	匹配失败时	匹配成功时	匹配数量	附加信息	说明
`re.match()`	`Match` 对象或 `None`	返回 `None`	返回一个 `Match` 对象	最多 1 个（开头）	丰富（位置、分组）	从字符串开头匹配，若开头不匹配则失败
`re.search()`	`Match` 对象或 `None`	返回 `None`	返回一个 `Match` 对象	最多 1 个（首次）	丰富（位置、分组）	在字符串中搜索第一个匹配位置
re.findall()	`List[str]` 或 `List[Tuple]`	返回一个空列表	根据捕获组括号不同返回字符串列表或元组列表	所有	简单（只有内容）	匹配成功时，0对或1对捕获组括号返回List $str$ ，字符串为匹配正则表达式（0对括号）或捕获组（1对括号）的内容；多个捕获组括号返回List $Tuple$ ，元组中的元素为匹配各个捕获组的字符串
`re.finditer()`	迭代器（`Iterator[Match]`）	返回一个空迭代器（无元素）	返回包含所有 `Match` 对象的迭代器	所有	丰富（位置、分组）	返回迭代器，每次迭代获取一个 `Match` 对象

`Match` 对象的常用方法

方法	说明	示例
`.group()`或`group(0)`	返回整个匹配的字符串	`match.group()` → `'abc123'`
`.group(n)`	返回第 n 个捕获组的内容	`match.group(1)` → `'abc'`
`.groups()`	返回所有捕获组组成的元组	`match.groups()` → `('abc', '123')`
`.start()`	匹配的起始位置	`match.start()` → `0`
`.end()`	匹配的结束位置	`match.end()` → `6`
`.span()`	返回 `(start, end)` 元组	`match.span()` → `(0, 6)`

表中示例所使用的例程：

python 复制代码

import re

str_test = 'abc123'
pattern = re.compile(r'([a-z]+)([0-9]+)')
match = pattern.match(str_test)
print(f'{match.group()=}')
print(f'{match.group(0)=}')
print(f'{match.group(1)=}')
print(f'{match.group(2)=}')
print(f'{match.groups()=}')
print(f'{match.start()=}')
print(f'{match.end()=}')
print(f'{match.span()=}')

########输出##########
# match.group()='abc123'
# match.group(0)='abc123'
# match.group(1)='abc'
# match.group(2)='123'
# match.groups()=('abc', '123')
# match.start()=0
# match.end()=6
# match.span()=(0, 6)

综合示例

python 复制代码

import re

text = "2023年12月25日 星期天"
pattern = r"(\d{4})年(\d{1,2})月(\d{1,2})日"

match = re.search(pattern, text)
if match:
    # 获取整个匹配
    print("完整匹配:", match.group())      # 2023年12月25日
    print("完整匹配:", match.group(0))     # 同上
    
    # 获取分组
    print("年:", match.group(1))          # 2023
    print("月:", match.group(2))          # 12
    print("日:", match.group(3))          # 25
    
    # 获取所有分组
    print("所有分组:", match.groups())     # ('2023', '12', '25')
    
    # 获取分组字典（命名分组）
    pattern2 = r"(?P<year>\d{4})年(?P<month>\d{1,2})月(?P<day>\d{1,2})日"
    match2 = re.search(pattern2, text)
    if match2:
        print("分组字典:", match2.groupdict())  # {'year': '2023', 'month': '12', 'day': '25'}
        print("年份:", match2.group('year'))    # 2023
    
    # 位置信息
    print("起始位置:", match.start())      # 0
    print("结束位置:", match.end())        # 11
    print("匹配范围:", match.span())       # (0, 11)
    print("第1组起始:", match.start(1))    # 0
    print("第1组结束:", match.end(1))      # 4

2. 替换方法

re.sub() - 替换匹配内容

python 复制代码

import re

text = "今天是2023-12-25，明天是2023-12-26"

# 简单替换
new_text = re.sub(r"\d{4}-\d{2}-\d{2}", "日期", text)
print(new_text)  # 今天是日期，明天是日期

# 使用函数进行替换
def replace_date(match):
    date_str = match.group()
    return f"[{date_str}]"

new_text = re.sub(r"\d{4}-\d{2}-\d{2}", replace_date, text)
print(new_text)  # 今天是[2023-12-25]，明天是[2023-12-26]

# 限制替换次数
text = "aaa bbb aaa ccc aaa"
new_text = re.sub(r"aaa", "XXX", text, count=2)
print(new_text)  # XXX bbb XXX ccc aaa

re.subn() - 替换并返回次数

python 复制代码

import re

text = "apple banana apple orange apple"

# 替换并返回替换次数
new_text, count = re.subn(r"apple", "fruit", text)
print(f"新文本: {new_text}")  # 新文本: fruit banana fruit orange fruit
print(f"替换次数: {count}")    # 替换次数: 3

# 限制替换次数
new_text, count = re.subn(r"apple", "fruit", text, count=2)
print(f"替换次数: {count}")    # 替换次数: 2

3. 分割方法

re.split() - 按模式分割

python 复制代码

import re

text = "apple,banana;orange|grape"

# 按多种分隔符分割
parts = re.split(r"[,;|]", text)
print(parts)  # ['apple', 'banana', 'orange', 'grape']

# 包含分隔符
parts = re.split(r"([,;|])", text)
print(parts)  # ['apple', ',', 'banana', ';', 'orange', '|', 'grape']

# 最大分割次数
text = "1,2,3,4,5"
parts = re.split(r",", text, maxsplit=2)
print(parts)  # ['1', '2', '3,4,5']

# 复杂分割
text = "words and spaces"
parts = re.split(r"\s+", text)
print(parts)  # ['words', 'and', 'spaces']

4. 编译正则表达式

重复使用正则表达式时，可以通过re.compile() 预编译正则表达式以提高性能，编译成功返回re.Pattern对象（正则表达式对象），编译失败抛出re.error异常，而不返回None。除了在多次使用时提高性能外，编译正则表达式还可以在编译时就发现正则语法错误，而不是运行时，从而提高程序的健壮性。因此，即使并不会多次使用正则表达式，也推荐先进行编译。

编译后的 Pattern 对象支持所有 re 模块的函数方法，用法类似但参数略有不同：

方法	对应 `re` 模块函数	区别
`pattern.match(string)`	`re.match(pattern, string)`	不需要重复传入正则表达式
`pattern.search(string)`	`re.search(pattern, string)`	同上
`pattern.findall(string)`	`re.findall(pattern, string)`	同上
`pattern.finditer(string)`	`re.finditer(pattern, string)`	同上
`pattern.sub(repl, string)`	`re.sub(pattern, repl, string)`	同上
`pattern.split(string)`	`re.split(pattern, string)`	同上

`Pattern` 对象的常用属性

属性	说明	示例
`.pattern`	原始正则表达式字符串	`pattern.pattern` → `r'\d+'`
`.flags`	编译时使用的标志位	`pattern.flags` → `32`（表示 `re.IGNORECASE`）
`.groups`	正则表达式中捕获组的数量	`pattern.groups` → `2`
`.groupindex`	命名分组的字典映射	`pattern.groupindex` → `{'name': 1}`

python 复制代码

import re

# 编译正则表达式
pattern = re.compile(r"\d{3}-\d{3}-\d{4}")

texts = [
    "电话: 123-456-7890",
    "手机: 987-654-3210",
    "无效: 12-345-6789"
]

# 复用编译后的模式
for text in texts:
    match = pattern.search(text)
    if match:
        print(f"找到电话: {match.group()}")
    else:
        print("未找到有效电话")

# 编译时添加标志
pattern = re.compile(r"python", re.IGNORECASE)  # 忽略大小写
text = "Python is great, PYTHON is powerful"
matches = pattern.findall(text)
print(matches)  # ['Python', 'PYTHON']

5. 常用正则表达式模式

python 复制代码

import re

# 1. 邮箱验证
emails = ["test@example.com", "invalid-email", "user.name@domain.co.uk"]
email_pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"

for email in emails:
    if re.match(email_pattern, email):
        print(f"有效邮箱: {email}")
    else:
        print(f"无效邮箱: {email}")

# 2. 手机号验证（中国）
phones = ["13800138000", "12345678901", "010-12345678"]
phone_pattern = r"^1[3-9]\d{9}$"

for phone in phones:
    if re.match(phone_pattern, phone):
        print(f"有效手机号: {phone}")

# 3. URL提取
text = "访问 https://www.example.com 或 http://sub.domain.org/path"
url_pattern = r"https?://[^\s/$.?#]+\.[^\s]*"
urls = re.findall(url_pattern, text)
print("找到URL:", urls)  # ['https://www.example.com', 'http://sub.domain.org/path']

# 4. HTML标签提取
html = "<div>内容1</div><p>内容2</p><span>内容3</span>"
tag_pattern = r"<([a-zA-Z]+)[^>]*>([^<]+)</\1>"
tags = re.findall(tag_pattern, html)
print("HTML标签:", tags)  # [('div', '内容1'), ('p', '内容2'), ('span', '内容3')]

# 5. 中文提取
text = "Hello 世界！Python 编程 2023。"
chinese_pattern = r"[\u4e00-\u9fa5]+"
chinese_words = re.findall(chinese_pattern, text)
print("中文字词:", chinese_words)  # ['世界', '编程']

# 6. 数字提取（包括小数）
text = "价格: 12.5元, 数量: 100个, 折扣: 0.85"
number_pattern = r"\d+\.?\d*"
numbers = re.findall(number_pattern, text)
print("数字:", numbers)  # ['12.5', '100', '0.85']

6. 正则表达式标志

python 复制代码

import re

text = "Python\npython\nPYTHON"

# re.IGNORECASE (re.I) - 忽略大小写
matches = re.findall(r"python", text, re.IGNORECASE)
print("忽略大小写:", matches)  # ['Python', 'python', 'PYTHON']

# re.MULTILINE (re.M) - 多行模式
text = "第一行\n第二行\n第三行"
matches = re.findall(r"^第.*行$", text, re.MULTILINE)
print("多行匹配:", matches)  # ['第一行', '第二行', '第三行']

# re.DOTALL (re.S) - 点匹配所有字符（包括换行符）
text = "第一行\n第二行"
match = re.search(r"第一行.*第二行", text, re.DOTALL)
if match:
    print("跨行匹配:", match.group())

# re.VERBOSE (re.X) - 冗长模式（可添加注释和空白）
pattern = re.compile(r"""
    \d{3}    # 区号
    -        # 分隔符
    \d{3}    # 前缀
    -        # 分隔符
    \d{4}    # 线路号
""", re.VERBOSE)

# 组合标志
pattern = re.compile(r"python", re.IGNORECASE | re.MULTILINE)

7. 其它技巧

非贪婪匹配

python 复制代码

import re

text = "<div>内容1</div><div>内容2</div>"

# 贪婪匹配（默认）
greedy = re.findall(r"<div>.*</div>", text)
print("贪婪匹配:", greedy)  # ['<div>内容1</div><div>内容2</div>']

# 非贪婪匹配
non_greedy = re.findall(r"<div>.*?</div>", text)
print("非贪婪匹配:", non_greedy)  # ['<div>内容1</div>', '<div>内容2</div>']

限定前后内容查找

python 复制代码

import re

text = "苹果10元，香蕉5元，橙子8元"

# 正向肯定查找（匹配前面有"香蕉"的数字）
pattern = r"(?<=香蕉)\d+"
matches = re.findall(pattern, text)
print("香蕉价格:", matches)  # ['5']

# 正向否定查找（匹配前面不是"香蕉"的数字）
pattern = r"(?<!香蕉)\d+"
matches = re.findall(pattern, text)
print("非香蕉价格:", matches)  # ['10', '8']

# 反向肯定查找（匹配后面有"元"的数字）
pattern = r"\d+(?=元)"
matches = re.findall(pattern, text)
print("所有价格:", matches)  # ['10', '5', '8']

# 反向否定查找（匹配后面不是"元"的数字）
text2 = "苹果10个，香蕉5斤，橙子8元"
pattern = r"\d+(?!元)"
matches = re.findall(pattern, text2)
print("非元单位:", matches)  # ['10', '5']

命名分组

python 复制代码

import re

text = "张三,30岁,北京"
pattern = r"(?P<name>[\u4e00-\u9fa5]+),(?P<age>\d+)岁,(?P<city>[\u4e00-\u9fa5]+)"

match = re.search(pattern, text)
if match:
    print("姓名:", match.group('name'))  # 张三
    print("年龄:", match.group('age'))   # 30
    print("城市:", match.group('city'))  # 北京
    print("分组字典:", match.groupdict())  # {'name': '张三', 'age': '30', 'city': '北京'}

使用原始字符串避免双重转义

python 复制代码

import re
import time

# 预编译提高性能
text = "a" * 10000 + "b"

# 未编译
start = time.time()
for _ in range(1000):
    re.search(r"b", text)
print(f"未编译时间: {time.time() - start:.4f}秒")

# 预编译
pattern = re.compile(r"b")
start = time.time()
for _ in range(1000):
    pattern.search(text)
print(f"预编译时间: {time.time() - start:.4f}秒")

# 使用原始字符串（避免转义问题）
# 正确
pattern1 = r"\d+\\.\d+"  # 匹配数字.数字
# 错误
pattern2 = "\d+\\.\d+"   # 需要双重转义

8. 错误处理

python 复制代码

import re

# 无效正则表达式
try:
    re.compile(r"*invalid")  # 量词前没有表达式
except re.error as e:
    print(f"正则表达式错误: {e}")

# 处理匹配失败
text = "没有数字的文本"
match = re.search(r"\d+", text)
if match:  # 总是检查匹配结果
    print("找到数字:", match.group())
else:
    print("未找到数字")

# 安全提取分组
text = "部分匹配：2.10"
match = re.search(r"(\d+)\.(\d+)", text)
if match:
    # 安全访问分组
    integer = match.group(1) if len(match.groups()) >= 1 else None
    decimal = match.group(2) if len(match.groups()) >= 2 else None
    print(f"整数部分: {integer}, 小数部分: {decimal}")

正则表达式（`re`模块）常用方法总结表

方法	描述	返回值
`re.match(pattern, string)`	从开头匹配	匹配对象（re.Match）或None
`re.search(pattern, string)`	搜索整个字符串	匹配对象（re.Match）或None
`re.findall(pattern, string)`	查找所有匹配	列表
`re.finditer(pattern, string)`	返回匹配迭代器	迭代器
`re.sub(pattern, repl, string)`	替换匹配内容	新字符串
`re.subn(pattern, repl, string)`	替换并返回次数	(新字符串, 次数)
`re.split(pattern, string)`	按模式分割	列表
`re.compile(pattern)`	预编译正则表达式	模式对象（re.Pattern）

re模块是Python中处理文本的强大工具，掌握这些常用方法可以高效地进行字符串匹配、提取、替换等操作。在实际使用中，建议对复杂的正则表达式进行预编译以提高性能，并合理使用分组和标志来简化匹配逻辑。

Python re模块常用方法简要总结

1. 基本匹配方法

re.match() - 从字符串开头匹配

re.search() - 搜索整个字符串

re.findall() - 查找所有匹配

re.finditer() - 返回匹配迭代器

re.match()、re.search()、re.findall() 、re.finditer() 返回值对比

Match 对象的常用方法

综合示例

2. 替换方法

re.sub() - 替换匹配内容

re.subn() - 替换并返回次数

3. 分割方法

re.split() - 按模式分割

4. 编译正则表达式

Pattern 对象的常用属性

5. 常用正则表达式模式

6. 正则表达式标志

7. 其它技巧

非贪婪匹配

限定前后内容查找

命名分组

使用原始字符串避免双重转义

8. 错误处理

正则表达式（re模块）常用方法总结表

`re.match()`、`re.search()`、`re.findall()` 、`re.finditer()` 返回值对比

`Match` 对象的常用方法

`Pattern` 对象的常用属性

正则表达式（`re`模块）常用方法总结表