Python编程——进阶知识（正则表达式）

# 提取所有手机号（简化版）
import re
text = "联系我：13812345678 或 15987654321"
phones = re.findall(r'1[3-9]\d{9}', text)
print(phones)  # ['13812345678', '15987654321']

二、基础语法速查表

1. 基本匹配

模式	描述	示例
`abc`	字面量匹配	`re.search('hello', 'hello world')`
`.`	任意字符（除换行）	`a.c` 匹配 "abc", "a c"
`^`	字符串开头	`^Hello`
`$`	字符串结尾	`world$`

2. 字符类

模式	描述	等价形式
`[abc]`	匹配 a/b/c 之一	------
`[^abc]`	不匹配 a/b/c	------
`[a-z]`	小写字母	------
`\d`	数字	`[0-9]`
`\w`	单词字符	`[a-zA-Z0-9_]`
`\s`	空白字符	`[ \t\n\r\f\v]`

3. 量词

模式	描述	示例
`*`	0 次或多次	`ab*c` → "ac", "abbc"
`+`	1 次或多次	`ab+c` → "abc", "abbbc"
`?`	0 次或 1 次	`colou?r` → "color", "colour"
`{n}`	恰好 n 次	`\d{3}` → "123"
`{n,}`	至少 n 次	`\d{2,}` → "12", "123"
`{n,m}`	n 到 m 次	`\d{2,4}` → "12", "1234"

贪婪 vs 非贪婪：

贪婪（默认）：.* 尽可能多匹配

非贪婪：.*? 尽可能少匹配

4. 分组与捕获

模式	描述
`(abc)`	捕获分组，可用 `\1` 引用
`(?:abc)`	非捕获分组
`(?P<name>...)`	命名分组

三、Python `re` 模块核心函数

1. `re.match()` ------ 从开头匹配

python 复制代码

import re

# 只匹配字符串开头
print(re.match(r'\d+', '123abc'))  # <Match object>
print(re.match(r'\d+', 'abc123'))  # None

2. `re.search()` ------ 全文搜索首个匹配

python 复制代码

# 扫描整个字符串
print(re.search(r'\d+', 'abc123def'))  # 匹配 "123"

区别：

match() ≈ search() with ^

大多数场景用 search()

3. `re.findall()` ------ 查找所有匹配

python 复制代码

text = "Emails: alice@example.com, bob@test.org"
emails = re.findall(r'\w+@\w+\.\w+', text)
print(emails)  # ['alice@example.com', 'bob@test.org']

# 带分组时返回元组
pairs = re.findall(r'(\w+)=(\d+)', 'width=100 height=200')
print(pairs)  #

4. `re.finditer()` ------ 返回迭代器（内存友好）

python 复制代码

for match in re.finditer(r'\d+', 'a1b22c333'):
    print(match.group(), match.span())
# 1 (1, 2)
# 22 (3, 5)
# 333 (6, 9)

5. `re.sub()` ------ 替换匹配项

python 复制代码

# 简单替换
text = "Today is 2023-10-05"
new_text = re.sub(r'\d{4}-\d{2}-\d{2}', 'YYYY-MM-DD', text)
print(new_text)  # Today is YYYY-MM-DD

# 使用函数动态替换
def square(match):
    return str(int(match.group()) ** 2)

result = re.sub(r'\d+', square, "1 + 2 = 3")  # "1 + 4 = 9"

6. `re.split()` ------ 按模式分割

python 复制代码

# 分割但保留分隔符
parts = re.split(r'(\W+)', "Hello, World!")
print(parts)  # ['Hello', ', ', 'World', '!', '']

# 限制分割次数
re.split(r'\s+', "a b c d", maxsplit=2)  # ['a', 'b', 'c d']

四、编译正则：`re.compile()` 提升性能

当同一正则多次使用时，预编译可显著提升性能：

python 复制代码

# 未编译（每次解析正则）
for line in log_lines:
    if re.search(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', line):
        process(line)

# 编译后（推荐）
ip_pattern = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
for line in log_lines:
    if ip_pattern.search(line):
        process(line)

编译对象方法

python 复制代码

pattern = re.compile(r'(\w+)@(\w+)\.(\w+)')
match = pattern.search("contact@runoob.com")

print(match.group(0))    # contact@runoob.com
print(match.group(1))    # contact
print(match.groups())    # ('contact', 'runoob', 'com')
print(match.span())      # (0, 18)

五、标志（Flags）控制匹配行为

标志	作用	示例
`re.IGNORECASE` (`re.I`)	忽略大小写	`re.search('HELLO', 'hello', re.I)`
`re.MULTILINE` (`re.M`)	多行模式（`^`/`$` 匹配每行）
`re.DOTALL` (`re.S`)	`.` 匹配包括换行符
`re.VERBOSE` (`re.X`)	忽略空格和注释（提高可读性）

多标志组合

python 复制代码

# 同时忽略大小写和多行模式
re.search(r'^start', text, re.I | re.M)

内联标志（局部生效）

python 复制代码

# 仅对括号内生效
re.search(r'(?i)hello', 'HELLO')  # 忽略大小写

六、高级技巧：零宽断言与命名分组

1. 零宽断言（Lookaround）

正向先行 (?=...)：后面必须匹配

python 复制代码

# 提取后面跟着 ".com" 的单词
re.findall(r'\w+(?=\.com)', 'site: runoob.com')  # ['runoob']

负向先行 (?!...)：后面不能匹配

python 复制代码

# 匹配不以 "test" 开头的单词
re.findall(r'\b(?!test)\w+', 'test1 hello test2')  # ['hello']

2. 命名分组（提高可读性）

python 复制代码

date_pattern = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
match = date_pattern.search("Date: 2023-10-05")

print(match.group('year'))   # 2023
print(match.groupdict())     # {'year': '2023', 'month': '10', 'day': '05'}

七、常见实战场景

1. 验证邮箱（简化版）

python 复制代码

def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

2. 提取 HTML 标签内容

python 复制代码

html = '<div class="content">Hello <b>World</b></div>'
text = re.sub(r'<[^>]+>', '', html)  # 移除所有标签
print(text)  # Hello World

注意： 正则不适合解析复杂 HTML/XML，应使用 BeautifulSoup 等专用库。

3. 敏感信息脱敏

python 复制代码

# 隐藏手机号中间四位
phone = "13812345678"
masked = re.sub(r'(\d{3})\d{4}(\d{4})', r'\1****\2', phone)
print(masked)  # 138****5678

八、建议与陷阱

建议

使用原始字符串 ：r'\n' 而非 '\\n'
预编译重复使用的正则
优先使用 search() 而非 match()
复杂正则添加注释 （配合 re.X）
验证边界条件（空字符串、特殊字符）

陷阱

过度使用正则 ：简单字符串操作用 str 方法更高效
忽略转义 ：. 在正则中有特殊含义，匹配字面点需 \.

贪婪匹配陷阱 ：

python 复制代码

# 错误：.* 会匹配到最后一个 >
re.search(r'<div>(.*)</div>', '<div>a</div><div>b</div>')
# 正确：使用非贪婪 .*?
re.search(r'<div>(.*?)</div>', ...)

性能问题 ：避免回溯爆炸（如 (a+)+）

九、调试正则表达式

1. 使用在线工具

regex101.com（推荐）
pythex.org

2. Python 调试技巧

python 复制代码

# 打印匹配详情
match = re.search(pattern, text)
if match:
    print(f"匹配: {match.group()}")
    print(f"位置: {match.span()}")
    print(f"分组: {match.groups()}")

小结

正则表达式是文本处理的利器，但正如那句名言所说：

"有些人遇到问题就想用正则表达式解决。于是他们现在有两个问题。"

合理使用正则------在它真正能简化代码时。

避免滥用------当简单字符串方法足够时。

"好的正则表达式不是最复杂的，而是最清晰、最高效、最易维护的。"

掌握本文内容，你已具备解决文本处理任务的能力。现在，去用正则表达式解放你的双手吧！

希望大家能够有收获，大家的点赞和关注是对我最大的支持和帮助！

Python编程——进阶知识（正则表达式）

前言

一、正则表达式的作用

二、基础语法速查表

1. 基本匹配

2. 字符类

3. 量词

4. 分组与捕获

三、Python re 模块核心函数

1. re.match() ------ 从开头匹配

2. re.search() ------ 全文搜索首个匹配

3. re.findall() ------ 查找所有匹配

4. re.finditer() ------ 返回迭代器（内存友好）

5. re.sub() ------ 替换匹配项

6. re.split() ------ 按模式分割

四、编译正则：re.compile() 提升性能

编译对象方法

五、标志（Flags）控制匹配行为

多标志组合

内联标志（局部生效）

六、高级技巧：零宽断言与命名分组

1. 零宽断言（Lookaround）

2. 命名分组（提高可读性）

七、常见实战场景

1. 验证邮箱（简化版）

2. 提取 HTML 标签内容

3. 敏感信息脱敏

八、建议与陷阱

建议

陷阱

九、调试正则表达式

1. 使用在线工具

2. Python 调试技巧

小结

三、Python `re` 模块核心函数

1. `re.match()` ------ 从开头匹配

2. `re.search()` ------ 全文搜索首个匹配

3. `re.findall()` ------ 查找所有匹配

4. `re.finditer()` ------ 返回迭代器（内存友好）

5. `re.sub()` ------ 替换匹配项

6. `re.split()` ------ 按模式分割

四、编译正则：`re.compile()` 提升性能