正则表达式与python使用

一、Python正则表达式基础

1. 导入模块

Python通过 re 模块实现正则表达式功能，需先导入模块：

python 复制代码

import re

2. 核心语法

普通字符：直接匹配字面值（如 a 匹配字符 a）。
元字符：
- \d：匹配数字（等价于 [0-9]）。
- \w：匹配字母、数字、下划线（等价于 [a-zA-Z0-9_]）。
- \s：匹配空白字符（空格、制表符等）。
- ^ 和 $：分别匹配字符串开头和结尾。
- \b：单词边界（如 \bpython\b 匹配独立单词 python）。

3. 量词

*：匹配0次或多次（如 a* 匹配空字符串或多个 a）。
+：匹配1次或多次（如 a+ 至少匹配一个 a）。
?：匹配0次或1次（如 a? 可选 a）。
{m,n}：匹配m到n次（如 \d{3,5} 匹配3-5位数字）。

二、常用函数与使用示例

1. 匹配函数

re.match()：从字符串开头匹配，返回 Match 对象（若匹配失败则返回 None）。
python 复制代码
```
text = "Hello, World!"
match = re.match(r"Hello", text)
if match:
    print(match.group())  # 输出 "Hello"
```
re.search()：在整个字符串中搜索第一个匹配项。
python 复制代码
```
match = re.search(r"World", text)
print(match.group())  # 输出 "World"
```
re.findall()：返回所有匹配的子串列表。
python 复制代码
```
matches = re.findall(r"\w+", text)  # 输出 ['Hello', 'World']
```

2. 替换与分割

re.sub()：替换匹配内容。

python 复制代码

new_text = re.sub(r"World", "Python", text)  # 输出 "Hello, Python!"

re.split()：根据模式分割字符串。
python 复制代码
```
parts = re.split(r",", text)  # 输出 ['Hello', ' World!']
```

3. 编译正则表达式

预编译可提升多次调用效率：

python 复制代码

pattern = re.compile(r"\b\w{3}\b")  # 匹配3位单词
matches = pattern.findall("The quick brown fox")

三、分组与捕获

使用 () 分组提取特定内容：

python 复制代码

text = "apple, banana, cherry"
match = re.match(r"(\w+), (\w+), (\w+)", text)
print(match.groups())  # 输出 ('apple', 'banana', 'cherry')

非贪婪匹配

添加 ? 实现最小匹配：

python 复制代码

text = "2023-04-02T10:11:12Z"
date = re.search(r"\d{4}-\d{2}-\d{2}", text).group()  # 输出 "2023-04-02"

四、实际应用场景

1. 数据验证

手机号验证：^1[3-9]\d{9}$（以1开头，第二位3-9，后接9位数字）。
邮箱提取：([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})。

2. 文本处理

日期格式化：将 20230209 转为 2023.02.09：

python 复制代码

text = "管理办法20230209(修订).docx"
new_text = re.sub(r"(\d{4})(\d{2})(\d{2})", r"\1.\2.\3", text)
# 输出 "管理办法2023.02.09(修订).docx"

3. 网页数据抓取

提取链接：

python 复制代码

import re, requests
from bs4 import BeautifulSoup
url = "https://example.com"
html = requests.get(url).text
links = re.findall(r'href\s*=\s*["\']([^"\']+)["\']', html)

五、优化技巧

预编译正则表达式：使用 re.compile() 减少重复编译开销。
避免过度使用 .*：优先用精确匹配（如 \d{4} 代替 .*）。
忽略大小写：添加 re.IGNORECASE 修饰符（如 re.findall(r"python", text, re.I)）。

六、总结

Python的 re 模块提供了强大的正则表达式功能，涵盖匹配、替换、分组等操作。结合预编译和优化技巧，可高效处理文本数据。实际开发中建议使用在线工具（如 Regexr）调试复杂表达式。