Python09_正则表达式

文章目录

Python09_正则表达式
[Python 正则表达式常见问题与答案（Q&A）](#Python 正则表达式常见问题与答案（Q&A）)
- 第一章：基础概念与入门
- - Q1：什么是正则表达式（Regex）？
  - Q2：如何在Python中使用正则表达式？
  - [Q3：`re.match()` 和 `re.search()` 有什么区别？](#Q3：re.match() 和 re.search() 有什么区别？)
- 第二章：核心元字符与语法
- - Q4：常用的正则表达式元字符有哪些？
  - Q5：量词（重复匹配）如何使用？
- 第三章：贪婪与非贪婪匹配
- - Q6：什么是贪婪匹配和非贪婪匹配？
  - Q7：什么时候应该使用非贪婪匹配？
- 第四章：分组与捕获
- - [Q8：如何使用括号 `()` 进行分组？](#Q8：如何使用括号 () 进行分组？)
  - Q9：什么是命名分组？
  - Q10：非捕获组是什么？
- 第五章：零宽断言（环视）
- - Q11：什么是零宽断言（Lookaround）？
  - Q12：如何匹配HTML标签之间的内容？
- 第六章：常用正则表达式模式
- - Q13：如何验证邮箱地址？
  - Q14：如何验证中国手机号？
  - Q15：如何匹配URL？
  - Q16：如何提取日期？
- 第七章：re模块高级功能
- - [Q17：`re.findall()` 和 `re.finditer()` 有什么区别？](#Q17：re.findall() 和 re.finditer() 有什么区别？)
  - [Q18：如何使用 `re.sub()` 进行替换？](#Q18：如何使用 re.sub() 进行替换？)
  - [Q19：`re.split()` 有什么优势？](#Q19：re.split() 有什么优势？)
- 第八章：编译正则与性能优化
- - [Q20：为什么要使用 `re.compile()`？](#Q20：为什么要使用 re.compile()？)
  - Q21：常用的匹配标志（Flags）有哪些？
- 第九章：常见错误与调试技巧
- - Q22：正则表达式匹配失败的可能原因？
  - Q23：如何调试正则表达式？
- 第十章：实战应用场景
- - Q24：如何从文本中提取所有邮箱？
  - Q25：如何清洗文本数据（去除多余空格）？
  - Q26：如何匹配重复单词？
- 附录：速查表

Python 正则表达式常见问题与答案（Q&A）

第一章：基础概念与入门

Q1：什么是正则表达式（Regex）？

A：正则表达式是一种用于匹配字符串模式的强大工具，它是内置于Python re 模块中的"微型搜索语言"。可以把它想象成一个智能放大镜，能够在文本中查找特定模式，而不仅仅是固定的单词。例如，你可以用一行正则表达式从日志文件中提取所有邮箱地址，而无需手动逐行扫描。

Q2：如何在Python中使用正则表达式？

A：首先需要导入 re 模块：

python 复制代码

import re

re 模块提供了一系列函数用于字符串匹配、查找、替换等操作。建议在正则表达式字符串前加 r 前缀（原始字符串），以避免反斜杠被Python字符串转义误解。

Q3：`re.match()` 和 `re.search()` 有什么区别？

A：

方法	作用	匹配位置
`re.match()`	从字符串开头匹配正则表达式	必须从第0个字符开始匹配
`re.search()`	在字符串任意位置搜索第一个匹配	可以在字符串任何位置

示例：

python 复制代码

import re

text = "Hello World"

# match 必须从开头匹配
print(re.match(r'World', text))      # None
print(re.match(r'Hello', text))      # <re.Match object; span=(0, 5), match='Hello'>

# search 可以在任意位置查找
print(re.search(r'World', text))     # <re.Match object; span=(6, 11), match='World'>

第二章：核心元字符与语法

Q4：常用的正则表达式元字符有哪些？

A：

元字符	含义	示例
`.`	匹配除换行符外的任意单个字符	`r'a.c'` 匹配 "abc", "a1c"
`\d`	匹配数字 $0-9$	`r'\d+'` 匹配 "123"
`\w`	匹配单词字符 $a-zA-Z0-9_$	`r'\w+'` 匹配 "hello_123"
`\s`	匹配空白字符（空格、制表符、换行）	`r'\s+'` 匹配多个空格
`^`	匹配字符串开头	`r'^Hello'` 匹配以Hello开头
`$`	匹配字符串结尾	`r'world$'` 匹配以world结尾
`[]`	字符集，匹配括号内的任意一个字符	`r'[aeiou]'` 匹配任意元音
`[^]`	否定字符集，匹配不在括号内的字符	`r'[^0-9]'` 匹配非数字字符

Q5：量词（重复匹配）如何使用？

A：

量词	含义	说明
`*`	匹配0次或多次	`r'ab*c'` 匹配 "ac", "abc", "abbc"
`+`	匹配1次或多次	`r'ab+c'` 匹配 "abc", "abbc"，不匹配 "ac"
`?`	匹配0次或1次	`r'ab?c'` 匹配 "ac", "abc"
`{n}`	精确匹配n次	`r'a{3}'` 匹配 "aaa"
`{n,}`	至少匹配n次	`r'a{2,}'` 匹配 "aa", "aaa", "aaaa"...
`{n,m}`	匹配n到m次	`r'a{2,4}'` 匹配 "aa", "aaa", "aaaa"

示例：

python 复制代码

import re

# 匹配邮箱（简化版）
pattern = r'\w+@\w+\.\w+'
text = "Contact: john.doe@example.com"
match = re.search(pattern, text)
print(match.group())  # john.doe@example.com

第三章：贪婪与非贪婪匹配

Q6：什么是贪婪匹配和非贪婪匹配？

A：

贪婪匹配 （默认）：量词 *, +, ?, {m,n} 会尽可能多地匹配字符
非贪婪匹配 （惰性匹配）：在量词后加 ?，如 *?, +?, ??, {m,n}?，会尽可能少地匹配字符

示例对比：

python 复制代码

import re

text = "<div><p>Hello</p><span>World</span></div>"

# 贪婪匹配 - 匹配尽可能多的内容
greedy = re.findall(r'<div>.*</div>', text)
print("贪婪:", greedy)  
# 输出: ['<div><p>Hello</p><span>World</span></div>']

# 非贪婪匹配 - 匹配尽可能少的内容
non_greedy = re.findall(r'<div>.*?</div>', text)
print("非贪婪:", non_greedy)  
# 输出: ['<div><p>Hello</p><span>World</span>']

Q7：什么时候应该使用非贪婪匹配？

A：当你需要精确控制匹配范围，避免"过度匹配"时。常见场景包括：

提取HTML/XML标签内的内容
匹配引号内的字符串
处理嵌套结构时

第四章：分组与捕获

Q8：如何使用括号 `()` 进行分组？

A：分组可以实现以下功能：

提取子串：捕获匹配的部分内容
应用量词：对一组字符应用量词
反向引用：在正则中引用之前的分组

示例 - 提取信息：

python 复制代码

import re

text = "John Doe: john.doe@example.com"
pattern = r'(\w+) (\w+): (\S+)'

match = re.match(pattern, text)
if match:
    first_name = match.group(1)  # John
    last_name = match.group(2)    # Doe
    email = match.group(3)        # john.doe@example.com
    print(f"姓名: {first_name} {last_name}, 邮箱: {email}")

Q9：什么是命名分组？

A：使用 (?P<name>...) 语法可以给分组命名，使代码更具可读性：

python 复制代码

import re

pattern = r'(?P<username>\w+)(?:\.(?P<middle>\w+))?@(?P<domain>[A-Za-z0-9.-]+)\.[A-Za-z]{2,}'
email = 'john.doe@example.com'

match = re.match(pattern, email)
if match:
    print(match.group('username'))  # john
    print(match.group('domain'))    # example

Q10：非捕获组是什么？

A：使用 (?:...) 可以创建非捕获组，它只用于分组但不保存匹配内容，性能更好：

python 复制代码

import re

# 捕获组 - 会保存匹配内容
pattern1 = r'(\d+)-(\d+)'
# 非捕获组 - 不保存匹配内容，仅用于分组
pattern2 = r'\d+(?:-\d+)+'

text = "123-456-789"
print(re.findall(pattern1, text))  # [('123', '456')]
print(re.findall(pattern2, text))  # ['123-456-789']

第五章：零宽断言（环视）

Q11：什么是零宽断言（Lookaround）？

A：零宽断言用于匹配特定位置的内容，但不消耗字符（即匹配位置但不包含在结果中）。分为四种类型：

断言类型	语法	含义
正向先行断言	`(?=...)`	后面跟着...
负向先行断言	`(?!...)`	后面不跟着...
正向后行断言	`(?<=...)`	前面是...
负向后行断言	`(?<!...)`	前面不是...

示例 - 提取价格（不包含货币符号）：

python 复制代码

import re

text = "The price is $100.50 and the discount is $20"
# 匹配数字，但要求前面有$符号（不捕获$）
pattern = r'(?<=\$)\d+\.?\d*'

prices = re.findall(pattern, text)
print(prices)  # ['100.50', '20']

Q12：如何匹配HTML标签之间的内容？

A：使用零宽断言可以避免匹配到嵌套标签：

python 复制代码

import re

html = '<div class="content">Hello <em>world</em></div>'
# 提取div标签内的内容，但不包含其他标签
pattern = r'(?<=<div[^>]*>)\s*(.*?)(?=\s*</div>)'

match = re.search(pattern, html)
if match:
    print(match.group(1))  # Hello <em>world</em>

注意： 对于复杂的HTML解析，推荐使用 BeautifulSoup 等专用库，正则表达式在处理嵌套结构时有局限性。

第六章：常用正则表达式模式

Q13：如何验证邮箱地址？

A：

python 复制代码

import re

def validate_email(email):
    # 较完整的邮箱验证正则
    pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
    return re.match(pattern, email) is not None

# 测试
print(validate_email("test@example.com"))   # True
print(validate_email("invalid_email@"))    # False

Q14：如何验证中国手机号？

A：

python 复制代码

import re

def validate_phone(phone):
    # 中国手机号：1开头，第二位3-9，共11位
    pattern = r'^1[3-9]\d{9}$'
    return re.match(pattern, phone) is not None

print(validate_phone("13812345678"))  # True
print(validate_phone("1381234567"))   # False

Q15：如何匹配URL？

A：

python 复制代码

import re

# 简化版URL匹配
url_pattern = r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'

text = "Visit https://www.example.com or http://site.org"
urls = re.findall(url_pattern, text)
print(urls)  # ['https://www.example.com', 'http://site.org']

Q16：如何提取日期？

A：

python 复制代码

import re

# 匹配 YYYY-MM-DD 或 YYYY/MM/DD 格式
date_pattern = r'\b(19|20)\d\d[-/.](0[1-9]|1[0-2])[-/.](0[1-9]|[12]\d|3[01])\b'

text = "Meeting on 2024-03-15 and deadline is 2025/12/31"
dates = re.findall(date_pattern, text)
print(dates)  # [('2024', '03', '15'), ('2025', '12', '31')]

第七章：re模块高级功能

Q17：`re.findall()` 和 `re.finditer()` 有什么区别？

A：

re.findall()：返回所有匹配结果的列表
re.finditer()：返回匹配对象的迭代器，适合处理大量数据

python 复制代码

import re

text = "The rain in Spain stays mainly in the plain"
pattern = r'ain'

# findall 返回列表
print(re.findall(pattern, text))  # ['ain', 'ain', 'ain']

# finditer 返回迭代器，可获取位置信息
for match in re.finditer(pattern, text):
    print(f"Found '{match.group()}' at position {match.start()}-{match.end()}")
# Found 'ain' at position 5-8
# Found 'ain' at position 14-17
# Found 'ain' at position 30-33

Q18：如何使用 `re.sub()` 进行替换？

A： re.sub() 用于替换匹配的字符串，支持使用分组引用：

python 复制代码

import re

text = "The price is $100.50 and the discount is $20"

# 将$替换为¥，并保留数字
# \1 引用第一个捕获组
formatted = re.sub(r'\$(\d+\.?\d*)', r'¥\1', text)
print(formatted)  # The price is ¥100.50 and the discount is ¥20

Q19：`re.split()` 有什么优势？

A：可以使用正则表达式作为分隔符，支持多种分隔符：

python 复制代码

import re

text = "apple, orange; banana grape"

# 使用多种分隔符分割
fruits = re.split(r'[;,\s]\s*', text)
print(fruits)  # ['apple', 'orange', 'banana', 'grape']

第八章：编译正则与性能优化

Q20：为什么要使用 `re.compile()`？

A：当同一个正则表达式需要多次使用时，先编译可以提高性能：

python 复制代码

import re

# 不推荐：每次循环都重新编译正则
for text in large_text_list:
    re.findall(r'\d+', text)

# 推荐：先编译，重复使用
pattern = re.compile(r'\d+')
for text in large_text_list:
    pattern.findall(text)

Q21：常用的匹配标志（Flags）有哪些？

A：

标志	简写	作用
`re.IGNORECASE`	`re.I`	忽略大小写
`re.MULTILINE`	`re.M`	多行模式，`^`和`$`匹配每行开头结尾
`re.DOTALL`	`re.S`	点号`.`匹配包括换行符在内的所有字符
`re.VERBOSE`	`re.X`	允许正则表达式多行书写，忽略空白和注释

示例：

python 复制代码

import re

# 忽略大小写
print(re.match(r'hello', 'HELLO', flags=re.IGNORECASE))  # 匹配成功

# 多行模式
text = "First line\nSecond line"
print(re.findall(r'^Second', text, flags=re.MULTILINE))  # ['Second']

第九章：常见错误与调试技巧

Q22：正则表达式匹配失败的可能原因？

A：常见错误包括：

忘记转义特殊字符 ：如 . * + ? ( ) [ ] { } ^ $ \ | 等
混淆 match 和 search ：match 必须从开头匹配
贪婪匹配过度 ：尝试使用非贪婪量词 *? +?
字符集范围错误 ：如 [z-a] 是错误的，应为 [a-z]
忘记原始字符串 ：未使用 r'...' 导致 \ 被Python转义

Q23：如何调试正则表达式？

A：

使用在线工具：如 regex101.com、pythex.org
分步测试：先测试简单模式，逐步添加复杂度
打印匹配对象 ：查看 match.group(), match.start(), match.end()
使用 re.DEBUG：查看编译后的正则内部结构

python 复制代码

import re

# 调试模式
pattern = re.compile(r'\d+', re.DEBUG)
# 会输出正则的内部解析信息

第十章：实战应用场景

Q24：如何从文本中提取所有邮箱？

A：

python 复制代码

import re

text = "Contact us at contact@example.com or support@site.org"
emails = re.findall(r'[\w.-]+@[\w.-]+', text)
print(emails)  # ['contact@example.com', 'support@site.org']

Q25：如何清洗文本数据（去除多余空格）？

A：

python 复制代码

import re

text = "  Hello   World  This is   Python  "
# 将多个空格替换为单个空格，并去除首尾空格
clean = re.sub(r'\s+', ' ', text).strip()
print(clean)  # "Hello World This is Python"

Q26：如何匹配重复单词？

A：

python 复制代码

import re

text = "the cat in the the hat"
# \b 单词边界，\1 引用第一个分组
pattern = r'\b(\w+)\s+\1\b'
matches = re.findall(pattern, text)
print(matches)  # ['the']

附录：速查表

功能	代码示例
导入模块	`import re`
匹配开头	`re.match(pattern, string)`
搜索任意位置	`re.search(pattern, string)`
查找所有	`re.findall(pattern, string)`
迭代查找	`re.finditer(pattern, string)`
替换	`re.sub(pattern, repl, string)`
分割	`re.split(pattern, string)`
编译正则	`re.compile(pattern)`
获取匹配内容	`match.group()`
获取分组	`match.group(1)`, `match.groups()`
获取位置	`match.start()`, `match.end()`, `match.span()`

这份资料涵盖了Python正则表达式的核心知识点，建议按照章节顺序学习，并在实际项目中多加练习。对于复杂的文本处理任务，记住：正则表达式是强大的工具，但不是万能的，对于HTML/XML解析等场景，还是要使用专用库如 BeautifulSoup。

Python09_正则表达式