Python 正则表达式（RegEx）

Python 正则表达式（Regular Expressions，简称 RegEx）是用于模式匹配和字符串操作的强大工具。通过预定义的模式，Python 正则表达式可以快速地搜索、匹配和操作文本。无论你是验证用户输入、解析数据还是从大量文本文件中提取信息，Python 正则表达式都能显著提升你的编程能力。

本文将帮助你熟练掌握 Python 正则表达式，介绍其基础知识、工作原理，并提供实际应用示例。通过本文的学习，你将具备在各种实际应用中使用正则表达式的技能，提高编程的有效性和效率。

Python 中的正则表达式模块 (`re`)

Python 的 re 模块提供了一组用于处理正则表达式的函数。它使你能够使用特定的模式来搜索、匹配和操作文本。以下是 re 模块的主要概念和功能：

1. 导入模块

在使用正则表达式函数之前，需要导入 re 模块：

python 复制代码

import re

2. 基本函数

search()：在字符串中搜索匹配项，并返回一个匹配对象（如果找到）。
python 复制代码
```
match = re.search(r'\d+', 'There are 123 apples')
print(match.group())  # 输出: 123
```

match()：检查字符串的开头是否与模式匹配。

python 复制代码

match = re.match(r'Hello', 'Hello, world!')
print(match.group())  # 输出: Hello

findall()：查找字符串中所有匹配项，并返回一个匹配项列表。

python 复制代码

matches = re.findall(r'\d+', '123 apples and 456 oranges')
print(matches)  # 输出: ['123', '456']

sub()：将匹配项替换为指定的字符串。

python 复制代码

result = re.sub(r'apples', 'bananas', 'I like apples')
print(result)  # 输出: I like bananas

3. 特殊字符

.（点号）：匹配除换行符以外的任何单个字符。
^（脱字符）：匹配字符串的开始。
$（美元符号）：匹配字符串的结束。
[]（方括号）：匹配方括号内的任何一个字符。
\（反斜杠）：转义特殊字符或表示特定序列。

4. 特殊序列

\d：匹配任何数字。
\D：匹配任何非数字字符。
\s：匹配任何空白字符（如空格、制表符、换行符）。
\S：匹配任何非空白字符。
\w：匹配任何字母数字字符（包括下划线）。
\W：匹配任何非字母数字字符。

5. 量词

*：匹配前面的模式零次或多次。
+：匹配前面的模式一次或多次。
?：匹配前面的模式零次或一次。
{n}：精确匹配前面的模式 n 次。
{n,}：匹配前面的模式至少 n 次。
{n,m}：匹配前面的模式至少 n 次，最多 m 次。

6. 编译模式

为了提高性能，特别是对于多次使用的模式，可以使用 re.compile() 编译正则表达式模式。

python 复制代码

pattern = re.compile(r'\d+')

matches = pattern.findall('123 apples and 456 oranges')
print(matches)  # 输出: ['123', '456']

实际应用示例

验证用户输入

python 复制代码

def validate_email(email):
    pattern = re.compile(r'^[\w\.-]+@[\w\.-]+\.\w+$')
    return bool(pattern.match(email))

print(validate_email('test@example.com'))  # 输出: True
print(validate_email('invalid-email'))  # 输出: False

解析日志文件

python 复制代码

log_line = '2023-10-01 12:34:56 INFO User logged in'

pattern = re.compile(r'(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (.+)')

match = pattern.search(log_line)
if match:
    date, time, message = match.groups()
    print(f'Date: {date}, Time: {time}, Message: {message}')
    # 输出: Date: 2023-10-01, Time: 12:34:56, Message: INFO User logged in

替换敏感信息

python 复制代码

text = 'My credit card number is 1234-5678-9012-3456'

pattern = re.compile(r'\d{4}-\d{4}-\d{4}-\d{4}')
redacted_text = pattern.sub('XXXX-XXXX-XXXX-XXXX', text)

print(redacted_text)  # 输出: My credit card number is XXXX-XXXX-XXXX-XXXX

如何在 Python 中使用正则表达式（RegEx）

要在 Python 中搜索、匹配和编辑字符串，你需要导入 re 模块并使用其提供的函数来创建正则表达式（RegEx）。以下是一些使用正则表达式的指令和示例。

1. 导入 `re` 模块

首先，需要导入 re 模块：

python 复制代码

import re

2. 使用 `search()` 函数

search() 函数在字符串中搜索匹配项，并返回一个匹配对象（如果找到）。

python 复制代码

import re

text = "The price is 123 dollars"

match = re.search(r'\d+', text)

if match:
    print("Found a match:", match.group())  # 输出: Found a match: 123

3. 使用 `match()` 函数

match() 函数检查字符串的开头是否与模式匹配。

python 复制代码

import re

text = "Hello, world!"

match = re.match(r'Hello', text)

if match:
    print("Found a match:", match.group())  # 输出: Found a match: Hello

4. 使用 `findall()` 函数

findall() 函数查找字符串中所有匹配项，并返回一个匹配项列表。

python 复制代码

import re

text = "123 apples and 456 oranges"

matches = re.findall(r'\d+', text)

print("All matches:", matches)  # 输出: All matches: ['123', '456']

5. 使用 `sub()` 函数

sub() 函数将匹配项替换为指定的字符串。

python 复制代码

import re

text = "I like apples"

result = re.sub(r'apples', 'bananas', text)

print("Replaced text:", result)  # 输出: Replaced text: I like bananas

6. 使用 `compile()` 函数

compile() 函数将正则表达式模式编译为正则表达式对象，以便重复使用。

python 复制代码

import re

pattern = re.compile(r'\d+')

text = "123 apples and 456 oranges"

matches = pattern.findall(text)

print("All matches:", matches)  # 输出: All matches: ['123', '456']

实际应用示例

以下是一个使用正则表达式从文本中提取电子邮件地址的实用示例。

python 复制代码

import re

text = """
Contact us at support@example.com for more information.
You can also reach out to sales@example.com or marketing@example.net.
"""

# 定义电子邮件地址的正则表达式模式
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# 使用 findall() 提取所有电子邮件地址
email_addresses = re.findall(email_pattern, text)

# 打印提取的电子邮件地址
print("Extracted email addresses:", email_addresses)
# 输出: Extracted email addresses: ['support@example.com', 'sales@example.com', 'marketing@example.net']

正则表达式函数（RegEx Functions）

1. `findall()`

解释：返回字符串中所有匹配模式的列表。
用途：用于从文本中提取所有匹配项。

示例：

python 复制代码

import re

text = "123 apples, 456 oranges, and 789 bananas"

matches = re.findall(r'\d+', text)

print(matches)  # 输出: ['123', '456', '789']

2. `search()`

解释：在字符串中搜索匹配模式，并返回一个匹配对象（如果找到）。
用途：用于检查字符串中是否存在匹配项。

示例：

python 复制代码

import re

text = "Hello, world!"

match = re.search(r'world', text)

if match:
    print("Found:", match.group())  # 输出: Found: world

3. `split()`

解释：按照模式出现的位置拆分字符串，并返回子字符串的列表。
用途：用于根据模式将字符串拆分为多个子字符串。

示例：

python 复制代码

import re

text = "one, two, three; four"

parts = re.split(r'[,;]', text)

print(parts)  # 输出: ['one', ' two', ' three', ' four']

4. `sub()`

解释：将字符串中匹配模式的部分替换为指定的替换字符串。
用途：用于根据模式替换字符串中的部分。

示例：

python 复制代码

import re

text = "I like apples"

result = re.sub(r'apples', 'bananas', text)

print(result)  # 输出: I like bananas

5. `compile()`

解释：将正则表达式模式编译为正则表达式对象，以便重复使用。
用途：用于编译正则表达式模式，提高性能。

示例：

python 复制代码

import re

pattern = re.compile(r'\d+')

text = "123 apples and 456 oranges"

matches = pattern.findall(text)

print(matches)  # 输出: ['123', '456']

6. `escape()`

解释：转义字符串中的所有非字母数字字符。
用途：用于将特殊字符视为普通字符。

示例：

python 复制代码

import re

text = "example.com?query=value"

escaped_text = re.escape(text)

print(escaped_text)  # 输出: example\.com\?query\=value

7. `fullmatch()`

解释：检查整个字符串是否与模式完全匹配。
用途：用于验证字符串是否完全符合模式。

示例：

python 复制代码

import re

pattern = r'Hello, world!'

text = 'Hello, world!'

match = re.fullmatch(pattern, text)

if match:
    print("Exact match!")  # 输出: Exact match!

元字符（MetaCharacters）

元字符是正则表达式中具有特殊意义的字符。以下是一些常见的元字符及其解释和示例：

1. `[]`

解释：用于指定一组字符，匹配其中的任何一个字符。

示例：

python 复制代码

import re

text = "bat, cat, hat"

matches = re.findall(r'[bch]at', text)

print(matches)  # 输出: ['bat', 'cat', 'hat']

2. `\`

解释：转义特殊字符，使其被视为普通字符。也可以用于特殊序列（例如，\d 表示数字）。

示例：

python 复制代码

import re

text = "This is a test. 123."

matches = re.findall(r'\d+', text)

print(matches)  # 输出: ['123']

3. `.`

解释：匹配除换行符外的任何单个字符。常用于通配符搜索。

示例：

python 复制代码

import re

text = "cat, cot, cut"

matches = re.findall(r'c.t', text)

print(matches)  # 输出: ['cat', 'cot', 'cut']

4. `^`

解释：匹配字符串的开始。确保模式出现在字符串的开头。

示例：

python 复制代码

import re

text = "Hello, world!"

match = re.search(r'^Hello', text)

if match:
    print("Found:", match.group())  # 输出: Found: Hello

5. `$`

解释：匹配字符串的结束。确保模式出现在字符串的结尾。

示例：

python 复制代码

import re

text = "Welcome to Python"

match = re.search(r'Python$', text)

if match:
    print("Found:", match.group())  # 输出: Found: Python

6. `*`

解释：匹配前面的模式零次或多次。适用于匹配可选和重复的字符。

示例：

python 复制代码

import re

text = "ac, abc, abbc"

matches = re.findall(r'ab*c', text)

print(matches)  # 输出: ['ac', 'abc', 'abbc']

7. `+`

解释：匹配前面的模式一次或多次。适用于匹配至少一次出现的字符。

示例：

python 复制代码

import re

text = "ac, abc, abbc"

matches = re.findall(r'ab+c', text)

print(matches)  # 输出: ['abc', 'abbc']

8. `?`

解释：匹配前面的模式零次或一次。适用于使前面的字符可选。

示例：

python 复制代码

import re

text = "color, colour"

matches = re.findall(r'colou?r', text)

print(matches)  # 输出: ['color', 'colour']

9. `|`

解释：逻辑或，匹配两边的任意模式。适用于匹配多个模式。

示例：

python 复制代码

import re

text = "cat, bat, rat"

matches = re.findall(r'cat|rat', text)

print(matches)  # 输出: ['cat', 'rat']

特殊序列（Special Sequences）

特殊序列是正则表达式中具有特定含义的转义序列。以下是一些常见的特殊序列及其解释和示例：

1. `\A`

解释：匹配指定的字符是否在字符串的开头。类似于 ^，但更严格（仅在字符串开头有效）。

示例：

python 复制代码

import re

text = "Hello world"

match = re.search(r'\AHello', text)

if match:
    print("Found:", match.group())  # 输出: Found: Hello

2. `\b`

解释：匹配单词边界处的空字符串。适用于单词的开始或结束。

示例：

python 复制代码

import re

text = "Hello, world!"

matches = re.findall(r'\bworld\b', text)

print(matches)  # 输出: ['world']

3. `\B`

解释：匹配不在单词边界处的空字符串。适用于非单词边界。

示例：

python 复制代码

import re

text = "Hello, world!"

matches = re.findall(r'\Bworld\B', text)

print(matches)  # 输出: []

4. `\d`

解释：匹配任何数字（0-9）。等同于 [0-9]。

示例：

python 复制代码

import re

text = "There are 123 apples"

matches = re.findall(r'\d+', text)

print(matches)  # 输出: ['123']

5. `\D`

解释：匹配任何非数字字符。等同于 [^0-9]。

示例：

python 复制代码

import re

text = "There are 123 apples"

matches = re.findall(r'\D+', text)

print(matches)  # 输出: ['There are ', ' apples']

6. `\s`

解释：匹配任何空白字符（包括空格、制表符、换行符等）。等同于 [ \t\n\r\f\v]。

示例：

python 复制代码

import re

text = "Hello world!"

matches = re.findall(r'\s', text)

print(matches)  # 输出: [' ']

7. `\S`

解释：匹配任何非空白字符。等同于 [^ \t\n\r\f\v]。

示例：

python 复制代码

import re

text = "Hello world!"

matches = re.findall(r'\S+', text)

print(matches)  # 输出: ['Hello', 'world!']

8. `\w`

解释：匹配任何字母数字字符（包括字母、数字和下划线）。等同于 [a-zA-Z0-9_]。

示例：

python 复制代码

import re

text = "Hello_world 123"

matches = re.findall(r'\w+', text)

print(matches)  # 输出: ['Hello_world', '123']

9. `\W`

解释：匹配任何非字母数字字符。等同于 [^a-zA-Z0-9_]。

示例：

python 复制代码

import re

text = "Hello world!"

matches = re.findall(r'\W+', text)

print(matches)  # 输出: [' ', '!']

10. `\Z`

解释：匹配指定的字符是否在字符串的末尾。类似于 $，但更严格（仅在字符串末尾有效）。

示例：

python 复制代码

import re

text = "Hello world"

match = re.search(r'world\Z', text)

if match:
    print("Found:", match.group())  # 输出: Found: world

集合（Sets）

集合是正则表达式中用于指定一组字符的工具。以下是一些常见的集合及其解释和示例：

1. `[arn]`

解释：匹配集合中的任意一个字符（'a'、'r' 或 'n'）。
用途：用于匹配特定的一组字符。

示例：

python 复制代码

import re

text = "apple, banana, orange"

matches = re.findall(r'[arn]', text)

print(matches)  # 输出: ['a', 'a', 'a', 'n', 'a', 'a', 'a']

2. `[a-n]`

解释：匹配范围内的任意一个字符（'a' 到 'n'）。
用途：用于匹配一个范围内的字符。

示例：

python 复制代码

import re

text = "apple, banana, orange"

matches = re.findall(r'[a-n]', text)

print(matches)  # 输出: ['a', 'l', 'e', 'a', 'a', 'n', 'a', 'a', 'e']

3. `[^arn]`

解释：匹配集合之外的任意一个字符（不包括 'a'、'r' 或 'n'）。
用途：用于排除特定的字符。

示例：

python 复制代码

import re

text = "apple, banana, orange"

matches = re.findall(r'[^arn]', text)

print(matches)  # 输出: ['p', 'p', 'l', 'e', 'b', ' ', 'g', 'e']

4. `[0123]`

解释：匹配集合中的任意一个数字（'0'、'1'、'2' 或 '3'）。
用途：用于匹配特定的一组数字。

示例：

python 复制代码

import re

text = "1024, 123, 456"

matches = re.findall(r'[0123]', text)

print(matches)  # 输出: ['1', '0', '2', '1', '2', '3']

5. `[0-9]`

解释：匹配任何数字（0 到 9）。等同于 \d。
用途：用于匹配任何数字。

示例：

python 复制代码

import re

text = "1024, 123, 456"

matches = re.findall(r'[0-9]', text)

print(matches)  # 输出: ['1', '0', '2', '4', '1', '2', '3', '4', '5', '6']

6. `[0-5][0-9]`

解释：匹配任何两位数（00 到 59）。适用于匹配分钟或秒等时间单位。
用途：用于匹配特定范围内的两位数。

示例：

python 复制代码

import re

text = "The time is 12:45 and 08:30."

matches = re.findall(r'[0-5][0-9]', text)

print(matches)  # 输出: ['12', '45', '08', '30']

7. `[a-zA-Z]`

解释：匹配任何字母（小写或大写）。
用途：用于大小写不敏感的字母匹配。

示例：

python 复制代码

import re

text = "Hello, World!"

matches = re.findall(r'[a-zA-Z]', text)

print(matches)  # 输出: ['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']

8. `[+]`

解释：匹配字面意义上的加号 +。
用途：用于匹配特殊字符，通过将其放在方括号内来避免其特殊意义。

示例：

python 复制代码

import re

text = "Use + for addition"

matches = re.findall(r'[+]', text)

print(matches)  # 输出: ['+']

常见问题解答

1. 如何检查字符串是否匹配正则表达式？

要检查字符串是否匹配正则表达式模式，可以使用 re 模块的 match()、search() 或 fullmatch() 函数。

re.match()：检查字符串的开头是否匹配正则表达式模式。
re.search()：扫描整个字符串，查找任何匹配模式的部分。
re.fullmatch()：确保整个字符串匹配正则表达式模式。

每个函数在找到匹配项时返回一个匹配对象，如果没有找到匹配项则返回 None。

示例：

python 复制代码

import re

text = "Hello 123"

# 使用 re.search 查找数字
match = re.search(r'\d+', text)

if match:
    print("Found:", match.group())  # 输出: Found: 123
else:
    print("No match found")

2. 如何在 Python 中使用正则表达式搜索短语？

要在字符串中使用正则表达式搜索短语，可以使用 re 模块的 search() 函数。首先导入 re 模块，然后定义包含你要搜索的确切短语的正则表达式模式。re.search() 函数会扫描整个字符串，查找匹配模式的部分，如果找到短语，则返回一个匹配对象；否则返回 None。

示例：

python 复制代码

import re

text = "This is a sample text with hello world in it."

# 使用 re.search 搜索短语 "hello world"
match = re.search(r'hello world', text)

if match:
    print("Found:", match.group())  # 输出: Found: hello world
else:
    print("No match found")

3. 如何在 Python 中使用正则表达式替换文本文件中的内容？

要在文件中使用正则表达式替换文本，可以使用 re 模块的 sub() 函数。首先读取文件内容到一个字符串中，然后使用 re.sub() 定义要替换的模式和替换文本。执行替换后，将修改后的内容写回文件。

示例：

python 复制代码

import re

# 读取文件内容
with open('example.txt', 'r') as file:
    content = file.read()

# 使用正则表达式替换文本
updated_content = re.sub(r'foo', 'bar', content)

# 将修改后的内容写回文件
with open('example.txt', 'w') as file:
    file.write(updated_content)

4. 如何在 Python 中使用正则表达式查找全名？

要在 Python 中使用正则表达式查找全名，可以创建一个匹配典型名字格式的模式。常见的全名模式由两个单词组成，每个单词以大写字母开头，后面跟着小写字母，并且可能包括中间名或首字母。可以使用 re 模块的 findall() 或 search() 函数来定位文本中的名字。

示例：

python 复制代码

import re

text = "Contact John Doe or Jane Smith for more information."

# 定义全名的正则表达式模式
pattern = r'\b[A-Z][a-z]+\s[A-Z][a-z]+\b'

# 查找文本中的所有全名
full_names = re.findall(pattern, text)

print(full_names)  # 输出: ['John Doe', 'Jane Smith']

总结

Python 正则表达式（RegEx）是一种强大且灵活的工具，用于模式匹配和字符串操作，广泛应用于从文本处理到数据验证的各种场景。熟练掌握 re 模块及其函数，如 match()、sub()、findall() 和 search()，可以有效地处理复杂的文本处理任务。Python 正则表达式提供了足够的灵活性和效率，使得诸如文本替换、信息提取和模式识别等任务变得简单易行。

主要内容回顾

模式匹配：
- match()：检查字符串的开头是否匹配正则表达式模式。
- search()：扫描整个字符串，查找任何匹配模式的部分。
- fullmatch()：确保整个字符串匹配正则表达式模式。
字符串操作：
- findall()：返回字符串中所有匹配模式的列表。
- sub()：将字符串中匹配模式的部分替换为指定的替换字符串。
- split()：按照模式出现的位置拆分字符串，并返回子字符串的列表。
特殊序列和集合：
- 特殊序列 ：如 \d（匹配数字）、\s（匹配空白字符）、\w（匹配字母数字字符）等，提供了一种方便的方式来匹配特定类型的字符。
- 集合：如 [arn]（匹配 'a'、'r' 或 'n'）、[a-n]（匹配 'a' 到 'n' 范围内的字符）、[^arn]（匹配不在 'a'、'r' 或 'n' 中的字符）等，允许你定义更复杂的匹配规则。

实际应用

文本替换 ：使用 re.sub() 函数可以轻松地替换文本中的特定模式。
信息提取 ：使用 re.findall() 或 re.search() 可以从文本中提取所需的信息。
模式识别 ：使用 re.match() 或 re.fullmatch() 可以验证字符串是否符合特定的模式。

Python 正则表达式（RegEx）

Python 中的正则表达式模块 (re)

1. 导入模块

2. 基本函数

3. 特殊字符

4. 特殊序列

5. 量词

6. 编译模式

实际应用示例

验证用户输入

解析日志文件

替换敏感信息

如何在 Python 中使用正则表达式（RegEx）

1. 导入 re 模块

2. 使用 search() 函数

3. 使用 match() 函数

4. 使用 findall() 函数

5. 使用 sub() 函数

6. 使用 compile() 函数

实际应用示例

正则表达式函数（RegEx Functions）

1. findall()

2. search()

3. split()

4. sub()

5. compile()

6. escape()

7. fullmatch()

元字符（MetaCharacters）

1. []

2. \

3. .

4. ^

5. $

6. *

7. +

8. ?

9. |

特殊序列（Special Sequences）

1. \A

2. \b

3. \B

4. \d

5. \D

6. \s

7. \S

8. \w

9. \W

10. \Z

集合（Sets）

1. [arn]

2. [a-n]

3. [^arn]

4. [0123]

5. [0-9]

6. [0-5][0-9]

7. [a-zA-Z]

8. [+]

常见问题解答

1. 如何检查字符串是否匹配正则表达式？

2. 如何在 Python 中使用正则表达式搜索短语？

3. 如何在 Python 中使用正则表达式替换文本文件中的内容？

4. 如何在 Python 中使用正则表达式查找全名？

总结

主要内容回顾

实际应用

Python 中的正则表达式模块 (`re`)

1. 导入 `re` 模块

2. 使用 `search()` 函数

3. 使用 `match()` 函数

4. 使用 `findall()` 函数

5. 使用 `sub()` 函数

6. 使用 `compile()` 函数

1. `findall()`

2. `search()`

3. `split()`

4. `sub()`

5. `compile()`

6. `escape()`

7. `fullmatch()`

1. `[]`

2. `\`

3. `.`

4. `^`

5. `$`

6. `*`

7. `+`

8. `?`

9. `|`

1. `\A`

2. `\b`

3. `\B`

4. `\d`

5. `\D`

6. `\s`

7. `\S`

8. `\w`

9. `\W`

10. `\Z`

1. `[arn]`

2. `[a-n]`

3. `[^arn]`

4. `[0123]`

5. `[0-9]`

6. `[0-5][0-9]`

7. `[a-zA-Z]`

8. `[+]`