Python正则表达式：30秒精通文本处理

一、概述

1. 含义

正则表达式是一种记录文本规则的代码工具，用于描述字符串的结构和模式。它广泛应用于字符串的匹配、查找、替换、提取等操作。

2. 特点

语法复杂：符号多、规则灵活，可读性较差。
功能强大：可以精确控制字符串内容，适用于各种文本处理场景。
跨语言支持：Python、JavaScript、Java 等主流语言都支持正则表达式。

二、Python 中使用正则表达式

1. 导入模块

复制代码

import re

2. 常用函数介绍

`re.match(pattern, string, flags)`

从字符串的开头开始匹配，若成功返回 Match 对象，否则返回 None。

pattern：要匹配的正则表达式。
string：目标字符串。
flags：标志位（如忽略大小写）。

⚠️ 注意：

匹配失败时返回的是 None，而不是 False。

match() 只从字符串开头开始匹配。

示例：

复制代码

res = re.match("hello", "hello python")
if res:
    print(res.group())  # 输出: hello
else:
    print("未匹配到字符串")

三、基本语法与元字符

1. 单个字符匹配

元字符	含义
`.`	匹配任意一个字符（除 `\n` 外）
`\d`	匹配任意数字 `[0-9]`
`\D`	匹配非数字
`\s`	匹配空白字符（空格、换行、制表符等）
`\S`	匹配非空白字符
`\w`	匹配单词字符（字母、数字、下划线 `_`）
`\W`	匹配非单词字符
{}	匹配次数
[]	匹配[]中列举的字符

示例：

python 复制代码

# 1 使用.匹配惹你单个字符
text = "hello python"

res = re.match("..",text)
print(res.group())

# 2 匹配 []中列举的字符，一个
res =  re.match("[he]",text)  # 只会匹配h因为匹配单个字符，是从开头匹配的
print(res.group())

res =  re.match("[he][he]",text)  # 匹配到he
print(res.group())
# 3 匹配 0-9
res = re.match("[0123456789]","2312") # 匹配到2
print(res.group())
res = re.match("[0-9]","2312") # 匹配到2
print(res.group())
res = re.match("[0-23-5-9]","2312") # 不匹配4
print(res.group())
# 4 匹配任意字母
res = re.match("[a-zA-Z]","ABCD")
print(res.group())
# 匹配数字 \d
res = re.match("\d","9823456")
print(res.group())
# 匹配数字 9开头的连续4个数字的
res = re.match("9\d{3}","9823456")
print(res.group())

# \D匹配非数字
res = re.match("\D{4}","ABCDEFGHIJKLMNO")
print(res.group())

# 匹配空白 \s
res = re.match("\s"," helllo python")
print(res.group())

# 匹配非空白
res = re.match("\S","helllo python")
print(res.group())

#匹配单词字符 A-Z a-z 0-9 _ 汉字
res = re.match("\w","helllo python")
print(res.group())
res = re.match("\w","你好 python")
print(res.group())
res = re.match("\w",".你好 python") # 匹配不到
if res is not None:
    print("True")
    print(res.group())
else:
    print("False")
# 匹配非单词 \W
res = re.match('\W',"   .你好 python") # 匹配不到
try:
    print("True")
    print(res.group())
except Exception as e:
    print("错误信息",e)

2. 匹配次数控制（量词）

量词	含义
`*`	匹配前一个字符出现 0 次或多次
`+`	匹配前一个字符出现至少一次
`?`	匹配前一个字符出现 0 次或 1 次
`{m}`	匹配前一个字符出现恰好 m 次
`{m,n}`	匹配前一个字符出现 m 到 n 次（包含）

自定义安全匹配函数

python 复制代码

def safe_match(pattern, text):
    try:
        res = re.match(pattern, text)
        print(f"匹配成功:[{pattern}] ==>'{res.group()}'")
    except Exception as e:
        print(f"出错了-->{e}")

示例：

python 复制代码

safe_match('\w+'," hello python") #捕获异常,至少要匹配1次

safe_match('\w*',"hello python") #输出hello \w匹配单词 *: 将匹配出的单词匹配0次或者无数次  要注意的是*的0次和无数次是针对\w 而不是\w匹配出来的单词

safe_match('\w*',"    hello python") #不会报错,因为是0次或者无数次

safe_match('\w+'," hello python") #捕获异常,至少要匹配1次
safe_match('\w?'," hello python") #不会报错,因为是0次或者1次
safe_match('\w{3}', "hello python")  # 不会报错,因为是0次或者1次

safe_match('\w{7}', "hello python") # 会报错,没有7个连续的单词
safe_match('\w{4,7}', "hello python") # 不会报错 因为查询4-7个单词

四、匹配位置控制

符号	含义
`^`	匹配字符串的开始位置
`$`	匹配字符串的结束位置

示例：

python 复制代码

# ^: 匹配字符串开头,或者对某种规则取反
safe_match('^hell', "hello python") # 以hell开头
# []:再[]中表示取反
safe_match("[^py]","hello python") #
# $ 匹配字符串结尾 但是要注意match是从开头匹配的
# 匹配以n结尾应该这样
safe_match('.*[n]$',"hello python")
# 匹配以非n结尾应该这样
safe_match('.*[^n]$',"hello python")

五、分组与引用

符号	含义
\|	匹配左右任意一个表达式: 优先匹配左边的,左边不匹配再去右边
(ab)	将括号中的字符作为一个分组
\num	匹配分组num匹配到的字符串
(?P<name>)	分组起别名
(?P=name)	引用别名为name分组匹配到的字符串。为分组命名，便于后续引用。

示例：

python 复制代码

# 匹配左右任意任意表达式
safe_match("abc|ABC","abc")
safe_match("abc|ABC","ABC")

# (ab) 将括号中的字符作为一个分组
safe_match("\w*@(163|qq|wechat).com","stitchcool@163.com")

# 匹配分组num匹配到的字符串  -一般再匹配标签时使用
# 注意:从外到内进行排序,编号从1开始
safe_match("<\w*>\w*</\w*>","<html>login</html>")
#这样太麻烦 我们使用匹配到分组匹配到的字符串

safe_match("<(\w*)>\w*</\\1>","<html>login</html>")
# 也可以使用r取消转义
safe_match(r"<(\w*)>\w*</\1>","<html>login</html>")
safe_match(r"<(\w*)><(\w*)>\w*</\2></\1>","<html><body>login</body></html>")

# 别名操作
safe_match(r"<(?P<标签1>\w*)><(?P<标签2>\w*)>\w*</(?P=标签2)></(?P=标签1)>","<html><body>login</body></html>")
# 匹配网址 前缀一般是www 后缀 .com/.cn等
li = ["www.baidu.com","www.python.org","http.taobao.cn","http\iaidu\com"]

for i in li:
    safe_match(r"www(.)\w*\1(com|cn|org)",i)

    # r'':也就是原始字符串,也就是不会经过转义,将字符串完整的写入
    #  "http\niaidu\com"  ->这个没有加r表示会对里面的字符串进行转义,在内存中就会是 http 回车 iaidu\com  因为后面的\c不是有效的转义符号,就会原样输出
    # r("http\niaidu\com")  这个是原始字符串,不会进行转义,写入内存的就是 "http\niaidu\com"
print(li[3])

str111= "http\niaidu\com"
print(str111)
str111 = r"http\niaidu\com"
print(str111)
safe_match(r"\w",str111)

五、高级用法

1. `re.search()`

扫描整个字符串并返回第一个成功匹配的结果。

复制代码

def safe_search(pattern, text):
    try:
        res = re.search(pattern, text)
        print("search成功: ", res.group())
    except Exception as e:
        print("search失败:", e)

safe_search("\d", "python123")  # 成功匹配 '1'

2. `re.findall()`

返回所有匹配项组成的列表，不会报错。

复制代码

print(re.findall("\d", "1py2345thon"))  # ['1','2','3','4','5']

3. `re.sub()`

替换匹配内容。

re.sub(pattern,rep,string,count)

pattren : 代表正则表达式,表示需要被替换的,

rep: 替换的新内容

string: 被替换的

count: 替换的次数默认匹配到的全被替换

复制代码

res = re.sub("\d", "X", "[1,2,3,4,5]")
print(res)  # 输出: [X,X,X,X,X]

4. `re.split()`

按正则表达式分割字符串。

#re.split(pattern,string,maxsplit)

pattern 正则表达式,把其中的内容当作分隔符

string 字符串

maxsplit：指定最大分割次数

复制代码

res = re.split("\d", "Y1Y2Y3Y4Y5]", 2)
print(res)  # ['Y', 'Y', 'Y3Y4Y5]']

六、贪婪与非贪婪

默认是贪婪匹配：尽可能匹配更长的字符串。
添加 ? 表示非贪婪匹配：尽可能短地匹配。

示例：

复制代码

text = "<div>hello</div><div>world</div>"
safe_match(r"<div>(.*)</div>", text)  # 贪婪，匹配整个字符串
safe_match(r"<div>(.*?)</div>", text)  # 非贪婪，只匹配第一个 div

七、原始字符串（Raw String）

在 Python 中，使用 r"" 定义原始字符串，避免转义问题。

写法	含义
`"\\\\"`	普通字符串中表示两个反斜杠，进入内存后是 `\\`，这正好是正则表达式要匹配的单个 `\`
`r"\\"`	原始字符串中直接表示 `\`，不会被 Python 转义，推荐使用

在普通字符串中，为了在正则表达式中匹配一个反斜杠 \，你需要写成 "\\\\"：

第一层转义是 Python 字符串解析器做的，将 "\\\\" 转义为 \\
第二层转义是正则表达式引擎做的，将 \\ 转义为一个实际的 \
如果使用原始字符串 r""，只需要写成 r"\\" 就可以了，清晰又安全！

显示区分:

"E:\\pyCode\\pytest\\pythonProject1" 这个字符串中，每个 \\ 都被 Python 解释器当作一个单独的反斜杠字符。

实际在内存中它是E:\pyCode\pytest\pythonProject1一个\，当你打印它的时候，print() 函数默认会把每一个反斜杠 \ 显示为字面意义上的 \\。

所以你看到的是E:\\pyCode\\pytest\\pythonProject1，但如果你直接在控制台输入那就会变得不一样。

控制台输出

下面就是一个在控制台输出的\\的实际显示

示例：

复制代码

safe_match(r"\\", r"\game")  # 成功匹配 '\'
safe_match("\\\\", "\\game")  # 成功匹配 '\'

Python正则表达式：30秒精通文本处理

一、概述

1. 含义

2. 特点

二、Python 中使用正则表达式

1. 导入模块

2. 常用函数介绍

re.match(pattern, string, flags)

示例：

三、基本语法与元字符

1. 单个字符匹配

示例：

2. 匹配次数控制（量词）

自定义安全匹配函数

示例：

四、匹配位置控制

示例：

示例：

五、高级用法

1. re.search()

2. re.findall()

3. re.sub()

re.sub(pattern,rep,string,count)

pattren : 代表正则表达式,表示需要被替换的,

rep: 替换的新内容

string: 被替换的

count: 替换的次数 默认匹配到的全被替换

4. re.split()

pattern 正则表达式,把其中的内容当作分隔符

string 字符串

maxsplit：指定最大分割次数

六、贪婪与非贪婪

示例：

七、原始字符串（Raw String）

显示区分:

控制台输出

示例：

`re.match(pattern, string, flags)`

1. `re.search()`

2. `re.findall()`

3. `re.sub()`

count: 替换的次数默认匹配到的全被替换

4. `re.split()`