8、Python 字符串处理与正则表达式实战指南

Python 字符串处理与正则表达式实战指南

文章概述

本文深入探讨Python字符串处理核心方法与正则表达式实战技巧，涵盖字符串编码转换、分割替换、正则表达式语法精髓，并通过日志解析、数据清洗等真实场景案例展示高阶应用。最后提供10道阶梯式练习题（附完整答案代码），助你从基础到进阶全面掌握文本处理技能。

一、字符串处理核心三剑客

1.1 编码转换（encode/decode）

python 复制代码

text = "中文文本"
utf8_bytes = text.encode('utf-8')  # 转为UTF-8字节流
gbk_str = utf8_bytes.decode('utf-8').encode('gbk', errors='replace')  # 转GBK编码
print(gbk_str.decode('gbk', errors='ignore'))  # 输出：中文文本

注意：处理多语言文本时需统一编码格式，errors参数可指定处理非法字符策略（replace/ignore）

1.2 智能分割（split）

python 复制代码

csv_line = "2023-07-25,192.168.1.1,GET /api/data,200"
# 分割时处理多余空格
parts = [x.strip() for x in csv_line.split(',', maxsplit=3)]  
print(parts)  # ['2023-07-25', '192.168.1.1', 'GET /api/data', '200']

进阶：rsplit()反向分割，partition()三段式分割

1.3 灵活替换（replace）

python 复制代码

text = "Python2 is outdated, but Python2 code still exists"
new_text = text.replace("Python2", "Python3", 1)  # 仅替换第一个匹配项
print(new_text)  # "Python3 is outdated, but Python2 code still exists"

局限：无法处理模式匹配，需正则表达式进阶处理

二、正则表达式深度解析

2.1 re模块核心方法

python 复制代码

import re

# 预编译提升性能（重要！）
PHONE_PATTERN = re.compile(r'(\+86)?1[3-9]\d{9}')  

text = "联系方式：+8613812345678，备用号15812345678"
matches = PHONE_PATTERN.findall(text)
print(matches)  # ['+86', ''] 分组匹配结果

2.2 正则语法精要

分组捕获 ：(pattern) vs (?:pattern)（非捕获分组）
贪婪控制 ：.*?非贪婪匹配 vs .*贪婪匹配
断言魔法 ：(?=...)正向肯定断言，(?<!...)逆向否定断言

三、真实场景实战案例

3.1 日志解析（Apache日志）

python 复制代码

log_line = '127.0.0.1 - - [25/Jul/2023:15:30:22 +0800] "GET /index.html HTTP/1.1" 200 2326'

pattern = r'''
    ^(\d+\.\d+\.\d+\.\d+)\s+       # IP地址
    ([\w-]+)\s+                     # 用户标识
    ([\w-]+)\s+                     # 认证用户
    \[(.*?)\]\s+                    # 时间戳
    "(.*?)\s+(.*?)\s+HTTP/(\d\.\d)" # 请求信息
    \s+(\d{3})\s+                   # 状态码
    (\d+)                           # 响应大小
'''

match = re.compile(pattern, re.X).search(log_line)
if match:
    print(f"IP: {match.group(1)}, 状态码: {match.group(8)}")

3.2 数据清洗实战

python 复制代码

def clean_html(raw):
    # 删除HTML标签但保留内容
    cleaner = re.compile(r'<[^>]+>|&nbsp;| |\n')
    # 合并连续空格
    return re.sub(r'\s{2,}', ' ', cleaner.sub(' ', raw))

dirty_html = "<p>Hello&nbsp;&nbsp;World!</p>  <br/>"
print(clean_html(dirty_html))  # "Hello World!  "

四、进阶练习题与答案

练习题（10题）

提取嵌套JSON字符串中的特定字段（忽略转义引号）
验证复杂密码规则（大小写+数字+特殊符号，8-20位）
中英文混合文本分词处理
高效提取海量文本中所有URL
转换日期格式（MM/DD/YYYY到YYYY-MM-DD）
识别并高亮SQL注入特征
多模式匹配优化（使用正则字典）
解析非标准CSV（含逗号字段）
多层级日志时间戳转换
构建简易模板引擎

以下是10个练习题的完整实现代码，每个解决方案均包含详细注释说明：

题1：提取嵌套JSON字符串中的特定字段（忽略转义引号）

python 复制代码

import re
import json

def extract_json_field(json_str, target_field):
    """
    提取JSON字符串中指定字段的值（处理转义引号）
    :param json_str: 原始JSON字符串
    :param target_field: 要提取的字段名
    :return: 匹配值列表
    """
    # 匹配字段值：字符串、数字、布尔、null
    pattern = re.compile(
        rf'"{target_field}"\s*:\s*("(?:\\"|[^"])*"|-?\d+\.?\d*|true|false|null)',
        re.IGNORECASE
    )
    return [json.loads(m.group(1)) if m.group(1).startswith('"') 
            else eval(m.group(1)) 
            for m in pattern.finditer(json_str)]

# 测试用例
sample_json = '''
{"user": "Alice", "data": "{\\"info\\": \\"secret\\"}", "age": 25}
'''
print(extract_json_field(sample_json, "data"))  # 输出： ['{"info": "secret"}']

关键点 ：正则处理转义引号\\"，使用json.loads解析字符串值

题2：验证复杂密码规则（大小写+数字+特殊符号，8-20位）

python 复制代码

import re

def validate_password(password):
    """
    验证密码复杂度：
    - 至少1个大写字母
    - 至少1个小写字母
    - 至少1个数字
    - 至少1个特殊符号(!@#$%^&*)
    - 长度8-20
    """
    pattern = re.compile(
        r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!@#$%^&*])[\w!@#$%^&*]{8,20}$'
    )
    return bool(pattern.fullmatch(password))

# 测试
print(validate_password("Abc123#"))   # False（长度不足）
print(validate_password("Abcdefg123!"))  # True

技巧：使用正向肯定断言(?=)确保各类型存在

题3：中英文混合文本分词

python 复制代码

def split_chinese_english(text):
    """
    中英文分词：中文按字分割，英文按单词分割
    """
    return re.findall(r'\b\w+\b|[\u4e00-\u9fa5]', text)

# 测试
sample = "Hello世界！Python 代码 2023"
print(split_chinese_english(sample))  # ['Hello世界', 'Python', '代码', '2023'] 由于\w+匹配的是单词，而Hello世界是 连在一起的并非单词所以 没有达到分词的效果


def split_chinese_english(text):
    """
    中英文分词：中文按字分割，英文按单词分割
    """
    pattern = r'[a-zA-Z0-9]+|[\u4e00-\u9fa5]'
    return re.findall(pattern, text)


# 测试
sample = "Hello世界！Python代码2023"
print(split_chinese_english(sample)) # ['Hello', '世', '界', 'Python', '代', '码', '2023']


def split_chinese_english(text):
    # 先将英文和数字部分提取出来
    english_num_pattern = re.compile(r'[a-zA-Z0-9]+')
    english_num_matches = english_num_pattern.findall(text)
    # 去除英文和数字部分，只保留中文和其他字符
    chinese_text = english_num_pattern.sub('', text)
    # 对中文部分进行分词
    chinese_words = jieba.lcut(chinese_text)
    result = []
    index = 0
    for char in text:
        if re.match(r'[a-zA-Z0-9]', char):
            if index < len(english_num_matches) and text.startswith(english_num_matches[index], text.index(char)):
                result.append(english_num_matches[index])
                index += 1
        elif char in ''.join(chinese_words):
            for word in chinese_words:
                if text.startswith(word, text.index(char)):
                    result.append(word)
                    break
    return result


# 测试
sample = "Hello世界！Python代码2023"
print(split_chinese_english(sample)) # ['Hello', '世界', '！', 'Python', '代码', '2023']

原理：\b\w+\b匹配英文单词，[\u4e00-\u9fa5]匹配单个汉字

题4：高效提取海量文本中的URL

python 复制代码

import re
from urllib.parse import urlparse

# 预编译提升性能
URL_PATTERN = re.compile(
    r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+(?:[/?#][^\s"]*)?',
    re.IGNORECASE
)

def extract_urls_large_file(file_path):
    """流式读取大文件提取URL"""
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            yield from (urlparse(url).geturl() for url in URL_PATTERN.findall(line))

# 使用示例
# for url in extract_urls_large_file("bigfile.log"):
#     print(url)

优化点：流式读取避免内存溢出，预编译正则提升效率

题5：转换日期格式（MM/DD/YYYY到YYYY-MM-DD）

python 复制代码

def convert_date_format(text):
    """将MM/DD/YYYY格式转为YYYY-MM-DD"""
    return re.sub(
        r'\b(0?[1-9]|1[0-2])/(0?[1-9]|[12]\d|3[01])/(\d{4})\b',
        lambda m: f"{m.group(3)}-{m.group(1).zfill(2)}-{m.group(2).zfill(2)}",
        text
    )

# 测试
print(convert_date_format("Date: 7/25/2023, 12/31/2024"))  
# 输出：Date: 2023-07-25, 2024-12-31

注意：使用zfill补零，\b确保完整日期匹配

题6：识别并高亮SQL注入特征

python 复制代码

def highlight_sql_injection(text):
    """高亮常见SQL注入关键词"""
    keywords = r'union|select|insert|delete|update|drop|--|\/\*'
    return re.sub(
        f'({keywords})', 
        lambda m: f'\033[31m{m.group(1)}\033[0m', 
        text, 
        flags=re.IGNORECASE
    )

# 测试
sample = "SELECT * FROM users WHERE id=1 UNION SELECT password FROM admins"
print(highlight_sql_injection(sample))

效果：在终端显示红色高危关键词

题7：多模式匹配优化（正则字典）

python 复制代码

class FastMatcher:
    def __init__(self):
        self.patterns = {
            'email': re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
            'phone': re.compile(r'\+?\d{1,3}[-. ]?\(?\d{3}\)?[-. ]?\d{3}[-. ]?\d{4}'),
            'hashtag': re.compile(r'#\w+')
        }
    
    def match_all(self, text):
        return {k: p.findall(text) for k, p in self.patterns.items()}

# 使用示例
matcher = FastMatcher()
print(matcher.match_all("Contact: test@example.com #info 123-456-7890"))

题8：解析非标准CSV（含逗号字段）

python 复制代码

def parse_nonstandard_csv(csv_str):
    """解析带逗号的CSV字段（使用正则）"""
    return re.findall(
        r'(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))', 
        csv_str
    )

# 测试
sample = '1, "Quote, test", "Escaped ""Double Quote"""'
print([tuple(filter(None, m)) for m in parse_nonstandard_csv(sample)])
# 输出：[('1',), ('Quote, test',), ('Escaped "Double Quote"',)]

题9：多层级日志时间戳转换

python 复制代码

from datetime import datetime

def convert_log_timestamp(log_line):
    """转换[25/Jul/2023:15:30:22 +0800]到ISO格式"""
    def repl(m):
        dt = datetime.strptime(m.group(1), "%d/%b/%Y:%H:%M:%S %z")
        return dt.isoformat()
    
    return re.sub(
        r'\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} [+\-]\d{4})\]',
        repl,
        log_line
    )

# 测试
log = '[25/Jul/2023:15:30:22 +0800] "GET /" 200'
print(convert_log_timestamp(log))  
# 输出：2023-07-25T15:30:22+08:00 "GET /" 200

题10：构建简易模板引擎

python 复制代码

class SimpleTemplate:
    def __init__(self, template):
        self.pattern = re.compile(r'\{\{(\w+)\}\}')
        self.template = template

    def render(self, **context):
        return self.pattern.sub(
            lambda m: str(context.get(m.group(1), '')),  self.template
        )

# 使用示例
tpl = SimpleTemplate("Hello, {{name}}! Your score: {{score}}")
print(tpl.render(name="Alice", score=95))
# 输出：Hello, Alice! Your score: 95

实际使用中需根据具体需求添加异常处理、性能优化等逻辑。建议结合单元测试验证边界条件。

五、性能优化关键点

预编译正则：重复使用模式务必预编译
惰性匹配：避免.*导致的回溯爆炸
原子分组 ：(?>pattern)减少回溯
Unicode优化 ：re.ASCII标志提升英文处理速度
替代方案：简单文本处理优先使用字符串方法

掌握字符串处理与正则表达式，将使你在数据处理、日志分析、Web开发等领域游刃有余。本文涉及的技术需要反复实践，建议读者亲自测试每个示例代码，并尝试扩展练习题功能。