AWS Lambda解压缩S3 ZIP文件流程

在AWS云上创建Lambda函数用Python解压缩S3上一个或多个zip压缩文件，也可以是一个或者多个目录下通过通配符匹配的多个文件到指定S3目录的完整流程和源代码。

以下是使用Python在AWS Lambda上解压缩S3上一个或多个ZIP文件（支持通配符匹配）的完整解决方案：

此解决方案可处理单个文件、批量文件，以及通过通配符匹配的分布式存储文件，满足各类解压缩场景需求。

解决方案概述

触发机制：通过S3事件触发或手动调用
处理逻辑 ：
- 解析事件（直接文件列表或通配符匹配）
- 下载ZIP文件到Lambda临时存储
- 解压缩并上传内容到目标S3路径
- 保留原始目录结构
安全考虑 ：
- 文件大小检查（Lambda存储限制）
- 仅处理ZIP文件
- 异常处理

完整代码

python 复制代码

import boto3
import zipfile
import os
import io
import re
from urllib.parse import unquote_plus

s3 = boto3.client('s3')
TEMP_DIR = '/tmp/'

def lambda_handler(event, context):
    # 解析事件来源
    if 'Records' in event:  # S3触发事件
        handle_s3_event(event)
    else:  # 手动调用事件
        handle_custom_event(event)

def handle_s3_event(event):
    """处理S3触发的事件"""
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'])
        
        # 检查是否是ZIP文件
        if not key.lower().endswith('.zip'):
            print(f"Skipping non-zip file: {key}")
            continue
            
        process_zip_file(
            bucket, 
            key,
            target_bucket=os.environ['TARGET_BUCKET'],
            target_prefix=os.environ.get('TARGET_PREFIX', 'unzipped/')
        )

def handle_custom_event(event):
    """处理自定义事件（带通配符）"""
    source_bucket = event['source_bucket']
    source_prefix = event.get('source_prefix', '')
    pattern = event.get('pattern', '*.zip')
    target_bucket = event['target_bucket']
    target_prefix = event.get('target_prefix', 'unzipped/')
    
    # 获取匹配的文件列表
    zip_files = list_matching_files(source_bucket, source_prefix, pattern)
    
    for key in zip_files:
        process_zip_file(
            source_bucket,
            key,
            target_bucket,
            target_prefix
        )

def list_matching_files(bucket, prefix, pattern):
    """使用通配符列出匹配的文件"""
    pattern_re = re.compile(pattern.replace('*', '.*'))
    files = []
    paginator = s3.get_paginator('list_objects_v2')
    
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get('Contents', []):
            key = obj['Key']
            # 检查通配符匹配且是ZIP文件
            if pattern_re.match(os.path.basename(key)) and key.lower().endswith('.zip'):
                files.append(key)
    
    print(f"Found {len(files)} files matching pattern: {pattern}")
    return files

def process_zip_file(source_bucket, source_key, target_bucket, target_prefix):
    """处理单个ZIP文件"""
    print(f"Processing: s3://{source_bucket}/{source_key}")
    
    try:
        # 下载ZIP文件
        zip_buffer = io.BytesIO()
        s3.download_fileobj(source_bucket, source_key, zip_buffer)
        zip_buffer.seek(0)
        
        # 创建解压目录
        extract_dir = os.path.join(TEMP_DIR, os.path.basename(source_key))
        os.makedirs(extract_dir, exist_ok=True)
        
        # 解压文件
        with zipfile.ZipFile(zip_buffer) as zip_ref:
            # 检查解压后大小
            total_size = sum(f.file_size for f in zip_ref.infolist())
            if total_size > 500 * 1024 * 1024:  # 500MB限制
                raise ValueError("Uncompressed size exceeds Lambda storage limit")
                
            zip_ref.extractall(extract_dir)
        
        # 上传到S3
        upload_directory_to_s3(
            extract_dir, 
            target_bucket,
            os.path.join(target_prefix, os.path.splitext(os.path.basename(source_key))[0])
        )
        
        print(f"Successfully processed: {source_key}")
        
    except Exception as e:
        print(f"Error processing {source_key}: {str(e)}")
        raise

def upload_directory_to_s3(directory, bucket, prefix):
    """上传目录到S3"""
    for root, _, files in os.walk(directory):
        for file in files:
            local_path = os.path.join(root, file)
            relative_path = os.path.relpath(local_path, directory)
            s3_key = os.path.join(prefix, relative_path).replace('\\', '/')
            
            s3.upload_file(local_path, bucket, s3_key)
            print(f"Uploaded: s3://{bucket}/{s3_key}")

部署流程

1. 创建IAM角色

创建具有以下权限的IAM角色：

json 复制代码

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::SOURCE_BUCKET",
                "arn:aws:s3:::SOURCE_BUCKET/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::TARGET_BUCKET/*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:*",
            "Resource": "arn:aws:logs:*:*:*"
        }
    ]
}

2. 创建Lambda函数

运行时：Python 3.9+
超时时间：15分钟
内存：1024MB（根据文件大小调整）
临时存储：512MB（最大）
环境变量：
- TARGET_BUCKET：目标存储桶名称
- TARGET_PREFIX：目标路径前缀（可选，默认unzipped/）

3. 配置触发器（可选）

方式1：S3自动触发

在S3源存储桶创建事件通知
事件类型：s3:ObjectCreated:*
前缀：your/zip/files/path/（可选）
后缀：.zip
目标：选择创建的Lambda函数

方式2：手动触发（通配符匹配）

json 复制代码

{
  "source_bucket": "your-source-bucket",
  "source_prefix": "path/to/zips/",
  "pattern": "*.zip",
  "target_bucket": "your-target-bucket",
  "target_prefix": "output/path/"
}

功能说明

通配符支持 ：
- *.zip：匹配所有ZIP文件
- data-*.zip：匹配特定前缀文件

路径保留 ：

text 复制代码

s3://source-bucket/path/to/file.zip
→ s3://target-bucket/output/path/file/file.txt

安全机制 ：
- 检查ZIP文件格式
- 验证解压后大小（<500MB）
- 异常捕获和日志记录

测试用例

测试事件1（S3触发）：

json 复制代码

{
  "Records": [
    {
      "s3": {
        "bucket": {"name": "test-bucket"},
        "object": {"key": "uploads/data.zip"}
      }
    }
  ]
}

测试事件2（通配符匹配）：

json 复制代码

{
  "source_bucket": "archive-bucket",
  "source_prefix": "2023-08/",
  "pattern": "backup-*.zip",
  "target_bucket": "processed-data",
  "target_prefix": "unpacked/"
}

注意事项

文件大小限制 ：
- 单个ZIP文件≤450MB（压缩后）
- 解压后总大小≤500MB
执行超时 ：
- 大文件需增加超时时间
- 超多文件建议分批次处理
目标路径 ：
- 自动创建不存在的目录
- 同名文件会被覆盖
安全建议 ：
- 添加DLQ处理失败事件
- 启用Lambda执行日志
- 使用S3版本控制