3D 视觉与点云处理（3D Vision & Point Cloud Processing）

[3D 视觉基础理论](#3D 视觉基础理论)
点云数据与表示
[PointNet 系列](#PointNet 系列)
[3D 目标检测](#3D 目标检测)
[神经辐射场 NeRF](#神经辐射场 NeRF)
[3D Gaussian Splatting](#3D Gaussian Splatting)
多视图几何与三维重建
[SLAM 技术](#SLAM 技术)
[3D 生成模型](#3D 生成模型)
评估指标与应用

1. 3D 视觉基础理论

1.1 什么是 3D 视觉

复制代码

3D 视觉的目标:
  从 2D 图像或传感器数据中理解三维世界

  ┌─────────────────────────────────────────────────────────────────┐
  │                    3D 视觉任务                                   │
  ├─────────────────────────────────────────────────────────────────┤
  │                                                                 │
  │  感知类:                                                         │
  │  ├── 3D 目标检测: 定位和识别 3D 空间中的物体                     │
  │  ├── 3D 语义分割: 对点云/网格进行语义标注                        │
  │  ├── 深度估计: 从单目/双目图像估计深度                           │
  │  └── 3D 姿态估计: 估计物体/人体的 3D 姿态                       │
  │                                                                 │
  │  重建类:                                                         │
  │  ├── 3D 重建: 从图像重建 3D 模型                                │
  │  ├── NeRF: 神经辐射场，隐式 3D 表示                             │
  │  ├── 3DGS: 3D Gaussian Splatting，显式 3D 表示                  │
  │  └── SLAM: 同时定位与建图                                       │
  │                                                                 │
  │  生成类:                                                         │
  │  ├── 3D 生成: 从文本/图像生成 3D 模型                           │
  │  ├── 3D 编辑: 编辑 3D 场景                                      │
  │  └── 新视角合成: 从新角度渲染场景                                │
  │                                                                 │
  └─────────────────────────────────────────────────────────────────┘

1.2 3D 数据表示

复制代码

3D 数据的表示方式:

  1. 点云 (Point Cloud):
     无序的点集合 {(x, y, z, ...)}
     每个点有坐标和可选属性 (颜色、法线等)
     
     优点: 灵活，保留几何细节
     缺点: 无拓扑结构，不规则
     
  2. 体素 (Voxel):
     3D 网格，每个格子有值 (占据/空)
     
     优点: 规则，可用 3D 卷积
     缺点: 内存大，分辨率受限
     
  3. 网格 (Mesh):
     顶点 + 边 + 面
     
     优点: 拓扑结构明确，渲染高效
     缺点: 构建复杂
     
  4. 隐式表示 (Implicit):
     f(x, y, z) = 0 (表面)
     或 f(x, y, z) ∈ [0, 1] (占据概率)
     
     优点: 连续，任意分辨率
     缺点: 推理慢
     
  5. 辐射场 (Radiance Field):
     (x, y, z, θ, φ) → (r, g, b, σ)
     
     位置 + 方向 → 颜色 + 密度
     
     NeRF 使用这种表示

1.3 相机模型与坐标系

复制代码

相机投影模型:

  3D 世界坐标 → 相机坐标 → 图像坐标 → 像素坐标

  世界坐标系 (World):
    全局坐标系，单位: 米
    
  相机坐标系 (Camera):
    以相机为原点，z 轴指向相机前方
    
  图像坐标系 (Image):
    2D 平面，单位: 米
    
  像素坐标系 (Pixel):
    2D 平面，单位: 像素

投影方程:
  设 3D 点 P_w = [X, Y, Z, 1]^T
  
  相机坐标: P_c = [R|t] · P_w
  
  图像坐标: p = K · P_c
  
  像素坐标: u = f_x · X/Z + c_x
            v = f_y · Y/Z + c_y

内参矩阵 K:
  K = [[f_x, 0, c_x],
       [0, f_y, c_y],
       [0,  0,   1 ]]
       
  f_x, f_y: 焦距 (像素单位)
  c_x, c_y: 主点坐标

外参 [R|t]:
  R: 旋转矩阵 (3×3)
  t: 平移向量 (3×1)
  描述相机在世界坐标系中的位姿

python 复制代码

import torch
import numpy as np

class CameraModel:
    """
    相机投影模型
    
    理论:
      将 3D 世界坐标投影到 2D 像素坐标
      
      投影方程:
        u = f_x · X/Z + c_x
        v = f_y · Y/Z + c_y
    """
    def __init__(self, intrinsic, extrinsic=None):
        """
        intrinsic: 内参矩阵 [3, 3]
        extrinsic: 外参矩阵 [3, 4] 或 [4, 4]
        """
        self.K = intrinsic
        self.R = extrinsic[:, :3] if extrinsic is not None else np.eye(3)
        self.t = extrinsic[:, 3] if extrinsic is not None else np.zeros(3)
    
    def project(self, points_3d):
        """
        将 3D 点投影到 2D
        
        points_3d: [N, 3] 世界坐标
        返回: [N, 2] 像素坐标
        """
        # 世界坐标 → 相机坐标
        points_cam = (self.R @ points_3d.T).T + self.t
        
        # 相机坐标 → 像素坐标
        points_img = (self.K @ points_cam.T).T
        
        # 齐次坐标 → 非齐次坐标
        u = points_img[:, 0] / points_img[:, 2]
        v = points_img[:, 1] / points_img[:, 2]
        
        return np.stack([u, v], axis=-1)
    
    def unproject(self, points_2d, depths):
        """
        将 2D 像素坐标反投影到 3D
        
        points_2d: [N, 2] 像素坐标
        depths: [N] 深度值
        返回: [N, 3] 相机坐标
        """
        # 像素坐标 → 归一化坐标
        K_inv = np.linalg.inv(self.K)
        points_homo = np.concatenate([points_2d, np.ones_like(points_2d[:, :1])], axis=1)
        points_norm = (K_inv @ points_homo.T).T
        
        # 乘以深度
        points_3d = points_norm * depths[:, None]
        
        return points_3d

"""
坐标系变换:

  世界 → 相机: P_c = R · P_w + t
  相机 → 图像: p = K · P_c
  图像 → 像素: u = f_x · x/z + c_x
  
  反投影:
    给定 (u, v, d):
    X = (u - c_x) · d / f_x
    Y = (v - c_y) · d / f_y
    Z = d
"""

2. 点云数据与表示

2.1 点云的特性

复制代码

点云 (Point Cloud) 的特性:

  定义: 无序的 3D 点集合
  
  P = {p₁, p₂, ..., pₙ}, pᵢ ∈ ℝ³ (或 ℝ⁶ 含颜色)

  特性:
    1. 无序性: 点的顺序不影响表示
    2. 稀疏性: 点在 3D 空间中稀疏分布
    3. 不规则: 点的数量和分布不固定
    4. 局部性: 相邻点通常属于同一物体

  挑战:
    传统 CNN 无法直接处理点云
    需要专门的网络架构

2.2 点云预处理

python 复制代码

class PointCloudPreprocessing:
    """点云预处理"""
    
    @staticmethod
    def normalize(points):
        """
        归一化点云
        
        理论:
          将点云归一化到单位球内
          便于网络处理
        """
        centroid = points.mean(axis=0)
        points = points - centroid
        max_dist = np.max(np.sqrt(np.sum(points ** 2, axis=1)))
        points = points / max_dist
        return points
    
    @staticmethod
    def random_sample(points, num_points):
        """
        随机采样固定数量的点
        
        理论:
          不同点云的点数不同
          需要统一数量便于批处理
        """
        if len(points) >= num_points:
            indices = np.random.choice(len(points), num_points, replace=False)
        else:
            indices = np.random.choice(len(points), num_points, replace=True)
        return points[indices]
    
    @staticmethod
    def farthest_point_sample(points, num_points):
        """
        最远点采样 (FPS)
        
        理论:
          贪心算法，每次选择离已选点最远的点
          保证采样点分布均匀
          
        算法:
          1. 随机选择一个起始点
          2. 计算所有点到已选点的最小距离
          3. 选择距离最大的点
          4. 重复直到采样足够
        """
        N, _ = points.shape
        centroids = np.zeros(num_points)
        distances = np.ones(N) * 1e10
        
        # 随机起始点
        farthest = np.random.randint(0, N)
        
        for i in range(num_points):
            centroids[i] = farthest
            
            centroid = points[farthest]
            dist = np.sum((points - centroid) ** 2, axis=1)
            distances = np.minimum(distances, dist)
            
            farthest = np.argmax(distances)
        
        return points[centroids.astype(int)]
    
    @staticmethod
    def voxel_downsample(points, voxel_size):
        """
        体素下采样
        
        理论:
          将空间划分为体素网格
          每个体素内取一个代表点 (如质心)
          
        优点:
          - 均匀采样
          - 保留几何结构
          - 减少点数
        """
        # 计算体素索引
        voxel_indices = np.floor(points / voxel_size).astype(int)
        
        # 按体素分组
        voxel_dict = {}
        for i, idx in enumerate(voxel_indices):
            key = tuple(idx)
            if key not in voxel_dict:
                voxel_dict[key] = []
            voxel_dict[key].append(i)
        
        # 每个体素取质心
        downsampled = []
        for indices in voxel_dict.values():
            centroid = points[indices].mean(axis=0)
            downsampled.append(centroid)
        
        return np.array(downsampled)

"""
采样方法对比:

  随机采样: 简单，但可能丢失细节
  最远点采样: 均匀，保留几何结构
  体素下采样: 均匀，可控密度
  
  实践中 FPS 最常用
"""

2.3 点云增强

python 复制代码

class PointCloudAugmentation:
    """点云数据增强"""
    
    @staticmethod
    def random_rotation(points):
        """
        随机旋转
        
        理论:
          增强模型对旋转的鲁棒性
          点云应该对旋转不变
        """
        theta = np.random.uniform(0, 2 * np.pi)
        rotation_matrix = np.array([
            [np.cos(theta), -np.sin(theta), 0],
            [np.sin(theta), np.cos(theta), 0],
            [0, 0, 1]
        ])
        return points @ rotation_matrix.T
    
    @staticmethod
    def random_scale(points, scale_range=(0.8, 1.2)):
        """
        随机缩放
        """
        scale = np.random.uniform(*scale_range)
        return points * scale
    
    @staticmethod
    def random_translation(points, translation_range=0.1):
        """
        随机平移
        """
        translation = np.random.uniform(-translation_range, translation_range, 3)
        return points + translation
    
    @staticmethod
    def random_jitter(points, sigma=0.01):
        """
        随机抖动 (添加噪声)
        
        理论:
          模拟传感器噪声
          增强模型鲁棒性
        """
        noise = np.random.normal(0, sigma, points.shape)
        return points + noise
    
    @staticmethod
    def random_dropout(points, max_dropout_ratio=0.2):
        """
        随机丢弃点
        
        理论:
          模拟遮挡和不完整扫描
          增强模型对缺失数据的鲁棒性
        """
        dropout_ratio = np.random.uniform(0, max_dropout_ratio)
        num_drop = int(len(points) * dropout_ratio)
        indices = np.random.choice(len(points), len(points) - num_drop, replace=False)
        return points[indices]

"""
点云增强策略:

  训练时:
    - 随机旋转 (绕 z 轴或任意轴)
    - 随机缩放
    - 随机平移
    - 随机抖动
    - 随机丢弃
    
  测试时:
    - 通常只做归一化
"""

3. PointNet 系列

3.1 PointNet

复制代码

论文: "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" 
      (Qi et al., 2017)

核心创新:
  1. 直接处理点云 (无需体素化)
  2. 对称函数保证排列不变性
  3. 端到端学习

核心问题:
  点云是无序的，但神经网络对输入顺序敏感
  需要保证输出与点的顺序无关

解决方案:
  使用对称函数 (如最大池化)
  
  f({p₁, p₂, ..., pₙ}) = γ(max{h(p₁), h(p₂), ..., h(pₙ)})
  
  最大池化是排列不变的

python 复制代码

import torch
import torch.nn as nn

class PointNet(nn.Module):
    """
    PointNet
    
    核心思想:
      1. 共享 MLP 提取逐点特征
      2. 最大池化保证排列不变性
      3. 全局特征用于分类/分割
      
    理论:
      对称函数 (最大池化) 保证:
        f(P) = f(π(P))  对于任意排列 π
    """
    def __init__(self, num_classes, use_transform=True):
        super().__init__()
        
        # 逐点特征提取 (共享 MLP)
        self.mlp1 = nn.Sequential(
            nn.Conv1d(3, 64, 1),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, 64, 1),
            nn.BatchNorm1d(64),
            nn.ReLU()
        )
        
        self.mlp2 = nn.Sequential(
            nn.Conv1d(64, 128, 1),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Conv1d(128, 1024, 1),
            nn.BatchNorm1d(1024),
            nn.ReLU()
        )
        
        # 输入变换网络
        if use_transform:
            self.input_transform = TransformNet(3, 3)
            self.feature_transform = TransformNet(64, 64)
        else:
            self.input_transform = None
            self.feature_transform = None
        
        # 分类头
        self.classifier = nn.Sequential(
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )
    
    def forward(self, x):
        """
        x: [B, N, 3] 点云
        返回: [B, num_classes] 分类 logits
        """
        B, N, _ = x.shape
        
        # 转置为 [B, 3, N]
        x = x.permute(0, 2, 1)
        
        # 输入变换
        if self.input_transform is not None:
            T = self.input_transform(x)
            x = torch.bmm(T, x)
        
        # 逐点特征
        x = self.mlp1(x)  # [B, 64, N]
        
        # 特征变换
        if self.feature_transform is not None:
            T = self.feature_transform(x)
            x = torch.bmm(T, x)
        
        # 提取更多特征
        x = self.mlp2(x)  # [B, 1024, N]
        
        # 全局特征 (最大池化)
        global_feature = torch.max(x, dim=2)[0]  # [B, 1024]
        
        # 分类
        output = self.classifier(global_feature)
        
        return output

class TransformNet(nn.Module):
    """
    变换网络 (T-Net)
    
    学习输入/特征的变换矩阵
    
    理论:
      对齐点云/特征到规范空间
      提高模型对变换的鲁棒性
    """
    def __init__(self, input_dim, output_dim):
        super().__init__()
        
        self.mlp = nn.Sequential(
            nn.Conv1d(input_dim, 64, 1),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, 128, 1),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Conv1d(128, 1024, 1),
            nn.BatchNorm1d(1024),
            nn.ReLU()
        )
        
        self.fc = nn.Sequential(
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256, output_dim * output_dim)
        )
        
        self.output_dim = output_dim
    
    def forward(self, x):
        """
        x: [B, D, N]
        返回: [B, D, D] 变换矩阵
        """
        # 全局特征
        x = self.mlp(x)
        x = torch.max(x, dim=2)[0]
        
        # 预测变换矩阵
        x = self.fc(x)
        x = x.view(-1, self.output_dim, self.output_dim)
        
        # 添加单位矩阵 (保证初始为恒等变换)
        identity = torch.eye(self.output_dim, device=x.device).unsqueeze(0)
        x = x + identity
        
        return x

"""
PointNet 的理论分析:

  定理: PointNet 可以近似任意连续的集合函数
  
  f({p₁, ..., pₙ}) ≈ γ(max{h(p₁), ..., h(pₙ)})
  
  其中:
    h: 逐点特征提取 (MLP)
    max: 对称函数 (最大池化)
    γ: 全局特征处理 (FC)
    
  关键:
    最大池化保证排列不变性
    MLP 可以近似任意连续函数
"""

3.2 PointNet++

复制代码

论文: "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space" 
      (Qi et al., 2017)

核心改进:
  1. 层次化学习: 逐步扩大感受野
  2. 局部特征: 捕获局部几何结构
  3. 多尺度分组: 处理不同密度的点云

问题:
  PointNet 只有全局特征
  无法捕获局部几何结构

解决:
  类似于 CNN 的层次结构
  逐层扩大感受野

python 复制代码

class PointNetPlusPlus(nn.Module):
    """
    PointNet++
    
    核心创新:
      1. Set Abstraction: 层次化特征学习
      2. 局部特征: 捕获局部几何
      3. 多尺度: 处理不同密度
      
    架构:
      Set Abstraction 层 × 多层
      每层: 采样 → 分组 → 提取局部特征
    """
    def __init__(self, num_classes):
        super().__init__()
        
        # Set Abstraction 层
        self.sa1 = SetAbstraction(
            npoint=512, radius=0.2, nsample=32,
            in_channel=3, mlp=[64, 64, 128]
        )
        self.sa2 = SetAbstraction(
            npoint=128, radius=0.4, nsample=64,
            in_channel=128 + 3, mlp=[128, 128, 256]
        )
        self.sa3 = SetAbstraction(
            npoint=None, radius=None, nsample=None,
            in_channel=256 + 3, mlp=[256, 512, 1024]
        )
        
        # 分类头
        self.classifier = nn.Sequential(
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )
    
    def forward(self, x):
        """
        x: [B, N, 3]
        """
        B, N, _ = x.shape
        
        # Set Abstraction
        l1_xyz, l1_features = self.sa1(x, None)
        l2_xyz, l2_features = self.sa2(l1_xyz, l1_features)
        l3_xyz, l3_features = self.sa3(l2_xyz, l2_features)
        
        # 全局特征
        global_feature = l3_features.squeeze(1)
        
        # 分类
        output = self.classifier(global_feature)
        
        return output

class SetAbstraction(nn.Module):
    """
    Set Abstraction 层
    
    流程:
      1. 采样 (Sampling): 选择中心点
      2. 分组 (Grouping): 局部邻域
      3. 特征提取 (PointNet): 局部特征
      
    理论:
      类似于 CNN 的卷积 + 池化
      逐步扩大感受野
    """
    def __init__(self, npoint, radius, nsample, in_channel, mlp):
        super().__init__()
        
        self.npoint = npoint
        self.radius = radius
        self.nsample = nsample
        
        # MLP 层
        self.mlp_convs = nn.ModuleList()
        self.mlp_bns = nn.ModuleList()
        
        last_channel = in_channel
        for out_channel in mlp:
            self.mlp_convs.append(nn.Conv2d(last_channel, out_channel, 1))
            self.mlp_bns.append(nn.BatchNorm2d(out_channel))
            last_channel = out_channel
    
    def forward(self, xyz, features):
        """
        xyz: [B, N, 3] 点坐标
        features: [B, N, C] 点特征 (可选)
        """
        B, N, _ = xyz.shape
        
        # 1. 采样: 选择中心点
        if self.npoint is not None:
            fps_indices = farthest_point_sample(xyz, self.npoint)
            new_xyz = xyz[:, fps_indices]  # [B, npoint, 3]
        else:
            new_xyz = xyz.mean(dim=1, keepdim=True)  # 全局
        
        # 2. 分组: 找局部邻域
        if self.radius is not None:
            grouped_indices = ball_query(xyz, new_xyz, self.radius, self.nsample)
        else:
            grouped_indices = torch.arange(N).unsqueeze(0).unsqueeze(0).expand(B, 1, N)
        
        # 提取分组特征
        grouped_xyz = index_points(xyz, grouped_indices)  # [B, npoint, nsample, 3]
        grouped_xyz = grouped_xyz - new_xyz.unsqueeze(2)  # 相对坐标
        
        if features is not None:
            grouped_features = index_points(features, grouped_indices)
            grouped_features = torch.cat([grouped_xyz, grouped_features], dim=-1)
        else:
            grouped_features = grouped_xyz
        
        # 3. 特征提取: 逐点 MLP
        grouped_features = grouped_features.permute(0, 3, 1, 2)  # [B, C, npoint, nsample]
        
        for conv, bn in zip(self.mlp_convs, self.mlp_bns):
            grouped_features = torch.relu(bn(conv(grouped_features)))
        
        # 最大池化
        new_features = torch.max(grouped_features, dim=3)[0]  # [B, C, npoint]
        new_features = new_features.permute(0, 2, 1)  # [B, npoint, C]
        
        return new_xyz, new_features

"""
PointNet++ 的关键操作:

  1. 最远点采样 (FPS):
     选择分布均匀的中心点
     
  2. 球查询 (Ball Query):
     找固定半径内的邻域点
     
  3. 分组 (Grouping):
     提取局部邻域的特征
     
  4. 局部 PointNet:
     对每个邻域提取特征
"""

3.3 点云分割

python 复制代码

class PointNetPlusPlusSegmentation(nn.Module):
    """
    PointNet++ 分割网络
    
    使用特征传播层恢复逐点特征
    
    理论:
      编码器: 逐步下采样，提取高级特征
      解码器: 逐步上采样，恢复空间细节
      跳跃连接: 融合编码器特征
    """
    def __init__(self, num_classes):
        super().__init__()
        
        # 编码器 (Set Abstraction)
        self.sa1 = SetAbstraction(512, 0.2, 32, 3, [64, 64, 128])
        self.sa2 = SetAbstraction(128, 0.4, 64, 128+3, [128, 128, 256])
        self.sa3 = SetAbstraction(None, None, None, 256+3, [256, 512, 1024])
        
        # 解码器 (Feature Propagation)
        self.fp3 = FeaturePropagation(1024+256, [256, 256])
        self.fp2 = FeaturePropagation(256+128, [256, 128])
        self.fp1 = FeaturePropagation(128+3, [128, 128, 128])
        
        # 分类头
        self.classifier = nn.Conv1d(128, num_classes, 1)
    
    def forward(self, x):
        B, N, _ = x.shape
        
        # 编码器
        l1_xyz, l1_features = self.sa1(x, None)
        l2_xyz, l2_features = self.sa2(l1_xyz, l1_features)
        l3_xyz, l3_features = self.sa3(l2_xyz, l2_features)
        
        # 解码器
        l2_features = self.fp3(l2_xyz, l3_xyz, l2_features, l3_features)
        l1_features = self.fp2(l1_xyz, l2_xyz, l1_features, l2_features)
        l0_features = self.fp1(x, l1_xyz, None, l1_features)
        
        # 分类
        output = self.classifier(l0_features.permute(0, 2, 1))
        
        return output

class FeaturePropagation(nn.Module):
    """
    特征传播层 (解码器)
    
    使用插值和 MLP 恢复逐点特征
    
    理论:
      三线性插值: 从稀疏点恢复密集特征
      跳跃连接: 融合编码器特征
    """
    def __init__(self, in_channel, mlp):
        super().__init__()
        
        self.mlp_convs = nn.ModuleList()
        self.mlp_bns = nn.ModuleList()
        
        last_channel = in_channel
        for out_channel in mlp:
            self.mlp_convs.append(nn.Conv1d(last_channel, out_channel, 1))
            self.mlp_bns.append(nn.BatchNorm1d(out_channel))
            last_channel = out_channel
    
    def forward(self, xyz1, xyz2, features1, features2):
        """
        xyz1: 密集点坐标 [B, N, 3]
        xyz2: 稀疏点坐标 [B, M, 3]
        features1: 密集点特征 [B, N, C1]
        features2: 稀疏点特征 [B, M, C2]
        """
        B, N, _ = xyz1.shape
        _, M, _ = xyz2.shape
        
        # 三线性插值
        interpolated = three_interpolate(xyz1, xyz2, features2)
        
        # 跳跃连接
        if features1 is not None:
            new_features = torch.cat([interpolated, features1], dim=-1)
        else:
            new_features = interpolated
        
        # MLP
        new_features = new_features.permute(0, 2, 1)
        for conv, bn in zip(self.mlp_convs, self.mlp_bns):
            new_features = torch.relu(bn(conv(new_features)))
        
        return new_features.permute(0, 2, 1)

4. 3D 目标检测

4.1 3D 检测概述

复制代码

3D 目标检测任务:

  输入: 点云 / 图像 / 多模态
  输出: 3D 边界框 (位置、尺寸、朝向) + 类别

  3D 边界框表示:
    (x, y, z, w, l, h, θ)
    
    (x, y, z): 中心坐标
    (w, l, h): 宽度、长度、高度
    θ: 朝向角 (偏航角)

  应用:
    自动驾驶: 检测车辆、行人、骑行者
    机器人: 抓取、导航
    AR/VR: 场景理解

4.2 基于点云的方法

复制代码

PointPillars:
  论文: "PointPillars: Fast Encoders for Object Detection from Point Clouds" 
        (Lang et al., 2019)
  
  核心思想:
    将点云组织为柱体 (Pillars)
    使用 2D CNN 处理
    
  优势: 速度快，适合实时应用

VoxelNet:
  论文: "VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection" 
        (Zhou et al., 2018)
  
  核心思想:
    将点云体素化
    使用 3D 卷积 + 2D 卷积
    
  优势: 保留 3D 信息

SECOND:
  论文: "SECOND: Sparsely Embedded Convolutional Detection" (Yan et al., 2018)
  
  核心思想:
    稀疏卷积处理体素
    大幅减少计算量

python 复制代码

class PointPillars(nn.Module):
    """
    PointPillars
    
    核心思想:
      1. 将点云组织为柱体 (Pillars)
      2. 使用 PointNet 提取柱体特征
      3. 2D CNN 处理伪图像
      
    优势:
      - 速度快 (2D CNN)
      - 适合实时应用
    """
    def __init__(self, num_classes):
        super().__init__()
        
        # Pillar 特征网络
        self.pillar_feature_net = PillarFeatureNet(
            in_channels=4,  # x, y, z, intensity
            out_channels=64
        )
        
        # 2D 检测网络
        self.backbone = nn.Sequential(
            nn.Conv2d(64, 64, 3, stride=2, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, stride=2, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.Conv2d(128, 256, 3, stride=2, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU()
        )
        
        # 检测头
        self.detection_head = DetectionHead(256, num_classes)
    
    def forward(self, point_cloud):
        """
        point_cloud: [B, N, 4] (x, y, z, intensity)
        """
        # 1. 体素化为柱体
        pillars = self.create_pillars(point_cloud)
        
        # 2. 提取柱体特征
        pillar_features = self.pillar_feature_net(pillars)
        
        # 3. 创建伪图像
        pseudo_image = self.scatter_to_image(pillar_features)
        
        # 4. 2D 检测
        features = self.backbone(pseudo_image)
        
        # 5. 检测头
        predictions = self.detection_head(features)
        
        return predictions

class PillarFeatureNet(nn.Module):
    """
    Pillar 特征网络
    
    对每个柱体内的点提取特征
    
    理论:
      柱体 = 垂直方向的体素
      每个柱体内的点使用 PointNet 提取特征
    """
    def __init__(self, in_channels, out_channels):
        super().__init__()
        
        self.mlp = nn.Sequential(
            nn.Linear(in_channels + 5, 64),  # +5 for 相对坐标和质心
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Linear(64, out_channels),
            nn.BatchNorm1d(out_channels),
            nn.ReLU()
        )
    
    def forward(self, pillars):
        """
        pillars: [B, max_points, max_pillars, C]
        """
        # 逐点特征
        point_features = self.mlp(pillars)
        
        # 最大池化 (每个柱体)
        pillar_features = torch.max(point_features, dim=1)[0]
        
        return pillar_features

"""
PointPillars 的速度优势:

  体素方法: 需要 3D 卷积，计算量大
  PointPillars: 转换为 2D，使用 2D CNN
  
  速度: PointPillars 可达 60+ FPS
  精度: 接近 3D 卷积方法
"""

4.3 基于 Transformer 的方法

python 复制代码

class DETR3D(nn.Module):
    """
    DETR3D
    
    将 DETR 应用于 3D 检测
    
    核心思想:
      1. 学习 3D 参考点
      2. 从图像特征中采样
      3. Transformer 解码
    """
    def __init__(self, num_classes, num_queries=300):
        super().__init__()
        
        # 可学习的查询
        self.query_embed = nn.Embedding(num_queries, 256)
        
        # 参考点预测
        self.reference_points = nn.Linear(256, 3)
        
        # Transformer 解码器
        self.decoder = nn.TransformerDecoder(
            nn.TransformerDecoderLayer(256, 8, 1024),
            num_layers=6
        )
        
        # 检测头
        self.class_head = nn.Linear(256, num_classes)
        self.box_head = nn.Linear(256, 7)  # x, y, z, w, l, h, θ
    
    def forward(self, image_features):
        B = image_features.shape[0]
        
        # 查询嵌入
        queries = self.query_embed.weight.unsqueeze(0).expand(B, -1, -1)
        
        # 参考点
        ref_points = self.reference_points(queries)
        
        # 解码
        decoded = self.decoder(queries.permute(1, 0, 2), image_features.permute(1, 0, 2))
        decoded = decoded.permute(1, 0, 2)
        
        # 预测
        class_logits = self.class_head(decoded)
        box_pred = self.box_head(decoded)
        
        return class_logits, box_pred

5. 神经辐射场 NeRF

5.1 NeRF 基础

复制代码

论文: "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" 
      (Mildenhall et al., 2020)

核心思想:
  使用神经网络表示 3D 场景
  
  输入: 3D 坐标 (x, y, z) + 观察方向 (θ, φ)
  输出: 颜色 (r, g, b) + 密度 σ
  
  F_θ: (x, y, z, θ, φ) → (r, g, b, σ)

渲染方程:
  对于图像上的一个像素，发射一条射线:
    r(t) = o + t·d
    
    o: 相机原点
    d: 射线方向
    t: 距离
  
  预期颜色:
    C(r) = ∫ T(t) · σ(r(t)) · c(r(t), d) dt
    
    T(t) = exp(-∫₀ᵗ σ(r(s)) ds)  (透射率)
    
  含义:
    沿射线积分颜色和密度
    密度高的地方贡献更多颜色

python 复制代码

import torch
import torch.nn as nn

class NeRF(nn.Module):
    """
    NeRF (Neural Radiance Fields)
    
    核心思想:
      使用 MLP 表示 3D 场景
      输入位置和方向，输出颜色和密度
      
    理论:
      位置编码: 将低维坐标映射到高维
      视角相关: 颜色依赖于观察方向
    """
    def __init__(self, pos_freq=10, dir_freq=4):
        super().__init__()
        
        # 位置编码维度
        self.pos_dim = 3 + 3 * 2 * pos_freq
        self.dir_dim = 3 + 3 * 2 * dir_freq
        
        # 位置编码频率
        self.pos_freq = pos_freq
        self.dir_freq = dir_freq
        
        # 位置编码层 (前 8 层)
        self.pos_layers = nn.Sequential(
            nn.Linear(self.pos_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 256)
        )
        
        # 密度预测
        self.density_head = nn.Sequential(
            nn.Linear(256 + self.pos_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )
        
        # 颜色预测
        self.color_layers = nn.Sequential(
            nn.Linear(256 + self.dir_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 3)
        )
    
    def positional_encoding(self, x, freq):
        """
        位置编码
        
        PE(x) = [sin(2⁰πx), cos(2⁰πx), ..., sin(2^(L-1)πx), cos(2^(L-1)πx)]
        
        理论:
          将低维坐标映射到高维
          让 MLP 能学习高频细节
        """
        encodings = [x]
        for i in range(freq):
            encodings.append(torch.sin(2 ** i * torch.pi * x))
            encodings.append(torch.cos(2 ** i * torch.pi * x))
        return torch.cat(encodings, dim=-1)
    
    def forward(self, positions, directions):
        """
        positions: [B, N, 3] 3D 坐标
        directions: [B, N, 3] 观察方向
        返回: [B, N, 4] (r, g, b, σ)
        """
        # 位置编码
        pos_enc = self.positional_encoding(positions, self.pos_freq)
        dir_enc = self.positional_encoding(directions, self.dir_freq)
        
        # 位置特征
        pos_features = self.pos_layers(pos_enc)
        
        # 密度预测 (不依赖方向)
        density_input = torch.cat([pos_features, pos_enc], dim=-1)
        density = torch.relu(self.density_head(density_input))
        
        # 颜色预测 (依赖方向)
        color_input = torch.cat([pos_features, dir_enc], dim=-1)
        color = torch.sigmoid(self.color_layers(color_input))
        
        return torch.cat([color, density], dim=-1)

"""
位置编码的理论:

  MLP 难以学习高频函数
  位置编码将低频输入映射到高频
  
  不使用位置编码: 只能学习低频 (模糊)
  使用位置编码: 可以学习高频 (清晰)
  
  实验:
    L=10 用于位置 (x, y, z)
    L=4 用于方向 (θ, φ)
"""

5.2 NeRF 渲染

python 复制代码

class VolumeRenderer:
    """
    体渲染器
    
    将 NeRF 的输出渲染为图像
    
    渲染方程:
      C(r) = Σ T_i · α_i · c_i
      
      T_i = Π(1 - α_j) for j < i  (累积透射率)
      α_i = 1 - exp(-σ_i · δ_i)   (不透明度)
    """
    
    @staticmethod
    def render(nerf, rays_o, rays_d, near=0.1, far=2.0, num_samples=64):
        """
        渲染一条射线
        
        rays_o: [B, 3] 射线原点
        rays_d: [B, 3] 射线方向
        """
        B = rays_o.shape[0]
        
        # 采样点
        t = torch.linspace(near, far, num_samples, device=rays_o.device)
        t = t.unsqueeze(0).expand(B, -1)
        
        # 添加噪声 (训练时)
        if nerf.training:
            noise = torch.rand_like(t) * (far - near) / num_samples
            t = t + noise
        
        # 3D 坐标
        points = rays_o.unsqueeze(1) + t.unsqueeze(2) * rays_d.unsqueeze(1)
        
        # NeRF 推理
        directions = rays_d.unsqueeze(1).expand_as(points)
        nerf_output = nerf(points, directions)
        
        colors = nerf_output[..., :3]
        densities = nerf_output[..., 3]
        
        # 体渲染
        delta = t[:, 1:] - t[:, :-1]
        delta = torch.cat([delta, torch.full_like(delta[:, :1], 1e10)], dim=-1)
        
        # 不透明度
        alpha = 1.0 - torch.exp(-densities * delta)
        
        # 累积透射率
        T = torch.cumprod(1.0 - alpha + 1e-10, dim=-1)
        T = torch.cat([torch.ones_like(T[:, :1]), T[:, :-1]], dim=-1)
        
        # 权重
        weights = T * alpha
        
        # 预期颜色
        rgb = (weights.unsqueeze(-1) * colors).sum(dim=1)
        
        return rgb

"""
体渲染的理论:

  沿射线积分颜色和密度
  
  离散化:
    C = Σ T_i · α_i · c_i
    
  其中:
    α_i = 1 - exp(-σ_i · δ_i): 不透明度
    T_i = Π(1 - α_j): 累积透射率
    
  含义:
    密度高的地方贡献更多颜色
    前面的物体遮挡后面的
"""

5.3 NeRF 的改进

复制代码

NeRF 的改进方向:

  1. 速度:
     - NeRF++: 分解前景和背景
     - Instant-NGP: 多分辨率哈希编码
     - Plenoxels: 体素表示
     
  2. 质量:
     - Mip-NeRF: 多尺度渲染
     - NeRF-W: 处理野外照片
     
  3. 泛化:
     - PixelNeRF: 从少量图像泛化
     - IBRNet: 图像基渲染
     
  4. 动态:
     - D-NeRF: 动态场景
     - Nerfies: 可变形 NeRF
     
  5. 大规模:
     - Block-NeRF: 城市级场景
     - Mega-NeRF: 大规模场景

6. 3D Gaussian Splatting

6.1 3DGS 概述

复制代码

论文: "3D Gaussian Splatting for Real-Time Radiance Field Rendering" 
      (Kerbl et al., 2023)

核心思想:
  使用 3D 高斯函数表示场景
  通过溅射 (Splatting) 渲染

  每个高斯:
    G(x) = exp(-½(x-μ)ᵀ Σ⁻¹ (x-μ))
    
    μ: 中心位置 (3D)
    Σ: 协方差矩阵 (3D)
    α: 不透明度
    c: 颜色 (球谐函数系数)

优势:
  1. 实时渲染: 100+ FPS
  2. 高质量: 接近 NeRF
  3. 显式表示: 易于编辑

python 复制代码

class Gaussian3D:
    """
    3D 高斯表示
    
    每个高斯有:
      - 位置 μ: [3]
      - 协方差 Σ: [3, 3] (通过旋转和缩放参数化)
      - 不透明度 α: [1]
      - 颜色 c: 球谐函数系数
    """
    def __init__(self, num_gaussians):
        self.num_gaussians = num_gaussians
        
        # 位置
        self.positions = nn.Parameter(torch.randn(num_gaussians, 3))
        
        # 缩放
        self.scales = nn.Parameter(torch.ones(num_gaussians, 3))
        
        # 旋转 (四元数)
        self.rotations = nn.Parameter(torch.zeros(num_gaussians, 4))
        self.rotations.data[:, 0] = 1  # 初始化为单位四元数
        
        # 不透明度
        self.opacities = nn.Parameter(torch.ones(num_gaussians, 1))
        
        # 颜色 (球谐函数)
        self.sh_coefficients = nn.Parameter(torch.zeros(num_gaussians, 16, 3))
    
    def get_covariance(self):
        """
        计算协方差矩阵
        
        Σ = R · S · S^T · R^T
        
        R: 旋转矩阵 (从四元数)
        S: 缩放矩阵
        """
        # 旋转矩阵
        R = quaternion_to_rotation_matrix(self.rotations)
        
        # 缩放矩阵
        S = torch.diag_embed(self.scales)
        
        # 协方差
        M = torch.bmm(R, S)
        covariance = torch.bmm(M, M.transpose(1, 2))
        
        return covariance

"""
高斯溅射的渲染:

  对于图像上的每个像素:
    1. 找到影响该像素的高斯
    2. 按深度排序
    3. 从前到后混合颜色
    
  C(p) = Σ c_i · α_i · G_i(p) · T_i
  
  T_i = Π(1 - α_j · G_j(p)) for j < i
"""

6.2 3DGS 的训练

python 复制代码

class Gaussian3DTrainer:
    """
    3D Gaussian Splatting 训练器
    
    训练策略:
      1. 从 SfM 点云初始化
      2. 渲染损失优化
      3. 自适应密度控制
    """
    def __init__(self, gaussians, renderer):
        self.gaussians = gaussians
        self.renderer = renderer
        
        self.optimizer = torch.optim.Adam([
            {'params': gaussians.positions, 'lr': 0.00016},
            {'params': gaussians.scales, 'lr': 0.005},
            {'params': gaussians.rotations, 'lr': 0.001},
            {'params': gaussians.opacities, 'lr': 0.05},
            {'params': gaussians.sh_coefficients, 'lr': 0.0025}
        ])
    
    def train_step(self, images, cameras):
        """
        训练步骤
        
        1. 渲染图像
        2. 计算损失
        3. 更新参数
        4. 自适应密度控制
        """
        # 随机选择视角
        idx = torch.randint(len(images))
        image_gt = images[idx]
        camera = cameras[idx]
        
        # 渲染
        image_pred = self.renderer.render(self.gaussians, camera)
        
        # 损失 (L1 + SSIM)
        loss_l1 = torch.abs(image_pred - image_gt).mean()
        loss_ssim = 1 - ssim(image_pred, image_gt)
        loss = 0.8 * loss_l1 + 0.2 * loss_ssim
        
        # 反向传播
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # 自适应密度控制
        self.adaptive_density_control(image_pred, image_gt)
        
        return loss.item()
    
    def adaptive_density_control(self, image_pred, image_gt):
        """
        自适应密度控制
        
        1. 克隆: 梯度大但位置不准的高斯
        2. 分裂: 大高斯分裂为小高斯
        3. 剪枝: 不透明度低的高斯
        """
        # 梯度阈值
        grads = self.gaussians.positions.grad
        grad_norm = torch.norm(grads, dim=-1)
        
        # 克隆 (位置梯度大)
        clone_mask = grad_norm > 0.0002
        self.clone_gaussians(clone_mask)
        
        # 分裂 (尺度大)
        scale_max = self.gaussians.scales.max(dim=-1).values
        split_mask = scale_max > 0.01
        self.split_gaussians(split_mask)
        
        # 剪枝 (不透明度低)
        prune_mask = self.gaussians.opacities < 0.01
        self.prune_gaussians(prune_mask)

"""
3DGS 的优势:

  1. 实时渲染: 100+ FPS
     - 不需要神经网络推理
     - 直接渲染高斯
     
  2. 高质量:
     - 细节丰富
     - 边界清晰
     
  3. 易于编辑:
     - 显式表示
     - 可以移动/删除高斯
     
  4. 快速训练:
     - 10-30 分钟 (vs NeRF 数小时)
"""

7. 多视图几何与三维重建

7.1 多视图几何基础

复制代码

多视图几何 (Multi-View Geometry):

  从多个视角的图像恢复 3D 结构

  核心问题:
    1. 相机位姿估计 (Pose Estimation)
    2. 三角化 (Triangulation)
    3. 稠密重建 (Dense Reconstruction)

极几何 (Epipolar Geometry):
  两幅图像之间的几何关系
  
  基本矩阵 F:
    p₂ᵀ · F · p₁ = 0
    
    p₁, p₂: 对应点的齐次坐标
    
  本质矩阵 E:
    E = K₂ᵀ · F · K₁
    
  分解 E 得到 R, t

7.2 SfM (Structure from Motion)

复制代码

SfM (从运动恢复结构):

  输入: 多视角图像
  输出: 相机位姿 + 3D 点云

  流程:
    1. 特征提取 (SIFT, SuperPoint)
    2. 特征匹配 (FLANN, SuperGlue)
    3. 相机位姿估计 (PnP, 5-point)
    4. 三角化
    5. 光束法平差 (Bundle Adjustment)

  代表软件:
    COLMAP: 最广泛使用的 SfM 工具

python 复制代码

class SfMPipeline:
    """
    SfM 流程
    
    理论:
      从多视角图像恢复 3D 结构
      通过特征匹配和三角化
    """
    def __init__(self):
        self.feature_extractor = SIFTFeatureExtractor()
        self.matcher = FeatureMatcher()
        self.pose_estimator = PoseEstimator()
        self.triangulator = Triangulator()
        self.bundle_adjuster = BundleAdjuster()
    
    def reconstruct(self, images):
        """
        完整 SfM 流程
        """
        # 1. 特征提取
        features = [self.feature_extractor.extract(img) for img in images]
        
        # 2. 特征匹配
        matches = []
        for i in range(len(images)):
            for j in range(i+1, len(images)):
                match = self.matcher.match(features[i], features[j])
                matches.append((i, j, match))
        
        # 3. 相机位姿估计
        poses = self.estimate_poses(features, matches)
        
        # 4. 三角化
        points_3d = self.triangulate(features, matches, poses)
        
        # 5. 光束法平差
        poses, points_3d = self.bundle_adjustment(poses, points_3d, features, matches)
        
        return poses, points_3d

"""
光束法平差 (Bundle Adjustment):

  目标: 同时优化相机位姿和 3D 点
  
  min Σᵢⱼ ‖p_ij - π(P_i, X_j)‖²
  
  其中:
    p_ij: 图像 i 中第 j 个点的观测
    P_i: 相机 i 的位姿
    X_j: 第 j 个 3D 点
    π: 投影函数
    
  使用 Levenberg-Marquardt 优化
"""

8. SLAM 技术

8.1 SLAM 概述

复制代码

SLAM (Simultaneous Localization and Mapping):

  同时定位与建图
  
  机器人在未知环境中:
    1. 估计自身位置 (定位)
    2. 构建环境地图 (建图)
    
  两者相互依赖:
    定位需要地图
    建图需要位置

传感器:
  1. 视觉 SLAM: 单目/双目/RGB-D 相机
  2. 激光 SLAM: LiDAR
  3. 多传感器融合: 视觉 + IMU + LiDAR

8.2 视觉 SLAM

复制代码

视觉 SLAM 流程:

  1. 前端 (Frontend):
     视觉里程计 (Visual Odometry)
     估计相邻帧的运动
     
  2. 后端 (Backend):
     优化相机轨迹和地图
     光束法平差
     
  3. 回环检测 (Loop Closure):
     检测是否回到之前的位置
     消除累积误差
     
  4. 建图 (Mapping):
     构建环境地图

代表系统:
  ORB-SLAM: 特征点法
  LSD-SLAM: 直接法
  DSO: 直接稀疏里程计

python 复制代码

class VisualSLAM:
    """
    视觉 SLAM 系统
    
    核心组件:
      1. 前端: 视觉里程计
      2. 后端: 优化
      3. 回环检测
      4. 建图
    """
    def __init__(self):
        self.frontend = VisualOdometry()
        self.backend = BundleAdjustment()
        self.loop_detector = LoopClosureDetector()
        self.mapper = Mapper()
    
    def process_frame(self, image, timestamp):
        """
        处理一帧图像
        """
        # 1. 前端: 估计运动
        pose, features = self.frontend.track(image)
        
        # 2. 回环检测
        loop = self.loop_detector.detect(image, features)
        
        if loop:
            # 检测到回环，优化全局
            self.backend.optimize_with_loop(loop)
        
        # 3. 局部优化
        self.backend.local_optimize()
        
        # 4. 更新地图
        self.mapper.update(features, pose)
        
        return pose

"""
特征点法 vs 直接法:

  特征点法 (ORB-SLAM):
    提取特征点 → 匹配 → 估计运动
    优点: 鲁棒，精度高
    缺点: 依赖特征点质量
    
  直接法 (LSD-SLAM):
    直接使用像素强度 → 最小化光度误差
    优点: 不需要特征点，稠密
    缺点: 对光照变化敏感
"""

9. 3D 生成模型

9.1 文本到 3D

复制代码

文本到 3D 生成:

  输入: 文本描述
  输出: 3D 模型

  方法:
    1. DreamFusion: 使用 NeRF + 2D 扩散模型
    2. Magic3D: 粗到细生成
    3. ProlificDreamer: 变分分数蒸馏

python 复制代码

class DreamFusion:
    """
    DreamFusion
    
    使用 2D 扩散模型指导 3D 生成
    
    核心思想:
      1. 初始化 NeRF
      2. 从随机视角渲染
      3. 使用扩散模型计算梯度
      4. 更新 NeRF
      
    理论:
      Score Distillation Sampling (SDS):
        L_SDS = E[w(t) · (ε_θ(z_t, t) - ε) · ∂z/∂θ]
    """
    def __init__(self, nerf, diffusion_model):
        self.nerf = nerf
        self.diffusion = diffusion_model
    
    def generate(self, text_prompt, num_steps=10000):
        """
        生成 3D 模型
        """
        for step in range(num_steps):
            # 随机视角
            camera = self.sample_random_camera()
            
            # 渲染
            image = self.nerf.render(camera)
            
            # SDS 损失
            loss = self.compute_sds_loss(image, text_prompt)
            
            # 更新 NeRF
            loss.backward()
            self.optimizer.step()
            
        return self.nerf

"""
Score Distillation Sampling (SDS):

  理论:
    使用预训练的 2D 扩散模型作为指导
    扩散模型知道"好图像"应该是什么样子
    通过梯度将这种知识传递给 3D 模型
    
  公式:
    L_SDS = E[w(t) · (ε_θ(z_t, t, y) - ε) · ∂z/∂θ]
    
    其中:
      ε_θ: 扩散模型预测的噪声
      ε: 实际添加的噪声
      y: 文本条件
      z: 渲染图像的潜在表示
      θ: NeRF 参数
"""

9.2 3D 编辑

复制代码

3D 编辑技术:

  1. Instruct-NeRF2NeRF:
     使用指令编辑 NeRF 场景
     
  2. Point-E:
     从文本生成点云
     
  3. Shap-E:
     直接生成 3D 模型的参数

10. 评估指标与应用

10.1 评估指标

复制代码

┌─────────────────────────────────────────────────────────────────────┐
│                    3D 视觉评估指标                                  │
├─────────────────┬───────────────────────────────────────────────────┤
│  任务            │  指标                                            │
├─────────────────┼───────────────────────────────────────────────────┤
│  3D 检测         │  AP@IoU (3D IoU 阈值)                           │
│                 │  BEV AP (鸟瞰图)                                 │
├─────────────────┼───────────────────────────────────────────────────┤
│  3D 分割         │  mIoU (平均交并比)                               │
│                 │  分类别 IoU                                       │
├─────────────────┼───────────────────────────────────────────────────┤
│  3D 重建         │  Chamfer Distance (倒角距离)                     │
│                 │  Hausdorff Distance                              │
│                 │  F-Score                                         │
├─────────────────┼───────────────────────────────────────────────────┤
│  新视角合成      │  PSNR, SSIM, LPIPS                              │
│                 │  渲染速度 (FPS)                                  │
├─────────────────┼───────────────────────────────────────────────────┤
│  SLAM            │  ATE (绝对轨迹误差)                              │
│                 │  RPE (相对位姿误差)                              │
└─────────────────┴───────────────────────────────────────────────────┘

python 复制代码

class Metrics3D:
    """3D 评估指标"""
    
    @staticmethod
    def chamfer_distance(pred_points, gt_points):
        """
        倒角距离 (Chamfer Distance)
        
        CD = (1/|P|) Σ min ‖p - g‖² + (1/|G|) Σ min ‖g - p‖²
        
        理论:
          衡量两个点云的相似度
          每个点找最近点，计算平均距离
        """
        # pred -> gt
        dist_pred_to_gt = torch.cdist(pred_points, gt_points)
        min_dist_pred = dist_pred_to_gt.min(dim=1)[0]
        
        # gt -> pred
        dist_gt_to_pred = torch.cdist(gt_points, pred_points)
        min_dist_gt = dist_gt_to_pred.min(dim=1)[0]
        
        cd = min_dist_pred.mean() + min_dist_gt.mean()
        
        return cd
    
    @staticmethod
    def iou_3d(box1, box2):
        """
        3D IoU
        
        计算两个 3D 边界框的交并比
        """
        # 使用 Shapely 或 trimesh 计算
        pass
    
    @staticmethod
    def fscore(pred_points, gt_points, threshold=0.01):
        """
        F-Score
        
        F = 2 · Precision · Recall / (Precision + Recall)
        
        Precision: 预测点中有多少在阈值内
        Recall: 真实点中有多少在阈值内
        """
        dist_pred_to_gt = torch.cdist(pred_points, gt_points)
        dist_gt_to_pred = torch.cdist(gt_points, pred_points)
        
        precision = (dist_pred_to_gt.min(dim=1)[0] < threshold).float().mean()
        recall = (dist_gt_to_pred.min(dim=1)[0] < threshold).float().mean()
        
        fscore = 2 * precision * recall / (precision + recall + 1e-8)
        
        return fscore

"""
Chamfer Distance vs F-Score:

  Chamfer Distance:
    衡量平均距离
    对离群点敏感
    
  F-Score:
    衡量在阈值内的比例
    更关注正确重建的部分
"""

10.2 应用领域

复制代码

3D 视觉的应用:

  1. 自动驾驶:
     3D 目标检测
     点云分割
     高精地图
     
  2. 机器人:
     抓取规划
     导航避障
     场景理解
     
  3. AR/VR:
     场景重建
     虚实融合
     新视角渲染
     
  4. 医学:
     3D 器官重建
     手术规划
     影像分析
     
  5. 数字孪生:
     城市建模
     工厂仿真
     虚拟旅游
     
  6. 影视:
     3D 特效
     虚拟人
     场景重建

附录

A. 3D 视觉发展时间线

复制代码

2017  ──┬──  PointNet (点云深度学习)
        │
2017  ──┼──  PointNet++ (层次化点云学习)
        │
2018  ──┼──  VoxelNet (体素检测)
        │
2019  ──┼──  PointPillars (快速检测)
        │
2020  ──┼──  NeRF (神经辐射场)
        │
2021  ──┼──  Mip-NeRF, NeRF++
        │
2022  ──┼──  Instant-NGP (快速 NeRF)
        │
2023  ──┼──  3D Gaussian Splatting
        │
2023  ──┼──  DreamFusion (文本到 3D)
        │
2024+ ──┴──  更快、更高质量的 3D 生成

B. 核心公式速查

公式	含义
C® = ∫ T(t)·σ(r(t))·c(r(t),d)dt	NeRF 渲染方程
G(x) = exp(-½(x-μ)ᵀΣ⁻¹(x-μ))	3D 高斯
CD = Σ min‖p-g‖² + Σ min‖g-p‖²	倒角距离
p₂ᵀ·F·p₁ = 0	极几何约束
PQ = SQ × RQ	全景质量

C. 推荐资源

Qi, C.R., et al. (2017). PointNet: Deep Learning on Point Sets
Mildenhall, B., et al. (2020). NeRF: Representing Scenes as Neural Radiance Fields
Kerbl, B., et al. (2023). 3D Gaussian Splatting
Mur-Artal, R., et al. (2015). ORB-SLAM

3D 视觉与点云处理（3D Vision & Point Cloud Processing）

目录

1. 3D 视觉基础理论

1.1 什么是 3D 视觉

1.2 3D 数据表示

1.3 相机模型与坐标系

2. 点云数据与表示

2.1 点云的特性

2.2 点云预处理

2.3 点云增强

3. PointNet 系列

3.1 PointNet

3.2 PointNet++

3.3 点云分割

4. 3D 目标检测

4.1 3D 检测概述

4.2 基于点云的方法

4.3 基于 Transformer 的方法

5. 神经辐射场 NeRF

5.1 NeRF 基础

5.2 NeRF 渲染

5.3 NeRF 的改进

6. 3D Gaussian Splatting

6.1 3DGS 概述

6.2 3DGS 的训练

7. 多视图几何与三维重建

7.1 多视图几何基础

7.2 SfM (Structure from Motion)

8. SLAM 技术

8.1 SLAM 概述

8.2 视觉 SLAM

9. 3D 生成模型

9.1 文本到 3D

9.2 3D 编辑

10. 评估指标与应用

10.1 评估指标

10.2 应用领域

附录

A. 3D 视觉发展时间线

B. 核心公式速查

C. 推荐资源