微软蓝屏事件揭秘：有问题的数据引发内存读取越界

讲动人的故事，写懂人的代码

CrowdStrike前一阵在官网上发布了上周爆发的全球企业微软蓝屏事件的官方初步复盘结果。其中谈到了这次事件的根本原因：

2024年7月19日，我们部署了两个额外的IPC模板实例。由于内容验证器中的一个bug，使得即使其中一个模板实例存在有问题的内容数据，但这个模板实例也通过了验证。

基于模板类型在初次部署前（2024年3月5日）执行过测试，我们对内容验证器中执行的检查采取信任态度，以及之前成功的IPC模板实例部署，所以这些模板实例被部署到生产环境中。

当传感器接收到并加载到内容解释器时，通道文件291中的有问题内容，导致内存读取越界，从而触发异常。代码无法优雅地处理这个异常，导致Windows操作系统崩溃（蓝屏）。

用大白话说，根本原因是内容验证器由于有bug，所以没有查出模板实例中有问题的内容数据。结果这些内容数据有问题的模板实例，上了生产环境。之后，生产环境的传感器中的内容解释器，在读取这些内容数据有问题的模板实例时，导致内存读取越界。而用C++编写的内容解释器，无法优雅地处理这个"内存读取越界"的异常情况，从而导致全球范围的企业微软蓝屏事件。

什么是"内存读取越界"？

内存读取越界是指程序试图访问或读取超出其被分配或合法范围的内存区域的行为。这种情况通常发生在数组或其他连续内存结构中，当索引或指针超出了有效的边界时就会出现。具体来说，内存读取越界的本质，就是访问未经授权或未分配的内存位置。

内存读取越界常见的场景，包括数组索引超出其定义的大小，指针引用越过分配的内存块边界，以及字符串操作超出字符串的实际长度。

内存读取越界的潜在后果，包括读取未初始化或无关的数据，程序不稳定或崩溃，以及可能导致信息泄露或被攻击者利用这样的安全漏洞。

一些编程语言（如C/C++）可能允许这种行为而不立即报错。其他语言（如Java、Python、Rust）通常会抛出异常或导致程序立即终止。

检测和预防内存读取越界的方法，包括使用支持边界检查的数据结构和函数，采用静态代码分析工具，以及在编码时注意边界条件的处理。

C++真的允许内存读取越界这种行为而不立即报错吗？

是的。不信的话，可以把下面的C++代码，复制粘贴到repl.com页面上运行，看看运行结果。（注意，下面的C++代码只是为了说明内存读取越界问题，而模拟了数组索引超出其定义的大小的内存读取越界场景。这并不是这次事件真正出问题的代码哦。）

cpp 复制代码

 1 #include <iostream>
 2 #include <stdexcept>
 3 #include <vector>
 4 
 5 // 模拟从传感器接收数据的函数
 6 std::vector<int> receiveSensorData(int channel) {
 7   // 假设Channel 291的数据包含问题内容
 8   if (channel == 291) {
 9     return {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}; // 问题数据
10   } else {
11     return {6, 7, 8, 9, 10}; // 正常数据
12   }
13 }
14 
15 // 模拟内容解释器类
16 class ContentInterpreter {
17 public:
18   void loadContent(const std::vector<int> &data) {
19     if (data.size() < 10) {
20       throw std::runtime_error("Data size too small for processing");
21     }
22 
23     // 模拟处理数据
24     for (size_t i = 0; i <= data.size(); ++i) {
25       // 越界访问，最后一次循环会导致越界
26       std::cout << "Processing data: " << data[i] << std::endl;
27     }
28   }
29 };
30 
31 int main() {
32   int channel = 291; // 指定故障发生的通道
33   try {
34     std::vector<int> sensorData = receiveSensorData(channel);
35     ContentInterpreter interpreter;
36     interpreter.loadContent(sensorData);
37   } catch (const std::exception &e) {
38     std::cerr << "Exception caught: " << e.what() << std::endl;
39   }
40 
41   return 0;
42 }
// 运行结果：
// Processing data: 1
// Processing data: 2
// Processing data: 3
// Processing data: 4
// Processing data: 5
// Processing data: 6
// Processing data: 7
// Processing data: 8
// Processing data: 9
// Processing data: 10
// Processing data: 1041

注意，上面代码第24行，i <= data.size();就出现了越界。正确代码应该是i < data.size();。结果一运行，C++代码并没有在内存读取越界后立即中止，而是继续执行，打印出Processing data: 1041。这个1041就是内存读取越界后获得的越界数据。

这个越界数据1041看起来貌似人畜无害，但这种运行时遇到内存读取越界还继续执行的行为，确实带来了下面更大的风险。

安全风险：读取未定义的内存区域可能导致敏感信息泄露。
稳定性问题：程序可能在之后的某个时刻因为这个未检测到的错误而崩溃，就像这次微软蓝屏那样。
调试困难：因为错误没有在发生点被捕获，可能导致问题源头难以定位。

如果把上面的C++代码，转换成等效的Rust代码，运行后会怎样？

rust 复制代码

 1 use std::fmt;
 2 use std::vec::Vec;
 3 
 4 #[derive(Debug)]
 5 struct DataSizeTooSmallError;
 6 
 7 impl fmt::Display for DataSizeTooSmallError {
 8     fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
 9         write!(f, "Data size too small for processing")
10     }
11 }
12 
13 impl std::error::Error for DataSizeTooSmallError {}
14 
15 fn receive_sensor_data(channel: i32) -> Vec<i32> {
16     if channel == 291 {
17         vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10] // 问题数据
18     } else {
19         vec![6, 7, 8, 9, 10] // 正常数据
20     }
21 }
22 
23 struct ContentInterpreter;
24 
25 impl ContentInterpreter {
26     fn load_content(&self, data: &[i32]) -> Result<(), Box<dyn std::error::Error>> {
27         if data.len() < 10 {
28             return Err(Box::new(DataSizeTooSmallError));
29         }
30 
31         // 模拟处理数据
32         for i in 0..=data.len() {
33             // 越界访问，最后一次循环会导致越界
34             println!("Processing data: {}", data[i]);
35         }
36         Ok(())
37     }
38 }
39 
40 fn main() {
41     let channel = 291; // 指定故障发生的通道
42     let sensor_data = receive_sensor_data(channel);
43     let interpreter = ContentInterpreter;
44 
45     match interpreter.load_content(&sensor_data) {
46         Ok(_) => (),
47         Err(e) => println!("Exception caught: {}", e),
48     }
49 }
// 运行结果
// Processing data: 1
// Processing data: 2
// Processing data: 3
// Processing data: 4
// Processing data: 5
// Processing data: 6
// Processing data: 7
// Processing data: 8
// Processing data: 9
// Processing data: 10
// thread 'main' panicked at src/main.rs:34:45:
// index out of bounds: the len is 10 but the index is 10
// note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

从上面代码可以看到，第32行虽然发生了内存读取越界（正确代码应该是for i in 0..data.len()），但Rust代码在尝试访问越界内存时立即panic并中止了程序执行。输出清楚地指出了错误的位置和原因："index out of bounds: the len is 10 but the index is 10"。这种立即中止程序执行行为，因为下面的原因而更安全。

立即发现问题：错误在发生点被捕获，便于定位和修复。
防止未定义行为：阻止了潜在的内存损坏或信息泄露。
提高可靠性：通过快速失败，避免了程序在损坏状态下继续运行。

如果把上面的C++代码，转换成等效的Java代码，运行后会怎样？

java 复制代码

 1 import java.util.List;
 2 import java.util.ArrayList;
 3 
 4 public class Main {
 5   // 模拟从传感器接收数据的函数
 6   public static List<Integer> receiveSensorData(int channel) {
 7     List<Integer> data = new ArrayList<>();
 8     if (channel == 291) {
 9       // 问题数据
10       data.add(1);
11       data.add(2);
12       data.add(3);
13       data.add(4);
14       data.add(5);
15       data.add(6);
16       data.add(7);
17       data.add(8);
18       data.add(9);
19       data.add(10);
20     } else {
21       // 正常数据
22       data.add(6);
23       data.add(7);
24       data.add(8);
25       data.add(9);
26       data.add(10);
27     }
28     return data;
29   }
30 
31   // 模拟内容解释器类
32   static class ContentInterpreter {
33     public void loadContent(List<Integer> data) throws Exception {
34       if (data.size() < 10) {
35         throw new Exception("Data size too small for processing");
36       }
37 
38       // 模拟处理数据
39       for (int i = 0; i <= data.size(); i++) {
40         // 越界访问，最后一次循环会导致越界
41         System.out.println("Processing data: " + data.get(i));
42       }
43     }
44   }
45 
46   public static void main(String[] args) {
47     int channel = 291; // 指定故障发生的通道
48     try {
49       List<Integer> sensorData = receiveSensorData(channel);
50       ContentInterpreter interpreter = new ContentInterpreter();
51       interpreter.loadContent(sensorData);
52     } catch (Exception e) {
53       System.err.println("Exception caught: " + e.getMessage());
54     }
55   }
56 }
// 运行结果：
// Processing data: 1
// Processing data: 2
// Processing data: 3
// Processing data: 4
// Processing data: 5
// Processing data: 6
// Processing data: 7
// Processing data: 8
// Processing data: 9
// Processing data: 10
// Exception caught: Index 10 out of bounds for length 10

从上面代码可以看到，第39行虽然发生了内存读取越界（正确代码应该是i < data.size();），但Java的表现类似于Rust，在尝试访问越界元素时抛出了IndexOutOfBoundsException异常。这个异常被main方法中的try-catch块捕获并打印。这种方式也提供了下面良好的安全性。

异常处理：提供了结构化的错误处理机制。
错误信息明确：异常信息清楚地指出了问题所在。
防止继续执行：阻止了在错误状态下继续执行，提高了程序的可靠性。

专业的C++程序员会如何改进代码以避免内存读取越界问题？

专业的C++程序员，会采用以下方式来改进代码，以优雅地处理潜在的越界行为。这种方法注重预防、使用现代C++特性，并在适当的层次处理异常。

cpp 复制代码

 1 #include <algorithm>
 2 #include <iostream>
 3 #include <stdexcept>
 4 #include <vector>
 5 
 6 // 模拟从传感器接收数据的函数
 7 std::vector<int> receiveSensorData(int channel) {
 8   if (channel == 291) {
 9     return {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}; // 问题数据
10   } else {
11     return {6, 7, 8, 9, 10}; // 正常数据
12   }
13 }
14 
15 // 模拟内容解释器类
16 class ContentInterpreter {
17 public:
18   void loadContent(const std::vector<int> &data) {
19     if (data.size() < 10) {
20       throw std::runtime_error("Data size too small for processing");
21     }
22 
23     // 使用基于范围的for循环，避免手动索引
24     for (const auto &value : data) {
25       processData(value);
26     }
27   }
28 
29 private:
30   void processData(int value) {
31     std::cout << "Processing data: " << value << std::endl;
32   }
33 };
34 
35 // 错误处理函数
36 void handleError(const std::exception &e) {
37   std::cerr << "Error: " << e.what() << std::endl;
38   // 在这里可以添加日志记录、错误报告等逻辑
39 }
40 
41 int main() {
42   int channel = 291; // 指定故障发生的通道
43 
44   try {
45     std::vector<int> sensorData = receiveSensorData(channel);
46     ContentInterpreter interpreter;
47     interpreter.loadContent(sensorData);
48   } catch (const std::exception &e) {
49     handleError(e);
50     return 1; // 非正常退出
51   }
52 
53   return 0;
54 }
// 运行结果：
// Processing data: 1
// Processing data: 2
// Processing data: 3
// Processing data: 4
// Processing data: 5
// Processing data: 6
// Processing data: 7
// Processing data: 8
// Processing data: 9
// Processing data: 10

这一版的C++代码主要包括以下改进。

第24行，使用基于范围的for循环：
在 loadContent 方法中，使用基于范围的for循环替代了手动索引。这完全消除了越界访问的可能性。
错误处理策略：
第36行，创建了一个单独的 handleError 函数来集中处理错误。这使得错误处理逻辑更加集中和一致。
异常处理：
在 main 函数中捕获所有可能的异常，并使用 handleError 函数处理它们。这提供了一个统一的错误处理机制。
返回错误代码：
第50行，在发生错误时，程序返回非零值，表示非正常退出。
模块化：
第30行，将数据处理逻辑封装在 processData 私有方法中，提高了代码的模块性和可维护性。

专业的Rust程序员会如何改进代码以避免内存读取越界问题？

类似地，专业Rust程序员，通常会使用以下方法来改进这段代码，避免内存读取越界的问题。即使用 for 循环与迭代器，而不是索引访问。这是最常用且最安全的方法，因为它完全避免了手动索引，从而消除了越界访问的可能性。

rust 复制代码

    // 使用 for 循环遍历数据
    for &item in data {
        println!("Processing data: {}", item);
    }

专业的Java程序员会如何改进代码以避免内存读取越界问题？

专业的Java程序员通常会使用以下方法来改进这段代码，避免内存读取越界的问题。

使用增强型for循环（for-each循环）或者Java 8引入的流式API。

这些都是最常用且最安全的方法，因为它完全避免了手动索引，从而消除了越界访问的可能性。修改后的loadContent方法如下所示：

java 复制代码

public void loadContent(List<Integer> data) throws Exception {
    if (data.size() < 10) {
        throw new Exception("Data size too small for processing");
    }

// 使用增强型for循环遍历数据
    for (Integer item : data) {
        System.out.println("Processing data: " + item);
    }
}

或者使用Java 8的流式API：

java 复制代码

public void loadContent(List<Integer> data) throws Exception {
    if (data.size() < 10) {
        throw new Exception("Data size too small for processing");
    }

// 使用流式API遍历数据
    data.forEach(item -> System.out.println("Processing data: " + item));
}

这些方法有以下优点。

安全性：完全消除了越界访问的可能性。
可读性：代码更加简洁清晰，意图更加明确。
性能：在大多数情况下，这种方法的性能与传统的索引遍历相当。

此外，这些方法还符合Java的现代编程风格，更加符合函数式编程的思想。

如果喜欢这篇文章，别忘了给文章点个"赞"，好鼓励我继续写哦～😃