Apache Geaflow推理框架Geaflow-infer 解析系列（七）数据读写流程

第7章：数据读写流程

章节导读

本章讲解 InferDataWriter 和 InferDataReader 如何在共享内存上进行高效的数据读写。理解本章内容，你将掌握：

数据帧格式（4字节长度 + 数据体）
字节序处理（Little Endian）
流式 I/O 与缓冲区管理

核心设计

css 复制代码

✓ 自描述的数据帧（长度头）
✓ 统一的字节序约定
✓ 标准的 Java I/O 接口
✓ 流式处理支持大数据
✓ 无锁高性能

7.1 InferDataWriter 实现

核心职责

java 复制代码

public class InferDataWriter implements Closeable {
    
    private static final int HEADER_LENGTH = 4;  // 4 字节长度头
    private final DataQueueOutputStream outputStream;
    private final byte[] dataHeaderBytes;
    private final ByteBuffer headerByteBuffer;
    
    public InferDataWriter(DataExchangeQueue queue) {
        this.outputStream = new DataQueueOutputStream(queue);
        // 准备长度头缓冲区
        this.dataHeaderBytes = new byte[HEADER_LENGTH];
        this.headerByteBuffer = ByteBuffer.wrap(dataHeaderBytes);
        this.headerByteBuffer.order(ByteOrder.LITTLE_ENDIAN);
    }
    
    /**
     * 写入单条记录
     */
    public boolean write(byte[] record) throws IOException {
        return write(record, 0, record.length);
    }
    
    /**
     * 写入部分数据（支持 offset）
     */
    public boolean write(byte[] record, int offset, int length) 
            throws IOException {
        
        // 总大小 = 4 字节长度头 + 数据体
        int outputSize = HEADER_LENGTH + (length - offset);
        
        // 1. 预留空间（无锁检查）
        if (!outputStream.tryReserveBeforeWrite(outputSize)) {
            return false;  // 队列已满
        }
        
        // 2. 生成长度头（Little Endian）
        byte[] headerData = extractHeaderData(length);
        
        // 3. 先写长度头
        outputStream.write(headerData, 0, HEADER_LENGTH);
        
        // 4. 再写数据体
        outputStream.write(record, offset, length);
        
        return true;
    }
    
    /**
     * 生成 Little Endian 长度头
     */
    private byte[] extractHeaderData(int length) {
        headerByteBuffer.clear();
        headerByteBuffer.putInt(length);
        return dataHeaderBytes;
    }
}

数据帧格式

scss 复制代码

┌──────────────────────────────────┐
│  InferDataWriter 写入的数据格式   │
├──────────────────────────────────┤
│ [4字节长度头 (Little Endian)]     │
│  0x0A 0x00 0x00 0x00  → 10       │
│  (表示后续 10 字节的数据)         │
├──────────────────────────────────┤
│ [数据体 10字节]                   │
│  0x48 0x65 0x6C 0x6C ...        │
│  (H    e    l    l    o    ...)  │
└──────────────────────────────────┘

示例: 序列化 "Hello World" (11 字节)

第一帧:
  Length: 0x0B 0x00 0x00 0x00
  Data:   0x48 0x65 0x6C 0x6C 0x6F 0x20 (H e l l o [space])

第二帧:
  Length: 0x05 0x00 0x00 0x00
  Data:   0x57 0x6F 0x72 0x6C 0x64 (W o r l d)

字节序处理

java 复制代码

// Little Endian vs Big Endian

整数 0x12345678

Big Endian (网络字节序):
  ├─ byte 0: 0x12
  ├─ byte 1: 0x34
  ├─ byte 2: 0x56
  └─ byte 3: 0x78

Little Endian (x86/ARM 原生):
  ├─ byte 0: 0x78
  ├─ byte 1: 0x56
  ├─ byte 2: 0x34
  └─ byte 3: 0x12

// Java ByteBuffer 支持两种方式
ByteBuffer bb = ByteBuffer.wrap(bytes);

// 方式 1: Big Endian (默认)
bb.putInt(0x12345678);  
// 结果: [0x12, 0x34, 0x56, 0x78]

// 方式 2: Little Endian (GeaFlow 使用)
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.putInt(0x12345678);
// 结果: [0x78, 0x56, 0x34, 0x12]

// 为什么选择 Little Endian?
// ✓ x86/ARM 处理器原生支持
// ✓ 避免字节序转换开销
// ✓ Java 和 Python mmap 都能快速访问

7.2 InferDataReader 实现

核心职责

java 复制代码

public class InferDataReader implements Closeable {
    
    private static final int HEADER_LENGTH = 4;
    private final DataInputStream input;
    private static final AtomicBoolean END = new AtomicBoolean(false);
    
    public InferDataReader(DataExchangeQueue queue) {
        DataQueueInputStream dataQueueInputStream = 
            new DataQueueInputStream(queue);
        this.input = new DataInputStream(dataQueueInputStream);
    }
    
    /**
     * 读取一条记录
     */
    public byte[] read() throws IOException {
        // 1. 读取 4 字节长度头
        byte[] buffer = new byte[HEADER_LENGTH];
        int bytesNum = input.read(buffer);
        
        // 处理 EOF 或读取失败
        if (bytesNum < 0) {
            END.set(true);
            return null;
        }
        
        // 如果长度头未读完，继续读
        if (bytesNum < buffer.length) {
            input.readFully(buffer, bytesNum, buffer.length - bytesNum);
        }
        
        // 2. 解析长度（Little Endian）
        int len = fromInt32LE(buffer);
        
        // 3. 读取数据体
        byte[] data = new byte[len];
        input.readFully(data);  // 阻塞直到读满
        
        return data;
    }
    
    /**
     * 从 Little Endian 字节转换为整数
     */
    private static int fromInt32LE(byte[] bytes) {
        return (bytes[0] & 0xFF) 
            | ((bytes[1] & 0xFF) << 8) 
            | ((bytes[2] & 0xFF) << 16) 
            | ((bytes[3] & 0xFF) << 24);
    }
}

读取流程

scss 复制代码

DataQueueInputStream
  ↓
底层连接到 DataExchangeQueue
  ├─ 轮询等待数据
  ├─ 当数据到达时，返回
  └─ 支持阻塞/超时

DataInputStream
  ↓
包装 DataQueueInputStream
  ├─ 提供缓冲 I/O
  ├─ readFully() 确保读足 N 字节
  └─ 处理 EOF 和错误

7.3 数据帧设计的优雅性

为什么要有长度头？

yaml 复制代码

方案 1: 无长度头（直接写数据）
  问题: Reader 不知道何时停止
  Example:
    Writer 写入: [0x48, 0x65, 0x6C, 0x6C, 0x6F]
    Reader 读取: 应该读 5 个字节还是 10 个？无法判断

方案 2: 有长度头（推荐）
  优点: 数据自描述，Reader 知道何时停止
  Example:
    Writer 写入: [0x05, 0x00, 0x00, 0x00][0x48, 0x65, 0x6C, 0x6C, 0x6F]
    Reader: 先读 4 字节长度 = 5，再读 5 字节数据 ✓

方案 3: 分界符（不推荐）
  问题: 数据可能包含分界符，需要转义
  复杂度: 高

字节序的必要性

ini 复制代码

场景 1: Java 写，Python 读（都用 Little Endian）
  Java:   int length = 10
          将其编码为 [0x0A, 0x00, 0x00, 0x00]
          写入共享内存
  
  Python: 从共享内存读取 [0x0A, 0x00, 0x00, 0x00]
          使用 struct.unpack('<I', ...) 解析
          得到 10 ✓

场景 2: 如果 Java 用 Big Endian, Python 用 Little Endian
  Java:   int length = 10
          编码为 [0x00, 0x00, 0x00, 0x0A]  ← Big Endian
          写入共享内存
  
  Python: 从共享内存读取 [0x00, 0x00, 0x00, 0x0A]
          使用 struct.unpack('<I', ...) 解析
          得到 167772160 ✗ 完全错误!

结论: 双方必须约定同一种字节序

7.4 流式 I/O 与缓冲

DataQueueOutputStream

java 复制代码

public class DataQueueOutputStream extends OutputStream {
    
    private final DataExchangeQueue queue;
    
    /**
     * 预留空间（无锁检查）
     */
    public boolean tryReserveBeforeWrite(int size) {
        // 检查队列是否有足够空间
        long writeIndex = UNSAFE.getVolatileLong(this, WRITE_PTR);
        long readIndex = UNSAFE.getVolatileLong(this, READ_PTR);
        
        long available = (readIndex + capacity) - writeIndex;
        return available >= size;
    }
    
    @Override
    public void write(int b) throws IOException {
        byte[] bytes = new byte[1];
        bytes[0] = (byte) b;
        write(bytes);
    }
    
    @Override
    public void write(byte[] b, int off, int len) 
            throws IOException {
        // 调用 DataExchangeQueue 的底层写方法
        queue.put(b, off, len);
    }
}

DataQueueInputStream

java 复制代码

public class DataQueueInputStream extends InputStream {
    
    private final DataExchangeQueue queue;
    private static final long DEFAULT_TIMEOUT = 1000;  // 1 秒
    
    @Override
    public int read() throws IOException {
        byte[] buffer = new byte[1];
        int n = read(buffer);
        return n > 0 ? buffer[0] & 0xFF : -1;
    }
    
    @Override
    public int read(byte[] b, int off, int len) 
            throws IOException {
        // 轮询等待数据（带超时）
        long startTime = System.currentTimeMillis();
        while (true) {
            byte[] data = queue.get();  // 非阻塞
            
            if (data != null) {
                System.arraycopy(data, 0, b, off, len);
                return len;
            }
            
            // 超时检查
            if (System.currentTimeMillis() - startTime 
                    > DEFAULT_TIMEOUT) {
                throw new SocketTimeoutException(
                    "读取队列数据超时");
            }
            
            Thread.sleep(10);  // 让出 CPU，避免忙轮询
        }
    }
}