基于 ESP32-S3 的四博 AI 双目智能音箱工程设计:双目屏、触控、IMU、震动反馈、WebSocket 与 MCP 接入
1. 前言
传统智能音箱的核心是"语音输入 + 云端识别 + TTS 播放"。这种架构可以完成基础语音问答,但缺少"表情""触觉""姿态""可视化状态"等交互维度。
本文设计一套基于 四博 AI-S3 / AI 双目硬件体系 的智能音箱方案。方案以 ESP32-S3 为主控,外接双目 LCD、四路触控、三轴 IMU、震动马达、麦克风、喇叭和 AI 模组,通过 WebSocket 接入大模型服务,并通过 MCP 实现自然语言到设备动作的映射。
目标不是做一个普通蓝牙音箱,而是做一个具有"眼睛、触觉、姿态感知和 AI 人格"的桌面智能终端。
2. 方案能力定义
本方案目标能力如下:
1. 支持 AI 大模型语音对话
2. 支持 0.71 寸 / 1.28 寸双目屏表情显示
3. 支持四路触控:唤醒、打断、音量、模式切换
4. 支持三轴 IMU:摇一摇、倾斜、拿起、放下
5. 支持震动马达:触控确认、AI 状态、低电量、闹钟
6. 支持 WebSocket 音频链路
7. 支持 BluFi / 小程序配网
8. 支持声音克隆、自建知识库、声纹识别
9. 支持 MCP 工具调用
10. 支持 OTA 和产测模式
整体产品可以抽象为:
ESP32-S3 实时控制
+ 双目屏情绪表达
+ Touch 本地交互
+ IMU 姿态感知
+ Haptic 触觉反馈
+ WebSocket AI 音频链路
+ MCP 工具控制
+ 小程序配置
= AI 双目智能音箱
3. 系统架构
┌─────────────────────────────────────────────────────────┐
│ 四博 AI 双目智能音箱 │
├─────────────────────────────────────────────────────────┤
│ 交互层 │
│ ├─ 双目屏:显示眼睛、表情、状态 │
│ ├─ 四路触控:唤醒、打断、音量、模式 │
│ ├─ 三轴 IMU:摇一摇、倾斜、拿起 │
│ └─ 震动马达:触觉确认、状态反馈 │
├─────────────────────────────────────────────────────────┤
│ 硬件层 │
│ ├─ ESP32-S3:主控、Wi-Fi、BLE、LCD、I2S、I2C │
│ ├─ VB6824 / ES8311:唤醒、音频处理、编解码 │
│ ├─ MIC + Speaker:语音输入和输出 │
│ ├─ LCD x 2:0.71 / 1.28 寸双目屏 │
│ ├─ Touch Key x 4 │
│ └─ IMU + Motor + Battery │
├─────────────────────────────────────────────────────────┤
│ 固件层 │
│ ├─ app_event_bus:事件总线 │
│ ├─ touch_mgr:触控扫描 │
│ ├─ imu_mgr:姿态识别 │
│ ├─ haptic_mgr:震动控制 │
│ ├─ eye_ui:双目表情渲染 │
│ ├─ audio_i2s:音频采集和播放 │
│ ├─ ai_ws_client:AI WebSocket 链路 │
│ ├─ mcp_uart:MCP 工具控制 │
│ └─ ota_mgr:远程升级 │
├─────────────────────────────────────────────────────────┤
│ 云端层 │
│ ├─ ASR:语音识别 │
│ ├─ LLM:大模型问答 │
│ ├─ RAG:知识库检索 │
│ ├─ TTS:语音合成 │
│ ├─ Voice Clone:声音克隆 │
│ └─ MCP Server:工具调用 │
└─────────────────────────────────────────────────────────┘
4. 硬件选型
主控推荐:
ESPS3-32 N16R8
ESPS3-32 N16R2
ESPS3-32E N16R8
ESP32-S3 适合这个项目的原因:
1. 双核 240MHz,适合多任务调度
2. GPIO 资源充足,适合屏幕、触控、I2S、I2C、马达
3. 支持 Wi-Fi + BLE
4. 支持 PSRAM,适合 UI 动画和音频缓存
5. 四博已有 AI-S3、双目、AI EYE 相关硬件生态
推荐硬件配置:
| 模块 | 配置建议 | 说明 |
|---|---|---|
| 主控 | ESP32-S3 N16R8 | 建议带 PSRAM |
| 显示 | 0.71 / 1.28 寸双目屏 | 表情、状态、动画 |
| 音频 | MIC + Speaker | 语音采集和播放 |
| 编解码 | VB6824 / ES8311 | 唤醒、音频处理 |
| 触控 | 4 路触控 | 唤醒、打断、音量、模式 |
| 姿态 | 三轴 IMU | 摇一摇、倾斜、拿起 |
| 震动 | 扁平马达 + MOS | 触觉反馈 |
| 配网 | BluFi / SoftAP | 小程序配网 |
| 电源 | 锂电池 + 充电管理 | 便携应用 |
| 扩展 | RGB / TF / 4G | 选配 |
5. 工程目录设计
sibo_ai_speaker/
├── CMakeLists.txt
├── sdkconfig.defaults
├── main/
│ ├── app_main.c
│ ├── app_config.h
│ ├── board_pins.h
│ ├── app_event_bus.c
│ ├── app_event_bus.h
│ ├── touch_mgr.c
│ ├── imu_mgr.c
│ ├── haptic_mgr.c
│ ├── eye_ui.c
│ ├── audio_i2s.c
│ ├── ai_ws_client.c
│ ├── mcp_uart.c
│ ├── ota_mgr.c
│ └── power_mgr.c
├── components/
│ ├── display_driver/
│ ├── audio_codec/
│ ├── imu_driver/
│ ├── eye_assets/
│ └── json_parser/
└── server/
├── main.py
├── rag_service.py
├── voice_clone.py
├── mcp_tools.py
└── tts_stream.py
6. 基础配置
app_config.h
#pragma once
#define PRODUCT_NAME "SIBO_AI_DUAL_EYE_SPEAKER"
#define FIRMWARE_VERSION "v1.0.0"
#define SCREEN_TYPE_071 0
#define SCREEN_TYPE_128 1
#define CONFIG_EYE_SCREEN_TYPE SCREEN_TYPE_128
#if CONFIG_EYE_SCREEN_TYPE == SCREEN_TYPE_128
#define EYE_LCD_W 240
#define EYE_LCD_H 240
#else
#define EYE_LCD_W 160
#define EYE_LCD_H 160
#endif
#define AUDIO_SAMPLE_RATE 16000
#define AUDIO_BITS_PER_SAMPLE 16
#define AUDIO_FRAME_MS 20
#define AUDIO_FRAME_SAMPLES (AUDIO_SAMPLE_RATE * AUDIO_FRAME_MS / 1000)
#define AI_WS_URL "wss://your-ai-server.example.com/device/ws"
#define OTA_MANIFEST_URL "https://your-ota.example.com/sibo_ai/manifest.json"
#define ENABLE_BLUFI_PROVISION 1
#define ENABLE_MCP_TOOLS 1
#define ENABLE_RAG_KNOWLEDGE 1
#define ENABLE_VOICE_CLONE 1
#define ENABLE_BLE_SPEAKER 1
board_pins.h
#pragma once
#include "driver/gpio.h"
#define PIN_I2C_SCL GPIO_NUM_9
#define PIN_I2C_SDA GPIO_NUM_8
#define PIN_MOTOR_PWM GPIO_NUM_21
#define PIN_I2S_BCLK GPIO_NUM_4
#define PIN_I2S_WS GPIO_NUM_5
#define PIN_I2S_DIN_MIC GPIO_NUM_6
#define PIN_I2S_DOUT_SPK GPIO_NUM_7
#define PIN_LCD_SPI_SCLK GPIO_NUM_12
#define PIN_LCD_SPI_MOSI GPIO_NUM_13
#define PIN_LCD_LEFT_CS GPIO_NUM_14
#define PIN_LCD_RIGHT_CS GPIO_NUM_15
#define PIN_LCD_DC GPIO_NUM_16
#define PIN_LCD_RST GPIO_NUM_17
#define PIN_LCD_BL GPIO_NUM_18
#define PIN_TOUCH_1 GPIO_NUM_38
#define PIN_TOUCH_2 GPIO_NUM_39
#define PIN_TOUCH_3 GPIO_NUM_40
#define PIN_TOUCH_4 GPIO_NUM_41
#define PIN_BAT_ADC GPIO_NUM_1
#define PIN_CHARGE_DET GPIO_NUM_2
#define PIN_RGB_LED GPIO_NUM_48
7. 事件总线设计
很多示例代码容易犯一个问题:多个任务共用一个队列,导致事件被其中一个任务消费后,其他任务收不到。
这里推荐使用"事件广播总线",每个模块注册自己的队列,事件总线负责 fan-out 分发。
app_event_bus.h
#pragma once
#include <stdint.h>
#include "freertos/FreeRTOS.h"
#include "freertos/queue.h"
typedef enum {
EVT_TOUCH_1,
EVT_TOUCH_2,
EVT_TOUCH_3,
EVT_TOUCH_4,
EVT_TOUCH_LONG_1,
EVT_TOUCH_LONG_2,
EVT_TOUCH_LONG_3,
EVT_TOUCH_LONG_4,
EVT_IMU_SHAKE,
EVT_IMU_TILT_LEFT,
EVT_IMU_TILT_RIGHT,
EVT_IMU_PICKUP,
EVT_IMU_PUTDOWN,
EVT_AI_IDLE,
EVT_AI_WAKEUP,
EVT_AI_LISTENING,
EVT_AI_THINKING,
EVT_AI_SPEAKING,
EVT_AI_INTERRUPTED,
EVT_AUDIO_VOLUME_UP,
EVT_AUDIO_VOLUME_DOWN,
EVT_AUDIO_MUTE,
EVT_BAT_LOW,
EVT_BAT_CHARGING,
EVT_BAT_FULL,
EVT_WIFI_CONNECTED,
EVT_WIFI_DISCONNECTED,
EVT_OTA_START,
EVT_OTA_PROGRESS,
EVT_OTA_DONE,
EVT_OTA_FAIL,
} app_evt_type_t;
typedef struct {
app_evt_type_t type;
int32_t value;
int64_t ts_ms;
} app_evt_t;
void app_event_bus_init(void);
QueueHandle_t app_event_bus_register(const char *name, uint32_t queue_len);
void app_event_post(app_evt_type_t type, int32_t value);
app_event_bus.c
#include <string.h>
#include "esp_timer.h"
#include "esp_log.h"
#include "app_event_bus.h"
#define APP_EVENT_SUB_MAX 10
#define APP_EVENT_QUEUE_ITEM_SIZE sizeof(app_evt_t)
static const char *TAG = "event_bus";
typedef struct {
char name[24];
QueueHandle_t q;
} app_event_sub_t;
static app_event_sub_t s_subs[APP_EVENT_SUB_MAX];
static int s_sub_count = 0;
void app_event_bus_init(void)
{
memset(s_subs, 0, sizeof(s_subs));
s_sub_count = 0;
}
QueueHandle_t app_event_bus_register(const char *name, uint32_t queue_len)
{
if (s_sub_count >= APP_EVENT_SUB_MAX) {
ESP_LOGE(TAG, "subscriber full");
return NULL;
}
QueueHandle_t q = xQueueCreate(queue_len, APP_EVENT_QUEUE_ITEM_SIZE);
if (!q) {
ESP_LOGE(TAG, "create queue failed");
return NULL;
}
strncpy(s_subs[s_sub_count].name, name, sizeof(s_subs[s_sub_count].name) - 1);
s_subs[s_sub_count].q = q;
s_sub_count++;
ESP_LOGI(TAG, "register subscriber: %s", name);
return q;
}
void app_event_post(app_evt_type_t type, int32_t value)
{
app_evt_t evt = {
.type = type,
.value = value,
.ts_ms = esp_timer_get_time() / 1000
};
for (int i = 0; i < s_sub_count; i++) {
if (s_subs[i].q) {
xQueueSend(s_subs[i].q, &evt, 0);
}
}
}
8. 主程序入口
#include "esp_log.h"
#include "app_config.h"
#include "app_event_bus.h"
extern void touch_mgr_start(void);
extern void imu_mgr_start(void);
extern void haptic_mgr_start(void);
extern void eye_ui_start(void);
extern void audio_i2s_start(void);
extern void ai_ws_client_start(void);
extern void mcp_uart_start(void);
extern void ota_mgr_start(void);
extern void power_mgr_start(void);
static const char *TAG = "app_main";
void app_main(void)
{
ESP_LOGI(TAG, "boot product=%s fw=%s", PRODUCT_NAME, FIRMWARE_VERSION);
app_event_bus_init();
touch_mgr_start();
imu_mgr_start();
haptic_mgr_start();
eye_ui_start();
audio_i2s_start();
ai_ws_client_start();
power_mgr_start();
ota_mgr_start();
#if ENABLE_MCP_TOOLS
mcp_uart_start();
#endif
while (1) {
vTaskDelay(pdMS_TO_TICKS(1000));
}
}
9. 四路触控处理
触控可以用 ESP32-S3 内部 touch,也可以使用外部触摸 IC。下面用 GPIO 输入做通用模板,便于外部触摸芯片输出低电平时直接适配。
#include "driver/gpio.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "board_pins.h"
#include "app_event_bus.h"
typedef struct {
gpio_num_t pin;
app_evt_type_t short_evt;
app_evt_type_t long_evt;
int last_level;
int64_t press_ms;
} touch_key_t;
static touch_key_t s_keys[] = {
{PIN_TOUCH_1, EVT_TOUCH_1, EVT_TOUCH_LONG_1, 1, 0},
{PIN_TOUCH_2, EVT_TOUCH_2, EVT_TOUCH_LONG_2, 1, 0},
{PIN_TOUCH_3, EVT_TOUCH_3, EVT_TOUCH_LONG_3, 1, 0},
{PIN_TOUCH_4, EVT_TOUCH_4, EVT_TOUCH_LONG_4, 1, 0},
};
static int64_t now_ms(void)
{
return esp_timer_get_time() / 1000;
}
static void touch_task(void *arg)
{
while (1) {
for (int i = 0; i < 4; i++) {
int level = gpio_get_level(s_keys[i].pin);
if (s_keys[i].last_level == 1 && level == 0) {
s_keys[i].press_ms = now_ms();
}
if (s_keys[i].last_level == 0 && level == 1) {
int64_t dur = now_ms() - s_keys[i].press_ms;
if (dur > 800) {
app_event_post(s_keys[i].long_evt, dur);
} else if (dur > 30) {
app_event_post(s_keys[i].short_evt, dur);
}
}
s_keys[i].last_level = level;
}
vTaskDelay(pdMS_TO_TICKS(20));
}
}
void touch_mgr_start(void)
{
gpio_config_t io = {
.pin_bit_mask =
(1ULL << PIN_TOUCH_1) |
(1ULL << PIN_TOUCH_2) |
(1ULL << PIN_TOUCH_3) |
(1ULL << PIN_TOUCH_4),
.mode = GPIO_MODE_INPUT,
.pull_up_en = GPIO_PULLUP_ENABLE,
.pull_down_en = GPIO_PULLDOWN_DISABLE,
.intr_type = GPIO_INTR_DISABLE,
};
gpio_config(&io);
xTaskCreate(touch_task, "touch_task", 4096, NULL, 5, NULL);
}
10. 震动马达模块
#include "driver/ledc.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "app_event_bus.h"
#include "board_pins.h"
#define MOTOR_LEDC_MODE LEDC_LOW_SPEED_MODE
#define MOTOR_LEDC_TIMER LEDC_TIMER_0
#define MOTOR_LEDC_CH LEDC_CHANNEL_0
#define MOTOR_PWM_FREQ 20000
#define MOTOR_PWM_RES LEDC_TIMER_10_BIT
static QueueHandle_t s_evt_q;
static void motor_set(uint32_t duty)
{
ledc_set_duty(MOTOR_LEDC_MODE, MOTOR_LEDC_CH, duty);
ledc_update_duty(MOTOR_LEDC_MODE, MOTOR_LEDC_CH);
}
static void vibrate(uint32_t duty, uint32_t ms)
{
motor_set(duty);
vTaskDelay(pdMS_TO_TICKS(ms));
motor_set(0);
}
static void haptic_task(void *arg)
{
app_evt_t evt;
while (1) {
if (xQueueReceive(s_evt_q, &evt, portMAX_DELAY) == pdTRUE) {
switch (evt.type) {
case EVT_TOUCH_1:
case EVT_TOUCH_2:
case EVT_TOUCH_3:
case EVT_TOUCH_4:
vibrate(420, 35);
break;
case EVT_TOUCH_LONG_1:
case EVT_AI_INTERRUPTED:
vibrate(900, 35);
vTaskDelay(pdMS_TO_TICKS(50));
vibrate(900, 35);
break;
case EVT_AI_WAKEUP:
vibrate(600, 60);
break;
case EVT_BAT_LOW:
for (int i = 0; i < 3; i++) {
vibrate(650, 80);
vTaskDelay(pdMS_TO_TICKS(100));
}
break;
default:
break;
}
}
}
}
void haptic_mgr_start(void)
{
s_evt_q = app_event_bus_register("haptic", 16);
ledc_timer_config_t timer = {
.speed_mode = MOTOR_LEDC_MODE,
.timer_num = MOTOR_LEDC_TIMER,
.duty_resolution = MOTOR_PWM_RES,
.freq_hz = MOTOR_PWM_FREQ,
.clk_cfg = LEDC_AUTO_CLK
};
ledc_timer_config(&timer);
ledc_channel_config_t ch = {
.speed_mode = MOTOR_LEDC_MODE,
.channel = MOTOR_LEDC_CH,
.timer_sel = MOTOR_LEDC_TIMER,
.intr_type = LEDC_INTR_DISABLE,
.gpio_num = PIN_MOTOR_PWM,
.duty = 0,
.hpoint = 0
};
ledc_channel_config(&ch);
xTaskCreate(haptic_task, "haptic_task", 4096, NULL, 5, NULL);
}
11. IMU 姿态识别
#include <math.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "app_event_bus.h"
typedef struct {
float ax;
float ay;
float az;
} accel_g_t;
static esp_err_t imu_read_accel(accel_g_t *a)
{
/*
* 实际项目替换为具体 IMU:
* qmi8658_read_accel(a);
* mpu6050_read_accel(a);
* lis3dh_read_accel(a);
*/
static float t = 0;
t += 0.1f;
a->ax = sinf(t) * 0.05f;
a->ay = cosf(t) * 0.05f;
a->az = 1.0f;
return ESP_OK;
}
static void imu_task(void *arg)
{
accel_g_t a;
float last_mag = 1.0f;
int64_t last_shake_ms = 0;
while (1) {
if (imu_read_accel(&a) == ESP_OK) {
float mag = sqrtf(a.ax * a.ax + a.ay * a.ay + a.az * a.az);
float diff = fabsf(mag - last_mag);
int64_t now = esp_timer_get_time() / 1000;
last_mag = mag;
if (diff > 0.45f && now - last_shake_ms > 800) {
last_shake_ms = now;
app_event_post(EVT_IMU_SHAKE, (int32_t)(diff * 1000));
}
if (a.ax > 0.45f) {
app_event_post(EVT_IMU_TILT_RIGHT, (int32_t)(a.ax * 100));
} else if (a.ax < -0.45f) {
app_event_post(EVT_IMU_TILT_LEFT, (int32_t)(a.ax * 100));
}
if (a.az < 0.65f || mag > 1.35f) {
app_event_post(EVT_IMU_PICKUP, (int32_t)(mag * 100));
}
}
vTaskDelay(pdMS_TO_TICKS(40));
}
}
void imu_mgr_start(void)
{
xTaskCreate(imu_task, "imu_task", 4096, NULL, 5, NULL);
}
12. 双目表情状态机
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "app_event_bus.h"
typedef enum {
EYE_IDLE,
EYE_WAKEUP,
EYE_LISTENING,
EYE_THINKING,
EYE_SPEAKING,
EYE_HAPPY,
EYE_SLEEPY,
EYE_LOW_BAT,
} eye_state_t;
static QueueHandle_t s_evt_q;
static eye_state_t s_eye_state = EYE_IDLE;
static int s_offset_x = 0;
static int s_blink_tick = 0;
static void eye_draw_frame(eye_state_t state, int blink, int offset_x)
{
/*
* 实际项目中可以接:
* 1. esp_lcd_panel_draw_bitmap()
* 2. LVGL canvas
* 3. 自定义 RGB565 framebuffer
*/
}
static void eye_handle_event(const app_evt_t *evt)
{
switch (evt->type) {
case EVT_AI_IDLE:
s_eye_state = EYE_IDLE;
break;
case EVT_AI_WAKEUP:
s_eye_state = EYE_WAKEUP;
break;
case EVT_AI_LISTENING:
s_eye_state = EYE_LISTENING;
break;
case EVT_AI_THINKING:
s_eye_state = EYE_THINKING;
break;
case EVT_AI_SPEAKING:
s_eye_state = EYE_SPEAKING;
break;
case EVT_TOUCH_4:
case EVT_IMU_SHAKE:
s_eye_state = EYE_HAPPY;
break;
case EVT_IMU_TILT_LEFT:
s_offset_x = -12;
break;
case EVT_IMU_TILT_RIGHT:
s_offset_x = 12;
break;
case EVT_BAT_LOW:
s_eye_state = EYE_LOW_BAT;
break;
default:
break;
}
}
static void eye_task(void *arg)
{
app_evt_t evt;
while (1) {
while (xQueueReceive(s_evt_q, &evt, 0) == pdTRUE) {
eye_handle_event(&evt);
}
s_blink_tick++;
int blink = 0;
if (s_blink_tick > 120) {
s_blink_tick = 0;
blink = 1;
}
if (s_eye_state == EYE_THINKING) {
s_offset_x = (s_blink_tick % 40) - 20;
}
eye_draw_frame(s_eye_state, blink, s_offset_x);
s_offset_x = s_offset_x * 8 / 10;
vTaskDelay(pdMS_TO_TICKS(33));
}
}
void eye_ui_start(void)
{
s_evt_q = app_event_bus_register("eye_ui", 16);
/*
* display_init_left_right();
*/
xTaskCreate(eye_task, "eye_task", 8192, NULL, 4, NULL);
}
13. I2S 音频采集
#include "driver/i2s_std.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "app_config.h"
#include "board_pins.h"
extern void ai_ws_send_pcm(const uint8_t *data, size_t len);
static i2s_chan_handle_t s_rx_chan;
static i2s_chan_handle_t s_tx_chan;
static void audio_capture_task(void *arg)
{
int16_t pcm[AUDIO_FRAME_SAMPLES];
while (1) {
size_t bytes_read = 0;
esp_err_t ret = i2s_channel_read(
s_rx_chan,
pcm,
sizeof(pcm),
&bytes_read,
pdMS_TO_TICKS(100)
);
if (ret == ESP_OK && bytes_read > 0) {
/*
* 可加入:
* VAD:语音活动检测
* AEC:回声消除
* NS:噪声抑制
* AGC:自动增益
*/
ai_ws_send_pcm((uint8_t *)pcm, bytes_read);
}
}
}
void audio_i2s_start(void)
{
i2s_chan_config_t chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG(
I2S_NUM_0,
I2S_ROLE_MASTER
);
i2s_new_channel(&chan_cfg, &s_tx_chan, &s_rx_chan);
i2s_std_config_t std_cfg = {
.clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(AUDIO_SAMPLE_RATE),
.slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(
I2S_DATA_BIT_WIDTH_16BIT,
I2S_SLOT_MODE_MONO
),
.gpio_cfg = {
.mclk = I2S_GPIO_UNUSED,
.bclk = PIN_I2S_BCLK,
.ws = PIN_I2S_WS,
.dout = PIN_I2S_DOUT_SPK,
.din = PIN_I2S_DIN_MIC,
},
};
i2s_channel_init_std_mode(s_rx_chan, &std_cfg);
i2s_channel_init_std_mode(s_tx_chan, &std_cfg);
i2s_channel_enable(s_rx_chan);
i2s_channel_enable(s_tx_chan);
xTaskCreate(audio_capture_task, "audio_capture", 8192, NULL, 7, NULL);
}
14. AI WebSocket 客户端
设备端发送 PCM,云端返回状态和 TTS 音频。
#include <string.h>
#include "esp_websocket_client.h"
#include "esp_log.h"
#include "cJSON.h"
#include "app_config.h"
#include "app_event_bus.h"
static const char *TAG = "ai_ws";
static esp_websocket_client_handle_t s_ws;
static void handle_ai_json(const char *data, int len)
{
cJSON *root = cJSON_ParseWithLength(data, len);
if (!root) return;
cJSON *type = cJSON_GetObjectItem(root, "type");
if (cJSON_IsString(type)) {
if (!strcmp(type->valuestring, "idle")) {
app_event_post(EVT_AI_IDLE, 0);
} else if (!strcmp(type->valuestring, "wakeup")) {
app_event_post(EVT_AI_WAKEUP, 0);
} else if (!strcmp(type->valuestring, "listening")) {
app_event_post(EVT_AI_LISTENING, 0);
} else if (!strcmp(type->valuestring, "thinking")) {
app_event_post(EVT_AI_THINKING, 0);
} else if (!strcmp(type->valuestring, "speaking")) {
app_event_post(EVT_AI_SPEAKING, 0);
} else if (!strcmp(type->valuestring, "interrupted")) {
app_event_post(EVT_AI_INTERRUPTED, 0);
}
}
cJSON_Delete(root);
}
static void ws_event_handler(void *args,
esp_event_base_t base,
int32_t event_id,
void *event_data)
{
esp_websocket_event_data_t *d = (esp_websocket_event_data_t *)event_data;
switch (event_id) {
case WEBSOCKET_EVENT_CONNECTED:
ESP_LOGI(TAG, "connected");
esp_websocket_client_send_text(
s_ws,
"{\"type\":\"hello\",\"product\":\"SIBO_AI_DUAL_EYE_SPEAKER\",\"fw\":\"v1.0.0\"}",
strlen("{\"type\":\"hello\",\"product\":\"SIBO_AI_DUAL_EYE_SPEAKER\",\"fw\":\"v1.0.0\"}"),
portMAX_DELAY
);
break;
case WEBSOCKET_EVENT_DATA:
if (d->op_code == 0x1) {
handle_ai_json(d->data_ptr, d->data_len);
} else if (d->op_code == 0x2) {
/*
* TTS 二进制音频流:
* audio_play_write(d->data_ptr, d->data_len);
*/
}
break;
case WEBSOCKET_EVENT_DISCONNECTED:
ESP_LOGW(TAG, "disconnected");
app_event_post(EVT_AI_IDLE, 0);
break;
default:
break;
}
}
void ai_ws_send_pcm(const uint8_t *data, size_t len)
{
if (s_ws && esp_websocket_client_is_connected(s_ws)) {
esp_websocket_client_send_bin(s_ws, (const char *)data, len, 0);
}
}
void ai_ws_client_start(void)
{
esp_websocket_client_config_t cfg = {
.uri = AI_WS_URL,
.reconnect_timeout_ms = 3000,
.network_timeout_ms = 5000,
};
s_ws = esp_websocket_client_init(&cfg);
esp_websocket_register_events(
s_ws,
WEBSOCKET_EVENT_ANY,
ws_event_handler,
NULL
);
esp_websocket_client_start(s_ws);
}
15. MCP 工具控制
MCP 的核心价值是:把自然语言转成 MCU 可执行动作。
例如:
用户说:把眼睛切成开心模式
设备动作:set_eye_happy
用户说:音量调到 70
设备动作:set_volume 70
用户说:把灯改成蓝色
设备动作:set_rgb 0 0 255
四博资料中 MCP 通过 UART 115200 8N1 通信,使用 AT+ADDMCP 把自然语言意图映射成二进制控制帧,并支持 Type=1 返回 AI 参数。
#include <string.h>
#include "driver/uart.h"
#include "esp_log.h"
#include "app_event_bus.h"
#define MCP_UART_NUM UART_NUM_1
#define MCP_UART_BAUD 115200
#define MCP_RX_BUF 512
static const char *TAG = "mcp_uart";
static void mcp_send_at(const char *cmd)
{
uart_write_bytes(MCP_UART_NUM, cmd, strlen(cmd));
uart_write_bytes(MCP_UART_NUM, "\r\n", 2);
ESP_LOGI(TAG, "AT>> %s", cmd);
}
static void mcp_register_tools(void)
{
mcp_send_at("AT");
mcp_send_at("AT+CONNECT");
mcp_send_at("AT+ADDMCP=0,set_eye_happy,切换为开心表情,2,20,01");
mcp_send_at("AT+ADDMCP=0,set_eye_sleepy,切换为困倦表情,2,20,02");
mcp_send_at("AT+ADDMCP=0,set_eye_angry,切换为生气表情,2,20,03");
mcp_send_at("AT+ADDMCP=1,set_volume,设置音量,F3,1,V");
mcp_send_at("AT+ADDMCP=1,set_alarm,设置闹钟,F2,2,H,M");
mcp_send_at("AT+ADDMCP=1,set_lamp_color,设置灯光颜色,F1,3,R,G,B");
}
static void handle_mcp_frame(const uint8_t *buf, int len)
{
if (len < 6) return;
if (buf[0] != 0x55 || buf[1] != 0xAA) return;
uint8_t cmd = buf[3];
switch (cmd) {
case 0x20:
if (buf[4] == 0x01) {
app_event_post(EVT_AI_WAKEUP, 0);
}
break;
case 0xF1: {
uint8_t r = buf[4];
uint8_t g = buf[5];
uint8_t b = buf[6];
ESP_LOGI(TAG, "set rgb: r=%d g=%d b=%d", r, g, b);
break;
}
case 0xF2: {
uint8_t h = buf[4];
uint8_t m = buf[5];
ESP_LOGI(TAG, "set alarm: %02d:%02d", h, m);
break;
}
case 0xF3: {
uint8_t volume = buf[4];
ESP_LOGI(TAG, "set volume: %d", volume);
break;
}
case 0xFC:
ESP_LOGW(TAG, "mcp reset request, re-register tools");
mcp_register_tools();
break;
default:
ESP_LOGW(TAG, "unknown mcp cmd=0x%02X", cmd);
break;
}
}
static void mcp_uart_task(void *arg)
{
uint8_t buf[MCP_RX_BUF];
while (1) {
int len = uart_read_bytes(
MCP_UART_NUM,
buf,
sizeof(buf),
pdMS_TO_TICKS(100)
);
if (len > 0) {
handle_mcp_frame(buf, len);
}
}
}
void mcp_uart_start(void)
{
uart_config_t cfg = {
.baud_rate = MCP_UART_BAUD,
.data_bits = UART_DATA_8_BITS,
.parity = UART_PARITY_DISABLE,
.stop_bits = UART_STOP_BITS_1,
.flow_ctrl = UART_HW_FLOWCTRL_DISABLE,
};
uart_driver_install(MCP_UART_NUM, MCP_RX_BUF, 0, 0, NULL, 0);
uart_param_config(MCP_UART_NUM, &cfg);
mcp_register_tools();
xTaskCreate(mcp_uart_task, "mcp_uart", 4096, NULL, 5, NULL);
}
16. 后端 AI 网关
设备端只负责音频采集、播放和外设控制,AI 能力放在后端。
from fastapi import FastAPI, WebSocket, UploadFile, File
from pydantic import BaseModel
from typing import Optional
app = FastAPI(title="SIBO AI Speaker Gateway")
class AgentCreateReq(BaseModel):
name: str
model: str = "xiaozhi"
enable_voice_clone: bool = True
enable_kb: bool = True
enable_mcp: bool = True
@app.post("/api/agent/create")
def create_agent(req: AgentCreateReq):
return {
"agent_id": "agent_sibo_001",
"name": req.name,
"features": {
"voice_clone": req.enable_voice_clone,
"kb": req.enable_kb,
"mcp": req.enable_mcp,
}
}
@app.websocket("/device/ws")
async def device_ws(ws: WebSocket):
await ws.accept()
await ws.send_json({
"type": "idle",
"msg": "device connected"
})
while True:
msg = await ws.receive()
if "bytes" in msg and msg["bytes"]:
pcm = msg["bytes"]
text = asr_decode(pcm)
if text:
await ws.send_json({"type": "thinking", "text": text})
answer = llm_with_rag(text)
await ws.send_json({"type": "speaking", "text": answer})
async for chunk in tts_stream(answer):
await ws.send_bytes(chunk)
await ws.send_json({"type": "idle"})
elif "text" in msg and msg["text"]:
print("device json:", msg["text"])
def asr_decode(pcm: bytes) -> Optional[str]:
return None
def llm_with_rag(text: str) -> str:
return "这里是大模型结合知识库后的回答。"
async def tts_stream(text: str):
yield b""
17. OTA 流程
1. 读取当前固件版本
2. 请求 OTA manifest
3. 对比云端版本
4. 下载固件
5. 校验 SHA256
6. 写入 OTA 分区
7. 重启切换分区
8. 上报升级结果
manifest 示例:
{
"product": "SIBO_AI_DUAL_EYE_SPEAKER",
"version": "v1.0.1",
"url": "https://your-ota.example.com/firmware/v1.0.1.bin",
"sha256": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"force": false
}
18. 量产测试项
FACTORY_TEST_TOUCH 四路触控检测
FACTORY_TEST_LCD 双屏红绿蓝白黑测试
FACTORY_TEST_AUDIO_IN MIC 录音电平检测
FACTORY_TEST_AUDIO_OUT 喇叭 1kHz 播放
FACTORY_TEST_MOTOR 马达震动测试
FACTORY_TEST_IMU 三轴姿态检测
FACTORY_TEST_WIFI Wi-Fi RSSI 检测
FACTORY_TEST_BATTERY 电池电压检测
FACTORY_WRITE_SN 写入 SN
FACTORY_WRITE_CERT 写入证书
19. 总结
本文从硬件选型、工程目录、事件总线、触控、IMU、震动马达、双目 UI、I2S 音频、WebSocket、MCP 和 OTA 等方面,完整设计了一套基于 ESP32-S3 的四博 AI 双目智能音箱方案。
该方案的核心并不是"音箱",而是一个具备多模态交互能力的 AI 终端:
能听
能说
能显示表情
能感知姿态
能通过震动反馈
能连接知识库
能使用克隆声音
能用自然语言控制设备
对于 AI 玩具、桌面陪伴机器人、儿童教育终端、品牌 IP 硬件和智能家居控制终端来说,ESP32-S3 + 双目屏 + 触控 + IMU + 震动反馈 + AI 大模型 是一个非常适合工程落地和量产的组合。