AI语音助手自定义角色百度大模型【全新AI开发套件掌上AI+4w字教程+零基础上手】

1、简介

此项目主要使用ESP32-S3实现一个自定义角色的AI语音聊天助手（比如医生角色），可以通过该项目熟悉ESP32-S3 arduino的开发，百度语音识别，百度语音合成API调用，百度APPBuilder API的调用实现自定义角色的方法，自定义唤醒词的训练，SD卡的读写，触摸屏的使用，Wifi的配置（smartconfig方式）等基本开发方法。本项目的所有软硬件工程开源，并配备了详细的教程文档，和对应的视频教程，对零基础的同学非常适用，希望能够帮助到大家。

1.1具备的功能

支持小程序实现Wifi配网
语音唤醒词唤醒ESP32-S3
自定义唤醒词模型训练
百度语音识别语音合成api访问
自定义角色agent
独立电源供电
按键开关机
1.28TFT触摸屏唤醒

1.2开源地址

项目开源网址：

https://gitee.com/chging/esp32s3-ai-chat

视频教程网址：

【厚国兄的个人空间-哔哩哔哩】 https://b23.tv/AsFNSeJ

本教程的word文档版本：

我用夸克网盘分享了「ESP32-S3-AI-Chat-V2.docx」，点击链接即可保存。打开「夸克APP」在线查看，支持多种文档格式转换。

链接：https://pan.quark.cn/s/19835014b2d6

2、AI集成套件购买链接

购买链接： https://h5.m.taobao.com/awp/core/detail.htm?ft=t&id=833542085705

3、软件环境

3.1Arduino 软件安装

软件下载

下载网站：https://www.arduino.cc/en/software
下载操作步骤如下：

软件安装

双击下载的安装软件包，按照如下图所示步骤进行安装：

等待一会，安装完成。

arduino字体转换为中文

3.2ESP32芯片包安装

本项目需要安装esp32芯片包，可以先通过在线安装方式进行，如果失败，也可以选择用离线方式进行。

在线安装

打开Arduino IDE，选择 文件->首选项->设置。

将以下这个链接粘贴到开发板管理器地址中：

cpp 复制代码

https://raw.githubusercontent.com/espressif/arduino-esp32/gh-pages/package_esp32_dev_index.json

然后点击确定，保存。

打开 开发板管理器， 并搜索输入esp32 ，找到esp32 by Espressif Systems。选择版本（这里选择2.0.17，该版本测试没有问题，高版本可能会出现问题），点击安装进行安装，等待下载和安装成功。（如果失败，可以再次点击安装试一试）

安装成功。

离线安装

如果一直下载失败，安装失败，则可以通过离线方式进行安装。

直接下载安装包：

我用夸克网盘分享了「esp32.rar」，点击链接即可保存。链接：https://pan.quark.cn/s/61d4a28219bb

选择解压路径。要放在对应用户的arduino器件包目录。以下为Arduino 版本的安装路径：

C:\Users\用户名\AppData\Local\Arduino15\packages

注意：AppData是个隐藏文件夹，需要配置文件夹查看选项，能够查看隐藏的文件夹。我这里的用户名Administrator。

解压到对应文件夹完成后，关闭软件，重新打开arduino，点击开发板管理器，看到esp32-arduino已经安装完成。

安装完成。

3.3软件库安装

本项目需要安装下面的在线库和离线库。

在线库安装

arduino可以直接在库管理器中进行搜索所需的库的名字进行安装。

需要安装的在线库

库名称	版本
ArduinoJson	7.1.0
base64	1.3.0
UrlEncode	1.0.1
lvgl	8.5.10
TFT_eSPI	2.5.43
bb_captouch	1.2.2

安装步骤

点击 库管理->库名字搜索->选择对应版本点击安装。

安装完成，如下图，显示已安装。如果想要删除，则点击移除即可。

按照上面相同的方法安装base64

按照上面相同的方法安装UrlEncode

安装tft_eSPI库

驱动库安装

显示功能代码修改

修改 User_Setup_Select.h。在arduino的库安装文件夹中。

把开头的头文件注释掉。

启用自己屏幕型号的头文件。

适配引脚配置，打开Setup200_GC9A01.h文件进行修改。

安装lvgl库

驱动库安装

库代码修改

把LVGL文件夹下lv_conf_template.h复制一份，改名为 lv_conf.h。

将lv_conf.h放置到与lvgl文件夹平级目录。

打开lv_conf.h文件，开头的#if 0改为#if 1。

第88行的#define LV_TICK_CUSTOM 0改为#define LV_TICK_CUSTOM 1（注意！！！这很重要，这是启用自己的时钟，如果没有设置，则会导致动画不能切换，帧率显示为1）。

使能测试案例。

更改文件路径。

将</font>lvgl\demos文件夹移动至\lvgl\src\demos\中去。

7.ui文件夹导入。

我用夸克网盘分享了「test_ui.zip」，点击链接即可保存。

链接：https://pan.quark.cn/s/fd4657269252

将压缩包下载下来解压缩，然后复制到lvgl\src下。

安装bb_captouch库

离线库安装

arduino可以直接导入离线的库文件进行安装，本项目需要安装的离线库主要为训练的唤醒词库。下面是我自己训练的，如果大家没有训练的话可以先用我的这个唤醒词库进行导入，如果想要训练自己的唤醒词，详细的训练见后面第6章节。

需要安装的离线库

库名称
wakeup_detect_houguoxiong_inferencing

安装步骤

点击 项目->导入库->添加.ZIP库，选择本地的arduino库文件。

选择相应的库文件，点击打开。离线库在esp32s3-ai-chat\library文件夹下。

查看安装的库文件，贡献库显示已经安装成功。

4、百度API Key的申请和测试

4.1API Key的申请

在调用百度api之前，我们需要在百度的百度智能云平台上面申请api key，申请通过后并且开通对应的api调用服务接口，才可以进行api的访问。

百度智能云平台网址： https://cloud.baidu.com/

百度语音识别

我们首先需要创建语音识别的api key。

点击 产品->语音技术->语音识别->短语音识别标准版。

点击 立即使用，跳出百度账号登录界面，直接用 **手机号登录/注册 **一下。

注册后，需要进行实名认证，按照如下流程就行个人认证。

进入语音技术页面。领取免费资源，点击 去领取。

这里将语音识别的 待领接口 免费资源领取一下。

点击 创建应用。

填写 应用名称 ，**接口选择 **全选，**应用归属 **个人，填写 应用描述 ，点击 立即创建。

创建完毕，点击 返回应用列表。

应用列表中，可以查看到刚才创建的应用，并且还可以查看到**API Key、Secret Key**，后面就是需要把这两个key拷贝到程序中去使用访问语音识别api。

接着我们需要开通语音识别的服务。点击左侧 概览->语音识别->短语音识别->开通。

点击 按量后付费->语音识别->短语音识别-中文普通话->勾选服务协议->确认开通。

至此，语音识别的API key申请成功，并且服务开通成功。

百度语音合成

百度语音合成的api key与语音识别是同一个，所以上一节创建成功后，我们可以直接使用了。但是服务是需要另外开通。

点击左侧 概览->语音合成->短文本在线合成->基础音库->开通。

点击 按量后付费->语音合成->短文本在线合成-基础音库->勾选服务协议->确认开通。

至此，语音合成的服务开通成功。注意，一般语音合成没有免费的资源包赠送，因此需要提前充值点费用进去。

百度Agent角色创建

大模型应用Agent的api key申请同样在百度智能云平台上。

百度智能云平台首页，点击选择 千帆大模型应用开发平台AppBuilder。

点击 立即使用。

此时需要登录百度账号，点击登录。然后进入了Appbuilder。

创建密钥，点击密钥管理->新增密钥。

创建应用，点击个人空间->应用->创建应用->自主规划Agent。

接下来，我们根据下面流程建立Agent的角色设定，最后点击更新发布。

返回个人空间，可以看到应用发布成功，这里可以获取应用ID。

4.2API访问在线测试（可选步骤）

在开通上面的api服务后，我们可以在线测试一下服务是否开通成功。

百度智能云平台首页，点击** 控制台**。

进入控制台后，点击 文档->示例代码。

进入api调用的测试页面及示例代码页面，在这个页面，我们可以进行百度语音识别、语音合成、文心一言大模型的调用测试。

语音合成

在这里我们可以首先去测试语音合成的api调用，因为这个api的调用我们可以直接填写文本作为输入，而语音识别是需要传入音频数据作为输入。因此这里我们先测试语音合成，在语音合成生成后的音频则可以保存下来作为接下来语音识别的输入的测试。

点击 全部产品->语音技术，进入语音api测试界面。

点击 鉴权认证机制->获取AccessToken->立即前往。

选择 应用列表 中我们开通的应用服务，点击确定。

点击调试，在调试结果中，我们可以查询到响应数据中的access_token。这个表面我们申请的api key可以成功响应了。

接着测试语音合成api服务接口是否开通成功。我们点击 语音合成->短文本在线合成 ，然后填写需要合成音频的文本，选择音色，调整语速、音调、音量，选择音频格式wav ，点击合成。

生成合成的音频，点击 播放按钮，可以查看生成的音频是否正确。（这里可以点击后面的3个点，将这个音频保存下来，作为后面的语音识别输入使用）

点击 调试结果，可以查看请求数据及响应数据包。至此，我们可以通过这样的方式去测试我们的语音合成api服务是否开通成功。

点击 示例代码，可以查看各种语言平台的api调用代码实现。通过这个我们就可以在其他的平台上调用百度的api服务。

语音识别

点击 语音识别->短语音识别标准版 ，点击 上传文件，上传上一节语音合成的音频文件，其他的参数都为默认都可以。

点击调试，运行成功，查看 调试结果，可以从响应数据中查看语音识别是否正确。至此，我们可以通过这样的方式去测试我们的语音识别api服务是否开通成功。

同样，点击 示例代码，这里有各种编程语言的api调用实现。

4.3API访问Apipost测试（可选推荐工具）

api访问有一个通用的工具，这个工具用的比较广泛，可以专门测试api访问接口服务是否正常的。

下载链接：https://www.apipost.cn/

下载后安装成功，进入软件如下图。

这里以上一节百度api鉴权访问，获取access_token为例进行说明。apipost中的配置参数的格式去参考上一节中的示例代码访问。

点击 示例代码->Curl->复制 代码。

点击 API管理，点击"+"，curl导入。

粘贴 复制的代码 ，点击 立即导入。

点击发送。

从 实时响应中，查看响应的结果数据。至此，我们通过apipost的方式去测试了api的访问是否成功。

5、运行AI Agent角色扮演主程序

5.1AI Agent角色扮演主程序的运行

AI Agent角色扮演主程序的工程在**esp32s3-ai-chat\ai-all\esp32s3_ai_chat_all\esp32s3_ai_chat_all.ino**，我们直接打开工程，修改wifi及api key等配置信息后，编译代码，烧写代码到板子中进行测试。

修改wifi名称，将当前wifi的ssid、password赋值到对应的位置，填写api key，根据第4章操作获取api key。

开启psram。

开发板端口选择

点击上方的开发板选择栏，选择开发板和端口，开发板选择ESP32S3 Dev Module，端口选择USB typeC连接后的串口显示的对应串口（可通过设备管理器查看）。

编译下载。

语音对话测试。

用 "houguoxiong" 唤醒ESP32-S3进行对话。或者直接触摸屏幕（一直触摸进行录音，松开就结束录音，系统开始将录音进行语音识别及大模型访问之类的流程），唤醒ESP32-S3进行对话。

5.2整体软件流程

5.3主要模块的实现和代码分析

唤醒词语音唤醒

该模块主要实现自己训练的唤醒词唤醒的功能。**esp32s3-ai-chat/example/wake_detect**这个工程主要就是实现唤醒词唤醒的功能。基于这个工程，我们可以在此基础上进行AI语音聊天的开发。

整体代码实现：

cpp 复制代码

#include <wakeup_detect_houguoxiong_inferencing.h>

/* Edge Impulse Arduino examples
 * Copyright (c) 2022 EdgeImpulse Inc.
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */

// If your target is limited in memory remove this macro to save 10K RAM
#define EIDSP_QUANTIZE_FILTERBANK 0

/*
 ** NOTE: If you run into TFLite arena allocation issue.
 **
 ** This may be due to may dynamic memory fragmentation.
 ** Try defining "-DEI_CLASSIFIER_ALLOCATION_STATIC" in boards.local.txt (create
 ** if it doesn't exist) and copy this file to
 ** `<ARDUINO_CORE_INSTALL_PATH>/arduino/hardware/<mbed_core>/<core_version>/`.
 **
 ** See
 ** (https://support.arduino.cc/hc/en-us/articles/360012076960-Where-are-the-installed-cores-located-)
 ** to find where Arduino installs cores on your machine.
 **
 ** If the problem persists then there's not enough memory for this model and application.
 */

/* Includes ---------------------------------------------------------------- */
#include <driver/i2s.h>

#define SAMPLE_RATE 16000U
#define LED_BUILT_IN 21

// INMP441 config
#define I2S_IN_PORT I2S_NUM_0
#define I2S_IN_BCLK 4
#define I2S_IN_LRC 5
#define I2S_IN_DIN 6

/** Audio buffers, pointers and selectors */
typedef struct {
  int16_t *buffer;
  uint8_t buf_ready;
  uint32_t buf_count;
  uint32_t n_samples;
} inference_t;

static inference_t inference;
static const uint32_t sample_buffer_size = 2048;
static signed short sampleBuffer[sample_buffer_size];
static bool debug_nn = false;  // Set this to true to see e.g. features generated from the raw signal
static bool record_status = true;

/**
 * @brief      Arduino setup function
 */
void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
  // comment out the below line to cancel the wait for USB connection (needed for native USB)
  while (!Serial)
    ;
  Serial.println("Edge Impulse Inferencing Demo");

  pinMode(LED_BUILT_IN, OUTPUT);     // Set the pin as output
  digitalWrite(LED_BUILT_IN, HIGH);  //Turn off

  // Initialize I2S for audio input
  i2s_config_t i2s_config_in = {
    .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
    .sample_rate = SAMPLE_RATE,
    .bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,  // 注意：INMP441 输出 32 位数据
    .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
    .communication_format = i2s_comm_format_t(I2S_COMM_FORMAT_STAND_I2S),
    .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
    .dma_buf_count = 8,
    .dma_buf_len = 1024,
  };
  i2s_pin_config_t pin_config_in = {
    .bck_io_num = I2S_IN_BCLK,
    .ws_io_num = I2S_IN_LRC,
    .data_out_num = -1,
    .data_in_num = I2S_IN_DIN
  };
  i2s_driver_install(I2S_IN_PORT, &i2s_config_in, 0, NULL);
  i2s_set_pin(I2S_IN_PORT, &pin_config_in);

  // summary of inferencing settings (from model_metadata.h)
  ei_printf("Inferencing settings:\n");
  ei_printf("\tInterval: ");
  ei_printf_float((float)EI_CLASSIFIER_INTERVAL_MS);
  ei_printf(" ms.\n");
  ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
  ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);
  ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0]));

  ei_printf("\nStarting continious inference in 2 seconds...\n");
  ei_sleep(2000);

  if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {
    ei_printf("ERR: Could not allocate audio buffer (size %d), this could be due to the window length of your model\r\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT);
    return;
  }

  ei_printf("Recording...\n");
}

/**
 * @brief      Arduino main function. Runs the inferencing loop.
 */
void loop() {
  bool m = microphone_inference_record();
  if (!m) {
    ei_printf("ERR: Failed to record audio...\n");
    return;
  }

  signal_t signal;
  signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
  signal.get_data = &microphone_audio_signal_get_data;
  ei_impulse_result_t result = { 0 };

  EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);
  if (r != EI_IMPULSE_OK) {
    ei_printf("ERR: Failed to run classifier (%d)\n", r);
    return;
  }

  int pred_index = 0;    // Initialize pred_index
  float pred_value = 0;  // Initialize pred_value

  // print the predictions
  ei_printf("Predictions ");
  ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
            result.timing.dsp, result.timing.classification, result.timing.anomaly);
  ei_printf(": \n");
  for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
    ei_printf("    %s: ", result.classification[ix].label);
    ei_printf_float(result.classification[ix].value);
    ei_printf("\n");

    if (result.classification[ix].value > pred_value) {
      pred_index = ix;
      pred_value = result.classification[ix].value;
    }
  }
  // Display inference result
  if (pred_index == 3) {
    digitalWrite(LED_BUILT_IN, LOW);  //Turn on
  } else {
    digitalWrite(LED_BUILT_IN, HIGH);  //Turn off
  }


#if EI_CLASSIFIER_HAS_ANOMALY == 1
  ei_printf("    anomaly score: ");
  ei_printf_float(result.anomaly);
  ei_printf("\n");
#endif
}

static void audio_inference_callback(uint32_t n_bytes) {
  for (int i = 0; i < n_bytes >> 1; i++) {
    inference.buffer[inference.buf_count++] = sampleBuffer[i];

    if (inference.buf_count >= inference.n_samples) {
      inference.buf_count = 0;
      inference.buf_ready = 1;
    }
  }
}

static void capture_samples(void *arg) {

  const int32_t i2s_bytes_to_read = (uint32_t)arg;
  size_t bytes_read = i2s_bytes_to_read;

  while (record_status) {

    /* read data at once from i2s - Modified for XIAO ESP2S3 Sense and I2S.h library */
    i2s_read(I2S_IN_PORT, (void*)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);
    // esp_i2s::i2s_read(esp_i2s::I2S_NUM_0, (void *)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);

    if (bytes_read <= 0) {
      ei_printf("Error in I2S read : %d", bytes_read);
    } else {
      if (bytes_read < i2s_bytes_to_read) {
        ei_printf("Partial I2S read");
      }

      // scale the data (otherwise the sound is too quiet)
      for (int x = 0; x < i2s_bytes_to_read / 2; x++) {
        sampleBuffer[x] = (int16_t)(sampleBuffer[x]) * 8;
      }

      if (record_status) {
        audio_inference_callback(i2s_bytes_to_read);
      } else {
        break;
      }
    }
  }
  vTaskDelete(NULL);
}

/**
 * @brief      Init inferencing struct and setup/start PDM
 *
 * @param[in]  n_samples  The n samples
 *
 * @return     { description_of_the_return_value }
 */
static bool microphone_inference_start(uint32_t n_samples) {
  inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));

  if (inference.buffer == NULL) {
    return false;
  }

  inference.buf_count = 0;
  inference.n_samples = n_samples;
  inference.buf_ready = 0;

  //    if (i2s_init(EI_CLASSIFIER_FREQUENCY)) {
  //        ei_printf("Failed to start I2S!");
  //    }

  ei_sleep(100);

  record_status = true;

  xTaskCreate(capture_samples, "CaptureSamples", 1024 * 32, (void *)sample_buffer_size, 10, NULL);

  return true;
}

/**
 * @brief      Wait on new data
 *
 * @return     True when finished
 */
static bool microphone_inference_record(void) {
  bool ret = true;

  while (inference.buf_ready == 0) {
    delay(10);
  }

  inference.buf_ready = 0;
  return ret;
}

/**
 * Get raw audio signal data
 */
static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr) {
  numpy::int16_to_float(&inference.buffer[offset], out_ptr, length);

  return 0;
}

/**
 * @brief      Stop PDM and release buffers
 */
static void microphone_inference_end(void) {
  free(sampleBuffer);
  ei_free(inference.buffer);
}

#if !defined(EI_CLASSIFIER_SENSOR) || EI_CLASSIFIER_SENSOR != EI_CLASSIFIER_SENSOR_MICROPHONE
#error "Invalid model for current sensor."
#endif

下面进行各模块代码的介绍：

这个是自己训练好的唤醒词模型库的头文件，需要引用到工程中。

cpp 复制代码

#include <wakeup_detect_houguotongxue_inferencing.h>

初始化麦克风NMP441的i2s的配置。

cpp 复制代码

// Initialize I2S for audio input
  i2s_config_t i2s_config_in = {
    .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
    .sample_rate = SAMPLE_RATE,
    .bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,  // 注意：INMP441 输出 32 位数据
    .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
    .communication_format = i2s_comm_format_t(I2S_COMM_FORMAT_STAND_I2S),
    .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
    .dma_buf_count = 8,
    .dma_buf_len = 1024,
  };
  i2s_pin_config_t pin_config_in = {
    .bck_io_num = I2S_IN_BCLK,
    .ws_io_num = I2S_IN_LRC,
    .data_out_num = -1,
    .data_in_num = I2S_IN_DIN
  };
  i2s_driver_install(I2S_IN_PORT, &i2s_config_in, 0, NULL);
  i2s_set_pin(I2S_IN_PORT, &pin_config_in);

这个是唤醒词识别接口的初始化。

cpp 复制代码

// summary of inferencing settings (from model_metadata.h)
  ei_printf("Inferencing settings:\n");
  ei_printf("\tInterval: ");
  ei_printf_float((float)EI_CLASSIFIER_INTERVAL_MS);
  ei_printf(" ms.\n");
  ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
  ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);
  ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0]));

  ei_printf("\nStarting continious inference in 2 seconds...\n");
  ei_sleep(2000);

  if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {
    ei_printf("ERR: Could not allocate audio buffer (size %d), this could be due to the window length of your model\r\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT);
    return;
  }

这个初始化的函数主要就是创建了一个freeRTOS的task，task主要为实时采集音频数据。

cpp 复制代码

static bool microphone_inference_start(uint32_t n_samples) {
  inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));

  if (inference.buffer == NULL) {
    return false;
  }

  inference.buf_count = 0;
  inference.n_samples = n_samples;
  inference.buf_ready = 0;

  ei_sleep(100);

  record_status = true;

  xTaskCreate(capture_samples, "CaptureSamples", 1024 * 32, (void *)sample_buffer_size, 10, NULL);

  return true;
}

实时采集音频数据的task，将采集到的数据存储到一个全局的数据变量sampleBuffer中去。

cpp 复制代码

static void capture_samples(void *arg) {

  const int32_t i2s_bytes_to_read = (uint32_t)arg;
  size_t bytes_read = i2s_bytes_to_read;

  while (record_status) {

    /* read data at once from i2s - Modified for XIAO ESP2S3 Sense and I2S.h library */
    i2s_read(I2S_IN_PORT, (void*)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);
    // esp_i2s::i2s_read(esp_i2s::I2S_NUM_0, (void *)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);

    if (bytes_read <= 0) {
      ei_printf("Error in I2S read : %d", bytes_read);
    } else {
      if (bytes_read < i2s_bytes_to_read) {
        ei_printf("Partial I2S read");
      }

      // scale the data (otherwise the sound is too quiet)
      for (int x = 0; x < i2s_bytes_to_read / 2; x++) {
        sampleBuffer[x] = (int16_t)(sampleBuffer[x]) * 8;
      }

      if (record_status) {
        audio_inference_callback(i2s_bytes_to_read);
      } else {
        break;
      }
    }
  }
  vTaskDelete(NULL);
}

将缓存到sampleBuffer变量中的数据复制到inference数据结构体中去，这个结构体用于后面的分类函数的输入参数。到此，音频输入的数据准备的代码实现已经完成。

cpp 复制代码

static void audio_inference_callback(uint32_t n_bytes) {
  for (int i = 0; i < n_bytes >> 1; i++) {
    inference.buffer[inference.buf_count++] = sampleBuffer[i];

    if (inference.buf_count >= inference.n_samples) {
      inference.buf_count = 0;
      inference.buf_ready = 1;
    }
  }
}

接下来看具体的分类。

cpp 复制代码

void loop() {
  bool m = microphone_inference_record();
  if (!m) {
    ei_printf("ERR: Failed to record audio...\n");
    return;
  }

  signal_t signal;
  signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
  signal.get_data = &microphone_audio_signal_get_data;
  ei_impulse_result_t result = { 0 };

  EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);
  if (r != EI_IMPULSE_OK) {
    ei_printf("ERR: Failed to run classifier (%d)\n", r);
    return;
  }

  int pred_index = 0;    // Initialize pred_index
  float pred_value = 0;  // Initialize pred_value

  // print the predictions
  ei_printf("Predictions ");
  ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
            result.timing.dsp, result.timing.classification, result.timing.anomaly);
  ei_printf(": \n");
  for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
    ei_printf("    %s: ", result.classification[ix].label);
    ei_printf_float(result.classification[ix].value);
    ei_printf("\n");

    if (result.classification[ix].value > pred_value) {
      pred_index = ix;
      pred_value = result.classification[ix].value;
    }
  }
  // Display inference result
  if (pred_index == 3) {
    digitalWrite(LED_BUILT_IN, LOW);  //Turn on
  } else {
    digitalWrite(LED_BUILT_IN, HIGH);  //Turn off
  }


#if EI_CLASSIFIER_HAS_ANOMALY == 1
  ei_printf("    anomaly score: ");
  ei_printf_float(result.anomaly);
  ei_printf("\n");
#endif
}

在loop主循环中，主要是对采集到的音频数据进行分类预测。microphone_audio_signal_get_data获取之前存储的音频数据，然后调用run_classifier(&signal, &result, debug_nn)，计算出分类的预测值。在模型训练时候，训练有几个标签的数据，这里result就会返回对应几个标签的预测结果。
result.classification[ix].value预测值越接近1.0的标签，则表示当前识别的是相应的标签。当说出我们训练的唤醒词时，对应的唤醒词预测值也会接近1.0，从而实现唤醒。
我们可以进行一个阈值来与result.classification[ix].value进行比较来判断是否唤醒成功，控制这个比较的阈值大小，则可以控制识别的灵敏程度。至此，整个唤醒流程的代码实现结束。

百度API访问的access_token获取

在访问百度的语音识别、语音合成、文心一言大模型时，都需要提供access_token。在ESP32-S3中，我们通过创建http请求，根据access_token的api访问格式构建请求包，通过http发送请求，等待响应的数据，然后从响应的数据中解析出access_token。

整体代码实现如下：

cpp 复制代码

// Get Baidu API access token
String getAccessToken(const char* api_key, const char* secret_key) {
  String access_token = "";
  HTTPClient http;

  // 创建http请求
  http.begin("https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=" + String(api_key) + "&client_secret=" + String(secret_key));
  int httpCode = http.POST("");

  if (httpCode == HTTP_CODE_OK) {
    String response = http.getString();
    DynamicJsonDocument doc(1024);
    deserializeJson(doc, response);
    access_token = doc["access_token"].as<String>();

    Serial.printf("[HTTP] GET access_token: %s\n", access_token);
  } else {
    Serial.printf("[HTTP] GET... failed, error: %s\n", http.errorToString(httpCode).c_str());
  }
  http.end();

  return access_token;
}

在这里，我们需要先在百度智能云网站上申请api_key、secret_key，参考第5章节的操作方式。然后将api_key、secret_key作为输入参数，根据api访问格式，发送http.POST请求，然后从响应数据中解析出access_token。

百度语音识别API访问

在ESP32-S3通过i2s采集INMP441的音频数据后，需要将采集的音频数据流识别为文本模式，因此需要调用语音识别API实现实时的语音识别，这里我们采用了百度的语音识别API访问。

主要代码实现如下：

cpp 复制代码

String baiduSTT_Send(String access_token, uint8_t* audioData, int audioDataSize) {
  String recognizedText = "";

  if (access_token == "") {
    Serial.println("access_token is null");
    return recognizedText;
  }

  // audio数据包许愿哦进行Base64编码，数据量会增大1/3
  int audio_data_len = audioDataSize * sizeof(char) * 1.4;
  unsigned char* audioDataBase64 = (unsigned char*)ps_malloc(audio_data_len);
  if (!audioDataBase64) {
    Serial.println("Failed to allocate memory for audioDataBase64");
    return recognizedText;
  }

  // json包大小，由于需要将audioData数据进行Base64的编码，数据量会增大1/3
  int data_json_len = audioDataSize * sizeof(char) * 1.4;
  char* data_json = (char*)ps_malloc(data_json_len);
  if (!data_json) {
    Serial.println("Failed to allocate memory for data_json");
    return recognizedText;
  }

  // Base64 encode audio data
  encode_base64(audioData, audioDataSize, audioDataBase64);

  memset(data_json, '\0', data_json_len);
  strcat(data_json, "{");
  strcat(data_json, "\"format\":\"pcm\",");
  strcat(data_json, "\"rate\":16000,");
  strcat(data_json, "\"dev_pid\":1537,");
  strcat(data_json, "\"channel\":1,");
  strcat(data_json, "\"cuid\":\"57722200\",");
  strcat(data_json, "\"token\":\"");
  strcat(data_json, access_token.c_str());
  strcat(data_json, "\",");
  sprintf(data_json + strlen(data_json), "\"len\":%d,", audioDataSize);
  strcat(data_json, "\"speech\":\"");
  strcat(data_json, (const char*)audioDataBase64);
  strcat(data_json, "\"");
  strcat(data_json, "}");

  // 创建http请求
  HTTPClient http_client;

  http_client.begin("http://vop.baidu.com/server_api");
  http_client.addHeader("Content-Type", "application/json");
  int httpCode = http_client.POST(data_json);

  if (httpCode > 0) {
    if (httpCode == HTTP_CODE_OK) {
      // 获取返回结果
      String response = http_client.getString();
      Serial.println(response);

      // 从json中解析对应的result
      DynamicJsonDocument responseDoc(2048);
      deserializeJson(responseDoc, response);
      recognizedText = responseDoc["result"].as<String>();
    }
  } else {
    Serial.printf("[HTTP] POST failed, error: %s\n", http_client.errorToString(httpCode).c_str());
  }

  // 释放内存
  if (audioDataBase64) {
    free(audioDataBase64);
  }

  if (data_json) {
    free(data_json);
  }

  http_client.end();

  return recognizedText;
}

下面对上面代码重点地方进行分析说明：

这里json包的buffer创建需要为输入数据的1.4倍左右，因为需要进行base64的编码作为输入。这里分配的内存比较大，因此需要从psram中分配。

cpp 复制代码

  // audio数据包许愿哦进行Base64编码，数据量会增大1/3
  int audio_data_len = audioDataSize * sizeof(char) * 1.4;
  unsigned char* audioDataBase64 = (unsigned char*)ps_malloc(audio_data_len);
  if (!audioDataBase64) {
    Serial.println("Failed to allocate memory for audioDataBase64");
    return recognizedText;
  }

  // json包大小，由于需要将audioData数据进行Base64的编码，数据量会增大1/3
  int data_json_len = audioDataSize * sizeof(char) * 1.4;
  char* data_json = (char*)ps_malloc(data_json_len);
  if (!data_json) {
    Serial.println("Failed to allocate memory for data_json");
    return recognizedText;
  }

这里根据api调用文档的格式进行打包，需要注意的是len为原始的数据大小，不是base64编码后的数据大小。

cpp 复制代码

  // Base64 encode audio data
  encode_base64(audioData, audioDataSize, audioDataBase64);

  memset(data_json, '\0', data_json_len);
  strcat(data_json, "{");
  strcat(data_json, "\"format\":\"pcm\",");
  strcat(data_json, "\"rate\":16000,");
  strcat(data_json, "\"dev_pid\":1537,");
  strcat(data_json, "\"channel\":1,");
  strcat(data_json, "\"cuid\":\"57722200\",");
  strcat(data_json, "\"token\":\"");
  strcat(data_json, access_token.c_str());
  strcat(data_json, "\",");
  sprintf(data_json + strlen(data_json), "\"len\":%d,", audioDataSize);
  strcat(data_json, "\"speech\":\"");
  strcat(data_json, (const char*)audioDataBase64);
  strcat(data_json, "\"");
  strcat(data_json, "}");

这里，响应数据的json文档要足够大，够响应的返回数据的大小。

cpp 复制代码

// 从json中解析对应的result
DynamicJsonDocument responseDoc(2048);
deserializeJson(responseDoc, response);
recognizedText = responseDoc["result"].as<String>();

百度Agent 角色定义API访问

语音识别会以文本的格式返回识别的结果，然后我们可以用这个作为百度大模型Agent的api的输入。大模型api的调用代码实现如下：

cpp 复制代码

// Get Baidu API conversation id
String getConversation_id(const char* api_key, const char* app_id) {
  String conversation_id = "";

  // 创建http请求
  HTTPClient http;
  http.begin("https://qianfan.baidubce.com/v2/app/conversation");
  http.addHeader("Content-Type", "application/json");
  http.addHeader("X-Appbuilder-Authorization", "Bearer " + String(api_key));

  // 创建一个 JSON 文档
  DynamicJsonDocument requestJson(1024);
  requestJson["app_id"] = app_id;

  // 将 JSON 数据序列化为字符串
  String requestBody;
  serializeJson(requestJson, requestBody);

  // 发送http访问请求
  int httpCode = http.POST(requestBody);
  if (httpCode == HTTP_CODE_OK) {
    String response = http.getString();
    DynamicJsonDocument doc(1024);
    deserializeJson(doc, response);
    conversation_id = doc["conversation_id"].as<String>();

    ei_printf("[HTTP] GET conversation_id: %s\n", conversation_id.c_str());
  } else {
    ei_printf("[HTTP] GET... failed, error: %s\n", http.errorToString(httpCode).c_str());
  }
  http.end();

  return conversation_id;
}

百度语音合成API访问

从百度文心一言api返回的文本数据，我们需要通过扬声器播放出来，因此需要将文本数据转化为音频数据输出。这里我们通过调用百度语音合成api接口，实现文本转音频的功能。主要代码实现如下：

cpp 复制代码

void baiduTTS_Send(String access_token, String text) {
  if (access_token == "") {
    Serial.println("access_token is null");
    return;
  }

  if (text.length() == 0) {
    Serial.println("text is null");
    return;
  }

  const int per = 1;
  const int spd = 5;
  const int pit = 5;
  const int vol = 10;
  const int aue = 6;

  // 进行 URL 编码
  String encodedText = urlEncode(urlEncode(text));

  // URL http请求数据封装
  String url = "https://tsn.baidu.com/text2audio";

  const char* header[] = { "Content-Type", "Content-Length" };

  url += "?tok=" + access_token;
  url += "&tex=" + encodedText;
  url += "&per=" + String(per);
  url += "&spd=" + String(spd);
  url += "&pit=" + String(pit);
  url += "&vol=" + String(vol);
  url += "&aue=" + String(aue);
  url += "&cuid=esp32s3";
  url += "&lan=zh";
  url += "&ctp=1";

  // http请求创建
  HTTPClient http;

  http.begin(url);
  http.collectHeaders(header, 2);

  // http请求
  int httpResponseCode = http.GET();
  if (httpResponseCode > 0) {
    if (httpResponseCode == HTTP_CODE_OK) {
      String contentType = http.header("Content-Type");
      Serial.println(contentType);
      if (contentType.startsWith("audio")) {
        Serial.println("合成成功");

        // 获取返回的音频数据流
        Stream* stream = http.getStreamPtr();
        uint8_t buffer[512];
        size_t bytesRead = 0;

        // 设置timeout为200ms 避免最后出现杂音
        stream->setTimeout(200);

        while (http.connected() && (bytesRead = stream->readBytes(buffer, sizeof(buffer))) > 0) {
          // 音频输出
          playAudio(buffer, bytesRead);
          delay(1);
        }

        // 清空I2S DMA缓冲区
        clearAudio();
      } else if (contentType.equals("application/json")) {
        Serial.println("合成出现错误");
      } else {
        Serial.println("未知的Content-Type");
      }
    } else {
      Serial.println("Failed to receive audio file");
    }
  } else {
    Serial.print("Error code: ");
    Serial.println(httpResponseCode);
  }
  http.end();
}

// Play audio data using MAX98357A
void playAudio(uint8_t* audioData, size_t audioDataSize) {
  if (audioDataSize > 0) {
    // 发送
    size_t bytes_written = 0;
    i2s_write(I2S_OUT_PORT, (int16_t*)audioData, audioDataSize, &bytes_written, portMAX_DELAY);
  }
}

void clearAudio(void) {
  // 清空I2S DMA缓冲区
  i2s_zero_dma_buffer(I2S_OUT_PORT);
  Serial.print("clearAudio");
}

下面对上面代码重点地方进行分析说明：

这里是进行两次的url编码，参考的官网api调用文档说明的推荐方式。

cpp 复制代码

// 进行 URL 编码
  String encodedText = urlEncode(urlEncode(text));

http的请求包封装，根据api调用格式进行参数设置

cpp 复制代码

// URL http请求数据封装
  String url = "https://tsn.baidu.com/text2audio";

  const char* header[] = { "Content-Type", "Content-Length" };

  url += "?tok=" + access_token;
  url += "&tex=" + encodedText;
  url += "&per=" + String(per);
  url += "&spd=" + String(spd);
  url += "&pit=" + String(pit);
  url += "&vol=" + String(vol);
  url += "&aue=" + String(aue);
  url += "&cuid=esp32s3";
  url += "&lan=zh";
  url += "&ctp=1";

  // http请求创建
  HTTPClient http;

  http.begin(url);
  http.collectHeaders(header, 2);

这里是对http api请求的最大超时时间的设置，系统库默认为1s，但是在喇叭播报的最后会出现颤音现象，因此需要在这里将超时时间减小。

cpp 复制代码

// 设置timeout为200ms 避免最后出现杂音
stream->setTimeout(200);

这里是获取http音频流数据，在while中需要加入delay的处理，不然这里会占用系统，其他的task运行不了，比如音频录制、唤醒任务都不能运行，导致在音频输出时唤醒不了，因此这里我们做一个释放cpu的处理。

cpp 复制代码

while (http.connected() && (bytesRead = stream->readBytes(buffer, sizeof(buffer))) > 0) {
  // 音频输出
  playAudio(buffer, bytesRead);
  delay(1);
}

这个是清除i2s dma的缓冲区数据，消除杂音的作用。

cpp 复制代码

void clearAudio(void) {
  // 清空I2S DMA缓冲区
  i2s_zero_dma_buffer(I2S_OUT_PORT);
  Serial.print("clearAudio");
}

6、训练自己的唤醒词（进阶）

6.1音频录制

硬件准备

需要准备以下硬件：

AI集成套件
microSD卡(不大于32GB)
microSD读卡器

microSD卡格式化

将microSD卡装进读卡器中，并连接至电脑，将microSD卡**格式化为FAT32格式**。如下图：

格式化完成后，将microSD卡装到AI集成套件的卡槽中去。

开始录制音频数据

我们通过烧写录制音频软件到ESP32-S3中进行录制音频数据，录制的音频数据会保存到microSD卡中，然后我们可以通过电脑去读取出来。

烧写录制音频软件到ESP32-S3

音频录制软件工程在**esp32s3-ai-chat/example/capture_audio_data** 下。打开工程文件，在工程编译前，我们使用到了psram，因此需要**打开psram启动开关**，如下图所示，设置好后编译并且烧录到ESP32-S3中去。

串口发送标签进行录制音频

程序运行之后，正常运行的串口日志如下图。

程序运行正常后，我们就可以开始打开串口助手工具，发送相应的控制指令进行音频录制。发送"hgx"标签。

发送"rec"录制指令，开始录制一次。

发送**标签(例如：hgx)后，程序将等待另一个 命令rec**，每次发送命令**rec**时，程序就会开始记录新的样本（持续录制10秒钟后自动结束），文件将保存为hgx.1.wav、hgx.2.wav、hgx.3.wav等。
直到发送一个新标签(例如：noise)，在这种情况下，程序开始记录一个新标签样本，当你同样为每个新标签样本发送命令rec时，它将开始录音并被保存为noise.1.wav、noise.2.wav、noise.3.wav等。
最终，我们将得到保存在SD卡上的所有录制的标签样本文件，可以在电脑上通过读卡器读取到SD卡上的所有音频数据。如下图所示：