边缘计算相关实验02

1 音频分析及处理能力语音识别

1.1 语音识别

（1）Whisper模型（CPU）

介绍

Whisper 是 OpenAI 推出的一个自动语音识别（ASR）系统。

官方提供的Whisper模型只支持GPU。有相关人员将整个项目用C++重写，使得该模型可以在CPU上部署，whisper.cpp 具有无依赖项、内存使用量低等特点，支持 Mac、Windows、Linux、iOS 和 Android 系统。

C++版本地址：https://github.com/ggerganov/whisper.cpp

模型使用

（1）首先通过Github将源码下载下来：

复制代码

git clone https://github.com/ggerganov/whisper.cpp

（2）下载模型权重文件

复制代码

./download-ggml-model.sh base

参数base可以替换为base.en,tiny,tiny.en,small,small.en,medium,medium.en,large带en后缀的表示是英语模型，不带en后缀的是多国语言模型。本实验使用small + medium模型。

模型需要的磁盘空间和内存如下：

|--------|------------|----------|
| Model | Disk（磁盘空间） | Mem（内存） |
| tiny | 75 MiB | ~273 MB |
| base | 142 MiB | ~388 MB |
| small | 466 MiB | ~852 MB |
| medium | 1.5 GiB | ~2.1 GB |
| large | 2.9 GiB | ~3.9 GB |

注意：模型权重文件会从hugface上下载，但由于网络原因，建议提前从hugface上提前下载好需要的模型文件，并存放在项目的models目录下。

权重文件的hugging face地址：https://huggingface.co/ggerganov/whisper.cpp/tree/main

（3）编译

项目根目录执行make指令，得到二进制可执行文件main，然后就可以实现录音转文字了。

（4）将音频文件转化为16khz

录音文件只支持16khz的.wav文件。如果录音文件不是16khz的，需要使用ffmpeg进行转化。使用下述命令安装ffmpeg

复制代码

sudo apt install ffmpeg

验证安装

复制代码

ffmpeg -version

将文件以16000的采样率转化并保存为 wav 文件

复制代码

ffmpeg -i input.wav -vn -acodec pcm_s16le -ar 16000 output.wav

-i ：表示输入的音频文件
output.wav：表示输出采样率为16000的音频文件

（5）录音转文字

执行下述命令进行语音识别

复制代码

./main -l zh -m models/ggml-small.bin ./speech_data/test.wav

（6）识别效果

原始音频：近几年,不但我用书给女儿压岁,也劝说亲朋不要给女儿压岁钱,而改送压岁书。

识别效果（small模型）：近几年,不但我用书给女儿压碎,也劝说亲朋不要给女儿压碎钱,而改送压碎书。

耗时：146s

识别效果（medium模型）近几年,不但我用书给女儿压岁,也劝说亲朋不要给女儿压岁钱,而改送压岁书。

耗时：484s

原始音频：你一会儿是断保险去吗

识别效果（small模型）：你一会儿是断保险去吗

耗时：126s

识别效果（medium模型）：

耗时：420s

可选参数

复制代码

-h, --help：显示帮助信息并退出程序。
-t N, --threads N：设置使用的线程数。
-p N, --processors N：设置使用的处理器数。
-ot N, --offset-t N：设置音频的时间偏移量（以毫秒为单位）。
-d N, --duration N：设置要处理的音频长度（以毫秒为单位）。
-mc N, --max-context N：设置存储的文本上下文标记的最大数量。
-ml N, --max-len N：设置段的最大长度（以字符为单位）。
-sow, --split-on-word：指示是在单词上而不是标记上拆分段。
-bo N, --best-of N：设置要保留的最佳候选项数。
-tr, --translate：指示是否从源语言翻译为英语。
-di, --diarize：指示是否对立体声音频进行人声分离。
-otxt, --output-txt：指示是否将结果输出到文本文件中。
-ovtt, --output-vtt：指示是否将结果输出到vtt文件中。
-osrt, --output-srt：指示是否将结果输出到srt文件中。
-olrc, --output-lrc：指示是否将结果输出到lrc文件中。
-ocsv, --output-csv：指示是否将结果输出到CSV文件中。
-oj, --output-json：指示是否将结果输出到JSON文件中。
-of FNAME, --output-file FNAME：指定输出文件的路径（不包括文件扩展名）。
-pp, --print-progress：指示是否打印进度。
-nt, --no-timestamps：指示是否不打印时间戳。
-l LANG, --language LANG：指定语音识别的语言（'auto'表示自动检测）。
-dl, --detect-language：指示是否在自动检测语言后退出程序。
--prompt PROMPT：指定初始提示。
-m FNAME, --model FNAME：指定模型路径。
-f FNAME, --file FNAME：指定输入WAV文件的路径。

（2）Whisper模型（GPU, todo）

1.2 语音降噪/去背景音

（1）Denoiser模型

介绍

Denoiser 是 facebook 推出的一个语音降噪模型。

官方地址：https://github.com/facebookresearch/denoiser?tab=readme-ov-file#4-denoising

模型使用

（1）首先通过Github将源码下载下来：

复制代码

git clone https://github.com/facebookresearch/denoiser

（2）安装依赖

复制代码

cd denoiser
pip install -r requirements.txt  # cpu版本
pip install -r requirements_cuda.txt  # GPU版本

（3）降噪测试

复制代码

python3 -m denoiser.enhance --dns48 --noisy_dir=./noise --out_dir=./clean

其中

dns48是预训练好的模型（需要联网下载模型，具体的离线下载地址未找到）
noisy_dir 存在降噪的测试WAV文件
out_dir为指定降噪后的语音文件存储位置。

打开clean文件夹即可查看降噪后的语音文件

其中，后缀为_enhanced的文件为降噪后的语音文件，后缀为_noisy的文件为原始的噪音文件

（4）实验结果

（2）DFSMN模型（GPU,todo）

1.3 语音合成TTS

（1）SummerTTS 模型

介绍

SummerTTS 是一个基于C++的独立编译的中文和英文语音合成项目，可以本地运行不需要网络，而且没有额外的依赖，一键编译完成即可用于中文和英文的语音合成。

官网地址：https://github.com/huakunyang/SummerTTS

模型使用

（1）首先通过Github将源码下载下来：

复制代码

git clone https://github.com/huakunyang/SummerTTS.git

（2）安装依赖

复制代码

sudo apt-get install cmake

测试安装

复制代码

cmake --version

（3）下载模型

从以下的百度网盘地址下载模型，放入本项目的model目录中：链接: https://pan.baidu.com/s/1rYhtznOYQH7m8g-xZ_2VVQ?pwd=2d5h

模型文件放入后，models目录结构如下：
models/
├── single_speaker_mid.bin //中等大小的中文合成模型，速度比之前的模型稍慢，但合成的音质似乎要好点
├── single_speaker_english.bin //英文合成模型
├── single_speaker_english_fast.bin //速度更快的英文合成模型
└── single_speaker_fast.bin //中文合成模型

（4）构建编译

进入Build 目录，执行以下命令：

复制代码

cmake ..
make

编译完成后，会在Build 目录中生成 tts_test 执行程序

（5）运行合成程序

在Build 目录下，执行下面的命令，测试中文语音合成（TTS）

复制代码

./tts_test ../test.txt ../models/single_speaker_fast.bin out.wav

./tts_test ../test01.txt ../models/multi_speakers.bin out_multi.wav

运行下列命令，测试英文语音合成（TTS）：

复制代码

./tts_test ../test_eng.txt ../models/single_speaker_english.bin out_eng.wav

该命令行中：

第一个参数为是文本文件的路径，该文件包含需要被合成语音的文本。

第二个参数是前面提到的模型的路径，文件名开头的single 和 multi 表示模型包含了单个说话人还是多个说话人。推荐单说话人模型：single_speaker_fast.bin, 合成的速度较快，合成的音质也还行。

第三个参数是合成的音频文件（存放在build目录下），程序运行完之后生成该文件，可以用播放器打开。

（6）合成效果（ 该模型目前合成的语音只有单一说话人，不支持更换说话人**）**

原文：今年是2023年，夏天还没有来，但很快会过去的。

合成语音

原文：No one can help others as much as you do. No one can express himself like you. No one can express what you want to convey. No one can comfort others in your own way. No one can be as understanding as you are. No one can feel happy, carefree, and no one can smile as much as you do. In a word, no one can show your features to anyone else.

（2）Coqui TTS 模型

介绍

Coqui 文本转语音（Text-to-Speech，TTS）是新一代基于深度学习的低资源零样本文本转语音模型，具有合成多种语言语音的能力（支持阿拉伯语、巴西葡萄牙语、中文、捷克语、荷兰语、英语、法语、德语、意大利语、波兰语、俄语、西班牙语和土耳其语，共13种语言）。

官网地址：https://github.com/coqui-ai/TTS/

模型使用

（1）首先通过Github将源码下载下来：

复制代码

git clone https://github.com/coqui-ai/TTS

（2）安装依赖

复制代码

pip install -e .[all,dev,notebooks]

要求python >= 3.9, < 3.12

报错如下，尝试解决错误，还是无法解决

复制代码

ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/mec/tts/TTS/setup.py'"'"'; __file__='"'"'/home/mec/tts/TTS/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
         cwd: /home/mec/tts/TTS/
    Complete output (67 lines):
    performance hint: TTS/tts/utils/monotonic_align/core.pyx:11:5: Exception check on 'maximum_path_each' will always require the GIL to be acquired.
    Possible solutions:
        1. Declare 'maximum_path_each' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on 'maximum_path_each' to allow an error code to be returned.
    performance hint: TTS/tts/utils/monotonic_align/core.pyx:42:6: Exception check on 'maximum_path_c' will always require the GIL to be acquired.
    Possible solutions:
        1. Declare 'maximum_path_c' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on 'maximum_path_c' to allow an error code to be returned.
    performance hint: TTS/tts/utils/monotonic_align/core.pyx:47:21: Exception check after calling 'maximum_path_each' will always require the GIL to be acquired.
    Possible solutions:
        1. Declare 'maximum_path_each' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
        2. Use an 'int' return type on 'maximum_path_each' to allow an error code to be returned.
    Compiling TTS/tts/utils/monotonic_align/core.pyx because it changed.
    [1/1] Cythonizing TTS/tts/utils/monotonic_align/core.pyx
    running develop
    /tmp/pip-build-env-yxevugwt/overlay/lib/python3.9/site-packages/setuptools/command/develop.py:39: EasyInstallDeprecationWarning: easy_install command is deprecated.
    !!

            ********************************************************************************
            Please avoid running ``setup.py`` and ``easy_install``.
            Instead, use pypa/build, pypa/installer or other
            standards-based tools.

            See https://github.com/pypa/setuptools/issues/917 for details.
            ********************************************************************************

    !!
      easy_install.initialize_options(self)
    /tmp/pip-build-env-yxevugwt/overlay/lib/python3.9/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
    !!

            ********************************************************************************
            Please avoid running ``setup.py`` directly.
            Instead, use pypa/build, pypa/installer or other
            standards-based tools.

            See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
            ********************************************************************************

    !!
      self.initialize_options()
    running egg_info
    writing TTS.egg-info/PKG-INFO
    writing dependency_links to TTS.egg-info/dependency_links.txt
    writing entry points to TTS.egg-info/entry_points.txt
    writing requirements to TTS.egg-info/requires.txt
    writing top-level names to TTS.egg-info/top_level.txt
    reading manifest file 'TTS.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no previously-included files matching '*' found under directory 'tests'
    no previously-included directories found matching 'tests*'
    adding license file 'LICENSE.txt'
    writing manifest file 'TTS.egg-info/SOURCES.txt'
    running build_ext
    building 'TTS.tts.utils.monotonic_align.core' extension
    creating build
    creating build/temp.linux-x86_64-cpython-39
    creating build/temp.linux-x86_64-cpython-39/TTS
    creating build/temp.linux-x86_64-cpython-39/TTS/tts
    creating build/temp.linux-x86_64-cpython-39/TTS/tts/utils
    creating build/temp.linux-x86_64-cpython-39/TTS/tts/utils/monotonic_align
    x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/local/lib/python3.9/dist-packages/numpy/core/include -I/usr/include/python3.9 -c TTS/tts/utils/monotonic_align/core.c -o build/temp.linux-x86_64-cpython-39/TTS/tts/utils/monotonic_align/core.o
    TTS/tts/utils/monotonic_align/core.c:29:10: fatal error: Python.h: No such file or directory
       29 | #include "Python.h"
          |          ^~~~~~~~~~
    compilation terminated.
    error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/mec/tts/TTS/setup.py'"'"'; __file__='"'"'/home/mec/tts/TTS/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

通过上述源码安装会报错，于是通过pip方式安装

复制代码

pip install TTS

通过pip方式安装成功

（3）合成语音

复制代码

tts --text "知是行之始，行是知之成。" --out_path aaa.wav --model_name tts_models/zh-CN/baker/tacotron2-DDC-GST

其中，

text：指定输入的文本

out_path：指定输出的合成的语音文件的存放路径及名称

model_name：指定合成语音使用的模型（上述命令中使用中文合成语音模型）

执行上述命令后，会下载指定的模型文件

（4）合成效果

原文：知是行之始，行是知之成

合成语音

1.4 语音分离

（1）MossFormer模型

介绍

MossFormer语音分离模型是基于带卷积增强联合注意力(convolution-augmented joint self-attentions）的门控单头自注意力机制的架构（gated single-head transformer architecture ）开发出来的。

官方地址：https://www.modelscope.cn/models/damo/speech_mossformer2_separation_temporal_8k/summary

模型使用

（1）安装环境依赖

复制代码

# （1）安装speechbrain 
#  如果您的PyTorch版本>=1.10 安装最新版即可
pip install speechbrain
# 如果您的PyTorch版本 <1.10 且 >=1.7，可以指定如下版本安装
pip install speechbrain==0.5.12

# （2）安装libsndfile
# 本模型使用了三方库SoundFile进行wav文件处理，在Linux系统上用户需要手动安装SoundFile的底层依赖库libsndfile
sudo apt-get install libsndfile1 

# （2）安装modelscope和相关依赖 rotary-embedding-torch
pip install modelscope
pip install rotary-embedding-torch

（2）下载模型文件

复制代码

git clone https://www.modelscope.cn/damo/speech_separation_mossformer_8k_pytorch.git

（3）创建python文件

模型输入为8000Hz采样率的单声道wav文件，内容是两个人混杂在一起的说话声，输出结果是分离开的两个单声道音频。

复制代码

import numpy
import soundfile as sf
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# input指向本地文件音频文件路径
input = './speech_mossformer2_separation_temporal_8k/examples/mix_speech1.wav'
# model指向本地文件模型文件路径
separation = pipeline(
   Tasks.speech_separation,
   model='./speech_mossformer2_separation_temporal_8k')
result = separation(input)
for i, signal in enumerate(result['output_pcm_list']):
    save_file = f'output_spk{i}.wav'
    sf.write(save_file, numpy.frombuffer(signal, dtype=numpy.int16), 8000)

（4）运行python程序，并输出音频分离结果（存放在上述python程序的同级目录下）

复制代码

python3 test.py

1.5 声纹识别

（1）CAM++模型

介绍

CAM++模型是基于密集连接时延神经网络的说话人识别模型。相比于一些主流的说话人识别模型，比如ResNet34和ECAPA-TDNN，CAM++具有更准确的说话人识别性能和更快的推理速度。

官方地址：

https://www.modelscope.cn/models/damo/speech_campplus_sv_zh-cn_16k-common/summary

https://github.com/alibaba-damo-academy/3D-Speaker/tree/main

模型使用

（1）安装依赖

复制代码

# （1）安装libsndfile
libsndfile
sudo apt-get install libsndfile1 

# （2）安装modelscope和相关依赖 rotary-embedding-torch
pip install modelscope
pip install rotary-embedding-torch

（2）下载模型文件

复制代码

git clone https://www.modelscope.cn/damo/speech_campplus_sv_zh-cn_16k-common.git

（3）创建python文件

模型输入为16000Hz采样率的wav文件

复制代码

# -*- coding: utf-8 -*-
from modelscope.pipelines import pipeline
sv_pipeline = pipeline(
    task='speaker-verification',
    model='./speech_campplus_sv_zh-cn_16k-common',
    model_revision='v1.0.0'
)
speaker1_a_wav = './data/speaker1_a_cn_16k.wav'
speaker1_b_wav = './data/speaker1_b_cn_16k.wav'
speaker2_a_wav = './data/speaker2_a_cn_16k.wav'

speaker3_a_wav = './data/00.wav'
speaker4_a_wav = './data/01.wav'

result = sv_pipeline([speaker1_a_wav, speaker1_b_wav])
print(result)

result = sv_pipeline([speaker1_a_wav, speaker2_a_wav])
print(result)
# thr: 自定义得分阈值来进行识别，阈值越高，判定为同一人的条件越严格（只有得分大于thr时，才会判定为同一个人）
result = sv_pipeline([speaker1_a_wav, speaker2_a_wav], thr=0.31)
print(result)

result = sv_pipeline([speaker3_a_wav, speaker4_a_wav], thr=0.52)
print(result)

输出结果

该结果表明：

speaker1_a_wav 和 speaker1_b_wav的音频是同一个说话人，

speaker1_a_wav 和 speaker2_a_wav的音频不是同一个说话人，

speaker1_a_wav 和 speaker2_a_wav的音频不是同一个说话人

speaker3_a_wav 和 speaker4_a_wav的音频不是同一个说话人

（2）asv-subtools模型（GPU, todo）

官方地址：https://github.com/Snowdar/asv-subtools

2 视频分析及处理能力

2.1 目标检测+图像处理(人脸识别）

（1）face_recognition模型

介绍

本项目face_recognition是一个强大、简单、易上手的人脸识别开源项目，并且配备了完整的开发文档和应用案例，特别是兼容树莓派系统。

官方地址：https://github.com/ageitgey/face_recognition/blob/master/README_Simplified_Chinese.md

模型安装配置

（1）环境要求

Python 3.3+ or Python 2.7
macOS or Linux

（2）创建虚拟环境

对于 Python 项目，通常建议使用虚拟环境。这样可以保持不同项目的依赖项分开

复制代码

python3 -m venv myenv  #myenv为该虚拟环境的名称
source myenv/bin/activate  #激活并进入创建的虚拟环境

（4）安装dlib

face_recognition项目会依赖dlib，在执行安装命令（pip3 install face_recognition）会出现各种各样的报错（实测均是由dlib导致的），因此，在执行安装命令前，建议手动从源码安装dlib。

a) 配置gcc和g++

找到gcc和g++所处的位置

复制代码

which gcc
which g++

指定你gcc和g++对应的位置

复制代码

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

b）手动安装dlib

克隆代码

复制代码

git clone https://github.com/davisking/dlib.git

编译

复制代码

cd dlib
mkdir build; cd build; cmake ..; cmake --build .

安装dlib

复制代码

cd ..
python3 setup.py install

若报错有如下关键字

复制代码

CMake Error in CMakeLists.txt:
        Imported target "pybind11::module" includes non-existent path

          "/usr/include/python3.9"

        in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

        * The path was deleted, renamed, or moved to another location.

        * An install or uninstall procedure did not complete successfully.

        * The installation package was faulty and references files it does not
        provide.

说明编译 dlib 时，CMake 无法找到 Python 头文件的路径,

首先确认Python 的头文件所在位置

复制代码

find /usr -name Python.h

发现该目录下是/usr/include/python3.8并非cmake要找的/usr/include/python3.9

安装Python 3.9 的开发包

复制代码

sudo apt-get install python3.9-dev

重新确认Python头文件

可以看到/usr/include/python3.9路径已经存在

重新执行python3 setup.py install即可编译安装成功

（3）安装face_recognition项目

复制代码

pip3 install face_recognition

模型使用

（1）人脸识别

命令行方式

face_recognition 命令行工具

face_recognition命令行工具可以在单张图片或一个图片文件夹中认出是谁的脸。

首先，你得有一个你已经知道名字的人脸图片文件夹，一个人一张图，图片的文件名即为对应的人的名字

你在命令行中切换到这两个文件夹所在路径，然后使用face_recognition命令行，传入这两个图片文件夹，然后就会输出未知图片中人的名字：

复制代码

face_recognition ./pictures_of_people_i_know/ ./unknown_pictures/

这里以王宝强，郑凯的图片为例进行识别测试

i_know文件夹中的图片

unknown文件夹中的图片

输出结果（黄色的表示识别错误）

复制代码

./unknow/unknow11.jpg,zhengkai    #实际为王凯，输出结果应该为unknow_person
./unknow/unknow12.jpg,baoqiang  #实际为郑凯，输出结果应该为zhengkai
./unknow/unknow12.jpg,zhengkai  #实际为郑凯，输出结果应该为zhengkai
./unknow/unknow10.jpg,baoqiang  #实际为郑凯，输出结果应该为zhengkai
./unknow/unknow10.jpg,zhengkai  #实际为郑凯，输出结果应该为zhengkai
./unknow/unknow05.jpg,baoqiang  #实际为宝强，输出结果应该为baoqiang
./unknow/unknow05.jpg,zhengkai  #实际为宝强，输出结果应该为baoqian

如果一张脸识别出不止一个结果，那么这意味着他和其他人长的太像了（本项目对于小孩和亚洲人的人脸识别准确率有待提升）。可以把容错率调低一些，使识别结果更加严格。通过传入参数 --tolerance 来实现这个功能

因此，调整容错率参数（容错率越低，识别越严格准确，但也不能过低），重新进行识别。

复制代码

face_recognition --tolerance 0.45  ./i_know/ ./unknow/

识别结果如下，可见调整容错率后，识别准确率达到100%

复制代码

./unknow/unknow11.jpg,unknown_person
./unknow/unknow12.jpg,zhengkai
./unknow/unknow10.jpg,zhengkai
./unknow/unknow05.jpg,baoqiang

face_detection 命令行工具

face_detection命令行工具可以在单张图片或一个图片文件夹中定位人脸位置（输出像素点坐标）。

在命令行中使用face_detection，传入一个图片文件夹或单张图片文件来进行人脸位置检测：

复制代码

face_detection  ./folder_with_pictures/

examples/image1.jpg,65,215,169,112
examples/image2.jpg,62,394,211,244
examples/image2.jpg,95,941,244,792

输出结果的每一行都对应图片中的一张脸，输出坐标代表着这张脸的上、右、下、左像素点坐标。

（2）从视频文件中识别人脸（ 失败）

代码：

https://github.com/ageitgey/face_recognition/blob/master/examples/facerec_from_video_file.py

运行代码后一直报错

尝试多种方式均未解决

（3）人脸识别之后在原图上画框框并标注姓名（ 失败）

代码：https://github.com/ageitgey/face_recognition/blob/master/examples/identify_and_draw_boxes_on_faces.py

运行代码后一直报错

尝试多种方式均未解决

（2）MindFace模型（GPU，Todo）

官方地址：https://gitee.com/mindspore-lab/mindface#/mindspore-lab/mindface/blob/main/tutorials/detection/infer.md

（3）ArcFace模型（失败）

介绍

官方地址：https://www.modelscope.cn/models/damo/cv_ir50_face-recognition_arcface/summary

模型使用

（1）安装依赖

复制代码

pip install opencv-python
pip install scikit-image
pip install mmcv-full
pip install torchvision
pip install mmdet
pip install mmengine

运行测试程序时，报错mmcv-full版本不兼容，

尝试各种方法均未能解决该问题

2.2 特定对象识别

识别特定的对象，如某个人的人脸，或同类物体，如人、刀、枪、火等

（1）fire-and-gun-detection检测

介绍

该项目是作者根据yolov3开发的对于枪、火焰的检测模型（支持图片和视频）

官方地址：https://github.com/atulyakumar97/fire-and-gun-detection?tab=readme-ov-file

模型使用（需要用到opencv，请提前安装好）

（1）克隆代码

复制代码

git clone https://github.com/atulyakumar97/fire-and-gun-detection.git

（2）下载模型的权重文件，将文件放在项目目录下

https://onedrive.live.com/?authkey=%21AGDaftEjlDj9k6o&id=E9C1B3533D4253D%213525&cid=0E9C1B3533D4253D&parId=root&parQt=sharedby&parCid=C16D31CE07059237&o=OneUp

（3）运行程序（该程序目前可以识别三类物品：Gun（手枪）、Fire（火）、 Rifle（步枪））

复制代码

python3 yolo.py --play_video True --video_path videos/fire1.mp4

参数说明

复制代码

usage: yolo.py [-h] [--webcam WEBCAM] [--play_video PLAY_VIDEO]
               [--image IMAGE] [--video_path VIDEO_PATH]
               [--image_path IMAGE_PATH] [--verbose VERBOSE]

optional arguments:
  -h, --help            show this help message and exit
  --play_video PLAY_VIDEO  Tue/False //是否播放视频（输出识别效果）
  --image IMAGE         Tue/False //是否是图片
  --video_path VIDEO_PATH  Path of video file  //视频路径
  --image_path IMAGE_PATH  Path of image to detect objects  //图片路径

（4）识别效果

枪支识别+火焰识别（图片）

2.3 特定行为识别

识别特定的行为，如摔倒、边界入侵等

（1）falldetection_openpifpaf（摔倒/跌倒检测，GPU， todo）

官方地址：https://github.com/cwlroda/falldetection_openpifpaf

（2）HumanFallDetection

介绍

该项目是作者使用LSTM（长短时记忆神经网络）开发的用于人的摔倒检测。

官方地址：https://github.com/taufeeque9/HumanFallDetection

模型使用

（1）拉取代码

复制代码

git clone https://github.com/taufeeque9/HumanFallDetection.git

（2）进入项目目录，安装依赖

复制代码

pip install -r requirements.txt

（3）进行摔倒检测

复制代码

python3 fall_detector.py  --video ./video.mp4 --save_output ./dection.mp4 --disable_cuda True

参数说明

复制代码

num_cams  #要处理的摄像机/视频数量，默认值为1
video   #指定输入的视频及路径 ，默认值None
save_output  #将检测结果保存为文件
disable_cuda #禁用_cuda，默认值为false

图中：

Avg FPS: 表示 "平均帧率"，单位是每秒帧数（Frames Per Second）。这是衡量视频播放流畅度的标准，也就是说，在这种情况下，监控系统正以平均每秒x帧的速度处理和分析视频数据。
Frame: 表示当前正在处理的是视频的第几帧。例如，它显示 "Frame: 2"，意味着这是视频的第二帧。
Pred: 表示对当前帧的预测或分析结果。 "Normal"，意味着系统预测或分析当前帧时，认为场景是正常的，没有检测到跌倒或异常事件。FALL Warning 表示摔倒警告，FALL表示已经摔倒。

（3）stcae_pids（边界入侵检测，GPU，todo）

官方地址：https://github.com/devashishlohani/stcae_pids

（4）边界入侵检测

官方地址：https://blog.csdn.net/People1007/article/details/122428987

介绍

该项目是作者根据yolov3模型编写的边界入侵检测代码

模型使用

（1）创建main.py文件，将上述链接代码复制

（2）获取权重文件和视频文件

权重文件链接：https://pan.baidu.com/s/1wvHFpRkpuL4uuHWw707n6Q 提取码：3kja

视频文件链接：https://pan.baidu.com/s/15FZAnQOHhxACHJEGjGItDw 提取码：alz8

（3）将权重文件和视频文件放入到main.py的同级目录下

（4）执行程序

复制代码

python3 main.py

（5）获取检测结果

注：对上述博客中的源码进行了一些改动（原代码是边播放视频，边展示检测结果。由于服务器无法播放视频，修改代码直接输出检测后的视频）修改后的代码文件如下

复制代码

# -*- coding: utf-8 -*-

import cv2 as cv
import numpy as np
from matplotlib.path import Path


def yolo_detect(img):
    m = Path([(0, 338), (869, 333), (0, 443), (870, 425)])  # 警报区域
    # 报警区域的划定可以参考我另一篇水文《python获取图像中像素点坐标》
    n = Path([(968, 319), (1522, 341), (1521, 530), (958, 469)])
    # 这里的m，n是将4个坐标点顺序连起来组成的四边形所围成的区域
    confidence_thre = 0.5  # 置信度（概率/打分）阈值，即保留概率大于这个值的边界框，默认为0.5
    nms_thre = 0.3  # 非极大值抑制的阈值，默认为0.3
    LABELS = open('./coco.names').read().strip().split("\n")  # 加载类别标签文件
    (H, W) = img.shape[:2]  # 获取图片维度
    net = cv.dnn.readNetFromDarknet('./yolov3.cfg',
                                    './yolov3.weights')  # 加载模型配置和权重文件
    ln = net.getLayerNames()  # 获取YOLO输出层的名字
    ln = [ln[i - 1] for i in net.getUnconnectedOutLayers()]
    # 如果这里报错的话，请把i[0]的[0]去掉，变成i
    blob = cv.dnn.blobFromImage(img, 1 / 255.0, (416, 416), swapRB=True,
                                crop=False)  # 将图片构建成一个blob，设置图片尺寸，然后执行一次  YOLO前馈网络计算，最终获取边界框和相应概率
    net.setInput(blob)
    layerOutputs = net.forward(ln)

    boxes = []  # 初始化边界框，置信度（概率）以及类别
    confidences = []
    classIDs = []
    i = 0

    for output in layerOutputs:  # 迭代每个输出层，总共三个
        for detection in output:  # 迭代每个检测
            scores = detection[5:]  # 提取类别ID和置信度
            classID = np.argmax(scores)
            confidence = scores[classID]
            if confidence > confidence_thre:  # 只保留置信度大于某值的边界框
                box = detection[0:4] * np.array([W, H, W, H])  # 将边界框的坐标还原至与原图片相匹配，返回边界框的中心坐标以及边界框的宽度和高度
                (centerX, centerY, width, height) = box.astype("int")
                x = int(centerX - (width / 2))
                y = int(centerY - (height / 2))
                boxes.append([x, y, int(width), int(height)])  # 更新边界框，置信度（概率）以及类别
                confidences.append(float(confidence))
                classIDs.append(classID)

    idxs = cv.dnn.NMSBoxes(boxes, confidences, confidence_thre, nms_thre)  # 使用非极大值抑制方法抑制弱、重叠边界框
    if len(idxs) > 0:  # 确保至少一个边界框
        for i in idxs.flatten():  # 迭代每个边界框
            color = (255, 0, 0)
            (x, y) = (boxes[i][0], boxes[i][1])
            (w, h) = (boxes[i][2], boxes[i][3])
            # 报警条件
            if (m.contains_point((int(x + w / 2), int(y + h / 2))) or n.contains_point(
                    (int(x + w / 2), int(y + h / 2)))) and (LABELS[classIDs[i]] == 'person'):
                color = (0, 0, 255)
                # m.contain_point（x，y）可以判断点（x，y）是否在m区域内
                cv.putText(img, "Catch the thief!", (680, 425), cv.FONT_HERSHEY_COMPLEX, 2.0, (0, 0, 255), 5)  # 警报信息
            cv.rectangle(img, (x, y), (x + w, y + h), color, 2)  # 绘制边界框以及添加类别标签和置信度
            text = '{}: {:.3f}'.format(LABELS[classIDs[i]], confidences[i])
            (text_w, text_h), baseline = cv.getTextSize(text, cv.FONT_HERSHEY_SIMPLEX, 0.5, 2)
            cv.rectangle(img, (x, y - text_h - baseline), (x + text_w, y), color, -1)
            cv.putText(img, text, (x, y - 5), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 2)
    return img


def main():
    cap = cv.VideoCapture('./video.mp4')  # 注意更改视频路径

    frame_width = int(cap.get(3))
    frame_height = int(cap.get(4))

    # 定义编码器和创建 VideoWriter 对象以保存视频
    # 这里使用 XVID 编码器，你也可以根据需要选择其他编码器，如MP4V
    out = cv.VideoWriter('output.avi', cv.VideoWriter_fourcc('X', 'V', 'I', 'D'), 30, (frame_width, frame_height))

    while cap.isOpened():
        success, frame = cap.read()
        if success:
            frame = yolo_detect(frame)
            out.write(frame)  # 将处理后的帧写入视频文件
        else:
            break

    # 释放资源
    cap.release()
    out.release()
    cv.destroyAllWindows()

    # i = 0
    # while True:
    #     success, frame = cap.read()  # 读取视频流
    #     if success:
    #         if (i % 1 == 0):  # 每隔固定帧处理一次
    #             frame = yolo_detect(frame)
    #             cv.namedWindow('asd', cv.WINDOW_NORMAL)
    #             cv.imshow('asd', frame)
    #         i += 1
    #         key = cv.waitKey(5) & 0xFF  # 手动停止方法
    #         if key == ord('q'):
    #             print('停止播放')
    #             break
    #     else:
    #         print('播放完成')
    #         break
    # cap.release()
    # cv.destroyAllWindows()


if __name__ == "__main__":
    main()

边缘计算相关实验02

1 音频分析及处理能力语音识别

1.1 语音识别

（1）Whisper模型（CPU）

（2）Whisper模型（GPU, todo）

1.2 语音降噪/去背景音

（1）Denoiser模型

（2）DFSMN模型（GPU,todo）

1.3 语音合成TTS

（1）SummerTTS 模型

（2）Coqui TTS 模型

（3）Bark模型（GPU，todo）

（4）SpeechBrain模型（GPU，todo）

（5）Massively Multilingual Speech模型（GPU，todo）

1.4 语音分离

（1）MossFormer模型

1.5 声纹识别

（1）CAM++模型

（2）asv-subtools模型（GPU, todo）

2 视频分析及处理能力

2.1 目标检测+图像处理(人脸识别）

（1）face_recognition模型

（2）MindFace模型（GPU，Todo）

（3）ArcFace模型（失败）

2.2 特定对象识别

（1）fire-and-gun-detection检测

2.3 特定行为识别

（1）falldetection_openpifpaf（摔倒/跌倒检测，GPU， todo）

（2）HumanFallDetection

（3）stcae_pids（边界入侵检测，GPU，todo）

（4）边界入侵检测

3 自然语言分析及处理能力翻译

3.1 离线私有大语言模型

（1）ChatGLM模型（GPU， todo）

（2）Llama2模型（GPU， todo）

参考资料