Baby Crying Recognition Tutorials of Algorithm Landing
Author:Jet Date:2023/03
[TOC]
$ Pipeline
- Step 1 语音持续监控,当检测到音量大于某个阈值,开始连续录制音频某一段时间
- Step 2 对收集的语音数据进行信号处理,并提取一系列特征
- Step 3 将特征输入到分类器中进行二分类
- Step 4 根据分类结果决定是和否预警
- Step 5 循环上述步骤
$$ Methodology
- 输入过程
- 数字语音信号:离散
- 格式:.wav/.mp3/.amr/.m4a/.flac/.aac
-
中间过程
- 特征提取
- 短时傅里叶变换(STFT)的频谱、梅尔谱Mel-spectrogram的MFCC(DCT)
- 特征选择
-
过零率(Zero Crossing Rate)
-
频谱质心(Spectral Centroid)
-
频谱衰减 (Spectral Roll-off)
-
梅尔频率倒谱系数(Mel-frequency cepstral coefficients ,MFCC)
-
色度频率(Chroma Frequencies)
-
...
-
- 特征提取
-
输出过程
-
1 Scoring
- 计算特征得分后与阈值比较
-
2 Classifier
- Machine learning: SVM / Random forest
- Deep Learning: 1D Conv
-
$$$ Step 1 feature extraction
Key words
-
Amplitude --- Perceived as loudness 振幅-视为响度
-
Frequency --- Perceived as pitch 频率-视为音高
-
Sample rate --- It is how many times the sample is taken of a sound file if it says sample rate as 22000 Hz it means 22000 samples are taken in each second. 采样率---如果声音文件的采样率表示为22000 Hz,则它是对声音文件进行采样的次数,这表示每秒进行22000个采样。
-
Bit depth --- It represents the quality of sound recorded, It just likes pixels in an image. So 24 Bit sound is of better quality than 16 Bit.
位深度---它代表所记录声音的质量,就像图像中的像素一样。 因此,24位声音的质量比16位更好。
STFT
- 逐帧进行快速傅里叶变换的过程被称为短时傅里叶变换(short-time Fourier transform 或 short-term Fourier transform,STFT)
MFCC
- 提取过程分为预加重、分帧、加窗、快速傅里叶变换(FFT)、梅尔滤波器组过滤、取对数、离散余弦变换(DCT)
Feature selection
为9种:STFT 8 + MFCC 1
-
zero_crossing_rate 过零率
-
ste (short-time energy) 短时能量
-
ste_acc 过零率大于0
-
stzcr (short-time zero crossing rate) 短时过零率
-
spectral_centroid 频谱质心
-
spectral_bandwidth 频谱带宽
-
spectral_rolloff 频谱衰减
-
spectral_flatness 频谱平坦
-
==mfcc 梅尔频率倒谱系数==
python
import numpy as np
import librosa.feature as lrf
import scipy.signal as scisig
class AudioUtils:
def __init__(self):
pass
@staticmethod
def _sgn(x):
y = np.zeros_like(x)
y[np.where(x >= 0)] = 1.0
y[np.where(x < 0)] = -1.0
return y
@staticmethod
def ste(data, wintype, winlen):
"""
Compute short-time energy
:param data:
:param wintype:
:param winlen:
:return:
"""
win = scisig.get_window(wintype, winlen)
return scisig.convolve(data ** 2, win ** 2, mode="same")
@staticmethod
def stzcr(data, wintype, winlen):
"""
Compute short-time zero crossing rate.
:param data:
:param wintype:
:param winlen:
:return:
"""
win = scisig.get_window(wintype, winlen)
win = 0.5 * win / len(win)
x1 = np.roll(data, 1)
x1[0] = 0.0
abs_diff = np.abs(AudioUtils._sgn(data) - AudioUtils._sgn(x1))
return scisig.convolve(abs_diff, win, mode="same")
class FeatureExtraction:
RATE = 44100
FRAME = 512
def __init__(self, label=None):
if label is None:
self.label = ''
else:
self.label = label
def extract_feature(self, audio_data):
"""
extract features from audio data
:param audio_data:
:return:
"""
zcr = lrf.zero_crossing_rate(audio_data, frame_length=self.FRAME, hop_length=self.FRAME // 2)
feature_zcr = np.mean(zcr)
ste = AudioUtils.ste(audio_data, 'hamming', int(20 * 0.001 * self.RATE))
feature_ste = np.mean(ste)
ste_acc = np.diff(ste)
feature_steacc = np.mean(ste_acc[ste_acc > 0])
stzcr = AudioUtils.stzcr(audio_data, 'hamming', int(20 * 0.001 * self.RATE))
feature_stezcr = np.mean(stzcr)
mfcc = lrf.mfcc(y=audio_data, sr=self.RATE, n_mfcc=13)
feature_mfcc = np.mean(mfcc, axis=1)
spectral_centroid = lrf.spectral_centroid(y=audio_data, sr=self.RATE, hop_length=self.FRAME // 2)
feature_spectral_centroid = np.mean(spectral_centroid)
spectral_bandwidth = lrf.spectral_bandwidth(y=audio_data, sr=self.RATE, hop_length=self.FRAME // 2)
feature_spectral_bandwidth = np.mean(spectral_bandwidth)
spectral_rolloff = lrf.spectral_rolloff(y=audio_data, sr=self.RATE, hop_length=self.FRAME // 2,
roll_percent=0.90)
feature_spectral_rolloff = np.mean(spectral_rolloff)
spectral_flatness = lrf.spectral_flatness(y=audio_data, hop_length=self.FRAME // 2)
feature_spectral_flatness = np.mean(spectral_flatness)
features = np.append([feature_zcr, feature_ste, feature_steacc, feature_stezcr, feature_spectral_centroid,
feature_spectral_bandwidth, feature_spectral_rolloff, feature_spectral_flatness],
feature_mfcc)
return features, self.label
$$$ Step 2 audio classification
Input
- selected features above
Output
- cry or not (laugh, noise, silence...)
Classifier
- Machine learning: SVM / Random forest
python
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
- Deep Learning: Conv1d
python
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
class SpeechDataset(Dataset):
def __init__(self, inputs, labels):
self.inputs = inputs
self.labels = labels
# self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 1, '4Silence': 1}
self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 2, '4Silence': 3}
def __len__(self):
return len(self.inputs)
def __getitem__(self, index):
data = self.inputs[index]
label = self.labels[index]
label = self.map_dict[label]
return data, label
class Net(nn.Module):
def __init__(self, in_channels, num_classes, dropout=0.25):
super().__init__()
# expected conv1d input : minibatch_size x num_channel x width
self.layer = nn.Sequential(
nn.Conv1d(in_channels=1, out_channels=8, kernel_size=3),
nn.MaxPool1d(kernel_size=2, stride=2),
nn.Dropout(dropout, inplace=True),
nn.Conv1d(in_channels=8, out_channels=16, kernel_size=3),
nn.MaxPool1d(kernel_size=2, stride=2),
nn.Dropout(dropout, inplace=True),
nn.Flatten(),
nn.Linear(48, num_classes),
nn.Softmax(dim=1)
)
def forward(self, x):
x = x.view(x.size(0), 1, x.size(1))
x = self.layer(x)
return x
$ Training stage
Device
- linux
Data visualization
Datasets
-
Total numbers 420 = Train (80%) + Test (20%)
-
Labels
ini# 2 classes self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 1, '4Silence': 1} # 4 classes self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 2, '4Silence': 3}
Notes
-
Conv1d need input shape
[B, C=1, D=feature dim]
pythondef forward(self, x): x = x.view(x.size(0), 1, x.size(1)) x = self.layer(x) return x
$$ Accuracy & Model Size
Models | SVM | Random Forest | Neural Network |
---|---|---|---|
Accuracy(%) | 97.7 | 96.5 | 84 |
==Model Size(KB)== | 42 | 214 | 5 |
- SVM
perf, model = mt.train_svm_model() # {'accuracy': 0.9770114942528736, 'recall': 0.9761904761904762, 'precision': 0.9782608695652174, 'f1': 0.9763888888888889}
- RF
perf, model = mt.train_rf_model() # {'accuracy': 0.9655172413793104, 'recall': 0.9648268398268398, 'precision': 0.9657608695652173, 'f1': 0.965040650406504}
- NN
best_epoch:480 max_acc:0.8560919540229885
$ Deployment stage
Device
- embedded T31 chip
FFT
Scoring method result
noise as input | baby crying as input |
---|---|
Compiled C program size
-
32KB
-
c
//FFT运算 for (L = 1; L <= M; L++) { B = (int)(pow(2, L - 1));//第L级,每个蝶形的两个数据有B=2^(L-1)个点 for (J = 0; J < B; J++) { P = (int)(J*pow(2, M - L));//每级有B个旋转因子,一旋转因子对应着2^(M-1)个蝶形 for (k = J; k < N; k = (int)(k + pow(2, L))) { K1 = k + B; complex wn, t; Wn_i(N, P, &wn); c_chengfa(f[K1], wn, &t); //。。。。。。。。。。。。 c_jianfa(f[k], t, &(f[K1])); //蝶形运算 c_jiafa(f[k], t, &(f[k])); //。。。。。。。。。。。。 } } } for (int i = 0; i < N; i++) //快速傅里叶变换输出 { y[i]=sqrt(f[i].real*f[i].real+f[i].imag*f[i].imag); printf("%d %lf\n", i, y[i]); }
Export sklearn model to deploy
python
porter sk版本 https://github.com/nok/sklearn-porter/issues/82
pip install scikit-learn==0.22
py3.6环境比较好
# ! export to c language in windows
from sklearn_porter import Porter
porter = Porter(model, language='c') # 直接导出C语言即可
output = porter.export(embed_data=True)
with open('svm_infer.c', 'w') as f:
f.write(output)
inference with c language
c
int main(int argc, const char * argv[]) {
/* Features: */
double features[21] = {0.08548660455336426,3.7481414308053336,0.0031232076831082135,0.04624957396314546,3676.3404738124937,3480.6022753882544,8677.356268579611,0.0015091497916728258,-246.88226704105966,87.94723203609043,-69.86268052131344,-0.34579994437744116,-42.9471282727914,4.457187434533733,-17.901607875608402,21.03675732334522,6.356159473281781,3.2514479635033497,-7.103540737790745,10.796371942992913,-10.94354612752694}; // cry 0
// labels
char* labels[4] = {"1BabyCry", "2BabyLaugh", "3Noise", "4Silence"};
// std
double mean[21] = {5.19192940e-02, 9.96640933e+00, 1.70053242e-02, 2.81007237e-02,
2.43518755e+03, 3.18738345e+03, 6.77396457e+03, 4.17157508e-02,
-3.11171803e+02, 1.12272466e+02, -1.03072190e+01, 1.78713296e+01,
-4.93390880e+00, 3.33015672e+00, 8.81093827e-01, 2.69630493e+00,
1.28616258e+00, 2.27378571e+00, -2.89055164e-01, -2.95153293e-01,-1.40412454e-01};
double var[21] = {1.86645763e-03, 2.01389259e+02, 6.28560738e-04, 5.46686531e-04,
8.46803415e+05, 1.40065049e+06, 6.21648618e+06, 2.62907671e-02,
8.45015467e+03, 1.96557435e+03, 2.09598131e+03, 2.57818971e+02,
3.29287434e+02, 2.32575174e+02, 1.91351733e+02, 1.57938136e+02,
1.43058889e+02, 1.03135708e+02, 8.86539659e+01, 7.32728827e+01,
6.26621389e+01};
for(int i=0; i<21; i++){
features[i] = (features[i]-mean[i]) / sqrt(var[i]);
}
/* Prediction: */
int class_id = predict(features);
char* label = labels[class_id];
printf("class_id:%d \t label:%s\n", class_id, label);
printf("Done.\n");
return 0;
}
shell
output:
class_id:0 label:1BabyCry
Done.
56K run
Notes
- gcc编译依赖项的先后顺序问题
- gcc需要手动指定链接库比如
-lm
(#include<math.h> 数学库 -lm; posix 线程 -lpthread)
python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
// 在训练集上使用fit_transform()
scaler.fit_transform(X_train)
// 在测试集上使用transform()
scaler.transform(X_test)
或者测试用numpy
# * StandardScaler
features = features.reshape(1, -1)
mean_ = np.array([ 5.19192940e-02, 9.96640933e+00, 1.70053242e-02, 2.81007237e-02,
2.43518755e+03, 3.18738345e+03, 6.77396457e+03, 4.17157508e-02,
-3.11171803e+02, 1.12272466e+02, -1.03072190e+01, 1.78713296e+01,
-4.93390880e+00, 3.33015672e+00, 8.81093827e-01, 2.69630493e+00,
1.28616258e+00, 2.27378571e+00, -2.89055164e-01, -2.95153293e-01,-1.40412454e-01])
var_ = np.array([1.86645763e-03, 2.01389259e+02, 6.28560738e-04, 5.46686531e-04,
8.46803415e+05, 1.40065049e+06, 6.21648618e+06, 2.62907671e-02,
8.45015467e+03, 1.96557435e+03, 2.09598131e+03, 2.57818971e+02,
3.29287434e+02, 2.32575174e+02, 1.91351733e+02, 1.57938136e+02,
1.43058889e+02, 1.03135708e+02, 8.86539659e+01, 7.32728827e+01,
6.26621389e+01])
features = (features - mean_) / np.sqrt(var_)
$ References
-
conv1d + ensemble
-
-
conv2d
-
backbone是一个基于CNN+FC的网络结构,与图像CNN分类模型不同的是,图像CNN分类模型的输入维度(batch,3,H,W), 输入数据depth=3,而音频信号的梅尔频谱图是深度为depth=1,可以认为是灰度图,输入维度(batch,1,H,W),因此实际使用中,只需要将传统的CNN图像分类的backbone的第一层卷积层的in_channels=1即可。需要注意的是,由于维度不一致,导致不能使用imagenet的pretrained模型。当然可以将梅尔频谱图(灰度图)是转为3通道RGB图,这样就跟普通的RGB图像没有什么区别了,也可以imagenet的pretrained模型。
-
-
==tools==
- sklearn porter:更轻量
- m2cgen
$ TODO
-
refinement of Conv1d network
-
提取STFT feature的C/C++
-
提取STFT MFCC等feature的C/C++
C/C++实现Python音频处理库librosa中melspectrogram的计算过程
-
实现SVM/RF/NN的C/C++
github.com/livey/svm-e... linear kernel not well,需要增加RBF即可,但是二分类
github.com/koba-jon/sv... rbf可以
github.com/koba-jon/sv... 可以参考
github.com/cjlin1/libs... 参考
c++static double dot(const svm_node *px, const svm_node *py); double kernel_linear(int i, int j) const { return dot(x[i],x[j]); } double kernel_poly(int i, int j) const { return powi(gamma*dot(x[i],x[j])+coef0,degree); } double kernel_rbf(int i, int j) const { return exp(-gamma*(x_square[i]+x_square[j]-2*dot(x[i],x[j]))); } double kernel_sigmoid(int i, int j) const { return tanh(gamma*dot(x[i],x[j])+coef0); } double kernel_precomputed(int i, int j) const { return x[i][(int)(x[j][0].value)].value; }