Baby Crying Recognition Tutorials of Algorithm Landing
Author:Jet Date:2023/03

TOC

$ Pipeline

Step 1 语音持续监控，当检测到音量大于某个阈值，开始连续录制音频某一段时间
Step 2 对收集的语音数据进行信号处理，并提取一系列特征
Step 3 将特征输入到分类器中进行二分类
Step 4 根据分类结果决定是和否预警
Step 5 循环上述步骤

$$ Methodology

输入过程
- 数字语音信号：离散
- 格式：.wav/.mp3/.amr/.m4a/.flac/.aac

中间过程
- 特征提取
  - 短时傅里叶变换（STFT）的频谱、梅尔谱Mel-spectrogram的MFCC（DCT）
- 特征选择
  - 过零率（Zero Crossing Rate）
  - 频谱质心（Spectral Centroid）
  - 频谱衰减 (Spectral Roll-off）
  - 梅尔频率倒谱系数（Mel-frequency cepstral coefficients ，MFCC）
  - 色度频率（Chroma Frequencies）
  - ...
输出过程
- 1 Scoring
  - 计算特征得分后与阈值比较
- 2 Classifier
  - Machine learning: SVM / Random forest
  - Deep Learning: 1D Conv

$$$ Step 1 feature extraction

Key words

Amplitude --- Perceived as loudness 振幅-视为响度
Frequency --- Perceived as pitch 频率-视为音高
Sample rate --- It is how many times the sample is taken of a sound file if it says sample rate as 22000 Hz it means 22000 samples are taken in each second. 采样率---如果声音文件的采样率表示为22000 Hz，则它是对声音文件进行采样的次数，这表示每秒进行22000个采样。
Bit depth --- It represents the quality of sound recorded, It just likes pixels in an image. So 24 Bit sound is of better quality than 16 Bit.

位深度---它代表所记录声音的质量，就像图像中的像素一样。因此，24位声音的质量比16位更好。

STFT

逐帧进行快速傅里叶变换的过程被称为短时傅里叶变换（short-time Fourier transform 或 short-term Fourier transform，STFT）

MFCC

提取过程分为预加重、分帧、加窗、快速傅里叶变换（FFT）、梅尔滤波器组过滤、取对数、离散余弦变换（DCT）

Feature selection

为9种：STFT 8 + MFCC 1

zero_crossing_rate 过零率
ste (short-time energy) 短时能量
ste_acc 过零率大于0
stzcr (short-time zero crossing rate) 短时过零率
spectral_centroid 频谱质心
spectral_bandwidth 频谱带宽
spectral_rolloff 频谱衰减
spectral_flatness 频谱平坦
==mfcc 梅尔频率倒谱系数==

python 复制代码

import numpy as np
import librosa.feature as lrf
import scipy.signal as scisig


class AudioUtils:
    def __init__(self):
        pass

    @staticmethod
    def _sgn(x):
        y = np.zeros_like(x)
        y[np.where(x >= 0)] = 1.0
        y[np.where(x < 0)] = -1.0
        return y

    @staticmethod
    def ste(data, wintype, winlen):
        """
        Compute short-time energy
        :param data:
        :param wintype:
        :param winlen:
        :return:
        """
        win = scisig.get_window(wintype, winlen)
        return scisig.convolve(data ** 2, win ** 2, mode="same")

    @staticmethod
    def stzcr(data, wintype, winlen):
        """
        Compute short-time zero crossing rate.
        :param data:
        :param wintype:
        :param winlen:
        :return:
        """
        win = scisig.get_window(wintype, winlen)
        win = 0.5 * win / len(win)
        x1 = np.roll(data, 1)
        x1[0] = 0.0
        abs_diff = np.abs(AudioUtils._sgn(data) - AudioUtils._sgn(x1))
        return scisig.convolve(abs_diff, win, mode="same")


class FeatureExtraction:
    RATE = 44100
    FRAME = 512
    def __init__(self, label=None):
        if label is None:
            self.label = ''
        else:
            self.label = label
    def extract_feature(self, audio_data):
        """
        extract features from audio data
        :param audio_data:
        :return:
        """
        zcr = lrf.zero_crossing_rate(audio_data, frame_length=self.FRAME, hop_length=self.FRAME // 2)
        feature_zcr = np.mean(zcr)

        ste = AudioUtils.ste(audio_data, 'hamming', int(20 * 0.001 * self.RATE))
        feature_ste = np.mean(ste)

        ste_acc = np.diff(ste)
        feature_steacc = np.mean(ste_acc[ste_acc > 0])

        stzcr = AudioUtils.stzcr(audio_data, 'hamming', int(20 * 0.001 * self.RATE))
        feature_stezcr = np.mean(stzcr)

        mfcc = lrf.mfcc(y=audio_data, sr=self.RATE, n_mfcc=13)
        feature_mfcc = np.mean(mfcc, axis=1)

        spectral_centroid = lrf.spectral_centroid(y=audio_data, sr=self.RATE, hop_length=self.FRAME // 2)
        feature_spectral_centroid = np.mean(spectral_centroid)

        spectral_bandwidth = lrf.spectral_bandwidth(y=audio_data, sr=self.RATE, hop_length=self.FRAME // 2)
        feature_spectral_bandwidth = np.mean(spectral_bandwidth)

        spectral_rolloff = lrf.spectral_rolloff(y=audio_data, sr=self.RATE, hop_length=self.FRAME // 2,
                                                roll_percent=0.90)
        feature_spectral_rolloff = np.mean(spectral_rolloff)

        spectral_flatness = lrf.spectral_flatness(y=audio_data, hop_length=self.FRAME // 2)
        feature_spectral_flatness = np.mean(spectral_flatness)

        features = np.append([feature_zcr, feature_ste, feature_steacc, feature_stezcr, feature_spectral_centroid,
                              feature_spectral_bandwidth, feature_spectral_rolloff, feature_spectral_flatness],
                             feature_mfcc)
        return features, self.label

$$$ Step 2 audio classification

Input

selected features above

Output

cry or not (laugh, noise, silence...)

Classifier

Machine learning: SVM / Random forest

python 复制代码

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

Deep Learning: Conv1d

python 复制代码

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader


class SpeechDataset(Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels
        # self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 1, '4Silence': 1}
        self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 2, '4Silence': 3}
        
    def __len__(self):
        return len(self.inputs)
    
    def __getitem__(self, index):
        data = self.inputs[index]
        label = self.labels[index]
        label = self.map_dict[label]
        
        return data, label

    
class Net(nn.Module):
    def __init__(self, in_channels, num_classes, dropout=0.25):
        super().__init__()
        # expected conv1d input : minibatch_size x num_channel x width
        self.layer = nn.Sequential(
            nn.Conv1d(in_channels=1, out_channels=8, kernel_size=3),
            nn.MaxPool1d(kernel_size=2, stride=2),
            nn.Dropout(dropout, inplace=True),
            
            nn.Conv1d(in_channels=8, out_channels=16, kernel_size=3),
            nn.MaxPool1d(kernel_size=2, stride=2),
            nn.Dropout(dropout, inplace=True),
            
            nn.Flatten(),
            nn.Linear(48, num_classes),
            nn.Softmax(dim=1)
            )

    def forward(self, x):
        x = x.view(x.size(0), 1, x.size(1))
        x = self.layer(x)
        return x

$ Training stage

Device

linux

Data visualization

Datasets

Total numbers 420 = Train (80%) + Test (20%)

Labels

ini 复制代码

   # 2 classes
   self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 1, '4Silence': 1}
   
   # 4 classes
   self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 2, '4Silence': 3}

Notes

Conv1d need input shape [B, C=1, D=feature dim]

python 复制代码

def forward(self, x):
    x = x.view(x.size(0), 1, x.size(1))
    x = self.layer(x)
    return x

$$ Accuracy & Model Size

Models	SVM	Random Forest	Neural Network
Accuracy(%)	97.7	96.5	84
==Model Size(KB)==	42	214	5

SVM perf, model = mt.train_svm_model() # {'accuracy': 0.9770114942528736, 'recall': 0.9761904761904762, 'precision': 0.9782608695652174, 'f1': 0.9763888888888889}
RF perf, model = mt.train_rf_model() # {'accuracy': 0.9655172413793104, 'recall': 0.9648268398268398, 'precision': 0.9657608695652173, 'f1': 0.965040650406504}
NN best_epoch:480 max_acc:0.8560919540229885

$ Deployment stage

Device

embedded T31 chip

FFT

Scoring method result

noise as input	baby crying as input

Compiled C program size

32KB

debug code example

c 复制代码

	//FFT运算
	for (L = 1; L <= M; L++)							
	{
		B = (int)(pow(2, L - 1));//第L级，每个蝶形的两个数据有B=2^(L-1)个点
		for (J = 0; J < B; J++)
		{
			P = (int)(J*pow(2, M - L));//每级有B个旋转因子，一旋转因子对应着2^(M-1)个蝶形
			for (k = J; k < N; k = (int)(k + pow(2, L)))
			{
				K1 = k + B;
				complex wn, t;
				Wn_i(N, P, &wn);
				c_chengfa(f[K1], wn, &t);					//。。。。。。。。。。。。
				c_jianfa(f[k], t, &(f[K1]));				//蝶形运算
				c_jiafa(f[k], t, &(f[k]));				//。。。。。。。。。。。。
			}
		}
	}
	
	for (int i = 0; i < N; i++)							//快速傅里叶变换输出
	{
		y[i]=sqrt(f[i].real*f[i].real+f[i].imag*f[i].imag);
		printf("%d %lf\n", i, y[i]);
	}

Export sklearn model to deploy

pypi.org/project/skl... ==模型导出==
github.com/nok/sklearn...

python 复制代码

porter sk版本 https://github.com/nok/sklearn-porter/issues/82
pip install scikit-learn==0.22
py3.6环境比较好

# ! export to c language in windows
from sklearn_porter import Porter
porter = Porter(model, language='c') # 直接导出C语言即可
output = porter.export(embed_data=True)
with open('svm_infer.c', 'w') as f:
	f.write(output)

inference with c language

c 复制代码

int main(int argc, const char * argv[]) {
    /* Features: */
    double features[21] = {0.08548660455336426,3.7481414308053336,0.0031232076831082135,0.04624957396314546,3676.3404738124937,3480.6022753882544,8677.356268579611,0.0015091497916728258,-246.88226704105966,87.94723203609043,-69.86268052131344,-0.34579994437744116,-42.9471282727914,4.457187434533733,-17.901607875608402,21.03675732334522,6.356159473281781,3.2514479635033497,-7.103540737790745,10.796371942992913,-10.94354612752694}; // cry 0

    // labels
    char* labels[4] = {"1BabyCry", "2BabyLaugh", "3Noise", "4Silence"};

    // std
    double mean[21] = {5.19192940e-02,  9.96640933e+00,  1.70053242e-02,  2.81007237e-02,
        2.43518755e+03,  3.18738345e+03,  6.77396457e+03,  4.17157508e-02,
        -3.11171803e+02,  1.12272466e+02, -1.03072190e+01,  1.78713296e+01,
        -4.93390880e+00,  3.33015672e+00,  8.81093827e-01,  2.69630493e+00,
        1.28616258e+00,  2.27378571e+00, -2.89055164e-01, -2.95153293e-01,-1.40412454e-01};
    
    double var[21] = {1.86645763e-03, 2.01389259e+02, 6.28560738e-04, 5.46686531e-04,
        8.46803415e+05, 1.40065049e+06, 6.21648618e+06, 2.62907671e-02,
        8.45015467e+03, 1.96557435e+03, 2.09598131e+03, 2.57818971e+02,
        3.29287434e+02, 2.32575174e+02, 1.91351733e+02, 1.57938136e+02,
        1.43058889e+02, 1.03135708e+02, 8.86539659e+01, 7.32728827e+01,
        6.26621389e+01};
        
    for(int i=0; i<21; i++){
        features[i] = (features[i]-mean[i]) / sqrt(var[i]);
    }
    
    /* Prediction: */
    int class_id = predict(features);
    char* label = labels[class_id];
    printf("class_id:%d \t label:%s\n", class_id, label);
    printf("Done.\n");

    return 0;
}

shell 复制代码

output:
    class_id:0       label:1BabyCry
    Done.
    56K     run

Notes

gcc编译依赖项的先后顺序问题
gcc需要手动指定链接库比如 -lm （#include<math.h> 数学库 -lm; posix 线程 -lpthread）

python 复制代码

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
// 在训练集上使用fit_transform()
scaler.fit_transform(X_train)
// 在测试集上使用transform()
scaler.transform(X_test)

或者测试用numpy
        # * StandardScaler
        features = features.reshape(1, -1)
        mean_ = np.array([ 5.19192940e-02,  9.96640933e+00,  1.70053242e-02,  2.81007237e-02,
            2.43518755e+03,  3.18738345e+03,  6.77396457e+03,  4.17157508e-02,
            -3.11171803e+02,  1.12272466e+02, -1.03072190e+01,  1.78713296e+01,
            -4.93390880e+00,  3.33015672e+00,  8.81093827e-01,  2.69630493e+00,
            1.28616258e+00,  2.27378571e+00, -2.89055164e-01, -2.95153293e-01,-1.40412454e-01])
        
        var_ = np.array([1.86645763e-03, 2.01389259e+02, 6.28560738e-04, 5.46686531e-04,
            8.46803415e+05, 1.40065049e+06, 6.21648618e+06, 2.62907671e-02,
            8.45015467e+03, 1.96557435e+03, 2.09598131e+03, 2.57818971e+02,
            3.29287434e+02, 2.32575174e+02, 1.91351733e+02, 1.57938136e+02,
            1.43058889e+02, 1.03135708e+02, 8.86539659e+01, 7.32728827e+01,
            6.26621389e+01])
        
        features = (features - mean_) / np.sqrt(var_)

$ References

心跳信号分类预测
conv1d + ensemble
基于梅尔频谱的音频信号分类识别
- conv2d
- backbone是一个基于CNN+FC的网络结构，与图像CNN分类模型不同的是，图像CNN分类模型的输入维度(batch,3,H,W), 输入数据depth=3，而音频信号的梅尔频谱图是深度为depth=1，可以认为是灰度图，输入维度(batch,1,H,W)，因此实际使用中，只需要将传统的CNN图像分类的backbone的第一层卷积层的in_channels=1即可。需要注意的是，由于维度不一致，导致不能使用imagenet的pretrained模型。当然可以将梅尔频谱图(灰度图)是转为3通道RGB图，这样就跟普通的RGB图像没有什么区别了，也可以imagenet的pretrained模型。
婴儿哭泣后播放歌曲
C语法
用c++实现《统计学习方法》中的算法
libsvm-sc-reading
sklearn.apachecn.org/#/docs/mast...
==tools==
- sklearn porter：更轻量
- m2cgen
sklearn docs

$ TODO

refinement of Conv1d network
提取STFT feature的C/C++
提取STFT MFCC等feature的C/C++

C/C++实现Python音频处理库librosa中melspectrogram的计算过程

github.com/xiaominfc/m...

实现SVM/RF/NN的C/C++

github.com/arnaudsj/li...

github.com/livey/svm-e... linear kernel not well，需要增加RBF即可，但是二分类

github.com/koba-jon/sv... rbf可以

github.com/jgreitemann...

github.com/koba-jon/sv... 可以参考

github.com/cjlin1/libs... 参考

c++ 复制代码

static double dot(const svm_node *px, const svm_node *py);
double kernel_linear(int i, int j) const
{
    return dot(x[i],x[j]);
}
double kernel_poly(int i, int j) const
{
    return powi(gamma*dot(x[i],x[j])+coef0,degree);
}
double kernel_rbf(int i, int j) const
{
    return exp(-gamma*(x_square[i]+x_square[j]-2*dot(x[i],x[j])));
}
double kernel_sigmoid(int i, int j) const
{
    return tanh(gamma*dot(x[i],x[j])+coef0);
}
double kernel_precomputed(int i, int j) const
{
    return x[i][(int)(x[j][0].value)].value;
}

baby_crying_detection_tutorials

$ Pipeline

$$ Methodology

$$$ Step 1 feature extraction

Key words

STFT

MFCC

Feature selection

$$$ Step 2 audio classification

Input

Output

Classifier

$ Training stage

Device

Data visualization

Datasets

Notes

$$ Accuracy & Model Size

$ Deployment stage

Device

FFT

Scoring method result

Compiled C program size

Export sklearn model to deploy

inference with c language

Notes

$ References

$ TODO