baby_crying_detection_tutorials

Baby Crying Recognition Tutorials of Algorithm Landing
Author:Jet Date:2023/03

[TOC]

$ Pipeline

  • Step 1 语音持续监控,当检测到音量大于某个阈值,开始连续录制音频某一段时间
  • Step 2 对收集的语音数据进行信号处理,并提取一系列特征
  • Step 3 将特征输入到分类器中进行二分类
  • Step 4 根据分类结果决定是和否预警
  • Step 5 循环上述步骤

$$ Methodology

  • 输入过程
    • 数字语音信号:离散
    • 格式:.wav/.mp3/.amr/.m4a/.flac/.aac
  • 中间过程

    • 特征提取
      • 短时傅里叶变换(STFT)的频谱、梅尔谱Mel-spectrogram的MFCC(DCT)
    • 特征选择
      • 过零率(Zero Crossing Rate)

      • 频谱质心(Spectral Centroid)

      • 频谱衰减 (Spectral Roll-off)

      • 梅尔频率倒谱系数(Mel-frequency cepstral coefficients ,MFCC)

      • 色度频率(Chroma Frequencies)

      • ...

  • 输出过程

    • 1 Scoring

      • 计算特征得分后与阈值比较
    • 2 Classifier

      • Machine learning: SVM / Random forest
      • Deep Learning: 1D Conv

$$$ Step 1 feature extraction

Key words

  • Amplitude --- Perceived as loudness 振幅-视为响度

  • Frequency --- Perceived as pitch 频率-视为音高

  • Sample rate --- It is how many times the sample is taken of a sound file if it says sample rate as 22000 Hz it means 22000 samples are taken in each second. 采样率---如果声音文件的采样率表示为22000 Hz,则它是对声音文件进行采样的次数,这表示每秒进行22000个采样。

  • Bit depth --- It represents the quality of sound recorded, It just likes pixels in an image. So 24 Bit sound is of better quality than 16 Bit.

    位深度---它代表所记录声音的质量,就像图像中的像素一样。 因此,24位声音的质量比16位更好。

STFT

  • 逐帧进行快速傅里叶变换的过程被称为短时傅里叶变换(short-time Fourier transform 或 short-term Fourier transform,STFT)

MFCC

  • 提取过程分为预加重、分帧、加窗、快速傅里叶变换(FFT)、梅尔滤波器组过滤、取对数、离散余弦变换(DCT)

Feature selection

​ 为9种:STFT 8 + MFCC 1

  • zero_crossing_rate 过零率

  • ste (short-time energy) 短时能量

  • ste_acc 过零率大于0

  • stzcr (short-time zero crossing rate) 短时过零率

  • spectral_centroid 频谱质心

  • spectral_bandwidth 频谱带宽

  • spectral_rolloff 频谱衰减

  • spectral_flatness 频谱平坦

  • ==mfcc 梅尔频率倒谱系数==

python 复制代码
import numpy as np
import librosa.feature as lrf
import scipy.signal as scisig


class AudioUtils:
    def __init__(self):
        pass

    @staticmethod
    def _sgn(x):
        y = np.zeros_like(x)
        y[np.where(x >= 0)] = 1.0
        y[np.where(x < 0)] = -1.0
        return y

    @staticmethod
    def ste(data, wintype, winlen):
        """
        Compute short-time energy
        :param data:
        :param wintype:
        :param winlen:
        :return:
        """
        win = scisig.get_window(wintype, winlen)
        return scisig.convolve(data ** 2, win ** 2, mode="same")

    @staticmethod
    def stzcr(data, wintype, winlen):
        """
        Compute short-time zero crossing rate.
        :param data:
        :param wintype:
        :param winlen:
        :return:
        """
        win = scisig.get_window(wintype, winlen)
        win = 0.5 * win / len(win)
        x1 = np.roll(data, 1)
        x1[0] = 0.0
        abs_diff = np.abs(AudioUtils._sgn(data) - AudioUtils._sgn(x1))
        return scisig.convolve(abs_diff, win, mode="same")


class FeatureExtraction:
    RATE = 44100
    FRAME = 512
    def __init__(self, label=None):
        if label is None:
            self.label = ''
        else:
            self.label = label
    def extract_feature(self, audio_data):
        """
        extract features from audio data
        :param audio_data:
        :return:
        """
        zcr = lrf.zero_crossing_rate(audio_data, frame_length=self.FRAME, hop_length=self.FRAME // 2)
        feature_zcr = np.mean(zcr)

        ste = AudioUtils.ste(audio_data, 'hamming', int(20 * 0.001 * self.RATE))
        feature_ste = np.mean(ste)

        ste_acc = np.diff(ste)
        feature_steacc = np.mean(ste_acc[ste_acc > 0])

        stzcr = AudioUtils.stzcr(audio_data, 'hamming', int(20 * 0.001 * self.RATE))
        feature_stezcr = np.mean(stzcr)

        mfcc = lrf.mfcc(y=audio_data, sr=self.RATE, n_mfcc=13)
        feature_mfcc = np.mean(mfcc, axis=1)

        spectral_centroid = lrf.spectral_centroid(y=audio_data, sr=self.RATE, hop_length=self.FRAME // 2)
        feature_spectral_centroid = np.mean(spectral_centroid)

        spectral_bandwidth = lrf.spectral_bandwidth(y=audio_data, sr=self.RATE, hop_length=self.FRAME // 2)
        feature_spectral_bandwidth = np.mean(spectral_bandwidth)

        spectral_rolloff = lrf.spectral_rolloff(y=audio_data, sr=self.RATE, hop_length=self.FRAME // 2,
                                                roll_percent=0.90)
        feature_spectral_rolloff = np.mean(spectral_rolloff)

        spectral_flatness = lrf.spectral_flatness(y=audio_data, hop_length=self.FRAME // 2)
        feature_spectral_flatness = np.mean(spectral_flatness)

        features = np.append([feature_zcr, feature_ste, feature_steacc, feature_stezcr, feature_spectral_centroid,
                              feature_spectral_bandwidth, feature_spectral_rolloff, feature_spectral_flatness],
                             feature_mfcc)
        return features, self.label

$$$ Step 2 audio classification

Input

  • selected features above

Output

  • cry or not (laugh, noise, silence...)

Classifier

  • Machine learning: SVM / Random forest
python 复制代码
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
  • Deep Learning: Conv1d
python 复制代码
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader


class SpeechDataset(Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels
        # self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 1, '4Silence': 1}
        self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 2, '4Silence': 3}
        
    def __len__(self):
        return len(self.inputs)
    
    def __getitem__(self, index):
        data = self.inputs[index]
        label = self.labels[index]
        label = self.map_dict[label]
        
        return data, label

    
class Net(nn.Module):
    def __init__(self, in_channels, num_classes, dropout=0.25):
        super().__init__()
        # expected conv1d input : minibatch_size x num_channel x width
        self.layer = nn.Sequential(
            nn.Conv1d(in_channels=1, out_channels=8, kernel_size=3),
            nn.MaxPool1d(kernel_size=2, stride=2),
            nn.Dropout(dropout, inplace=True),
            
            nn.Conv1d(in_channels=8, out_channels=16, kernel_size=3),
            nn.MaxPool1d(kernel_size=2, stride=2),
            nn.Dropout(dropout, inplace=True),
            
            nn.Flatten(),
            nn.Linear(48, num_classes),
            nn.Softmax(dim=1)
            )

    def forward(self, x):
        x = x.view(x.size(0), 1, x.size(1))
        x = self.layer(x)
        return x

$ Training stage

Device

  • linux

Data visualization

Datasets

  • Total numbers 420 = Train (80%) + Test (20%)

  • Labels

    ini 复制代码
       # 2 classes
       self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 1, '4Silence': 1}
       
       # 4 classes
       self.map_dict = {'1BabyCry': 0, '2BabyLaugh': 1, '3Noise': 2, '4Silence': 3}

Notes

  • Conv1d need input shape [B, C=1, D=feature dim]

    python 复制代码
    def forward(self, x):
        x = x.view(x.size(0), 1, x.size(1))
        x = self.layer(x)
        return x

$$ Accuracy & Model Size

Models SVM Random Forest Neural Network
Accuracy(%) 97.7 96.5 84
==Model Size(KB)== 42 214 5
  • SVM perf, model = mt.train_svm_model() # {'accuracy': 0.9770114942528736, 'recall': 0.9761904761904762, 'precision': 0.9782608695652174, 'f1': 0.9763888888888889}
  • RF perf, model = mt.train_rf_model() # {'accuracy': 0.9655172413793104, 'recall': 0.9648268398268398, 'precision': 0.9657608695652173, 'f1': 0.965040650406504}
  • NN best_epoch:480 max_acc:0.8560919540229885

$ Deployment stage

Device

  • embedded T31 chip

FFT

Scoring method result

noise as input baby crying as input

Compiled C program size

  • 32KB

  • debug code example

    c 复制代码
    	//FFT运算
    	for (L = 1; L <= M; L++)							
    	{
    		B = (int)(pow(2, L - 1));//第L级,每个蝶形的两个数据有B=2^(L-1)个点
    		for (J = 0; J < B; J++)
    		{
    			P = (int)(J*pow(2, M - L));//每级有B个旋转因子,一旋转因子对应着2^(M-1)个蝶形
    			for (k = J; k < N; k = (int)(k + pow(2, L)))
    			{
    				K1 = k + B;
    				complex wn, t;
    				Wn_i(N, P, &wn);
    				c_chengfa(f[K1], wn, &t);					//。。。。。。。。。。。。
    				c_jianfa(f[k], t, &(f[K1]));				//蝶形运算
    				c_jiafa(f[k], t, &(f[k]));				//。。。。。。。。。。。。
    			}
    		}
    	}
    	
    	for (int i = 0; i < N; i++)							//快速傅里叶变换输出
    	{
    		y[i]=sqrt(f[i].real*f[i].real+f[i].imag*f[i].imag);
    		printf("%d %lf\n", i, y[i]);
    	}

Export sklearn model to deploy

python 复制代码
porter sk版本 https://github.com/nok/sklearn-porter/issues/82
pip install scikit-learn==0.22
py3.6环境比较好

# ! export to c language in windows
from sklearn_porter import Porter
porter = Porter(model, language='c') # 直接导出C语言即可
output = porter.export(embed_data=True)
with open('svm_infer.c', 'w') as f:
	f.write(output)

inference with c language

c 复制代码
int main(int argc, const char * argv[]) {
    /* Features: */
    double features[21] = {0.08548660455336426,3.7481414308053336,0.0031232076831082135,0.04624957396314546,3676.3404738124937,3480.6022753882544,8677.356268579611,0.0015091497916728258,-246.88226704105966,87.94723203609043,-69.86268052131344,-0.34579994437744116,-42.9471282727914,4.457187434533733,-17.901607875608402,21.03675732334522,6.356159473281781,3.2514479635033497,-7.103540737790745,10.796371942992913,-10.94354612752694}; // cry 0

    // labels
    char* labels[4] = {"1BabyCry", "2BabyLaugh", "3Noise", "4Silence"};

    // std
    double mean[21] = {5.19192940e-02,  9.96640933e+00,  1.70053242e-02,  2.81007237e-02,
        2.43518755e+03,  3.18738345e+03,  6.77396457e+03,  4.17157508e-02,
        -3.11171803e+02,  1.12272466e+02, -1.03072190e+01,  1.78713296e+01,
        -4.93390880e+00,  3.33015672e+00,  8.81093827e-01,  2.69630493e+00,
        1.28616258e+00,  2.27378571e+00, -2.89055164e-01, -2.95153293e-01,-1.40412454e-01};
    
    double var[21] = {1.86645763e-03, 2.01389259e+02, 6.28560738e-04, 5.46686531e-04,
        8.46803415e+05, 1.40065049e+06, 6.21648618e+06, 2.62907671e-02,
        8.45015467e+03, 1.96557435e+03, 2.09598131e+03, 2.57818971e+02,
        3.29287434e+02, 2.32575174e+02, 1.91351733e+02, 1.57938136e+02,
        1.43058889e+02, 1.03135708e+02, 8.86539659e+01, 7.32728827e+01,
        6.26621389e+01};
        
    for(int i=0; i<21; i++){
        features[i] = (features[i]-mean[i]) / sqrt(var[i]);
    }
    
    /* Prediction: */
    int class_id = predict(features);
    char* label = labels[class_id];
    printf("class_id:%d \t label:%s\n", class_id, label);
    printf("Done.\n");

    return 0;
}
shell 复制代码
output:
    class_id:0       label:1BabyCry
    Done.
    56K     run

Notes

  • gcc编译依赖项的先后顺序问题
  • gcc需要手动指定链接库比如 -lm (#include<math.h> 数学库 -lm; posix 线程 -lpthread)
python 复制代码
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
// 在训练集上使用fit_transform()
scaler.fit_transform(X_train)
// 在测试集上使用transform()
scaler.transform(X_test)

或者测试用numpy
        # * StandardScaler
        features = features.reshape(1, -1)
        mean_ = np.array([ 5.19192940e-02,  9.96640933e+00,  1.70053242e-02,  2.81007237e-02,
            2.43518755e+03,  3.18738345e+03,  6.77396457e+03,  4.17157508e-02,
            -3.11171803e+02,  1.12272466e+02, -1.03072190e+01,  1.78713296e+01,
            -4.93390880e+00,  3.33015672e+00,  8.81093827e-01,  2.69630493e+00,
            1.28616258e+00,  2.27378571e+00, -2.89055164e-01, -2.95153293e-01,-1.40412454e-01])
        
        var_ = np.array([1.86645763e-03, 2.01389259e+02, 6.28560738e-04, 5.46686531e-04,
            8.46803415e+05, 1.40065049e+06, 6.21648618e+06, 2.62907671e-02,
            8.45015467e+03, 1.96557435e+03, 2.09598131e+03, 2.57818971e+02,
            3.29287434e+02, 2.32575174e+02, 1.91351733e+02, 1.57938136e+02,
            1.43058889e+02, 1.03135708e+02, 8.86539659e+01, 7.32728827e+01,
            6.26621389e+01])
        
        features = (features - mean_) / np.sqrt(var_)
      

$ References

$ TODO

  • refinement of Conv1d network

  • 提取STFT feature的C/C++

  • 提取STFT MFCC等feature的C/C++

    C/C++实现Python音频处理库librosa中melspectrogram的计算过程

    github.com/xiaominfc/m...

  • 实现SVM/RF/NN的C/C++

    github.com/arnaudsj/li...

    github.com/livey/svm-e... linear kernel not well,需要增加RBF即可,但是二分类

    github.com/koba-jon/sv... rbf可以

    github.com/jgreitemann...

    github.com/koba-jon/sv... 可以参考

    github.com/cjlin1/libs... 参考

    c++ 复制代码
    static double dot(const svm_node *px, const svm_node *py);
    double kernel_linear(int i, int j) const
    {
        return dot(x[i],x[j]);
    }
    double kernel_poly(int i, int j) const
    {
        return powi(gamma*dot(x[i],x[j])+coef0,degree);
    }
    double kernel_rbf(int i, int j) const
    {
        return exp(-gamma*(x_square[i]+x_square[j]-2*dot(x[i],x[j])));
    }
    double kernel_sigmoid(int i, int j) const
    {
        return tanh(gamma*dot(x[i],x[j])+coef0);
    }
    double kernel_precomputed(int i, int j) const
    {
        return x[i][(int)(x[j][0].value)].value;
    }
相关推荐
网易独家音乐人Mike Zhou3 小时前
【卡尔曼滤波】数据预测Prediction观测器的理论推导及应用 C语言、Python实现(Kalman Filter)
c语言·python·单片机·物联网·算法·嵌入式·iot
Swift社区7 小时前
LeetCode - #139 单词拆分
算法·leetcode·职场和发展
Kent_J_Truman7 小时前
greater<>() 、less<>()及运算符 < 重载在排序和堆中的使用
算法
IT 青年8 小时前
数据结构 (1)基本概念和术语
数据结构·算法
Dong雨8 小时前
力扣hot100-->栈/单调栈
算法·leetcode·职场和发展
SoraLuna8 小时前
「Mac玩转仓颉内测版24」基础篇4 - 浮点类型详解
开发语言·算法·macos·cangjie
liujjjiyun8 小时前
小R的随机播放顺序
数据结构·c++·算法
¥ 多多¥9 小时前
c++中mystring运算符重载
开发语言·c++·算法
trueEve10 小时前
SQL,力扣题目1369,获取最近第二次的活动
算法·leetcode·职场和发展
天若有情67310 小时前
c++框架设计展示---提高开发效率!
java·c++·算法