静音检测主要区分语音帧与静音 / 噪声帧,广泛用于通话降噪、录音分段、直播降噪、语音识别前置处理。
一、静音检测的基础逻辑为:
- 分帧:把连续音频切成短帧(常见 10ms/20ms,匹配人耳短时平稳特性)
- 提取一帧音频特征能量
- 和阈值对比:特征<阈值 → 判定静音;特征>阈值 → 判定人声
- 平滑防抖:单帧误判会过滤,连续多帧才切换静音 / 说话状态
二、几种静音检测原理:
2.1 基础能量法
利用时域幅值 / 短时能量判断:静音时麦克风只有环境噪声,波形幅度极小;人声波形幅度显著变大。
公式:
一帧 N 个采样点, 为采样值,
每帧的短时能量公式:,
则短时平均幅度:
步骤1,通过以上公式计算出基准噪音值
步骤2,设置一个可以调灵敏度的阈值,如公式 , 其中K为可调节的值,可根据实际情况调准该值。
步骤3,判断规则, 如 E <=T, 则静音;否则为有声音。
通过公式和判断方法可以看到,该计算方法比较暴力,无法判断过小的人声,也无法区分汽车声与人声。
2.2 过零率辅助法
过零率 ZCR:采用一帧内采样信号穿过 0 电平的次数。
该方法主要为了区别噪音和人声,根据各自声音的特征进行区分:
- 白噪声 / 风扇噪声:波形杂乱,过零率极高
- 人声音频:低频为主,波形平缓,过零率低
计算公式:
设一帧有 N 个采样点,相邻两点符号相反即产生一次过零。
其中符合函数:
并且还可以简化理解: 求和得到整帧总过零次数,除以 2N 做归一化,取值范围 0,1。
采用该方法可以区分一些突兀的噪音,但是对于部分平稳并且低声的噪音可能也无法区别。当然采用该方法可以与能量法进行叠加使用,先用能量法进行初步判断,然后再用过零率法去除部分噪音。
2.3 频域能量法
时域只看整体音量,频域区分人声频段和噪声频段。使用FFT频谱,可以使得效果大幅提升。
原理为:
- 对音频帧做 FFT,转换到频谱
- 人声有效频段:300Hz ~ 3400Hz(语音基带)
- 只统计该频段内总能量;低频风噪、高频电子噪音直接忽略
- 人声频段能量超过自适应阈值则判定说话
该方法有点类似心里声学模型,只关注人能听到的频率,隔离大量不在人声频段的干扰噪声,嘈杂环境准确率远高于时域能量法。
关键公式:
1,FFT频谱
设 FFT 输出复数频点 Xk,k 为频点索引 单频点功率:
单频点功率:
- Re= Real,复数实部
- Im = Imaginary,复数虚部
2,语音带总能量
:对应 300Hz 的 FFT 下标
:对应 3400Hz 的 FFT 下标
分别对人听到的每个声段求和公式:
3,噪声基线自适应更新
持续多帧判定为静音时,缓慢更新背景噪声能量:
平滑系数 ,噪声变化越慢取值越大。
- 判决条件
设定阈值系数 (一般 2~8 可调灵敏度)
三,webrtc vad
WebRTC 中的 VAD (Voice Activity Detection) 主要基于 GMM (高斯混合模型) 和 频谱特征分析。其核心思想是将音频帧的特征向量与预训练的"语音模型"和"噪声模型"进行比对,计算似然概率从而做出判决。
3.1 整体架构流程
-
预处理: 下采样、分帧、加窗。
-
特征提取: 从时域和频域提取区分语音和噪声的关键特征。
-
模型匹配:使用 GMM 计算特征属于语音或噪声的概率。
-
决策逻辑: 结合概率、能量阈值和历史状态(Hangover)做出最终判断。
3.2 预处理
• 下采样: 无论输入采样率是 8k, 16k, 32k 还是 48k,VAD 内部通常会将信号下采样到 8kHz。
• 原因: 人声的主要能量和信息集中在低频段 (0-4kHz)。降低采样率可以大幅减少计算量,且对 VAD 精度影响很小。
• 分帧: 将连续信号划分为重叠的短帧。WebRTC VAD 支持 10ms, 20ms, 30ms 的帧长。
• 加窗: 通常使用汉明窗 (Hamming Window) 以减少频谱泄漏。
3.3 特征提取
- 总能量 (Total Energy):
• 计算帧内所有样本平方和的对数。
•
• 作用: 静音帧能量通常极低。
- 过零率 (Zero Crossing Rate, ZCR):
• 信号穿过零轴的次数。
• 作用: 清音(如 /s/, /f/)和噪声通常具有较高的 ZCR,而浊音(如 /a/, /o/)ZCR 较低。
- 频谱斜率 (Spectral Slope):
• 通过线性回归拟合频谱包络,计算斜率。
• 作用: 语音频谱通常随频率增加而下降(负斜率),而白噪声频谱较平坦。
- 频谱平坦度 (Spectral Flatness):
• 几何均值与算术均值的比值。
• 作用: 衡量频谱像音调(峰值明显)还是像噪声(平坦)。
- 子带能量比 (Sub-band Energy Ratio):
• 将 8kHz 频谱分为几个子带(例如:低、中、高)。
• 计算各子带能量占总能量的比例。
• 作用: 人声在低频子带(如 0-500Hz, 500-1000Hz)通常有较高的能量集中度,而高频噪声则在高频子带能量较高。
3.4 模型匹配(高斯混合模型 (GMM) 分类)
这是 WebRTC VAD 的核心。它维护两个独立的 GMM 模型:
• Speech Model (): 由大量纯净语音数据训练而成。
• Noise Model (): 由各种背景噪声数据训练而成。
每个模型由多个高斯分布组成: 其中 是特征向量, 是权重, 是均值, 是协方差。
当前WebRTC 也引入了基于 递归神经网络 (RNN) 的 VAD (modules/audio_processing/rnn_vad/), 这里不重点具体内容,有兴趣的可以自行研究。
计算过程:
-
对于当前帧的特征向量 ,分别计算其在语音模型下的对数似然概率 和在噪声模型下的对数似然概率 。
-
计算似然比 (Likelihood Ratio):
-
如果 ,则倾向于判定为语音;否则为噪声。
3.5 决策逻辑与平滑 (Decision & Smoothing)
原始的逐帧判决容易受到瞬时噪声干扰,产生抖动。因此引入了状态机和平滑机制:
- 自适应阈值:
• 阈值不是固定的,而是根据背景噪声电平动态调整。
• 在安静环境下,阈值较低,容易检测到微弱语音。
• 在嘈杂环境下,阈值提高,防止噪声误触发。
- Hangover 机制 (悬挂/滞后):
• 语音到静音转换: 当连续几帧被判定为噪声后,不会立即切换为静音状态,而是进入 "Hangover" 状态,继续判定为语音若干帧(例如 3-5 帧)。
• 目的: 防止切断语音的尾部(如辅音结尾)。
• 静音到语音转换: 需要连续几帧都判定为语音,才正式切换为语音状态。
• 目的: 防止瞬时突发噪声(如关门声)被误判为语音起始。
- 模式选择 (Modes): WebRTC 提供四种模式,本质上是调整上述阈值和 Hangover 长度:
• Normal: 平衡。
• Low Bitrate: 更激进地判定为静音(节省带宽),Hangover 较短。
• Aggressive: 更保守地判定为语音(保留更多声音),阈值较低。
• Very Aggressive: 极度保守,几乎不切断任何疑似语音的声音。
3.6 实现源码
1,提取能量、ZCR、频谱特征
cpp
bool FeaturesExtractor::CheckSilenceComputeFeatures(
rtc::ArrayView<const float, kFrameSize10ms24kHz> samples,
rtc::ArrayView<float, kFeatureVectorSize> feature_vector) {
// Pre-processing.
if (use_high_pass_filter_) {
std::array<float, kFrameSize10ms24kHz> samples_filtered;
hpf_.Process(samples, samples_filtered);
// Feed buffer with the pre-processed version of |samples|.
pitch_buf_24kHz_.Push(samples_filtered);
} else {
// Feed buffer with |samples|.
pitch_buf_24kHz_.Push(samples);
}
// Extract the LP residual.
float lpc_coeffs[kNumLpcCoefficients];
ComputeAndPostProcessLpcCoefficients(pitch_buf_24kHz_view_, lpc_coeffs);
ComputeLpResidual(lpc_coeffs, pitch_buf_24kHz_view_, lp_residual_view_);
// Estimate pitch on the LP-residual and write the normalized pitch period
// into the output vector (normalization based on training data stats).
pitch_info_48kHz_ = pitch_estimator_.Estimate(lp_residual_view_);
feature_vector[kFeatureVectorSize - 2] =
0.01f * (static_cast<int>(pitch_info_48kHz_.period) - 300);
// Extract lagged frames (according to the estimated pitch period).
RTC_DCHECK_LE(pitch_info_48kHz_.period / 2, kMaxPitch24kHz);
auto lagged_frame = pitch_buf_24kHz_view_.subview(
kMaxPitch24kHz - pitch_info_48kHz_.period / 2, kFrameSize20ms24kHz);
// Analyze reference and lagged frames checking if silence has been detected
// and write the feature vector.
return spectral_features_extractor_.CheckSilenceComputeFeatures(
reference_frame_view_, {lagged_frame.data(), kFrameSize20ms24kHz},
{feature_vector.data() + kNumLowerBands, kNumBands - kNumLowerBands},
{feature_vector.data(), kNumLowerBands},
{feature_vector.data() + kNumBands, kNumLowerBands},
{feature_vector.data() + kNumBands + kNumLowerBands, kNumLowerBands},
{feature_vector.data() + kNumBands + 2 * kNumLowerBands, kNumLowerBands},
&feature_vector[kFeatureVectorSize - 1]);
}
2, 核心函数,计算 GMM 概率。它使用预定义的系数数组(针对不同采样率和模式优化)来计算高斯分布的概率密度。
cpp
static int16_t GmmProbability(VadInstT* self, int16_t* features,
int16_t total_power, size_t frame_length) {
int channel, k;
int16_t feature_minimum;
int16_t h0, h1;
int16_t log_likelihood_ratio;
int16_t vadflag = 0;
int16_t shifts_h0, shifts_h1;
int16_t tmp_s16, tmp1_s16, tmp2_s16;
int16_t diff;
int gaussian;
int16_t nmk, nmk2, nmk3, smk, smk2, nsk, ssk;
int16_t delt, ndelt;
int16_t maxspe, maxmu;
int16_t deltaN[kTableSize], deltaS[kTableSize];
int16_t ngprvec[kTableSize] = { 0 }; // Conditional probability = 0.
int16_t sgprvec[kTableSize] = { 0 }; // Conditional probability = 0.
int32_t h0_test, h1_test;
int32_t tmp1_s32, tmp2_s32;
int32_t sum_log_likelihood_ratios = 0;
int32_t noise_global_mean, speech_global_mean;
int32_t noise_probability[kNumGaussians], speech_probability[kNumGaussians];
int16_t overhead1, overhead2, individualTest, totalTest;
// Set various thresholds based on frame lengths (80, 160 or 240 samples).
if (frame_length == 80) {
overhead1 = self->over_hang_max_1[0];
overhead2 = self->over_hang_max_2[0];
individualTest = self->individual[0];
totalTest = self->total[0];
} else if (frame_length == 160) {
overhead1 = self->over_hang_max_1[1];
overhead2 = self->over_hang_max_2[1];
individualTest = self->individual[1];
totalTest = self->total[1];
} else {
overhead1 = self->over_hang_max_1[2];
overhead2 = self->over_hang_max_2[2];
individualTest = self->individual[2];
totalTest = self->total[2];
}
if (total_power > kMinEnergy) {
// The signal power of current frame is large enough for processing. The
// processing consists of two parts:
// 1) Calculating the likelihood of speech and thereby a VAD decision.
// 2) Updating the underlying model, w.r.t., the decision made.
// The detection scheme is an LRT with hypothesis
// H0: Noise
// H1: Speech
//
// We combine a global LRT with local tests, for each frequency sub-band,
// here defined as |channel|.
for (channel = 0; channel < kNumChannels; channel++) {
// For each channel we model the probability with a GMM consisting of
// |kNumGaussians|, with different means and standard deviations depending
// on H0 or H1.
h0_test = 0;
h1_test = 0;
for (k = 0; k < kNumGaussians; k++) {
gaussian = channel + k * kNumChannels;
// Probability under H0, that is, probability of frame being noise.
// Value given in Q27 = Q7 * Q20.
tmp1_s32 = WebRtcVad_GaussianProbability(features[channel],
self->noise_means[gaussian],
self->noise_stds[gaussian],
&deltaN[gaussian]);
noise_probability[k] = kNoiseDataWeights[gaussian] * tmp1_s32;
h0_test += noise_probability[k]; // Q27
// Probability under H1, that is, probability of frame being speech.
// Value given in Q27 = Q7 * Q20.
tmp1_s32 = WebRtcVad_GaussianProbability(features[channel],
self->speech_means[gaussian],
self->speech_stds[gaussian],
&deltaS[gaussian]);
speech_probability[k] = kSpeechDataWeights[gaussian] * tmp1_s32;
h1_test += speech_probability[k]; // Q27
}
// Calculate the log likelihood ratio: log2(Pr{X|H1} / Pr{X|H1}).
// Approximation:
// log2(Pr{X|H1} / Pr{X|H1}) = log2(Pr{X|H1}*2^Q) - log2(Pr{X|H1}*2^Q)
// = log2(h1_test) - log2(h0_test)
// = log2(2^(31-shifts_h1)*(1+b1))
// - log2(2^(31-shifts_h0)*(1+b0))
// = shifts_h0 - shifts_h1
// + log2(1+b1) - log2(1+b0)
// ~= shifts_h0 - shifts_h1
//
// Note that b0 and b1 are values less than 1, hence, 0 <= log2(1+b0) < 1.
// Further, b0 and b1 are independent and on the average the two terms
// cancel.
shifts_h0 = WebRtcSpl_NormW32(h0_test);
shifts_h1 = WebRtcSpl_NormW32(h1_test);
if (h0_test == 0) {
shifts_h0 = 31;
}
if (h1_test == 0) {
shifts_h1 = 31;
}
log_likelihood_ratio = shifts_h0 - shifts_h1;
// Update |sum_log_likelihood_ratios| with spectrum weighting. This is
// used for the global VAD decision.
sum_log_likelihood_ratios +=
(int32_t) (log_likelihood_ratio * kSpectrumWeight[channel]);
// Local VAD decision.
if ((log_likelihood_ratio * 4) > individualTest) {
vadflag = 1;
}
// TODO(bjornv): The conditional probabilities below are applied on the
// hard coded number of Gaussians set to two. Find a way to generalize.
// Calculate local noise probabilities used later when updating the GMM.
h0 = (int16_t) (h0_test >> 12); // Q15
if (h0 > 0) {
// High probability of noise. Assign conditional probabilities for each
// Gaussian in the GMM.
tmp1_s32 = (noise_probability[0] & 0xFFFFF000) << 2; // Q29
ngprvec[channel] = (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, h0); // Q14
ngprvec[channel + kNumChannels] = 16384 - ngprvec[channel];
} else {
// Low noise probability. Assign conditional probability 1 to the first
// Gaussian and 0 to the rest (which is already set at initialization).
ngprvec[channel] = 16384;
}
// Calculate local speech probabilities used later when updating the GMM.
h1 = (int16_t) (h1_test >> 12); // Q15
if (h1 > 0) {
// High probability of speech. Assign conditional probabilities for each
// Gaussian in the GMM. Otherwise use the initialized values, i.e., 0.
tmp1_s32 = (speech_probability[0] & 0xFFFFF000) << 2; // Q29
sgprvec[channel] = (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, h1); // Q14
sgprvec[channel + kNumChannels] = 16384 - sgprvec[channel];
}
}
// Make a global VAD decision.
vadflag |= (sum_log_likelihood_ratios >= totalTest);
// Update the model parameters.
maxspe = 12800;
for (channel = 0; channel < kNumChannels; channel++) {
// Get minimum value in past which is used for long term correction in Q4.
feature_minimum = WebRtcVad_FindMinimum(self, features[channel], channel);
// Compute the "global" mean, that is the sum of the two means weighted.
noise_global_mean = WeightedAverage(&self->noise_means[channel], 0,
&kNoiseDataWeights[channel]);
tmp1_s16 = (int16_t) (noise_global_mean >> 6); // Q8
for (k = 0; k < kNumGaussians; k++) {
gaussian = channel + k * kNumChannels;
nmk = self->noise_means[gaussian];
smk = self->speech_means[gaussian];
nsk = self->noise_stds[gaussian];
ssk = self->speech_stds[gaussian];
// Update noise mean vector if the frame consists of noise only.
nmk2 = nmk;
if (!vadflag) {
// deltaN = (x-mu)/sigma^2
// ngprvec[k] = |noise_probability[k]| /
// (|noise_probability[0]| + |noise_probability[1]|)
// (Q14 * Q11 >> 11) = Q14.
delt = (int16_t)((ngprvec[gaussian] * deltaN[gaussian]) >> 11);
// Q7 + (Q14 * Q15 >> 22) = Q7.
nmk2 = nmk + (int16_t)((delt * kNoiseUpdateConst) >> 22);
}
// Long term correction of the noise mean.
// Q8 - Q8 = Q8.
ndelt = (feature_minimum << 4) - tmp1_s16;
// Q7 + (Q8 * Q8) >> 9 = Q7.
nmk3 = nmk2 + (int16_t)((ndelt * kBackEta) >> 9);
// Control that the noise mean does not drift to much.
tmp_s16 = (int16_t) ((k + 5) << 7);
if (nmk3 < tmp_s16) {
nmk3 = tmp_s16;
}
tmp_s16 = (int16_t) ((72 + k - channel) << 7);
if (nmk3 > tmp_s16) {
nmk3 = tmp_s16;
}
self->noise_means[gaussian] = nmk3;
if (vadflag) {
// Update speech mean vector:
// |deltaS| = (x-mu)/sigma^2
// sgprvec[k] = |speech_probability[k]| /
// (|speech_probability[0]| + |speech_probability[1]|)
// (Q14 * Q11) >> 11 = Q14.
delt = (int16_t)((sgprvec[gaussian] * deltaS[gaussian]) >> 11);
// Q14 * Q15 >> 21 = Q8.
tmp_s16 = (int16_t)((delt * kSpeechUpdateConst) >> 21);
// Q7 + (Q8 >> 1) = Q7. With rounding.
smk2 = smk + ((tmp_s16 + 1) >> 1);
// Control that the speech mean does not drift to much.
maxmu = maxspe + 640;
if (smk2 < kMinimumMean[k]) {
smk2 = kMinimumMean[k];
}
if (smk2 > maxmu) {
smk2 = maxmu;
}
self->speech_means[gaussian] = smk2; // Q7.
// (Q7 >> 3) = Q4. With rounding.
tmp_s16 = ((smk + 4) >> 3);
tmp_s16 = features[channel] - tmp_s16; // Q4
// (Q11 * Q4 >> 3) = Q12.
tmp1_s32 = (deltaS[gaussian] * tmp_s16) >> 3;
tmp2_s32 = tmp1_s32 - 4096;
tmp_s16 = sgprvec[gaussian] >> 2;
// (Q14 >> 2) * Q12 = Q24.
tmp1_s32 = tmp_s16 * tmp2_s32;
tmp2_s32 = tmp1_s32 >> 4; // Q20
// 0.1 * Q20 / Q7 = Q13.
if (tmp2_s32 > 0) {
tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(tmp2_s32, ssk * 10);
} else {
tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(-tmp2_s32, ssk * 10);
tmp_s16 = -tmp_s16;
}
// Divide by 4 giving an update factor of 0.025 (= 0.1 / 4).
// Note that division by 4 equals shift by 2, hence,
// (Q13 >> 8) = (Q13 >> 6) / 4 = Q7.
tmp_s16 += 128; // Rounding.
ssk += (tmp_s16 >> 8);
if (ssk < kMinStd) {
ssk = kMinStd;
}
self->speech_stds[gaussian] = ssk;
} else {
// Update GMM variance vectors.
// deltaN * (features[channel] - nmk) - 1
// Q4 - (Q7 >> 3) = Q4.
tmp_s16 = features[channel] - (nmk >> 3);
// (Q11 * Q4 >> 3) = Q12.
tmp1_s32 = (deltaN[gaussian] * tmp_s16) >> 3;
tmp1_s32 -= 4096;
// (Q14 >> 2) * Q12 = Q24.
tmp_s16 = (ngprvec[gaussian] + 2) >> 2;
tmp2_s32 = OverflowingMulS16ByS32ToS32(tmp_s16, tmp1_s32);
// Q20 * approx 0.001 (2^-10=0.0009766), hence,
// (Q24 >> 14) = (Q24 >> 4) / 2^10 = Q20.
tmp1_s32 = tmp2_s32 >> 14;
// Q20 / Q7 = Q13.
if (tmp1_s32 > 0) {
tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, nsk);
} else {
tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(-tmp1_s32, nsk);
tmp_s16 = -tmp_s16;
}
tmp_s16 += 32; // Rounding
nsk += tmp_s16 >> 6; // Q13 >> 6 = Q7.
if (nsk < kMinStd) {
nsk = kMinStd;
}
self->noise_stds[gaussian] = nsk;
}
}
// Separate models if they are too close.
// |noise_global_mean| in Q14 (= Q7 * Q7).
noise_global_mean = WeightedAverage(&self->noise_means[channel], 0,
&kNoiseDataWeights[channel]);
// |speech_global_mean| in Q14 (= Q7 * Q7).
speech_global_mean = WeightedAverage(&self->speech_means[channel], 0,
&kSpeechDataWeights[channel]);
// |diff| = "global" speech mean - "global" noise mean.
// (Q14 >> 9) - (Q14 >> 9) = Q5.
diff = (int16_t) (speech_global_mean >> 9) -
(int16_t) (noise_global_mean >> 9);
if (diff < kMinimumDifference[channel]) {
tmp_s16 = kMinimumDifference[channel] - diff;
// |tmp1_s16| = ~0.8 * (kMinimumDifference - diff) in Q7.
// |tmp2_s16| = ~0.2 * (kMinimumDifference - diff) in Q7.
tmp1_s16 = (int16_t)((13 * tmp_s16) >> 2);
tmp2_s16 = (int16_t)((3 * tmp_s16) >> 2);
// Move Gaussian means for speech model by |tmp1_s16| and update
// |speech_global_mean|. Note that |self->speech_means[channel]| is
// changed after the call.
speech_global_mean = WeightedAverage(&self->speech_means[channel],
tmp1_s16,
&kSpeechDataWeights[channel]);
// Move Gaussian means for noise model by -|tmp2_s16| and update
// |noise_global_mean|. Note that |self->noise_means[channel]| is
// changed after the call.
noise_global_mean = WeightedAverage(&self->noise_means[channel],
-tmp2_s16,
&kNoiseDataWeights[channel]);
}
// Control that the speech & noise means do not drift to much.
maxspe = kMaximumSpeech[channel];
tmp2_s16 = (int16_t) (speech_global_mean >> 7);
if (tmp2_s16 > maxspe) {
// Upper limit of speech model.
tmp2_s16 -= maxspe;
for (k = 0; k < kNumGaussians; k++) {
self->speech_means[channel + k * kNumChannels] -= tmp2_s16;
}
}
tmp2_s16 = (int16_t) (noise_global_mean >> 7);
if (tmp2_s16 > kMaximumNoise[channel]) {
tmp2_s16 -= kMaximumNoise[channel];
for (k = 0; k < kNumGaussians; k++) {
self->noise_means[channel + k * kNumChannels] -= tmp2_s16;
}
}
}
self->frame_counter++;
}
// Smooth with respect to transition hysteresis.
if (!vadflag) {
if (self->over_hang > 0) {
vadflag = 2 + self->over_hang;
self->over_hang--;
}
self->num_of_speech = 0;
} else {
self->num_of_speech++;
if (self->num_of_speech > kMaxSpeechFrames) {
self->num_of_speech = kMaxSpeechFrames;
self->over_hang = overhead2;
} else {
self->over_hang = overhead1;
}
}
return vadflag;
}