音频之静音检测(VAD)

静音检测主要区分语音帧与静音 / 噪声帧,广泛用于通话降噪、录音分段、直播降噪、语音识别前置处理。

一、静音检测的基础逻辑为:

  • 分帧:把连续音频切成短帧(常见 10ms/20ms,匹配人耳短时平稳特性)
  • 提取一帧音频特征能量
  • 和阈值对比:特征<阈值 → 判定静音;特征>阈值 → 判定人声
  • 平滑防抖:单帧误判会过滤,连续多帧才切换静音 / 说话状态

二、几种静音检测原理:

2.1 基础能量法

利用时域幅值 / 短时能量判断:静音时麦克风只有环境噪声,波形幅度极小;人声波形幅度显著变大。

公式:

一帧 N 个采样点, 为采样值,

每帧的短时能量公式:

则短时平均幅度:

步骤1,通过以上公式计算出基准噪音值

步骤2,设置一个可以调灵敏度的阈值,如公式 , 其中K为可调节的值,可根据实际情况调准该值。

步骤3,判断规则, 如 E <=T, 则静音;否则为有声音。

通过公式和判断方法可以看到,该计算方法比较暴力,无法判断过小的人声,也无法区分汽车声与人声。

2.2 过零率辅助法

过零率 ZCR:采用一帧内采样信号穿过 0 电平的次数。

该方法主要为了区别噪音和人声,根据各自声音的特征进行区分:

  • 白噪声 / 风扇噪声:波形杂乱,过零率极高
  • 人声音频:低频为主,波形平缓,过零率低

计算公式:

设一帧有 N 个采样点,相邻两点符号相反即产生一次过零。

其中符合函数:

​并且还可以简化理解: 求和得到整帧总过零次数,除以 2N 做归一化,取值范围 0,1

采用该方法可以区分一些突兀的噪音,但是对于部分平稳并且低声的噪音可能也无法区别。当然采用该方法可以与能量法进行叠加使用,先用能量法进行初步判断,然后再用过零率法去除部分噪音。

2.3 频域能量法

时域只看整体音量,频域区分人声频段和噪声频段。使用FFT频谱,可以使得效果大幅提升。

原理为:

  • 对音频帧做 FFT,转换到频谱
  • 人声有效频段:300Hz ~ 3400Hz(语音基带)
  • 只统计该频段内总能量;低频风噪、高频电子噪音直接忽略
  • 人声频段能量超过自适应阈值则判定说话

该方法有点类似心里声学模型,只关注人能听到的频率,隔离大量不在人声频段的干扰噪声,嘈杂环境准确率远高于时域能量法。

关键公式:

1,FFT频谱

设 FFT 输出复数频点 Xk,k 为频点索引 单频点功率:

单频点功率:

  • Re= Real,复数实部
  • Im = Imaginary,复数虚部

2,语音带总能量

:对应 300Hz 的 FFT 下标

:对应 3400Hz 的 FFT 下标

分别对人听到的每个声段求和公式:

3,噪声基线自适应更新

持续多帧判定为静音时,缓慢更新背景噪声能量:

平滑系数 ,噪声变化越慢取值越大。

  1. 判决条件

设定阈值系数 (一般 2~8 可调灵敏度)

三,webrtc vad

WebRTC 中的 VAD (Voice Activity Detection) 主要基于 GMM (高斯混合模型) 和 频谱特征分析。其核心思想是将音频帧的特征向量与预训练的"语音模型"和"噪声模型"进行比对,计算似然概率从而做出判决。

3.1 整体架构流程

  1. 预处理: 下采样、分帧、加窗。

  2. 特征提取: 从时域和频域提取区分语音和噪声的关键特征。

  3. 模型匹配:使用 GMM 计算特征属于语音或噪声的概率。

  4. 决策逻辑: 结合概率、能量阈值和历史状态(Hangover)做出最终判断。

3.2 预处理

• 下采样: 无论输入采样率是 8k, 16k, 32k 还是 48k,VAD 内部通常会将信号下采样到 8kHz。

• 原因: 人声的主要能量和信息集中在低频段 (0-4kHz)。降低采样率可以大幅减少计算量,且对 VAD 精度影响很小。

• 分帧: 将连续信号划分为重叠的短帧。WebRTC VAD 支持 10ms, 20ms, 30ms 的帧长。

• 加窗: 通常使用汉明窗 (Hamming Window) 以减少频谱泄漏。

3.3 特征提取

  1. 总能量 (Total Energy):

• 计算帧内所有样本平方和的对数。

• 作用: 静音帧能量通常极低。

  1. 过零率 (Zero Crossing Rate, ZCR):

• 信号穿过零轴的次数。

• 作用: 清音(如 /s/, /f/)和噪声通常具有较高的 ZCR,而浊音(如 /a/, /o/)ZCR 较低。

  1. 频谱斜率 (Spectral Slope):

• 通过线性回归拟合频谱包络,计算斜率。

• 作用: 语音频谱通常随频率增加而下降(负斜率),而白噪声频谱较平坦。

  1. 频谱平坦度 (Spectral Flatness):

• 几何均值与算术均值的比值。

• 作用: 衡量频谱像音调(峰值明显)还是像噪声(平坦)。

  1. 子带能量比 (Sub-band Energy Ratio):

• 将 8kHz 频谱分为几个子带(例如:低、中、高)。

• 计算各子带能量占总能量的比例。

• 作用: 人声在低频子带(如 0-500Hz, 500-1000Hz)通常有较高的能量集中度,而高频噪声则在高频子带能量较高。

3.4 模型匹配(高斯混合模型 (GMM) 分类)

这是 WebRTC VAD 的核心。它维护两个独立的 GMM 模型:

• Speech Model (): 由大量纯净语音数据训练而成。

• Noise Model (): 由各种背景噪声数据训练而成。

每个模型由多个高斯分布组成: 其中 是特征向量, 是权重, 是均值, 是协方差。

当前WebRTC 也引入了基于 递归神经网络 (RNN) 的 VAD (modules/audio_processing/rnn_vad/), 这里不重点具体内容,有兴趣的可以自行研究。

计算过程:

  1. 对于当前帧的特征向量 ,分别计算其在语音模型下的对数似然概率 和在噪声模型下的对数似然概率 。

  2. 计算似然比 (Likelihood Ratio):

  3. 如果 ,则倾向于判定为语音;否则为噪声。

3.5 决策逻辑与平滑 (Decision & Smoothing)

原始的逐帧判决容易受到瞬时噪声干扰,产生抖动。因此引入了状态机和平滑机制:

  1. 自适应阈值:

• 阈值不是固定的,而是根据背景噪声电平动态调整。

• 在安静环境下,阈值较低,容易检测到微弱语音。

• 在嘈杂环境下,阈值提高,防止噪声误触发。

  1. Hangover 机制 (悬挂/滞后):

• 语音到静音转换: 当连续几帧被判定为噪声后,不会立即切换为静音状态,而是进入 "Hangover" 状态,继续判定为语音若干帧(例如 3-5 帧)。

• 目的: 防止切断语音的尾部(如辅音结尾)。

• 静音到语音转换: 需要连续几帧都判定为语音,才正式切换为语音状态。

• 目的: 防止瞬时突发噪声(如关门声)被误判为语音起始。

  1. 模式选择 (Modes): WebRTC 提供四种模式,本质上是调整上述阈值和 Hangover 长度:

• Normal: 平衡。

• Low Bitrate: 更激进地判定为静音(节省带宽),Hangover 较短。

• Aggressive: 更保守地判定为语音(保留更多声音),阈值较低。

• Very Aggressive: 极度保守,几乎不切断任何疑似语音的声音。

3.6 实现源码

1,提取能量、ZCR、频谱特征

cpp 复制代码
bool FeaturesExtractor::CheckSilenceComputeFeatures(
    rtc::ArrayView<const float, kFrameSize10ms24kHz> samples,
    rtc::ArrayView<float, kFeatureVectorSize> feature_vector) {
  // Pre-processing.
  if (use_high_pass_filter_) {
    std::array<float, kFrameSize10ms24kHz> samples_filtered;
    hpf_.Process(samples, samples_filtered);
    // Feed buffer with the pre-processed version of |samples|.
    pitch_buf_24kHz_.Push(samples_filtered);
  } else {
    // Feed buffer with |samples|.
    pitch_buf_24kHz_.Push(samples);
  }
  // Extract the LP residual.
  float lpc_coeffs[kNumLpcCoefficients];
  ComputeAndPostProcessLpcCoefficients(pitch_buf_24kHz_view_, lpc_coeffs);
  ComputeLpResidual(lpc_coeffs, pitch_buf_24kHz_view_, lp_residual_view_);
  // Estimate pitch on the LP-residual and write the normalized pitch period
  // into the output vector (normalization based on training data stats).
  pitch_info_48kHz_ = pitch_estimator_.Estimate(lp_residual_view_);
  feature_vector[kFeatureVectorSize - 2] =
      0.01f * (static_cast<int>(pitch_info_48kHz_.period) - 300);
  // Extract lagged frames (according to the estimated pitch period).
  RTC_DCHECK_LE(pitch_info_48kHz_.period / 2, kMaxPitch24kHz);
  auto lagged_frame = pitch_buf_24kHz_view_.subview(
      kMaxPitch24kHz - pitch_info_48kHz_.period / 2, kFrameSize20ms24kHz);
  // Analyze reference and lagged frames checking if silence has been detected
  // and write the feature vector.
  return spectral_features_extractor_.CheckSilenceComputeFeatures(
      reference_frame_view_, {lagged_frame.data(), kFrameSize20ms24kHz},
      {feature_vector.data() + kNumLowerBands, kNumBands - kNumLowerBands},
      {feature_vector.data(), kNumLowerBands},
      {feature_vector.data() + kNumBands, kNumLowerBands},
      {feature_vector.data() + kNumBands + kNumLowerBands, kNumLowerBands},
      {feature_vector.data() + kNumBands + 2 * kNumLowerBands, kNumLowerBands},
      &feature_vector[kFeatureVectorSize - 1]);
}

2, 核心函数,计算 GMM 概率。它使用预定义的系数数组(针对不同采样率和模式优化)来计算高斯分布的概率密度。

cpp 复制代码
static int16_t GmmProbability(VadInstT* self, int16_t* features,
                              int16_t total_power, size_t frame_length) {
  int channel, k;
  int16_t feature_minimum;
  int16_t h0, h1;
  int16_t log_likelihood_ratio;
  int16_t vadflag = 0;
  int16_t shifts_h0, shifts_h1;
  int16_t tmp_s16, tmp1_s16, tmp2_s16;
  int16_t diff;
  int gaussian;
  int16_t nmk, nmk2, nmk3, smk, smk2, nsk, ssk;
  int16_t delt, ndelt;
  int16_t maxspe, maxmu;
  int16_t deltaN[kTableSize], deltaS[kTableSize];
  int16_t ngprvec[kTableSize] = { 0 };  // Conditional probability = 0.
  int16_t sgprvec[kTableSize] = { 0 };  // Conditional probability = 0.
  int32_t h0_test, h1_test;
  int32_t tmp1_s32, tmp2_s32;
  int32_t sum_log_likelihood_ratios = 0;
  int32_t noise_global_mean, speech_global_mean;
  int32_t noise_probability[kNumGaussians], speech_probability[kNumGaussians];
  int16_t overhead1, overhead2, individualTest, totalTest;

  // Set various thresholds based on frame lengths (80, 160 or 240 samples).
  if (frame_length == 80) {
    overhead1 = self->over_hang_max_1[0];
    overhead2 = self->over_hang_max_2[0];
    individualTest = self->individual[0];
    totalTest = self->total[0];
  } else if (frame_length == 160) {
    overhead1 = self->over_hang_max_1[1];
    overhead2 = self->over_hang_max_2[1];
    individualTest = self->individual[1];
    totalTest = self->total[1];
  } else {
    overhead1 = self->over_hang_max_1[2];
    overhead2 = self->over_hang_max_2[2];
    individualTest = self->individual[2];
    totalTest = self->total[2];
  }

  if (total_power > kMinEnergy) {
    // The signal power of current frame is large enough for processing. The
    // processing consists of two parts:
    // 1) Calculating the likelihood of speech and thereby a VAD decision.
    // 2) Updating the underlying model, w.r.t., the decision made.

    // The detection scheme is an LRT with hypothesis
    // H0: Noise
    // H1: Speech
    //
    // We combine a global LRT with local tests, for each frequency sub-band,
    // here defined as |channel|.
    for (channel = 0; channel < kNumChannels; channel++) {
      // For each channel we model the probability with a GMM consisting of
      // |kNumGaussians|, with different means and standard deviations depending
      // on H0 or H1.
      h0_test = 0;
      h1_test = 0;
      for (k = 0; k < kNumGaussians; k++) {
        gaussian = channel + k * kNumChannels;
        // Probability under H0, that is, probability of frame being noise.
        // Value given in Q27 = Q7 * Q20.
        tmp1_s32 = WebRtcVad_GaussianProbability(features[channel],
                                                 self->noise_means[gaussian],
                                                 self->noise_stds[gaussian],
                                                 &deltaN[gaussian]);
        noise_probability[k] = kNoiseDataWeights[gaussian] * tmp1_s32;
        h0_test += noise_probability[k];  // Q27

        // Probability under H1, that is, probability of frame being speech.
        // Value given in Q27 = Q7 * Q20.
        tmp1_s32 = WebRtcVad_GaussianProbability(features[channel],
                                                 self->speech_means[gaussian],
                                                 self->speech_stds[gaussian],
                                                 &deltaS[gaussian]);
        speech_probability[k] = kSpeechDataWeights[gaussian] * tmp1_s32;
        h1_test += speech_probability[k];  // Q27
      }

      // Calculate the log likelihood ratio: log2(Pr{X|H1} / Pr{X|H1}).
      // Approximation:
      // log2(Pr{X|H1} / Pr{X|H1}) = log2(Pr{X|H1}*2^Q) - log2(Pr{X|H1}*2^Q)
      //                           = log2(h1_test) - log2(h0_test)
      //                           = log2(2^(31-shifts_h1)*(1+b1))
      //                             - log2(2^(31-shifts_h0)*(1+b0))
      //                           = shifts_h0 - shifts_h1
      //                             + log2(1+b1) - log2(1+b0)
      //                          ~= shifts_h0 - shifts_h1
      //
      // Note that b0 and b1 are values less than 1, hence, 0 <= log2(1+b0) < 1.
      // Further, b0 and b1 are independent and on the average the two terms
      // cancel.
      shifts_h0 = WebRtcSpl_NormW32(h0_test);
      shifts_h1 = WebRtcSpl_NormW32(h1_test);
      if (h0_test == 0) {
        shifts_h0 = 31;
      }
      if (h1_test == 0) {
        shifts_h1 = 31;
      }
      log_likelihood_ratio = shifts_h0 - shifts_h1;

      // Update |sum_log_likelihood_ratios| with spectrum weighting. This is
      // used for the global VAD decision.
      sum_log_likelihood_ratios +=
          (int32_t) (log_likelihood_ratio * kSpectrumWeight[channel]);

      // Local VAD decision.
      if ((log_likelihood_ratio * 4) > individualTest) {
        vadflag = 1;
      }

      // TODO(bjornv): The conditional probabilities below are applied on the
      // hard coded number of Gaussians set to two. Find a way to generalize.
      // Calculate local noise probabilities used later when updating the GMM.
      h0 = (int16_t) (h0_test >> 12);  // Q15
      if (h0 > 0) {
        // High probability of noise. Assign conditional probabilities for each
        // Gaussian in the GMM.
        tmp1_s32 = (noise_probability[0] & 0xFFFFF000) << 2;  // Q29
        ngprvec[channel] = (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, h0);  // Q14
        ngprvec[channel + kNumChannels] = 16384 - ngprvec[channel];
      } else {
        // Low noise probability. Assign conditional probability 1 to the first
        // Gaussian and 0 to the rest (which is already set at initialization).
        ngprvec[channel] = 16384;
      }

      // Calculate local speech probabilities used later when updating the GMM.
      h1 = (int16_t) (h1_test >> 12);  // Q15
      if (h1 > 0) {
        // High probability of speech. Assign conditional probabilities for each
        // Gaussian in the GMM. Otherwise use the initialized values, i.e., 0.
        tmp1_s32 = (speech_probability[0] & 0xFFFFF000) << 2;  // Q29
        sgprvec[channel] = (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, h1);  // Q14
        sgprvec[channel + kNumChannels] = 16384 - sgprvec[channel];
      }
    }

    // Make a global VAD decision.
    vadflag |= (sum_log_likelihood_ratios >= totalTest);

    // Update the model parameters.
    maxspe = 12800;
    for (channel = 0; channel < kNumChannels; channel++) {

      // Get minimum value in past which is used for long term correction in Q4.
      feature_minimum = WebRtcVad_FindMinimum(self, features[channel], channel);

      // Compute the "global" mean, that is the sum of the two means weighted.
      noise_global_mean = WeightedAverage(&self->noise_means[channel], 0,
                                          &kNoiseDataWeights[channel]);
      tmp1_s16 = (int16_t) (noise_global_mean >> 6);  // Q8

      for (k = 0; k < kNumGaussians; k++) {
        gaussian = channel + k * kNumChannels;

        nmk = self->noise_means[gaussian];
        smk = self->speech_means[gaussian];
        nsk = self->noise_stds[gaussian];
        ssk = self->speech_stds[gaussian];

        // Update noise mean vector if the frame consists of noise only.
        nmk2 = nmk;
        if (!vadflag) {
          // deltaN = (x-mu)/sigma^2
          // ngprvec[k] = |noise_probability[k]| /
          //   (|noise_probability[0]| + |noise_probability[1]|)

          // (Q14 * Q11 >> 11) = Q14.
          delt = (int16_t)((ngprvec[gaussian] * deltaN[gaussian]) >> 11);
          // Q7 + (Q14 * Q15 >> 22) = Q7.
          nmk2 = nmk + (int16_t)((delt * kNoiseUpdateConst) >> 22);
        }

        // Long term correction of the noise mean.
        // Q8 - Q8 = Q8.
        ndelt = (feature_minimum << 4) - tmp1_s16;
        // Q7 + (Q8 * Q8) >> 9 = Q7.
        nmk3 = nmk2 + (int16_t)((ndelt * kBackEta) >> 9);

        // Control that the noise mean does not drift to much.
        tmp_s16 = (int16_t) ((k + 5) << 7);
        if (nmk3 < tmp_s16) {
          nmk3 = tmp_s16;
        }
        tmp_s16 = (int16_t) ((72 + k - channel) << 7);
        if (nmk3 > tmp_s16) {
          nmk3 = tmp_s16;
        }
        self->noise_means[gaussian] = nmk3;

        if (vadflag) {
          // Update speech mean vector:
          // |deltaS| = (x-mu)/sigma^2
          // sgprvec[k] = |speech_probability[k]| /
          //   (|speech_probability[0]| + |speech_probability[1]|)

          // (Q14 * Q11) >> 11 = Q14.
          delt = (int16_t)((sgprvec[gaussian] * deltaS[gaussian]) >> 11);
          // Q14 * Q15 >> 21 = Q8.
          tmp_s16 = (int16_t)((delt * kSpeechUpdateConst) >> 21);
          // Q7 + (Q8 >> 1) = Q7. With rounding.
          smk2 = smk + ((tmp_s16 + 1) >> 1);

          // Control that the speech mean does not drift to much.
          maxmu = maxspe + 640;
          if (smk2 < kMinimumMean[k]) {
            smk2 = kMinimumMean[k];
          }
          if (smk2 > maxmu) {
            smk2 = maxmu;
          }
          self->speech_means[gaussian] = smk2;  // Q7.

          // (Q7 >> 3) = Q4. With rounding.
          tmp_s16 = ((smk + 4) >> 3);

          tmp_s16 = features[channel] - tmp_s16;  // Q4
          // (Q11 * Q4 >> 3) = Q12.
          tmp1_s32 = (deltaS[gaussian] * tmp_s16) >> 3;
          tmp2_s32 = tmp1_s32 - 4096;
          tmp_s16 = sgprvec[gaussian] >> 2;
          // (Q14 >> 2) * Q12 = Q24.
          tmp1_s32 = tmp_s16 * tmp2_s32;

          tmp2_s32 = tmp1_s32 >> 4;  // Q20

          // 0.1 * Q20 / Q7 = Q13.
          if (tmp2_s32 > 0) {
            tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(tmp2_s32, ssk * 10);
          } else {
            tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(-tmp2_s32, ssk * 10);
            tmp_s16 = -tmp_s16;
          }
          // Divide by 4 giving an update factor of 0.025 (= 0.1 / 4).
          // Note that division by 4 equals shift by 2, hence,
          // (Q13 >> 8) = (Q13 >> 6) / 4 = Q7.
          tmp_s16 += 128;  // Rounding.
          ssk += (tmp_s16 >> 8);
          if (ssk < kMinStd) {
            ssk = kMinStd;
          }
          self->speech_stds[gaussian] = ssk;
        } else {
          // Update GMM variance vectors.
          // deltaN * (features[channel] - nmk) - 1
          // Q4 - (Q7 >> 3) = Q4.
          tmp_s16 = features[channel] - (nmk >> 3);
          // (Q11 * Q4 >> 3) = Q12.
          tmp1_s32 = (deltaN[gaussian] * tmp_s16) >> 3;
          tmp1_s32 -= 4096;

          // (Q14 >> 2) * Q12 = Q24.
          tmp_s16 = (ngprvec[gaussian] + 2) >> 2;
          tmp2_s32 = OverflowingMulS16ByS32ToS32(tmp_s16, tmp1_s32);
          // Q20  * approx 0.001 (2^-10=0.0009766), hence,
          // (Q24 >> 14) = (Q24 >> 4) / 2^10 = Q20.
          tmp1_s32 = tmp2_s32 >> 14;

          // Q20 / Q7 = Q13.
          if (tmp1_s32 > 0) {
            tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, nsk);
          } else {
            tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(-tmp1_s32, nsk);
            tmp_s16 = -tmp_s16;
          }
          tmp_s16 += 32;  // Rounding
          nsk += tmp_s16 >> 6;  // Q13 >> 6 = Q7.
          if (nsk < kMinStd) {
            nsk = kMinStd;
          }
          self->noise_stds[gaussian] = nsk;
        }
      }

      // Separate models if they are too close.
      // |noise_global_mean| in Q14 (= Q7 * Q7).
      noise_global_mean = WeightedAverage(&self->noise_means[channel], 0,
                                          &kNoiseDataWeights[channel]);

      // |speech_global_mean| in Q14 (= Q7 * Q7).
      speech_global_mean = WeightedAverage(&self->speech_means[channel], 0,
                                           &kSpeechDataWeights[channel]);

      // |diff| = "global" speech mean - "global" noise mean.
      // (Q14 >> 9) - (Q14 >> 9) = Q5.
      diff = (int16_t) (speech_global_mean >> 9) -
          (int16_t) (noise_global_mean >> 9);
      if (diff < kMinimumDifference[channel]) {
        tmp_s16 = kMinimumDifference[channel] - diff;

        // |tmp1_s16| = ~0.8 * (kMinimumDifference - diff) in Q7.
        // |tmp2_s16| = ~0.2 * (kMinimumDifference - diff) in Q7.
        tmp1_s16 = (int16_t)((13 * tmp_s16) >> 2);
        tmp2_s16 = (int16_t)((3 * tmp_s16) >> 2);

        // Move Gaussian means for speech model by |tmp1_s16| and update
        // |speech_global_mean|. Note that |self->speech_means[channel]| is
        // changed after the call.
        speech_global_mean = WeightedAverage(&self->speech_means[channel],
                                             tmp1_s16,
                                             &kSpeechDataWeights[channel]);

        // Move Gaussian means for noise model by -|tmp2_s16| and update
        // |noise_global_mean|. Note that |self->noise_means[channel]| is
        // changed after the call.
        noise_global_mean = WeightedAverage(&self->noise_means[channel],
                                            -tmp2_s16,
                                            &kNoiseDataWeights[channel]);
      }

      // Control that the speech & noise means do not drift to much.
      maxspe = kMaximumSpeech[channel];
      tmp2_s16 = (int16_t) (speech_global_mean >> 7);
      if (tmp2_s16 > maxspe) {
        // Upper limit of speech model.
        tmp2_s16 -= maxspe;

        for (k = 0; k < kNumGaussians; k++) {
          self->speech_means[channel + k * kNumChannels] -= tmp2_s16;
        }
      }

      tmp2_s16 = (int16_t) (noise_global_mean >> 7);
      if (tmp2_s16 > kMaximumNoise[channel]) {
        tmp2_s16 -= kMaximumNoise[channel];

        for (k = 0; k < kNumGaussians; k++) {
          self->noise_means[channel + k * kNumChannels] -= tmp2_s16;
        }
      }
    }
    self->frame_counter++;
  }

  // Smooth with respect to transition hysteresis.
  if (!vadflag) {
    if (self->over_hang > 0) {
      vadflag = 2 + self->over_hang;
      self->over_hang--;
    }
    self->num_of_speech = 0;
  } else {
    self->num_of_speech++;
    if (self->num_of_speech > kMaxSpeechFrames) {
      self->num_of_speech = kMaxSpeechFrames;
      self->over_hang = overhead2;
    } else {
      self->over_hang = overhead1;
    }
  }
  return vadflag;
}