音频之静音检测（VAD）

静音检测主要区分语音帧与静音 / 噪声帧，广泛用于通话降噪、录音分段、直播降噪、语音识别前置处理。

一、静音检测的基础逻辑为：

分帧：把连续音频切成短帧（常见 10ms/20ms，匹配人耳短时平稳特性）
提取一帧音频特征能量
和阈值对比：特征＜阈值 → 判定静音；特征＞阈值 → 判定人声
平滑防抖：单帧误判会过滤，连续多帧才切换静音 / 说话状态

二、几种静音检测原理：

2.1 基础能量法

利用时域幅值 / 短时能量判断：静音时麦克风只有环境噪声，波形幅度极小；人声波形幅度显著变大。

公式：

一帧 N 个采样点，为采样值,

每帧的短时能量公式：，

则短时平均幅度：

步骤1，通过以上公式计算出基准噪音值

步骤2，设置一个可以调灵敏度的阈值，如公式 , 其中K为可调节的值，可根据实际情况调准该值。

步骤3，判断规则，如 E <=T, 则静音；否则为有声音。

通过公式和判断方法可以看到，该计算方法比较暴力，无法判断过小的人声，也无法区分汽车声与人声。

2.2 过零率辅助法

过零率 ZCR：采用一帧内采样信号穿过 0 电平的次数。

该方法主要为了区别噪音和人声，根据各自声音的特征进行区分：

白噪声 / 风扇噪声：波形杂乱，过零率极高
人声音频：低频为主，波形平缓，过零率低

计算公式：

设一帧有 N 个采样点，相邻两点符号相反即产生一次过零。

其中符合函数：

并且还可以简化理解：求和得到整帧总过零次数，除以 2N 做归一化，取值范围 $0,1$ 。

采用该方法可以区分一些突兀的噪音，但是对于部分平稳并且低声的噪音可能也无法区别。当然采用该方法可以与能量法进行叠加使用，先用能量法进行初步判断，然后再用过零率法去除部分噪音。

2.3 频域能量法

时域只看整体音量，频域区分人声频段和噪声频段。使用FFT频谱，可以使得效果大幅提升。

原理为：

对音频帧做 FFT，转换到频谱
人声有效频段：300Hz ~ 3400Hz（语音基带）
只统计该频段内总能量；低频风噪、高频电子噪音直接忽略
人声频段能量超过自适应阈值则判定说话

该方法有点类似心里声学模型，只关注人能听到的频率，隔离大量不在人声频段的干扰噪声，嘈杂环境准确率远高于时域能量法。

关键公式：

1，FFT频谱

设 FFT 输出复数频点 X $k$ ，k 为频点索引单频点功率：

单频点功率：

Re= Real，复数实部
Im = Imaginary，复数虚部

2,语音带总能量

：对应 300Hz 的 FFT 下标

：对应 3400Hz 的 FFT 下标

分别对人听到的每个声段求和公式：

3，噪声基线自适应更新

持续多帧判定为静音时，缓慢更新背景噪声能量：

平滑系数，噪声变化越慢取值越大。

判决条件

设定阈值系数（一般 2~8 可调灵敏度）

三，webrtc vad

WebRTC 中的 VAD (Voice Activity Detection) 主要基于 GMM (高斯混合模型) 和频谱特征分析。其核心思想是将音频帧的特征向量与预训练的"语音模型"和"噪声模型"进行比对，计算似然概率从而做出判决。

3.1 整体架构流程

预处理: 下采样、分帧、加窗。
特征提取: 从时域和频域提取区分语音和噪声的关键特征。
模型匹配:使用 GMM 计算特征属于语音或噪声的概率。
决策逻辑: 结合概率、能量阈值和历史状态（Hangover）做出最终判断。

3.2 预处理

• 下采样: 无论输入采样率是 8k, 16k, 32k 还是 48k，VAD 内部通常会将信号下采样到 8kHz。

• 原因: 人声的主要能量和信息集中在低频段 (0-4kHz)。降低采样率可以大幅减少计算量，且对 VAD 精度影响很小。

• 分帧: 将连续信号划分为重叠的短帧。WebRTC VAD 支持 10ms, 20ms, 30ms 的帧长。

• 加窗: 通常使用汉明窗 (Hamming Window) 以减少频谱泄漏。

3.3 特征提取

总能量 (Total Energy):

• 计算帧内所有样本平方和的对数。

•

• 作用: 静音帧能量通常极低。

过零率 (Zero Crossing Rate, ZCR):

• 信号穿过零轴的次数。

• 作用: 清音（如 /s/, /f/）和噪声通常具有较高的 ZCR，而浊音（如 /a/, /o/）ZCR 较低。

频谱斜率 (Spectral Slope):

• 通过线性回归拟合频谱包络，计算斜率。

• 作用: 语音频谱通常随频率增加而下降（负斜率），而白噪声频谱较平坦。

频谱平坦度 (Spectral Flatness):

• 几何均值与算术均值的比值。

• 作用: 衡量频谱像音调（峰值明显）还是像噪声（平坦）。

子带能量比 (Sub-band Energy Ratio):

• 将 8kHz 频谱分为几个子带（例如：低、中、高）。

• 计算各子带能量占总能量的比例。

• 作用: 人声在低频子带（如 0-500Hz, 500-1000Hz）通常有较高的能量集中度，而高频噪声则在高频子带能量较高。

3.4 模型匹配（高斯混合模型 (GMM) 分类）

这是 WebRTC VAD 的核心。它维护两个独立的 GMM 模型：

• Speech Model (): 由大量纯净语音数据训练而成。

• Noise Model (): 由各种背景噪声数据训练而成。

每个模型由多个高斯分布组成：其中是特征向量，是权重，是均值，是协方差。

当前WebRTC 也引入了基于递归神经网络 (RNN) 的 VAD (modules/audio_processing/rnn_vad/)，这里不重点具体内容，有兴趣的可以自行研究。

计算过程:

对于当前帧的特征向量，分别计算其在语音模型下的对数似然概率和在噪声模型下的对数似然概率。
计算似然比 (Likelihood Ratio):
如果，则倾向于判定为语音；否则为噪声。

3.5 决策逻辑与平滑 (Decision & Smoothing)

原始的逐帧判决容易受到瞬时噪声干扰，产生抖动。因此引入了状态机和平滑机制：

自适应阈值:

• 阈值不是固定的，而是根据背景噪声电平动态调整。

• 在安静环境下，阈值较低，容易检测到微弱语音。

• 在嘈杂环境下，阈值提高，防止噪声误触发。

Hangover 机制 (悬挂/滞后):

• 语音到静音转换: 当连续几帧被判定为噪声后，不会立即切换为静音状态，而是进入 "Hangover" 状态，继续判定为语音若干帧（例如 3-5 帧）。

• 目的: 防止切断语音的尾部（如辅音结尾）。

• 静音到语音转换: 需要连续几帧都判定为语音，才正式切换为语音状态。

• 目的: 防止瞬时突发噪声（如关门声）被误判为语音起始。

模式选择 (Modes): WebRTC 提供四种模式，本质上是调整上述阈值和 Hangover 长度：

• Normal: 平衡。

• Low Bitrate: 更激进地判定为静音（节省带宽），Hangover 较短。

• Aggressive: 更保守地判定为语音（保留更多声音），阈值较低。

• Very Aggressive: 极度保守，几乎不切断任何疑似语音的声音。

3.6 实现源码

1，提取能量、ZCR、频谱特征

cpp 复制代码

bool FeaturesExtractor::CheckSilenceComputeFeatures(
    rtc::ArrayView<const float, kFrameSize10ms24kHz> samples,
    rtc::ArrayView<float, kFeatureVectorSize> feature_vector) {
  // Pre-processing.
  if (use_high_pass_filter_) {
    std::array<float, kFrameSize10ms24kHz> samples_filtered;
    hpf_.Process(samples, samples_filtered);
    // Feed buffer with the pre-processed version of |samples|.
    pitch_buf_24kHz_.Push(samples_filtered);
  } else {
    // Feed buffer with |samples|.
    pitch_buf_24kHz_.Push(samples);
  }
  // Extract the LP residual.
  float lpc_coeffs[kNumLpcCoefficients];
  ComputeAndPostProcessLpcCoefficients(pitch_buf_24kHz_view_, lpc_coeffs);
  ComputeLpResidual(lpc_coeffs, pitch_buf_24kHz_view_, lp_residual_view_);
  // Estimate pitch on the LP-residual and write the normalized pitch period
  // into the output vector (normalization based on training data stats).
  pitch_info_48kHz_ = pitch_estimator_.Estimate(lp_residual_view_);
  feature_vector[kFeatureVectorSize - 2] =
      0.01f * (static_cast<int>(pitch_info_48kHz_.period) - 300);
  // Extract lagged frames (according to the estimated pitch period).
  RTC_DCHECK_LE(pitch_info_48kHz_.period / 2, kMaxPitch24kHz);
  auto lagged_frame = pitch_buf_24kHz_view_.subview(
      kMaxPitch24kHz - pitch_info_48kHz_.period / 2, kFrameSize20ms24kHz);
  // Analyze reference and lagged frames checking if silence has been detected
  // and write the feature vector.
  return spectral_features_extractor_.CheckSilenceComputeFeatures(
      reference_frame_view_, {lagged_frame.data(), kFrameSize20ms24kHz},
      {feature_vector.data() + kNumLowerBands, kNumBands - kNumLowerBands},
      {feature_vector.data(), kNumLowerBands},
      {feature_vector.data() + kNumBands, kNumLowerBands},
      {feature_vector.data() + kNumBands + kNumLowerBands, kNumLowerBands},
      {feature_vector.data() + kNumBands + 2 * kNumLowerBands, kNumLowerBands},
      &feature_vector[kFeatureVectorSize - 1]);
}

2，核心函数，计算 GMM 概率。它使用预定义的系数数组（针对不同采样率和模式优化）来计算高斯分布的概率密度。

cpp 复制代码

static int16_t GmmProbability(VadInstT* self, int16_t* features,
                              int16_t total_power, size_t frame_length) {
  int channel, k;
  int16_t feature_minimum;
  int16_t h0, h1;
  int16_t log_likelihood_ratio;
  int16_t vadflag = 0;
  int16_t shifts_h0, shifts_h1;
  int16_t tmp_s16, tmp1_s16, tmp2_s16;
  int16_t diff;
  int gaussian;
  int16_t nmk, nmk2, nmk3, smk, smk2, nsk, ssk;
  int16_t delt, ndelt;
  int16_t maxspe, maxmu;
  int16_t deltaN[kTableSize], deltaS[kTableSize];
  int16_t ngprvec[kTableSize] = { 0 };  // Conditional probability = 0.
  int16_t sgprvec[kTableSize] = { 0 };  // Conditional probability = 0.
  int32_t h0_test, h1_test;
  int32_t tmp1_s32, tmp2_s32;
  int32_t sum_log_likelihood_ratios = 0;
  int32_t noise_global_mean, speech_global_mean;
  int32_t noise_probability[kNumGaussians], speech_probability[kNumGaussians];
  int16_t overhead1, overhead2, individualTest, totalTest;

  // Set various thresholds based on frame lengths (80, 160 or 240 samples).
  if (frame_length == 80) {
    overhead1 = self->over_hang_max_1[0];
    overhead2 = self->over_hang_max_2[0];
    individualTest = self->individual[0];
    totalTest = self->total[0];
  } else if (frame_length == 160) {
    overhead1 = self->over_hang_max_1[1];
    overhead2 = self->over_hang_max_2[1];
    individualTest = self->individual[1];
    totalTest = self->total[1];
  } else {
    overhead1 = self->over_hang_max_1[2];
    overhead2 = self->over_hang_max_2[2];
    individualTest = self->individual[2];
    totalTest = self->total[2];
  }

  if (total_power > kMinEnergy) {
    // The signal power of current frame is large enough for processing. The
    // processing consists of two parts:
    // 1) Calculating the likelihood of speech and thereby a VAD decision.
    // 2) Updating the underlying model, w.r.t., the decision made.

    // The detection scheme is an LRT with hypothesis
    // H0: Noise
    // H1: Speech
    //
    // We combine a global LRT with local tests, for each frequency sub-band,
    // here defined as |channel|.
    for (channel = 0; channel < kNumChannels; channel++) {
      // For each channel we model the probability with a GMM consisting of
      // |kNumGaussians|, with different means and standard deviations depending
      // on H0 or H1.
      h0_test = 0;
      h1_test = 0;
      for (k = 0; k < kNumGaussians; k++) {
        gaussian = channel + k * kNumChannels;
        // Probability under H0, that is, probability of frame being noise.
        // Value given in Q27 = Q7 * Q20.
        tmp1_s32 = WebRtcVad_GaussianProbability(features[channel],
                                                 self->noise_means[gaussian],
                                                 self->noise_stds[gaussian],
                                                 &deltaN[gaussian]);
        noise_probability[k] = kNoiseDataWeights[gaussian] * tmp1_s32;
        h0_test += noise_probability[k];  // Q27

        // Probability under H1, that is, probability of frame being speech.
        // Value given in Q27 = Q7 * Q20.
        tmp1_s32 = WebRtcVad_GaussianProbability(features[channel],
                                                 self->speech_means[gaussian],
                                                 self->speech_stds[gaussian],
                                                 &deltaS[gaussian]);
        speech_probability[k] = kSpeechDataWeights[gaussian] * tmp1_s32;
        h1_test += speech_probability[k];  // Q27
      }

      // Calculate the log likelihood ratio: log2(Pr{X|H1} / Pr{X|H1}).
      // Approximation:
      // log2(Pr{X|H1} / Pr{X|H1}) = log2(Pr{X|H1}*2^Q) - log2(Pr{X|H1}*2^Q)
      //                           = log2(h1_test) - log2(h0_test)
      //                           = log2(2^(31-shifts_h1)*(1+b1))
      //                             - log2(2^(31-shifts_h0)*(1+b0))
      //                           = shifts_h0 - shifts_h1
      //                             + log2(1+b1) - log2(1+b0)
      //                          ~= shifts_h0 - shifts_h1
      //
      // Note that b0 and b1 are values less than 1, hence, 0 <= log2(1+b0) < 1.
      // Further, b0 and b1 are independent and on the average the two terms
      // cancel.
      shifts_h0 = WebRtcSpl_NormW32(h0_test);
      shifts_h1 = WebRtcSpl_NormW32(h1_test);
      if (h0_test == 0) {
        shifts_h0 = 31;
      }
      if (h1_test == 0) {
        shifts_h1 = 31;
      }
      log_likelihood_ratio = shifts_h0 - shifts_h1;

      // Update |sum_log_likelihood_ratios| with spectrum weighting. This is
      // used for the global VAD decision.
      sum_log_likelihood_ratios +=
          (int32_t) (log_likelihood_ratio * kSpectrumWeight[channel]);

      // Local VAD decision.
      if ((log_likelihood_ratio * 4) > individualTest) {
        vadflag = 1;
      }

      // TODO(bjornv): The conditional probabilities below are applied on the
      // hard coded number of Gaussians set to two. Find a way to generalize.
      // Calculate local noise probabilities used later when updating the GMM.
      h0 = (int16_t) (h0_test >> 12);  // Q15
      if (h0 > 0) {
        // High probability of noise. Assign conditional probabilities for each
        // Gaussian in the GMM.
        tmp1_s32 = (noise_probability[0] & 0xFFFFF000) << 2;  // Q29
        ngprvec[channel] = (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, h0);  // Q14
        ngprvec[channel + kNumChannels] = 16384 - ngprvec[channel];
      } else {
        // Low noise probability. Assign conditional probability 1 to the first
        // Gaussian and 0 to the rest (which is already set at initialization).
        ngprvec[channel] = 16384;
      }

      // Calculate local speech probabilities used later when updating the GMM.
      h1 = (int16_t) (h1_test >> 12);  // Q15
      if (h1 > 0) {
        // High probability of speech. Assign conditional probabilities for each
        // Gaussian in the GMM. Otherwise use the initialized values, i.e., 0.
        tmp1_s32 = (speech_probability[0] & 0xFFFFF000) << 2;  // Q29
        sgprvec[channel] = (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, h1);  // Q14
        sgprvec[channel + kNumChannels] = 16384 - sgprvec[channel];
      }
    }

    // Make a global VAD decision.
    vadflag |= (sum_log_likelihood_ratios >= totalTest);

    // Update the model parameters.
    maxspe = 12800;
    for (channel = 0; channel < kNumChannels; channel++) {

      // Get minimum value in past which is used for long term correction in Q4.
      feature_minimum = WebRtcVad_FindMinimum(self, features[channel], channel);

      // Compute the "global" mean, that is the sum of the two means weighted.
      noise_global_mean = WeightedAverage(&self->noise_means[channel], 0,
                                          &kNoiseDataWeights[channel]);
      tmp1_s16 = (int16_t) (noise_global_mean >> 6);  // Q8

      for (k = 0; k < kNumGaussians; k++) {
        gaussian = channel + k * kNumChannels;

        nmk = self->noise_means[gaussian];
        smk = self->speech_means[gaussian];
        nsk = self->noise_stds[gaussian];
        ssk = self->speech_stds[gaussian];

        // Update noise mean vector if the frame consists of noise only.
        nmk2 = nmk;
        if (!vadflag) {
          // deltaN = (x-mu)/sigma^2
          // ngprvec[k] = |noise_probability[k]| /
          //   (|noise_probability[0]| + |noise_probability[1]|)

          // (Q14 * Q11 >> 11) = Q14.
          delt = (int16_t)((ngprvec[gaussian] * deltaN[gaussian]) >> 11);
          // Q7 + (Q14 * Q15 >> 22) = Q7.
          nmk2 = nmk + (int16_t)((delt * kNoiseUpdateConst) >> 22);
        }

        // Long term correction of the noise mean.
        // Q8 - Q8 = Q8.
        ndelt = (feature_minimum << 4) - tmp1_s16;
        // Q7 + (Q8 * Q8) >> 9 = Q7.
        nmk3 = nmk2 + (int16_t)((ndelt * kBackEta) >> 9);

        // Control that the noise mean does not drift to much.
        tmp_s16 = (int16_t) ((k + 5) << 7);
        if (nmk3 < tmp_s16) {
          nmk3 = tmp_s16;
        }
        tmp_s16 = (int16_t) ((72 + k - channel) << 7);
        if (nmk3 > tmp_s16) {
          nmk3 = tmp_s16;
        }
        self->noise_means[gaussian] = nmk3;

        if (vadflag) {
          // Update speech mean vector:
          // |deltaS| = (x-mu)/sigma^2
          // sgprvec[k] = |speech_probability[k]| /
          //   (|speech_probability[0]| + |speech_probability[1]|)

          // (Q14 * Q11) >> 11 = Q14.
          delt = (int16_t)((sgprvec[gaussian] * deltaS[gaussian]) >> 11);
          // Q14 * Q15 >> 21 = Q8.
          tmp_s16 = (int16_t)((delt * kSpeechUpdateConst) >> 21);
          // Q7 + (Q8 >> 1) = Q7. With rounding.
          smk2 = smk + ((tmp_s16 + 1) >> 1);

          // Control that the speech mean does not drift to much.
          maxmu = maxspe + 640;
          if (smk2 < kMinimumMean[k]) {
            smk2 = kMinimumMean[k];
          }
          if (smk2 > maxmu) {
            smk2 = maxmu;
          }
          self->speech_means[gaussian] = smk2;  // Q7.

          // (Q7 >> 3) = Q4. With rounding.
          tmp_s16 = ((smk + 4) >> 3);

          tmp_s16 = features[channel] - tmp_s16;  // Q4
          // (Q11 * Q4 >> 3) = Q12.
          tmp1_s32 = (deltaS[gaussian] * tmp_s16) >> 3;
          tmp2_s32 = tmp1_s32 - 4096;
          tmp_s16 = sgprvec[gaussian] >> 2;
          // (Q14 >> 2) * Q12 = Q24.
          tmp1_s32 = tmp_s16 * tmp2_s32;

          tmp2_s32 = tmp1_s32 >> 4;  // Q20

          // 0.1 * Q20 / Q7 = Q13.
          if (tmp2_s32 > 0) {
            tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(tmp2_s32, ssk * 10);
          } else {
            tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(-tmp2_s32, ssk * 10);
            tmp_s16 = -tmp_s16;
          }
          // Divide by 4 giving an update factor of 0.025 (= 0.1 / 4).
          // Note that division by 4 equals shift by 2, hence,
          // (Q13 >> 8) = (Q13 >> 6) / 4 = Q7.
          tmp_s16 += 128;  // Rounding.
          ssk += (tmp_s16 >> 8);
          if (ssk < kMinStd) {
            ssk = kMinStd;
          }
          self->speech_stds[gaussian] = ssk;
        } else {
          // Update GMM variance vectors.
          // deltaN * (features[channel] - nmk) - 1
          // Q4 - (Q7 >> 3) = Q4.
          tmp_s16 = features[channel] - (nmk >> 3);
          // (Q11 * Q4 >> 3) = Q12.
          tmp1_s32 = (deltaN[gaussian] * tmp_s16) >> 3;
          tmp1_s32 -= 4096;

          // (Q14 >> 2) * Q12 = Q24.
          tmp_s16 = (ngprvec[gaussian] + 2) >> 2;
          tmp2_s32 = OverflowingMulS16ByS32ToS32(tmp_s16, tmp1_s32);
          // Q20  * approx 0.001 (2^-10=0.0009766), hence,
          // (Q24 >> 14) = (Q24 >> 4) / 2^10 = Q20.
          tmp1_s32 = tmp2_s32 >> 14;

          // Q20 / Q7 = Q13.
          if (tmp1_s32 > 0) {
            tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, nsk);
          } else {
            tmp_s16 = (int16_t) WebRtcSpl_DivW32W16(-tmp1_s32, nsk);
            tmp_s16 = -tmp_s16;
          }
          tmp_s16 += 32;  // Rounding
          nsk += tmp_s16 >> 6;  // Q13 >> 6 = Q7.
          if (nsk < kMinStd) {
            nsk = kMinStd;
          }
          self->noise_stds[gaussian] = nsk;
        }
      }

      // Separate models if they are too close.
      // |noise_global_mean| in Q14 (= Q7 * Q7).
      noise_global_mean = WeightedAverage(&self->noise_means[channel], 0,
                                          &kNoiseDataWeights[channel]);

      // |speech_global_mean| in Q14 (= Q7 * Q7).
      speech_global_mean = WeightedAverage(&self->speech_means[channel], 0,
                                           &kSpeechDataWeights[channel]);

      // |diff| = "global" speech mean - "global" noise mean.
      // (Q14 >> 9) - (Q14 >> 9) = Q5.
      diff = (int16_t) (speech_global_mean >> 9) -
          (int16_t) (noise_global_mean >> 9);
      if (diff < kMinimumDifference[channel]) {
        tmp_s16 = kMinimumDifference[channel] - diff;

        // |tmp1_s16| = ~0.8 * (kMinimumDifference - diff) in Q7.
        // |tmp2_s16| = ~0.2 * (kMinimumDifference - diff) in Q7.
        tmp1_s16 = (int16_t)((13 * tmp_s16) >> 2);
        tmp2_s16 = (int16_t)((3 * tmp_s16) >> 2);

        // Move Gaussian means for speech model by |tmp1_s16| and update
        // |speech_global_mean|. Note that |self->speech_means[channel]| is
        // changed after the call.
        speech_global_mean = WeightedAverage(&self->speech_means[channel],
                                             tmp1_s16,
                                             &kSpeechDataWeights[channel]);

        // Move Gaussian means for noise model by -|tmp2_s16| and update
        // |noise_global_mean|. Note that |self->noise_means[channel]| is
        // changed after the call.
        noise_global_mean = WeightedAverage(&self->noise_means[channel],
                                            -tmp2_s16,
                                            &kNoiseDataWeights[channel]);
      }

      // Control that the speech & noise means do not drift to much.
      maxspe = kMaximumSpeech[channel];
      tmp2_s16 = (int16_t) (speech_global_mean >> 7);
      if (tmp2_s16 > maxspe) {
        // Upper limit of speech model.
        tmp2_s16 -= maxspe;

        for (k = 0; k < kNumGaussians; k++) {
          self->speech_means[channel + k * kNumChannels] -= tmp2_s16;
        }
      }

      tmp2_s16 = (int16_t) (noise_global_mean >> 7);
      if (tmp2_s16 > kMaximumNoise[channel]) {
        tmp2_s16 -= kMaximumNoise[channel];

        for (k = 0; k < kNumGaussians; k++) {
          self->noise_means[channel + k * kNumChannels] -= tmp2_s16;
        }
      }
    }
    self->frame_counter++;
  }

  // Smooth with respect to transition hysteresis.
  if (!vadflag) {
    if (self->over_hang > 0) {
      vadflag = 2 + self->over_hang;
      self->over_hang--;
    }
    self->num_of_speech = 0;
  } else {
    self->num_of_speech++;
    if (self->num_of_speech > kMaxSpeechFrames) {
      self->num_of_speech = kMaxSpeechFrames;
      self->over_hang = overhead2;
    } else {
      self->over_hang = overhead1;
    }
  }
  return vadflag;
}