用AI监听语音通话的方案和实践

背景

电商平台作为中介又必须允许商户和顾客通过隐私电话通话，同时又担心商户引导顾客飞单，需要对隐私电话的通话进行监听；
AI时代，搞点AI相关的项目，同时缓解老板的焦虑；

基本思路

从运营商那里拉取隐私通话的语音
将语音转换为文字
人工标注文字版语音通话，是否存在诱导飞单
用人工标注的数据微调一个大模型用于自动检测

实施步骤

数据准备

从生产库读取订单信息生成excel，下载通话语音文件存储到指定目录

语音转写

预处理

将语音文件统一处理成单声道&16K采样率的标注wav

python 复制代码

import logging
import tempfile
from pydub import AudioSegment

def get_temp_path(prefix: str, suffix: str = ".wav") -> str:
    """生成临时文件路径"""
    with tempfile.NamedTemporaryFile(prefix=prefix, suffix=suffix, delete=False) as tmp_file:
        return tmp_file.name    

def preprocess_audio(input_path: str) -> Optional[str]:
    """
    音频预处理: 统一转换为单声道、16kHz的WAV格式
    返回处理后的临时文件路径,失败返回None
    """
    try:
        audio = AudioSegment.from_file(input_path)
        if audio.channels > 1:
            audio = audio.set_channels(1)
        if audio.frame_rate != 16000:
            audio = audio.set_frame_rate(16000)

        output_path = get_temp_path("preprocess_")
        audio.export(output_path, format="wav")
        logger.info(f"预处理成功: {input_path} -> {output_path}")
        return output_path
    except Exception as e:
        logger.error(f"预处理失败 [{input_path}]: {str(e)}")
        return None

降噪增强

选用 iic/speech_zipenhancer_ans_multiloss_16k_base 模型优化wav

python 复制代码

import logging
from modelscope.pipelines import pipeline

def load_enhance_pipeline() -> pipeline:
        enhancer_pipe = pipeline(
                Tasks.acoustic_noise_suppression, 
                model='iic/speech_zipenhancer_ans_multiloss_16k_base'
        )
        return enhancer_pipe

def enhance_wav(enhancer_pipe : pipeline, input_path: str) -> str:
    """
    音频增强: 降噪处理
    :param input_path: 输入音频文件路径
    :return: 增强后的音频文件路径
    """
    enhanced_path = get_temp_path(f"enhanced_", suffix=".wav")

    try:
        enhancer_pipe(input_path, output_path=enhanced_path)
        return enhanced_path
    except Exception as e:
        logger.warning(f"音频{input_path}降噪失败: {str(e)}")
        os.copy_file_range(input_path, enhanced_path)  # 复制原始文件
        logger.warning(f"使用原始音频文件替代: {input_path} -> {enhanced_path}")
        return input_path

语音转文字

选用 iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch 用于语音识别(该版本支持配置热词)
选用 iic/speech_fsmn_vad_zh-cn-16k-common-pytorch 用于语音端点检测
选用 iic/punc_ct-transformer_cn-en-common-vocab471067-large 用于标点符号生成
选用 iic/speech_campplus_sv_zh-cn_16k-common 用于说话人分离

python 复制代码

import logging
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks


def load_modelscope_pipeline() -> pipeline:
    # 通话转写
    inference_pipeline = pipeline(
        task=Tasks.auto_speech_recognition,
        # 改用支持热词的模型
        model='iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', model_revision='v2.0.5', # 语音识别
        # model='iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch', model_revision='v2.0.5', # 语音识别
        vad_model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch', vad_model_revision="v2.0.4",      # 语音端点检测
        punc_model='iic/punc_ct-transformer_cn-en-common-vocab471067-large', punc_model_revision="v2.0.4", # 语音标点
        spk_model="iic/speech_campplus_sv_zh-cn_16k-common", spk_model_revision="v2.0.2", # 说话人分离 
    )    
    return inference_pipeline
	
def transcript_wavs(inference_pipeline : pipeline, hotword : str, wav_files: list[str], verbose: bool) -> list[dict[str, str]]:
    wav_contents = []
    try:
        rec_result = inference_pipeline(
            wav_files, 
            batch_size_s=300, 
            batch_size_token_threshold_s=40,
            hotword=hotword,
        )
        for i, item in enumerate(rec_result):
            if item['sentence_info'] is None:
                wav_contents.append({"wav": wav_files[i], "content": ""})
                logger.warning(f"音频文件 {wav_files[i]} 处理失败，可能为非音频文件")
                continue
            info = item['sentence_info']
            speakers = set()       
            lines = []
            for item in info:
                spk = item['spk']
                txt = item['text']
                if spk is None or txt is None or len(txt) == 0:
                    continue
                start = int(item['start'])
                lines.append({'spk': spk, 'text': txt, 'start': start})
                speakers.add(spk)    
            
            if len(speakers) < 2:
                wav_contents.append({"wav": wav_files[i], "content": "\n".join([item['text'] for item in lines])})
                logger.warning(f"音频文件 {wav_files[i]} 未检测多多个说话人，可能为单人音频")
                continue

            pre_speaker = None
            pre_line = ""
            pre_start = 0
            contents = []
            for line in lines:
                current_speaker = line['spk']
                current_line = line['text']
                current_start = line['start']
                if pre_speaker is None:
                    pre_speaker = current_speaker
                    pre_line = current_line
                    pre_start = current_start
                elif pre_speaker != current_speaker:
                    contents.append(f"Speaker_{pre_speaker} {mills2timestr(pre_start)}: {pre_line}")
                    pre_speaker = current_speaker
                    pre_line = current_line
                    pre_start = current_start
                else:
                    pre_line += current_line
            if pre_speaker is not None:
                contents.append(f"Speaker_{pre_speaker} {mills2timestr(pre_start)}: {pre_line}")
            wav_contents.append({"wav": wav_files[i], "content": "\n".join(contents)})
            if verbose:
                logger.info(f"音频文件 {wav_files[i]} 转写结果：\n{wav_contents[-1]['content']}")
            else:
                logger.debug(f"音频文件 {wav_files[i]} 转写结果：\n{wav_contents[-1]['content']}")
        return wav_contents
    except Exception as e:
        logger.error(f"转写失败: {str(e)}")
        return wav_contents

文本摘要

数万条音频转写出的文本，不可能让运营一条条去读，需要先生成摘要，并且检测其中是否有诱导退货或引导以其他途径购买的行为，辅助运营快速鉴别飞单行为；

此处选用 Qwen/Qwen2.5-7B-Instruct 模型来生成摘要

python 复制代码

import logging
import os
import sys
from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch
from common import get_executable_directory, load_args


class TextSummarizer:
    def __init__(self, template : str, verbose : bool = False):
        self.template = template
        self.verbose = verbose

        if self.verbose:
            logger.info("使用的摘要模板:")
            logger.info(self.template)

        model_path = "Qwen/Qwen2.5-7B-Instruct"
        device="cuda" if torch.cuda.is_available() else "cpu"
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_path, 
            trust_remote_code=True,
        )
        
        # 使用device_map自动分配设备[3](@ref)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            device_map=device,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True
        )
        
        # 动态获取模型真实支持长度
        self.max_ctx_length = self.model.config.max_position_embeddings  
        # 摘要长度按比例调整
        self.summary_max_length = min(self.max_ctx_length // 4, 512)  
        logger.info(f"模型最大上下文长度: {self.max_ctx_length}")
        logger.info(f"模型摘要长度: {self.summary_max_length}")

    def _truncate_text(self, text : str):
        """智能截断文本以适应模型上下文窗口"""
        tokens = self.tokenizer.encode(text)
        if len(tokens) > self.max_ctx_length:
            # 保留首尾重要信息（开头50% + 结尾30%）
            keep_tokens = tokens[:self.max_ctx_length//2] + tokens[-self.max_ctx_length//3 * 2:]
            return self.tokenizer.decode(keep_tokens[:self.max_ctx_length])
        return text

    def generate_summary(self, txt_path : str):
        """生成文本摘要的核心方法"""
        try:
            # 读取并预处理文本
            with open(txt_path, 'r', encoding='utf-8') as f:
                raw_text = f.read().strip()
            
            if not raw_text:
                return "错误：文件内容为空"

            processed_text = self._truncate_text(raw_text)

            prompt = (
                f"<|im_start|>system\n{self.template}<|im_end|>\n"
                f"<|im_start|>user\n请分析以下文本：\n{processed_text}<|im_end|>\n"
                "<|im_start|>assistant\n"
            )

            # 生成摘要
            inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
            outputs = self.model.generate(**inputs,
                max_new_tokens=self.summary_max_length,
                temperature=0.6,
                top_k=50,
                top_p=0.9,
                do_sample=True,
                repetition_penalty=1.2
            )

            # 后处理输出
            full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
            return full_response.split("assistant")[1].strip()

        except Exception as e:
            logger.error(f"处理文件 {txt_path} 时发生错误: {str(e)}")
            return ""

摘要提示词配置如下

txt 复制代码

作为专业文本分析师，请按以下结构生成摘要：
# 核心主题
1句话概括

# 关键要点 
- 要点1(不超过15字)
- 要点2
- ...(3-5个)

# 全文总结 
1句话说明文本的主要内容和结论

# 诱导检测
检查是否存在诱导退货或者引导以其他途径购买的行为

数据示例

某个订单关联的语音通话，通过语音转写后的文本如下

txt 复制代码

Speaker_0 00:00.210: 喂，你喂，你好 TK。喂，你好，
Speaker_1 00:05.250: 你不是在我们那个店里面买了二十盒养阴镇静片吗？
Speaker_0 00:10.980: 对，对的，
Speaker_1 00:12.360: 就是哎然后这个仓库刚才那边发货，说这个现在这是没有货了，然后就是在这边打电话跟你说一下，你这边申请一下退款吧啊，
Speaker_0 00:22.450: 就是退款吗？
Speaker_1 00:24.200: 对对对，因为这这个规格的现在没有货了，
Speaker_0 00:27.330: 那就给说退货退退钱嘛。
Speaker_1 00:31.050: 呃，你这边你你这边申请一下退款，我这边退不了啊，
Speaker_0 00:35.870: 不会是一九三吗？
Speaker_1 00:37.570: 对，对你申请退款，然后现在就申请退款就行了，然后马上快要发货，超时了。
Speaker_0 00:42.890: 好了，马上好，你现在好，你现在就弄一下啊。哎，好啊，马上啊，
Speaker_1 00:47.570: 哎，好嘞，行好，你这现在就申请一下子。
Speaker_0 00:49.830: 哎，好的好行，
Speaker_1 00:51.370: 哎，好，不好意思啊，哎好好的。
Speaker_0 00:53.970: 哎，
Speaker_1 00:54.290: 好，再见哎。

通过文本摘要后的摘要如下：

txt 复制代码

# 核心主题
客户咨询关于已购商品缺货后的处理方式。

# 关键要点  
- 客户确认购买记录
- 店员告知无库存并建议退款
- 订单即将发出无法取消
- 客户同意进行退款操作

# 全文总结
该对话讨论了一位顾客在其店铺购买的商品因缺货而需要办理退款的情况，并最终达成一致意见进行退款处理。

# 诱导检测
未发现有诱导退货或引导通过其他途径购买的内容。

人工标注

将订单信息、语音转写结果、摘要结果、语音文件地址整合到一个Excel里，由运营的小伙伴人工核验，标注哪些通话涉及飞单

模型微调

用人工标注的数据，对 Qwen/Qwen2.5-7B-Instruct 进行微调，用于自动鉴定飞单的通话文本微调步骤可参考我的上一篇文章

监听报警

定时运行程序，拉取隐私号通话的录音
将录音转写为文本
用微调后的模型鉴定通话是否涉及飞单
如涉及，通过接口向运营报警
运营对报警的通话进行人工核验