播客是以语音为主,各种基于AI 的语音技术在播客领域十分重要。
语音转文本
Whisper
Whisper 是OpenAI 推出的开源语音辨识工具,可以把音档转成文字,支援超过50 种语言。这款工具是基于68 万小时的训练资料,其中包含11.7 万小时的多语言语音数据,涵盖了96 种不同语言。由于资料量庞大,Whisper 在英文的识别精准度相当高,而中文的错误率(Word Error Rate, WER)大约是14.7%,表现也不俗。
Whisper 这个名字来自 WSPSR:W eb-scale S upervised P retraining for S peech Recognition
文本转语音(TTS)
TTS(Text-to-Speech)是文本转语音的技术。现代都采用深度学习模型,通常基于 Transformer
或类似架构。OpenAI ,微软,Google和国内大厂云平台都提供了TTS 服务。这项技术已经相当成熟。
最近提到的MaskGCT 是比较好的TTS,特别是声音克隆做的非常好。
可以在这里试试
语音分析
pyannote-audio
实现播客中发言人分离,它将区分说话者 A 和说话者 B 等等。如果您想要更具体的内容(即说话者的实际姓名),那么您可以实现类似这样的功能。
Whisper 转录的准确性非常好,但不幸的是,它们没有说话人识别功能。
说话人识别功能是使用一个名为 pyannote 的 Python 库实现
pyannote 是说话者分离的开源项目。
pydub
Pydub 是一个功能强大的 Python 库,可简化处理音频文件的过程。它提供了一个用于处理音频的高级界面,使执行加载、切片、连接和将效果应用于音频文件等任务变得容易。他处理的原始音频wav 文件
API 介绍:pydub/API.markdown at master · jiaaro/pydub · GitHub
打开一个wav 文件
python
from pydub import AudioSegment
song = AudioSegment.from_wav("never_gonna_give_you_up.wav")
或者
python
song = AudioSegment.from_mp3("never_gonna_give_you_up.mp3")
音频切片
python
# pydub does things in milliseconds
ten_seconds = 10 * 1000
first_10_seconds = song[:ten_seconds]
last_5_seconds = song[-5000:]
指定音频的切片
python
# 从3秒开始切割,持续1秒
clip = song[3000:4000] # 从3秒到4秒的音频片段
导出文件
python
from pydub import AudioSegment
sound = AudioSegment.from_file("/path/to/sound.wav", format="wav")
# simple export
file_handle = sound.export("/path/to/output.mp3", format="mp3")
# more complex export
file_handle = sound.export("/path/to/output.mp3",
format="mp3",
bitrate="192k",
tags={"album": "The Bends", "artist": "Radiohead"},
cover="/path/to/albumcovers/radioheadthebends.jpg")
# split sound in 5-second slices and export
for i, chunk in enumerate(sound[::5000]):
with open("sound-%s.mp3" % i, "wb") as f:
chunk.export(f, format="mp3")
静音切片(silence.split_on_silence())
根据音频文件中的静音分段。
python
from pydub import AudioSegment
from pydub.silence import split_on_silence
sound = AudioSegment.from_mp3("audio_files/xxxxxx.mp3")
clip = sound[21*1000:45*1000]
#"graph" the volume in 1 second increments
for x in range(0,int(len(clip)/1000)):
print(x,clip[x*1000:(x+1)*1000].max_dBFS)
chunks = split_on_silence(
clip,
min_silence_len=1000,
silence_thresh=-16,
keep_silence=100
)
print("number of chunks",len(chunks))
print (chunks)
实例
python
from pydub import AudioSegment
from pydub.playback import play
# 示例代码:音频切割
def cut_audio(source_file_path, output_file_path, start_second, end_second):
# 加载音频文件
song = AudioSegment.from_file(source_file_path)
# 选择要切割的音频段
segment = song[start_second:end_second]
# 导出切割后的音频文件
segment.export(output_file_path, format="mp3")
# 示例代码:音频合并
def merge_audio(filepaths, output_file_path):
combined = AudioSegment.empty()
for filepath in filepaths:
# 加载单个音频文件并添加到合并列表
audio = AudioSegment.from_file(filepath)
combined += audio
# 导出合并后的音频文件
combined.export(output_file_path, format="mp3")
cut_audio('example.mp3', 'cut_example.mp3', 10, 20) # 从第10秒到第20秒切割音频
merge_audio(['part1.mp3', 'part2.mp3', 'part3.mp3'], 'merged_example.mp3') # 合并三个音频文件
应用程序
方法1 先转换,再将文字分段
python
from pyannote.core import Segment
import os
import whisper
from pyannote.audio import Pipeline
def get_text_with_timestamp(transcribe_res):
timestamp_texts = []
print(transcribe_res["text"])
for item in transcribe_res["segments"]:
print(item)
start = item["start"]
end = item["end"]
text = item["text"].strip()
timestamp_texts.append((Segment(start, end), text))
return timestamp_texts
def add_speaker_info_to_text(timestamp_texts, ann):
spk_text = []
for seg, text in timestamp_texts:
spk = ann.crop(seg).argmax()
spk_text.append((seg, spk, text))
return spk_text
def merge_cache(text_cache):
sentence = ''.join([item[-1] for item in text_cache])
spk = text_cache[0][1]
start = round(text_cache[0][0].start, 1)
end = round(text_cache[-1][0].end, 1)
return Segment(start, end), spk, sentence
PUNC_SENT_END = [',', '.', '?', '!', ",", "。", "?", "!"]
def merge_sentence(spk_text):
merged_spk_text = []
pre_spk = None
text_cache = []
for seg, spk, text in spk_text:
if spk != pre_spk and pre_spk is not None and len(text_cache) > 0:
merged_spk_text.append(merge_cache(text_cache))
text_cache = [(seg, spk, text)]
pre_spk = spk
elif text and len(text) > 0 and text[-1] in PUNC_SENT_END:
text_cache.append((seg, spk, text))
merged_spk_text.append(merge_cache(text_cache))
text_cache = []
pre_spk = spk
else:
text_cache.append((seg, spk, text))
pre_spk = spk
if len(text_cache) > 0:
merged_spk_text.append(merge_cache(text_cache))
return merged_spk_text
def diarize_text(transcribe_res, diarization_result):
timestamp_texts = get_text_with_timestamp(transcribe_res)
spk_text = add_speaker_info_to_text(timestamp_texts, diarization_result)
res_processed = merge_sentence(spk_text)
return res_processed
def write_to_txt(spk_sent, file):
with open(file, 'w') as fp:
for seg, spk, sentence in spk_sent:
line = f'{seg.start:.2f} {seg.end:.2f} {spk} {sentence}\n'
fp.write(line)
model_size = "large-v3"
os.environ['OPENAI_API_KEY'] ="sk-ZqGx7uD7sHMyITyIrxFDjbvVEAi84izUGGRwN23N9NbnqTbL"
os.environ['OPENAI_BASE_URL'] ="https://api.chatanywhere.tech/v1"
asr_model=whisper.load_model("large-v3")
print("model loaded")
audio = "asr_speaker_demo.wav"
spk_rec_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="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")
asr_result = asr_model.transcribe(audio, language="zh", fp16=False)
print("transcribe finished....")
diarization_result = spk_rec_pipeline(audio)
print("diarization finished...")
final_result = diarize_text(asr_result, diarization_result)
for segment, spk, sent in final_result:
print("[%.2fs -> %.2fs] %s \n %s 。\n" % (segment.start, segment.end, spk,sent))
方法2 先分段,再转换
分段转换,export 段的语音文件,然后分段转换。
python
import os
import whisper
from pyannote.audio import Pipeline
from pydub import AudioSegment
os.environ['OPENAI_API_KEY'] ="sk-ZqGx7uD7sHMyITyIrxFDjbvVEAi84izUGGRwN23N9NbnqTbL"
os.environ['OPENAI_BASE_URL'] ="https://api.chatanywhere.tech/v1"
model = whisper.load_model("large-v3")
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="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")
# run the pipeline on an audio file
diarization = pipeline("buss.wav")
audio = AudioSegment.from_wav("buss.wav")
i=0
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
clip = audio[turn.start*1000:turn.end*1000]
with open("audio-%s.wav" % i, "wb") as f:
clip.export(f, format="wav")
text = model.transcribe("audio-%s.wav"% i,language="zh", fp16=False)["text"]
print(text)
i=i+1
方法3 直接导入语音片段,再转换
将Segments 转换成语音数据数组,然后分段转换。
python
import os
import whisper
import numpy as np
from pyannote.audio import Pipeline
from pydub import AudioSegment
os.environ['OPENAI_API_KEY'] ="sk-ZqGx7uD7sHMyITyIrxFDjbvVEAi84izUGGRwN23N9NbnqTbL"
os.environ['OPENAI_BASE_URL'] ="https://api.chatanywhere.tech/v1"
model = whisper.load_model("large-v3")
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="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")
# run the pipeline on an audio file
diarization = pipeline("buss.wav")
audio = AudioSegment.from_wav("buss.wav")
i=0
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
audio_segment = audio[turn.start*1000:turn.end*1000]
if audio_segment.frame_rate != 16000: # 16 kHz
audio_segment = audio_segment.set_frame_rate(16000)
if audio_segment.sample_width != 2: # int16
audio_segment = audio_segment.set_sample_width(2)
if audio_segment.channels != 1: # mono
audio_segment = audio_segment.set_channels(1)
arr = np.array(audio_segment.get_array_of_samples())
arr = arr.astype(np.float32)/32768.0
text = model.transcribe(arr,language="zh", fp16=False)["text"]
print(text)
Spotify 的 AI 语音翻译
Spotify 正在尝试将外语播客转换成为母语的播客,意味着您最喜欢的播客可能会以您的母语被听到。
跨越文化、国家和社区,我们分享的故事将我们联系在一起。而且,更多时候,讲述者的声音和故事本身一样具有分量。15 年来,Spotify 的全球平台让各行各业的创作者能够与世界各地的观众分享他们的作品。从本质上讲,这是通过技术实现的,技术利用音频的力量克服了访问、边界和距离的障碍。但随着最近的进步,我们一直在想:是否还有更多方法可以弥合语言障碍,让全世界都能听到这些声音?
但你需要花时间和精力去做。你可以把播客的文字记录下来,然后把它(一次几段)输入到谷歌翻译或ChatGPT中(并让它翻译)。翻译完材料后,将其复制并粘贴到新脚本中。然后,重新录制。这里的成功取决于以下几点:
- **发音:**你用外语说话时感觉如何?我们很多人在高中学习西班牙语,但你的日语水平如何?
- **翻译准确性:**谷歌的支持文档声称谷歌翻译的准确率可能高达 94%。但这并未考虑到口语(例如,它如何翻译"cat got your tongue"或"in the zeitgeist?"这样的表达?)。
- **耐心:**您愿意重新录制和重新编辑。
这是无法回避的;这是一项艰巨的任务,即使只是将几集翻译成另一种语言。那么,如果你能负担得起帮助,你有什么选择?
结束语
国内平台提供的各项语音转换服务就速度和质量而言,都非常出色,但是API 过于复杂。云平台控制台太凌乱。也没有多少demo程序。作为底层研究,还是要研究Whisper, pyannote-audio和pydub。