本文以制作小学课堂音频数据集为例子
1. 搜索关键字获取音视频链接
python
if __name__ == "__main__":
with sync_playwright() as playwright:
searcher = BLVideoSearch(playwright, headless=True)
url = searcher.make_url(keyword=["小学公开课"])
searcher.run(url, outfile="videos_url.txt")
得到链接列表
2. 批量下载和实时视频转音频
you-get: 根据链接下载视频文件
ffmpeg: 将视频实时转音频
subprocess: 通过子进程执行上述命令
2.1 多线程批量下载 (you-get)
you-get 子进程:
python
command = [YOUGET, "-o", self.video_dir, "-O", utt, task]
subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
2.2 实时视频转音频
ffmpeg 子进程:
python
command = [FFMPEG, "-i", video_file, '-ac', '1', '-ar', '16000', audio_file]
subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
下载视频文件信息如下:
最终保存为音频文件