巨擘OpenMMLab将开源综合音频生成项目:Amphion

项目地址:https://github.com/open-mmlab/Amphion

TTS: Text-to-Speech

Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:

  • FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.

  • VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning

  • Vall-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.

  • NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.

SVC: Singing Voice Conversion

  • Ampion supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec. Their specific roles in SVC has been investigated in our NeurIPS 2023 workshop paper.

  • Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses Bidirectional dilated CNN as a backend and supports several sampling algorithms such as DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.

TTA: Text-to-Audio

Amphion supports the TTA with a latent diffusion model. It is designed like AudioLDM Make-an-Audio and AUDIT. It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper.

Vocoder

  • Amphion supports various widely-used neural vocoders, including:

    • GAN-based vocoders: MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet.

    • Flow-based vocoders: WaveGlow.

    • Diffusion-based vocoders: Diffwave.

    • Auto-regressive based vocoders: WaveNet, WaveRNN.

  • Amphion provides the official implementation of Multi-Scale Constant-Q Transfrom Discriminator. It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged.

Evaluation

Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:

  • F0 Modeling: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.

  • Energy Modeling: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.

  • Intelligibility: Character/Word Error Rate, which can be calculated based on Whisper and more.

  • Spectrogram Distortion: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.

  • Speaker Similarity: Cosine similarity, which can be calculated based on RawNet3, WeSpeaker, and more.

Datasets

Amphion unifies the data preprocess of the open-source datasets including AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK, and more. The supported dataset list can be seen here (updating).

📀 Installation

复制代码
git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

# Install Python Environment
conda create --name amphion python=3.9.15
conda activate amphion

# Install Python Packages Dependencies
sh env.sh

🐍 Usage in Python

We detail the instructions of different tasks in the following recipes:

  • Text-to-Speech (TTS)

  • Singing Voice Conversion (SVC)

  • Text-to-Audio (TTA)

  • Vocoder

  • Evaluation

🙏 Acknowled

  • ming024's FastSpeech2 and jaywalnut310's VITS for model architecture code.

  • lifeiteng's VALL-E for training pipeline and model architecture design.

  • WeNet, Whisper, ContentVec, and RawNet3 for pretrained models and inference code.

  • HiFi-GAN for GAN-based Vocoder's architecture design and training strategy.

  • Encodec for well-organized GAN Discriminator's architecture and basic blocks.

  • Latent Diffusion for model architecture design.

  • TensorFlowTTS for preparing the MFA tools.

©️ License

Amphion is under the MIT License. It is free for both research and commercial use cases.

📚 Citations

Stay tuned, Coming soon!

相关推荐
REDcker9 天前
WebCodecs VideoDecoder 的 hardwareAcceleration 使用
前端·音视频·实时音视频·直播·webcodecs·videodecoder
gihigo19989 天前
基于TCP协议实现视频采集与通信
网络协议·tcp/ip·音视频
山河君9 天前
四麦克风声源定位实战:基于 GCC-PHAT + 最小二乘法实现 DOA
算法·音视频·语音识别·信号处理·最小二乘法·tdoa
音视频牛哥9 天前
Android平台RTMP/RTSP超低延迟直播播放器开发详解——基于SmartMediaKit深度实践
android·人工智能·计算机视觉·音视频·rtmp播放器·安卓rtmp播放器·rtmp直播播放器
qq_416276429 天前
通用音频表征的对比学习
学习·音视频
美狐美颜sdk10 天前
Android全局美颜sdk实现方案详解
人工智能·音视频·美颜sdk·视频美颜sdk·美狐美颜sdk
EasyDSS10 天前
私有化部署EasyDSS视频点播能力:全链路视频技术赋能,打造企业级视听新体验
音视频·hls·m3u8·点播技术·智能转码
qq_4162764210 天前
DeLoRes——一种通用的音频表征学习新方法(DeLoRes(基于 Barlow Twins 的冗余最小化方法)
学习·音视频
Q_45828386810 天前
从定位到视频:808 + 1078 在各行业的落地实践
音视频
山顶望月川10 天前
实测MiniMax-Hailuo-02:当“开工大吉“变成“无字天书“,国产AI视频模型的能与之不能
人工智能·音视频