假如我们一定要说深度学习入门会有一定的门槛,那么设备成本是一个无法避开的话题。深度学习模型通常需要大量的计算资源来进行训练和推理。较大规模的深度学习模型和复杂的数据集需要更高的计算能力才能进行有效的训练。因此,训练深度学习模型可能需要使用高性能的计算设备,如图形处理器(GPU)或专用的深度学习处理器(如TPU),这让很多本地没有N卡的同学望而却步。
GoogleColab是由Google提供的一种基于云的免费Jupyter笔记本环境。它可以帮助入门用户轻松地进行机器学习和深度学习的实验。
尽管GoogleColab提供了很多便利和免费的功能,但也有一些限制。例如,每个会话的计算资源可能是有限的,并且会话可能会在一段时间后自动关闭。此外,Colab的使用可能受到Google的限制和政策规定。
对于笔者这样的穷哥们来讲,GoogleColab就是黑暗中的一道光,就算有训练时长限制,也能凑合用了,要啥自行车?要饭咱也就别嫌饭馊了,本次我们基于GoogleColab在云端训练和推理Bert-vits2-v2.2项目,复刻那黑破坏神角色莉莉丝(lilith)。
配置云端设备
首先进入GoogleColab实验室官网:
arduino
https://colab.research.google.com/
点击新建笔记,并且链接设备服务器:
这里硬件设备选择T4GPU。
随后新建一条命令
ruby
#@title 查看显卡
!nvidia-smi
点击运行程序返回:
sql
Tue Dec 19 03:07:21 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 54C P8 10W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
新一代图灵架构、16GB 显存,免费 GPU 也能如此耀眼,不愧是业界良心。
克隆代码仓库
随后新建命令:
bash
#@title 克隆代码仓库
!git clone https://github.com/v3ucn/Bert-vits2-V2.2.git
程序返回:
bash
Cloning into 'Bert-vits2-V2.2'...
remote: Enumerating objects: 310, done.
remote: Counting objects: 100% (310/310), done.
remote: Compressing objects: 100% (210/210), done.
remote: Total 310 (delta 97), reused 294 (delta 81), pack-reused 0
Receiving objects: 100% (310/310), 12.84 MiB | 18.95 MiB/s, done.
Resolving deltas: 100% (97/97), done.
安装所需要的依赖
新建安装依赖命令:
shell
#@title 安装所需要的依赖
%cd /content/Bert-vits2-V2.2
!pip install -r requirements.txt
依赖安装的时间要长一些,需要耐心等待。
下载必要的模型
接着下载必要的模型,这里包括bert模型和情感模型:
bash
#@title 下载必要的模型
!wget -P emotional/clap-htsat-fused/ https://huggingface.co/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
!wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin
!wget -P bert/chinese-roberta-wwm-ext-large/ https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/resolve/main/pytorch_model.bin
!wget -P bert/bert-base-japanese-v3/ https://huggingface.co/cl-tohoku/bert-base-japanese-v3/resolve/main/pytorch_model.bin
!wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.bin
!wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.generator.bin
!wget -P bert/deberta-v2-large-japanese/ https://huggingface.co/ku-nlp/deberta-v2-large-japanese/resolve/main/pytorch_model.bin
如果推理任务只需要中文语音,那么下载前三个模型即可。
下载底模文件
随后下载Bert-vits2-v2.2底模:
bash
#@title 下载底模文件
!wget -P Data/lilith/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.2-CLAP/resolve/main/DUR_0.pth
!wget -P Data/lilith/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.2-CLAP/resolve/main/D_0.pth
!wget -P Data/lilith/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.2-CLAP/resolve/main/G_0.pth
注意这里的底模要放在角色的models目录中,同时注意底模版本是2.2。
上传音频素材和重采样
随后打开目录,在lilith目录右键新建文件夹raw,接着右键点击上传,将素材上传到云端:
同时也将转写文件esd.list右键上传到项目的lilith目录:
bash
./Data/lilith/wavs/processed_0.wav|lilith|ZH|信仰,叫你们要否定心中的欲望。
./Data/lilith/wavs/processed_1.wav|lilith|ZH|把你們囚禁在自己的身體裡
./Data/lilith/wavs/processed_2.wav|lilith|ZH|圣修雅瑞之母
./Data/lilith/wavs/processed_3.wav|lilith|ZH|我有你要的东西
./Data/lilith/wavs/processed_4.wav|lilith|ZH|你渴望知识
./Data/lilith/wavs/processed_5.wav|lilith|ZH|不惜带着孩子寻遍圣修雅瑞
./Data/lilith/wavs/processed_6.wav|lilith|ZH|这话你真的相信吗
./Data/lilith/wavs/processed_7.wav|lilith|ZH|不必再裝了
./Data/lilith/wavs/processed_8.wav|lilith|ZH|你有問題,我有答案
./Data/lilith/wavs/processed_9.wav|lilith|ZH|我洞悉整個宇宙的真理
./Data/lilith/wavs/processed_10.wav|lilith|ZH|你看了那么多
./Data/lilith/wavs/processed_11.wav|lilith|ZH|知道的卻那麼少
./Data/lilith/wavs/processed_12.wav|lilith|ZH|打碎枷鎖
./Data/lilith/wavs/processed_13.wav|lilith|ZH|你願意接受我的提議嗎?
./Data/lilith/wavs/processed_14.wav|lilith|ZH|你很好奇想知道我
./Data/lilith/wavs/processed_15.wav|lilith|ZH|為什麼饒了你的命
./Data/lilith/wavs/processed_16.wav|lilith|ZH|你相信我吗
./Data/lilith/wavs/processed_17.wav|lilith|ZH|很好,现在你只需要知道。
./Data/lilith/wavs/processed_18.wav|lilith|ZH|我們要去見我兒子
./Data/lilith/wavs/processed_19.wav|lilith|ZH|是的,但不止如此。
./Data/lilith/wavs/processed_20.wav|lilith|ZH|他还是我计划的关键
./Data/lilith/wavs/processed_21.wav|lilith|ZH|雖然我無法預料
./Data/lilith/wavs/processed_22.wav|lilith|ZH|在新世界里你是否愿意站在我身边
./Data/lilith/wavs/processed_23.wav|lilith|ZH|找出自己真正的本性
./Data/lilith/wavs/processed_25.wav|lilith|ZH|可是我還是會為你
./Data/lilith/wavs/processed_27.wav|lilith|ZH|但現在所有的可能性
./Data/lilith/wavs/processed_28.wav|lilith|ZH|统统被夺走了
./Data/lilith/wavs/processed_29.wav|lilith|ZH|奪走啊
./Data/lilith/wavs/processed_30.wav|lilith|ZH|這把鑰匙能打開的不僅是地獄的大門
./Data/lilith/wavs/processed_31.wav|lilith|ZH|也会开启我们的未来
./Data/lilith/wavs/processed_32.wav|lilith|ZH|因為你的犧牲才得以實現到未來
./Data/lilith/wavs/processed_33.wav|lilith|ZH|打碎枷鎖
./Data/lilith/wavs/processed_34.wav|lilith|ZH|接受美麗的罪惡
./Data/lilith/wavs/processed_35.wav|lilith|ZH|这就是第一批
./Data/lilith/wavs/processed_36.wav|lilith|ZH|腦筋動得很快
./Data/lilith/wavs/processed_37.wav|lilith|ZH|没错,我正是莉莉丝。
至于音频如何切分、转写、标注等操作,请移步:本地训练,立等可取,30秒音频素材复刻霉霉讲中文音色基于Bert-VITS2V2.0.2。囿于篇幅,这里不再赘述。
确保素材切分和转写文件都上传成功后,新建命令:
bash
#@title 重采样
!python3 resample.py --sr 44100 --in_dir ./Data/lilith/raw/ --out_dir ./Data/lilith/wavs/
进行重采样操作。
预处理标签文件
接着新建命令:
bash
#@title 预处理标签文件
!python3 preprocess_text.py --transcription-path ./Data/lilith/esd.list --train-path ./Data/lilith/train.list --val-path ./Data/lilith/val.list --config-path ./Data/lilith/configs/config.json
程序返回:
sql
pytorch_model.bin: 100% 1.32G/1.32G [00:26<00:00, 49.4MB/s]
spm.model: 100% 2.46M/2.46M [00:00<00:00, 131MB/s]
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data] Unzipping corpora/cmudict.zip.
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Some weights of EmotionModel were not initialized from the model checkpoint at ./emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
0% 0/36 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.686 seconds.
Prefix dict has been built successfully.
100% 36/36 [00:00<00:00, 40.28it/s]
总重复音频数:0,总未找到的音频数:0
训练集和验证集生成完成!
此时,在lilith目录已经生成训练集和验证集,即train.list和val.list。
生成 BERT 特征文件
接着新建命令:
arduino
#@title 生成 BERT 特征文件
!python3 bert_gen.py --config-path ./Data/lilith/configs/config.json
程序返回:
sql
0% 0/36 [00:00<?, ?it/s]Some weights of the model checkpoint at ./bert/chinese-roberta-wwm-ext-large were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at ./bert/chinese-roberta-wwm-ext-large were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100% 36/36 [00:21<00:00, 1.67it/s]
bert生成完毕!, 共有36个bert.pt生成!
数一下,一共36个,和音频素材数量一致。
生成 clap 特征文件
最后生成clap情感特征文件:
shell
#@title 生成 clap 特征文件
#!wget -P emotional/clap-htsat-fused/ https://huggingface.co/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
!python3 clap_gen.py --config-path ./Data/lilith/configs/config.json
程序返回:
ini
/content/Bert-vits2-V2.2/clap_gen.py:34: FutureWarning: Pass sr=48000 as keyword args. From version 0.10 passing these as positional arguments will result in an error
audio = librosa.load(wav_path, 48000)[0]
0% 0/36 [00:00<?, ?it/s]/content/Bert-vits2-V2.2/clap_gen.py:34: FutureWarning: Pass sr=48000 as keyword args. From version 0.10 passing these as positional arguments will result in an error
audio = librosa.load(wav_path, 48000)[0]
/content/Bert-vits2-V2.2/clap_gen.py:34: FutureWarning: Pass sr=48000 as keyword args. From version 0.10 passing these as positional arguments will result in an error
audio = librosa.load(wav_path, 48000)[0]
/content/Bert-vits2-V2.2/clap_gen.py:34: FutureWarning: Pass sr=48000 as keyword args. From version 0.10 passing these as positional arguments will result in an error
audio = librosa.load(wav_path, 48000)[0]
100% 36/36 [00:44<00:00, 1.23s/it]
clap生成完毕!, 共有36个emo.pt生成!
同样36个,也就是说每个素材需要对应一个bert和一个clap。
开始训练
万事俱备,开始训练:
ruby
#@title 开始训练
!python3 train_ms.py
程序返回:
less
2023-12-19 03:17:48.852966: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-19 03:17:48.853057: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-19 03:17:48.992178: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-19 03:17:49.268092: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-19 03:17:51.369993: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
加载config中的配置localhost
加载config中的配置10086
加载config中的配置1
加载config中的配置0
加载config中的配置0
加载环境变量
MASTER_ADDR: localhost,
MASTER_PORT: 10086,
WORLD_SIZE: 1,
RANK: 0,
LOCAL_RANK: 0
12-19 03:17:55 INFO | data_utils.py:66 | Init dataset...
100% 32/32 [00:00<00:00, 51901.67it/s]
12-19 03:17:55 INFO | data_utils.py:81 | skipped: 0, total: 32
12-19 03:17:55 INFO | data_utils.py:66 | Init dataset...
100% 4/4 [00:00<00:00, 34100.03it/s]
12-19 03:17:55 INFO | data_utils.py:81 | skipped: 0, total: 4
Using noise scaled MAS for VITS2
Using duration discriminator for VITS2
INFO:models:Loaded checkpoint 'Data/lilith/models/DUR_0.pth' (iteration 0)
ERROR:models:emb_g.weight is not in the checkpoint
INFO:models:Loaded checkpoint 'Data/lilith/models/G_0.pth' (iteration 0)
INFO:models:Loaded checkpoint 'Data/lilith/models/D_0.pth' (iteration 0)
******************检测到模型存在,epoch为 1,gloabl step为 0*********************
0% 0/8 [00:00<?, ?it/s][W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
INFO:models:Train Epoch: 1 [0%]
INFO:models:[2.78941011428833, 2.49017596244812, 5.66870641708374, 25.731149673461914, 4.624840259552002, 3.6382224559783936, 0, 0.0002]
Evaluating ...
INFO:models:Saving model and optimizer state at iteration 1 to Data/lilith/models/G_0.pth
INFO:models:Saving model and optimizer state at iteration 1 to Data/lilith/models/D_0.pth
INFO:models:Saving model and optimizer state at iteration 1 to Data/lilith/models/DUR_0.pth
100% 8/8 [00:40<00:00, 5.05s/it]
INFO:models:====> Epoch: 1
100% 8/8 [00:09<00:00, 1.20s/it]
INFO:models:====> Epoch: 2
100% 8/8 [00:09<00:00, 1.23s/it]
INFO:models:====> Epoch: 3
100% 8/8 [00:09<00:00, 1.24s/it]
INFO:models:====> Epoch: 4
100% 8/8 [00:09<00:00, 1.25s/it]
INFO:models:====> Epoch: 5
100% 8/8 [00:10<00:00, 1.26s/it]
INFO:models:====> Epoch: 6
25% 2/8 [00:02<00:08, 1.41s/it]INFO:models:Train Epoch: 7 [25%]
由此就在底模的基础上开始训练了。
在线推理
训练了100步之后,我们可以先看看效果:
注意修改根目录的config.yml中的模型名称和模型名称一致:
yaml
# webui webui配置
# 注意, ":" 后需要加空格
webui:
# 推理设备
device: "cuda"
# 模型路径
model: "models/G_100.pth"
# 配置文件路径
config_path: "configs/config.json"
# 端口号
port: 7860
# 是否公开部署,对外网开放
share: false
# 是否开启debug模式
debug: false
# 语种识别库,可选langid, fastlid
language_identification_library: "langid"
这里model参数写成:models/G_100.pth
随后新建命令:
ruby
#@title 开始推理
!python3 webui.py
程序返回:
sql
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Some weights of EmotionModel were not initialized from the model checkpoint at ./emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
| numexpr.utils | INFO | NumExpr defaulting to 2 threads.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
| utils | INFO | Loaded checkpoint 'Data/lilith/models/G_100.pth' (iteration 13)
推理页面已开启!
Running on local URL: http://127.0.0.1:7860
Running on public URL: https://40b8695e0a18b0e2eb.gradio.live
一个内网地址,一个公网地址,访问公网地址40b8695e0a18b0e2eb.gradio.live进行推理即可。
最后奉上GoogleColab笔记链接:
ini
https://colab.research.google.com/drive/1LgewU9jevSovP9NTuqTtoxDop3qeWWKK?usp=sharing
与君共觞。