dreamtalk是一个语音+图片转视频的一个工具,就是给一段语音加一个头像图片,然后生成一段头像跟语音对口型的视频,其实还是很有意思的,最近阿里发布了一个类似的模型,但是还没开源,从展示视频看,阿里的那个效果比dreamtalk要好,希望早日开源。
本地运行其实坑主要在安装各种依赖包上,项目地址在dreamtalk · 模型库 (modelscope.cn),可以先把代码下下来,然后就开始装依赖包,第一个依赖:dlib,dlib是c++版本的dlib的一个python绑定,但是目前windows只有py3.6的,如果你的python版本正好是这个,那就直接pip install dlib就行,如果不是,就需要本地编译,本地编译的话,需要安装cmake,可以去官网下载https://github.com/Kitware/CMake/releases/download/v3.29.2/cmake-3.29.2-windows-x86_64.msi
如果访问不了可以用把连接复制到迅雷里用迅雷下(点赞迅雷),安装的时候记得勾选添加到环境变量,cmake是比较好装的,第一软件大小比较小,第二页没什么其它的依赖,下载装上就行了,另外一个就是安装visual studio了,这个就比较大了,当然除了大,其它的也还好,只要网速快,硬盘够大,也没什么大问题。装完了这两个就能对dlib进行编译了,但是正常情况下,你装完了这个,下载了dlib源码,运行python setup.py install还是会编译失败,这时还需要添加一个参数,那就是去掉gif支持 python .\setup.py install --no DLIB_GIF_SUPPORT这样才能编译安装成功,至此,第一个依赖安装成功。
第二个依赖就是ffmpeg,这个是个命令行软件,dreamtalk主要用它来生成视频文件用,这个安装比较简单,只需要下载下来,然后把ffmpeg.exe的路径添加环境变量里就行了。
第三个依赖是ffmpeg的python绑定,叫pyav,这个比较简单,只要pip install av就行了。
第四个依赖是神经网络库torch、torchvision、torchaudio了,我安装的是cpu版,这个看个人选择,因为我只想跑个结果,所以没用gpu,这块的安装如果有问题可以自己百度一下,我就不细写了,我这遇到的问题是版本问题,查了一下彼此的匹配版本,解决了问题,我的版本
torch 1.10.1
torchaudio 0.10.1
torchvision 0.11.2
第五个依赖是模型,dreamtalk依赖了一个叫jonatasgrosman/wav2vec2-large-xlsr-53-english的项目,这个项目托管在huggingface,需要翻墙,本来这个依赖代码里有自动下载代码,但是网络不通,下载时会报错,怎么办,百度搜到了一个镜像站HF-Mirror - Huggingface 镜像站,然后配置一个环境变量名字为HF_ENDPOINT,值就是镜像站首页地址,然后运行启动文件,文件也正常开始下载,但是不知道为什么,下了一部分的时候,就会报远程主机关闭连接的错误,可能是带宽问题,也可能是我电脑ip问题,如果屏幕前的你可以正常下那就再好不过,可惜我这自动下是不可能了,所以我就只能手动下,结果没想到手动也是报错,关键时刻,还是迅雷好使,连接一复制,非常快的就下载下来了(再次点赞迅雷)。下载好项目后,重点来了,需要把项目放到dreamtalk根目录下,并且外边还要包一层jonatasgrosman目录,因为代码引用就是这个结构,所以要注意。
其它python依赖,安装很简单,放个截图
经历了5大依赖,近两天的折腾,本以为就要可以跑一下了,结果运行启动文件,又报错了,错误是关于cuda的,因为我装个的cpu版本的torch是不支持cuda的,所以需要把cuda的代码删掉,于是,我小心翼翼,删掉了调用cuda()函数的代码,终于是可以运行了。修改后的代码inference_for_demo_video.py
python
import argparse
import torch
import json
import os
from scipy.io import loadmat
import subprocess
import numpy as np
import torchaudio
import shutil
from core.utils import (
get_pose_params,
get_video_style_clip,
get_wav2vec_audio_window,
crop_src_image,
)
from configs.default import get_cfg_defaults
from generators.utils import get_netG, render_video
from core.networks.diffusion_net import DiffusionNet
from core.networks.diffusion_util import NoisePredictor, VarianceSchedule
from transformers import Wav2Vec2Processor
from transformers.models.wav2vec2.modeling_wav2vec2 import Wav2Vec2Model
device = torch.device("cpu")
@torch.no_grad()
def get_diff_net(cfg):
diff_net = DiffusionNet(
cfg=cfg,
net=NoisePredictor(cfg),
var_sched=VarianceSchedule(
num_steps=cfg.DIFFUSION.SCHEDULE.NUM_STEPS,
beta_1=cfg.DIFFUSION.SCHEDULE.BETA_1,
beta_T=cfg.DIFFUSION.SCHEDULE.BETA_T,
mode=cfg.DIFFUSION.SCHEDULE.MODE,
),
)
checkpoint = torch.load(cfg.INFERENCE.CHECKPOINT,map_location=torch.device('cpu'))
model_state_dict = checkpoint["model_state_dict"]
diff_net_dict = {
k[9:]: v for k, v in model_state_dict.items() if k[:9] == "diff_net."
}
diff_net.load_state_dict(diff_net_dict, strict=True)
diff_net.eval()
return diff_net
@torch.no_grad()
def get_audio_feat(wav_path, output_name, wav2vec_model):
audio_feat_dir = os.path.dirname(audio_feat_path)
pass
@torch.no_grad()
def inference_one_video(
cfg,
audio_path,
style_clip_path,
pose_path,
output_path,
diff_net,
max_audio_len=None,
sample_method="ddim",
ddim_num_step=10,
):
audio_raw = audio_data = np.load(audio_path)
if max_audio_len is not None:
audio_raw = audio_raw[: max_audio_len * 50]
gen_num_frames = len(audio_raw) // 2
audio_win_array = get_wav2vec_audio_window(
audio_raw,
start_idx=0,
num_frames=gen_num_frames,
win_size=cfg.WIN_SIZE,
)
#audio_win = torch.tensor(audio_win_array).cuda()
audio_win = torch.tensor(audio_win_array)
audio = audio_win.unsqueeze(0)
# the second parameter is "" because of bad interface design...
style_clip_raw, style_pad_mask_raw = get_video_style_clip(
style_clip_path, "", style_max_len=256, start_idx=0
)
#style_clip = style_clip_raw.unsqueeze(0).cuda()
style_clip = style_clip_raw.unsqueeze(0)
style_pad_mask = (
#style_pad_mask_raw.unsqueeze(0).cuda()
style_pad_mask_raw.unsqueeze(0)
if style_pad_mask_raw is not None
else None
)
gen_exp_stack = diff_net.sample(
audio,
style_clip,
style_pad_mask,
output_dim=cfg.DATASET.FACE3D_DIM,
use_cf_guidance=cfg.CF_GUIDANCE.INFERENCE,
cfg_scale=cfg.CF_GUIDANCE.SCALE,
sample_method=sample_method,
ddim_num_step=ddim_num_step,
)
gen_exp = gen_exp_stack[0].cpu().numpy()
pose_ext = pose_path[-3:]
pose = None
pose = get_pose_params(pose_path)
# (L, 9)
selected_pose = None
if len(pose) >= len(gen_exp):
selected_pose = pose[: len(gen_exp)]
else:
selected_pose = pose[-1].unsqueeze(0).repeat(len(gen_exp), 1)
selected_pose[: len(pose)] = pose
gen_exp_pose = np.concatenate((gen_exp, selected_pose), axis=1)
np.save(output_path, gen_exp_pose)
return output_path
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="inference for demo")
parser.add_argument("--wav_path", type=str, default="", help="path for wav")
parser.add_argument("--image_path", type=str, default="", help="path for image")
parser.add_argument("--disable_img_crop", dest="img_crop", action="store_false")
parser.set_defaults(img_crop=True)
parser.add_argument(
"--style_clip_path", type=str, default="", help="path for style_clip_mat"
)
parser.add_argument("--pose_path", type=str, default="", help="path for pose")
parser.add_argument(
"--max_gen_len",
type=int,
default=1000,
help="The maximum length (seconds) limitation for generating videos",
)
parser.add_argument(
"--cfg_scale",
type=float,
default=1.0,
help="The scale of classifier-free guidance",
)
parser.add_argument(
"--output_name",
type=str,
default="test",
)
args = parser.parse_args()
cfg = get_cfg_defaults()
cfg.CF_GUIDANCE.SCALE = args.cfg_scale
cfg.freeze()
tmp_dir = f"tmp/{args.output_name}"
os.makedirs(tmp_dir, exist_ok=True)
# get audio in 16000Hz
wav_16k_path = os.path.join(tmp_dir, f"{args.output_name}_16K.wav")
command = f"ffmpeg -y -i {args.wav_path} -async 1 -ac 1 -vn -acodec pcm_s16le -ar 16000 {wav_16k_path}"
subprocess.run(command.split())
# get wav2vec feat from audio
wav2vec_processor = Wav2Vec2Processor.from_pretrained(
"jonatasgrosman/wav2vec2-large-xlsr-53-english"
)
wav2vec_model = (
Wav2Vec2Model.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-english")
.eval()
#.cuda()
)
speech_array, sampling_rate = torchaudio.load(wav_16k_path)
audio_data = speech_array.squeeze().numpy()
inputs = wav2vec_processor(
audio_data, sampling_rate=16_000, return_tensors="pt", padding=True
)
with torch.no_grad():
audio_embedding = wav2vec_model(inputs.input_values, return_dict=False)[
0
]
audio_feat_path = os.path.join(tmp_dir, f"{args.output_name}_wav2vec.npy")
np.save(audio_feat_path, audio_embedding[0].cpu().numpy())
# get src image
src_img_path = os.path.join(tmp_dir, "src_img.png")
if args.img_crop:
crop_src_image(args.image_path, src_img_path, 0.4)
else:
shutil.copy(args.image_path, src_img_path)
with torch.no_grad():
# get diff model and load checkpoint
#diff_net = get_diff_net(cfg).cuda()
diff_net = get_diff_net(cfg)
# generate face motion
face_motion_path = os.path.join(tmp_dir, f"{args.output_name}_facemotion.npy")
inference_one_video(
cfg,
audio_feat_path,
args.style_clip_path,
args.pose_path,
face_motion_path,
diff_net,
max_audio_len=args.max_gen_len,
)
# get renderer
renderer = get_netG("checkpoints/renderer.pt")
# render video
output_video_path = f"output_video/{args.output_name}.mp4"
render_video(
renderer,
src_img_path,
face_motion_path,
wav_16k_path,
output_video_path,
fps=25,
no_move=False,
)
# add watermark
# if you want to generate videos with no watermark (for evaluation), remove this code block.
no_watermark_video_path = f"{output_video_path}-no_watermark.mp4"
shutil.move(output_video_path, no_watermark_video_path)
os.system(
f'ffmpeg -y -i {no_watermark_video_path} -vf "movie=media/watermark.png,scale= 120: 36[watermask]; [in] [watermask] overlay=140:220 [out]" {output_video_path}'
)
os.remove(no_watermark_video_path)
dreamtalk/generators/utils.py
python
import argparse
import cv2
import json
import os
import numpy as np
import torch
import torchvision
import torchvision.transforms as transforms
from PIL import Image
def obtain_seq_index(index, num_frames, radius):
seq = list(range(index - radius, index + radius + 1))
seq = [min(max(item, 0), num_frames - 1) for item in seq]
return seq
device = torch.device("cpu")
@torch.no_grad()
def get_netG(checkpoint_path):
from generators.face_model import FaceGenerator
import yaml
with open("generators/renderer_conf.yaml", "r") as f:
renderer_config = yaml.load(f, Loader=yaml.FullLoader)
renderer = FaceGenerator(**renderer_config).to(device)
checkpoint = torch.load(checkpoint_path, map_location=lambda storage, loc: storage)
renderer.load_state_dict(checkpoint["net_G_ema"], strict=False)
renderer.eval()
return renderer
@torch.no_grad()
def render_video(
net_G,
src_img_path,
exp_path,
wav_path,
output_path,
silent=False,
semantic_radius=13,
fps=30,
split_size=16,
no_move=False,
):
"""
exp: (N, 73)
"""
target_exp_seq = np.load(exp_path)
if target_exp_seq.shape[1] == 257:
exp_coeff = target_exp_seq[:, 80:144]
angle_trans_crop = np.array(
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9370641, 126.84911, 129.03864],
dtype=np.float32,
)
target_exp_seq = np.concatenate(
[exp_coeff, angle_trans_crop[None, ...].repeat(exp_coeff.shape[0], axis=0)],
axis=1,
)
# (L, 73)
elif target_exp_seq.shape[1] == 73:
if no_move:
target_exp_seq[:, 64:] = np.array(
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9370641, 126.84911, 129.03864],
dtype=np.float32,
)
else:
raise NotImplementedError
frame = cv2.imread(src_img_path)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
src_img_raw = Image.fromarray(frame)
image_transform = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5), inplace=True),
]
)
src_img = image_transform(src_img_raw)
target_win_exps = []
for frame_idx in range(len(target_exp_seq)):
win_indices = obtain_seq_index(
frame_idx, target_exp_seq.shape[0], semantic_radius
)
win_exp = torch.tensor(target_exp_seq[win_indices]).permute(1, 0)
# (73, 27)
target_win_exps.append(win_exp)
target_exp_concat = torch.stack(target_win_exps, dim=0)
target_splited_exps = torch.split(target_exp_concat, split_size, dim=0)
output_imgs = []
for win_exp in target_splited_exps:
#win_exp = win_exp.cuda()
#cur_src_img = src_img.expand(win_exp.shape[0], -1, -1, -1).cuda()
cur_src_img = src_img.expand(win_exp.shape[0], -1, -1, -1)
output_dict = net_G(cur_src_img, win_exp)
output_imgs.append(output_dict["fake_image"].cpu().clamp_(-1, 1))
output_imgs = torch.cat(output_imgs, 0)
transformed_imgs = ((output_imgs + 1) / 2 * 255).to(torch.uint8).permute(0, 2, 3, 1)
if silent:
torchvision.io.write_video(output_path, transformed_imgs.cpu(), fps)
else:
silent_video_path = f"{output_path}-silent.mp4"
torchvision.io.write_video(silent_video_path, transformed_imgs.cpu(), fps)
os.system(
f"ffmpeg -loglevel quiet -y -i {silent_video_path} -i {wav_path} -shortest {output_path}"
)
os.remove(silent_video_path)
印象里就这两个文件,当然如果你装的gpu版本的torch,不需要改代码。 效果图,说的是你好,你会写代码吗
总结来说,跑一个开源项目还是不太容易的,这还是不用写代码的情况,特别是第一次,之前什么都没装,跑起来可是真麻烦,我一度怀疑能跑起来,好在没放弃,搞了两天终于搞出来,记录一下,有需要的朋友可以参考一下,有什么问题也可以随时给我留言。