webassembly003 TTS BARK.CPP

TTS task

  • TTS(Text-to-Speech)任务是一种自然语言处理(NLP)任务,其中模型的目标是将输入的文本转换为声音,实现自动语音合成。具体来说,模型需要理解输入的文本并生成对应的语音输出,使得合成的语音听起来自然而流畅,类似于人类语音的表达方式。

Bark

  • Bark(https://github.com/suno-ai/bark) 是由 Suno 创建的基于转换器的文本到音频模型。Bark 可以生成高度逼真的多语言语音以及其他音频,包括音乐、背景噪音和简单的音效。该模型还可以产生非语言交流,如大笑、叹息和哭泣。为了支持研究社区,我们提供了对预训练模型检查点的访问,这些检查点已准备好进行推理并可用于商业用途。

bark.cpp

编译

$mkdir build
$cd build
$cmake ..
-- The C compiler identification is GNU 9.5.0
-- The CXX compiler identification is GNU 9.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Linux detected
-- Configuring done
-- Generating done
-- Build files have been written to: /home/pdd/le/bark.cpp/build
$cmake --build . --config Release
[  7%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[ 14%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 21%] Linking C static library libggml.a
[ 21%] Built target ggml
[ 28%] Building CXX object CMakeFiles/bark.cpp.dir/bark.cpp.o
[ 42%] Linking CXX static library libbark.cpp.a
[ 42%] Built target bark.cpp
[ 50%] Building CXX object examples/main/CMakeFiles/main.dir/main.cpp.o
[ 57%] Linking CXX executable ../../bin/main
[ 57%] Built target main
[ 64%] Building CXX object examples/server/CMakeFiles/server.dir/server.cpp.o
[ 71%] Linking CXX executable ../../bin/server
[ 71%] Built target server
[ 78%] Building CXX object examples/quantize/CMakeFiles/quantize.dir/quantize.cpp.o
[ 85%] Linking CXX executable ../../bin/quantize
[ 85%] Built target quantize
[ 92%] Building CXX object tests/CMakeFiles/test-tokenizer.dir/test-tokenizer.cpp.o
[100%] Linking CXX executable ../bin/test-tokenizer
[100%] Built target test-tokenizer

权重下载与转换

$cd ../
# text_2.pt, coarse_2.pt, fine_2.pt,https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th
$python3 download_weights.py --download-dir ./models
# convert the model to ggml format
$python3 convert.py   --dir-model ./models --codec-path ./models --vocab-path ./ggml_weights/ --out-dir ./ggml_weights/
$ ls -ahl ./models/
总用量 13G
drwxrwxr-x  2 pdd pdd 4.0K Jan 29 08:22 .
drwxrwxr-x 13 pdd pdd 4.0K Jan 29 06:50 ..
-rwxrwxrwx  1 pdd pdd 3.7G Jan 29 07:34 coarse_2.pt
-rw-rw-r--  1 pdd pdd  89M Jan 29 07:29 encodec_24khz-d7cc33bc.th
-rwxrwxrwx  1 pdd pdd 3.5G Jan 29 07:53 fine_2.pt
-rwxrwxrwx  1 pdd pdd 5.0G Jan 29 07:22 text_2.pt
$ ls -ahl ./ggml_weights/
总用量 4.2G
drwxrwxr-x  2 pdd pdd 4.0K Jan 29 08:34 .
drwxrwxr-x 13 pdd pdd 4.0K Jan 29 06:50 ..
-rw-rw-r--  1 pdd pdd 1.3M Jan 29 08:33 ggml_vocab.bin
-rw-rw-r--  1 pdd pdd 1.3G Jan 29 08:34 ggml_weights_coarse.bin
-rw-rw-r--  1 pdd pdd  45M Jan 29 08:34 ggml_weights_codec.bin
-rw-rw-r--  1 pdd pdd 1.2G Jan 29 08:34 ggml_weights_fine.bin
-rw-rw-r--  1 pdd pdd 1.7G Jan 29 08:33 ggml_weights_text.bin
-rw-rw-r--  1 pdd pdd 973K Jan 29 05:23 vocab.txt
$ ./main -m ./ggml_weights/ -p "this is an audio"

运行

$ ./build/bin/main -h
usage: ./build/bin/main [options]

options:
  -h, --help            show this help message and exit
  -t N, --threads N     number of threads to use during computation (default: 4)
  -s N, --seed N        seed for random number generator (default: 0)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)
  -m FNAME, --model FNAME
                        model path (default: /home/pdd/le/bark.cpp/ggml_weights)
  -o FNAME, --outwav FNAME
                        output generated wav (default: output.wav)

$ ./build/bin/main -m ./ggml_weights/ -p "this is an audio"
bark_load_model_from_file: loading model from './ggml_weights/'
bark_load_model_from_file: reading bark text model
gpt_model_load: n_in_vocab  = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 304 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1701.69 MB
bark_load_model_from_file: reading bark vocab

bark_load_model_from_file: reading bark coarse model
gpt_model_load: n_in_vocab  = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 1
gpt_model_load: n_wtes      = 1
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 304 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1250.69 MB

bark_load_model_from_file: reading bark fine model
gpt_model_load: n_in_vocab  = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size  = 1024
gpt_model_load: n_embd      = 1024
gpt_model_load: n_head      = 16
gpt_model_load: n_layer     = 24
gpt_model_load: n_lm_heads  = 7
gpt_model_load: n_wtes      = 8
gpt_model_load: ftype       = 0
gpt_model_load: qntvr       = 0
gpt_model_load: ggml tensor size = 304 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size =   192.00 MB, n_mem = 24576
gpt_model_load: model size  =  1218.26 MB

bark_load_model_from_file: reading bark codec model
encodec_model_load: model size    =   44.32 MB

bark_load_model_from_file: total model size  =  4170.64 MB

bark_tokenize_input: prompt: 'this is an audio'
bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595 
bark_forward_text_encoder: ...........................................................................................................

bark_print_statistics: mem per token =     4.81 MB
bark_print_statistics:   sample time =    16.03 ms / 109 tokens
bark_print_statistics:  predict time =  9644.73 ms / 87.68 ms per token
bark_print_statistics:    total time =  9663.29 ms

bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................

bark_print_statistics: mem per token =     8.53 MB
bark_print_statistics:   sample time =     4.43 ms / 324 tokens
bark_print_statistics:  predict time = 52071.64 ms / 160.22 ms per token
bark_print_statistics:    total time = 52080.24 ms

ggml_new_object: not enough space in the context's memory pool (needed 4115076720, available 4112941056)
段错误 (核心已转储)
  • 一开始以为是内存不足,去增加了虚拟内存,但仍然报错
csharp 复制代码
$ sudo dd if=/dev/zero of=swapfile bs=1024 count=10000000
记录了10000000+0 的读入
记录了10000000+0 的写出
10240000000字节(10 GB,9.5 GiB)已复制,55.3595 s,185 MB/s
$ sudo chmod 600 ./swapfile  # delete the swapfile if you dont need it
$ sudo mkswap -f ./swapfile
正在设置交换空间版本 1,大小 = 9.5 GiB (10239995904  个字节)
无标签, UUID=f3e2a0be-b880-48da-b598-950b7d69f94f
$ sudo swapon ./swapfile
$ free -m
               total        used        free      shared  buff/cache   available
内存:      15731        6441         307        1242        8982        7713
交换:      11813        2047        9765

$ ./build/bin/main -m ./ggml_weights/ -p "this is an audio"
ggml_new_object: not enough space in the context's memory pool (needed 4115076720, available 4112941056)
  • 去看了报错的函数,应该不是内存的原因

    static struct ggml_object * ggml_new_object(struct ggml_context * ctx, enum ggml_object_type type, size_t size) {
    // always insert objects at the end of the context's memory pool
    struct ggml_object * obj_cur = ctx->objects_end;

      const size_t cur_offs = obj_cur == NULL ? 0 : obj_cur->offs;
      const size_t cur_size = obj_cur == NULL ? 0 : obj_cur->size;
      const size_t cur_end  = cur_offs + cur_size;
    
      // align to GGML_MEM_ALIGN
      size_t size_needed = GGML_PAD(size, GGML_MEM_ALIGN);
    
      char * const mem_buffer = ctx->mem_buffer;
      struct ggml_object * const obj_new = (struct ggml_object *)(mem_buffer + cur_end);
    
      if (cur_end + size_needed + GGML_OBJECT_SIZE > ctx->mem_size) {
          GGML_PRINT("%s: not enough space in the context's memory pool (needed %zu, available %zu)\n",
                  __func__, cur_end + size_needed, ctx->mem_size);
          assert(false);
          return NULL;
      }
    
      *obj_new = (struct ggml_object) {
          .offs = cur_end + GGML_OBJECT_SIZE,
          .size = size_needed,
          .next = NULL,
          .type = type,
      };
    
      ggml_assert_aligned(mem_buffer + obj_new->offs);
    
      if (obj_cur != NULL) {
          obj_cur->next = obj_new;
      } else {
          // this is the first object in this context
          ctx->objects_begin = obj_new;
      }
    
      ctx->objects_end = obj_new;
    
      //printf("%s: inserted new object at %zu, size = %zu\n", __func__, cur_end, obj_new->size);
    
      return obj_new;
    

    }

  • 然后找到了https://github.com/PABannier/bark.cpp/issues/122

    $ cd bark.cpp/
    $ git checkout -f 07e651618b3a8a27de3bfa7f733cdb0aa8f46b8a
    HEAD 目前位于 07e6516 ENH Decorrelate fine GPT graph (#111)

  • 运行成功

    /home/pdd/le/bark.cpp/cmake-build-debug/bin/main
    bark_load_model_from_file: loading model from '/home/pdd/le/bark.cpp/ggml_weights'
    bark_load_model_from_file: reading bark text model
    gpt_model_load: n_in_vocab = 129600
    gpt_model_load: n_out_vocab = 10048
    gpt_model_load: block_size = 1024
    gpt_model_load: n_embd = 1024
    gpt_model_load: n_head = 16
    gpt_model_load: n_layer = 24
    gpt_model_load: n_lm_heads = 1
    gpt_model_load: n_wtes = 1
    gpt_model_load: ftype = 0
    gpt_model_load: qntvr = 0
    gpt_model_load: ggml tensor size = 272 bytes
    gpt_model_load: ggml ctx size = 1894.87 MB
    gpt_model_load: memory size = 192.00 MB, n_mem = 24576
    gpt_model_load: model size = 1701.69 MB
    bark_load_model_from_file: reading bark vocab

    bark_load_model_from_file: reading bark coarse model
    gpt_model_load: n_in_vocab = 12096
    gpt_model_load: n_out_vocab = 12096
    gpt_model_load: block_size = 1024
    gpt_model_load: n_embd = 1024
    gpt_model_load: n_head = 16
    gpt_model_load: n_layer = 24
    gpt_model_load: n_lm_heads = 1
    gpt_model_load: n_wtes = 1
    gpt_model_load: ftype = 0
    gpt_model_load: qntvr = 0
    gpt_model_load: ggml tensor size = 272 bytes
    gpt_model_load: ggml ctx size = 1443.87 MB
    gpt_model_load: memory size = 192.00 MB, n_mem = 24576
    gpt_model_load: model size = 1250.69 MB

    bark_load_model_from_file: reading bark fine model
    gpt_model_load: n_in_vocab = 1056
    gpt_model_load: n_out_vocab = 1056
    gpt_model_load: block_size = 1024
    gpt_model_load: n_embd = 1024
    gpt_model_load: n_head = 16
    gpt_model_load: n_layer = 24
    gpt_model_load: n_lm_heads = 7
    gpt_model_load: n_wtes = 8
    gpt_model_load: ftype = 0
    gpt_model_load: qntvr = 0
    gpt_model_load: ggml tensor size = 272 bytes
    gpt_model_load: ggml ctx size = 1411.25 MB
    gpt_model_load: memory size = 192.00 MB, n_mem = 24576
    gpt_model_load: model size = 1218.26 MB

    bark_load_model_from_file: reading bark codec model

    bark_load_model_from_file: total model size = 4170.64 MB

    bark_tokenize_input: prompt: 'this is an audio'
    bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
    encodec_model_load: model size = 44.32 MB
    bark_forward_text_encoder: ...........................................................................................................

    bark_print_statistics: mem per token = 4.80 MB
    bark_print_statistics: sample time = 59.49 ms / 109 tokens
    bark_print_statistics: predict time = 24761.95 ms / 225.11 ms per token
    bark_print_statistics: total time = 24826.76 ms

    bark_forward_coarse_encoder: ...................................................................................................................................................................................................................................................................................................................................

    bark_print_statistics: mem per token = 8.51 MB
    bark_print_statistics: sample time = 19.74 ms / 324 tokens
    bark_print_statistics: predict time = 178366.69 ms / 548.82 ms per token
    bark_print_statistics: total time = 178396.22 ms

    bark_forward_fine_encoder: .....

    bark_print_statistics: mem per token = 0.66 MB
    bark_print_statistics: sample time = 304.20 ms / 6144 tokens
    bark_print_statistics: predict time = 407086.19 ms / 58155.17 ms per token
    bark_print_statistics: total time = 407399.91 ms

    bark_forward_encodec: mem per token = 760209 bytes
    bark_forward_encodec: predict time = 4349.03 ms
    bark_forward_encodec: total time = 4349.07 ms

    Number of frames written = 51840.

    main: load time = 11441.58 ms
    main: eval time = 614987.69 ms
    main: total time = 626429.31 ms

    Process finished with exit code 0

CG