DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

Overview

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

arXiv

This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).

Besides, more detailed & improved code framework, which contains the implementations of FastSpeech 2, DiffSpeech and our NeurIPS-2021 work PortaSpeech is coming soon โœจ โœจ โœจ .

DiffSinger/DiffSpeech at training DiffSinger/DiffSpeech at inference
Training Inference

๐Ÿš€ News:

  • Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
  • Sep.29, 2021: Our recent work PortaSpeech: Portable and High-Quality Generative Text-to-Speech was accepted by NeurIPS-2021 arXiv .
  • May.06, 2021: We submitted DiffSinger to Arxiv arXiv.

Environments

conda create -n your_env_name python=3.8
source activate your_env_name 
pip install -r requirements_2080.txt   (GPU 2080Ti, CUDA 10.2)
or pip install -r requirements_3090.txt   (GPU 3090, CUDA 11.4)

DiffSpeech (TTS version)

1. Data Preparation

a) Download and extract the LJ Speech dataset, then create a link to the dataset folder: ln -s /xxx/LJSpeech-1.1/ data/raw/

b) Download and Unzip the ground-truth duration extracted by MFA: tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/

c) Run the following scripts to pack the dataset for training/inference.

CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml

# `data/binary/ljspeech` will be generated.

2. Training Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset --infer

We also provide:

  • the pre-trained model of DiffSpeech;
  • the pre-trained model of HifiGAN vocoder;
  • the individual pre-trained model of FastSpeech 2 for the shallow diffusion mechanism in DiffSpeech;

Remember to put the pre-trained models in checkpoints directory.

About the determination of 'k' in shallow diffusion: We recommend the trick introduced in Appendix B. We have already provided the proper 'k' for Ljspeech dataset in the config files.

DiffSinger (SVS version)

0. Data Acquirement

  • WIP. We will provide a form to apply for PopCS dataset.

1. Data Preparation

  • WIP. Similar to DiffSpeech.

2. Training Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6.yaml --exp_name xxx --reset
# or
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name xxx --reset

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config xxx --exp_name xxx --reset --infer

The pre-trained model for SVS will be provided recently.

Tensorboard

tensorboard --logdir_spec exp_name
Tensorboard

Mel Visualization

Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].

DiffSpeech vs. FastSpeech 2
DiffSpeech-vs-FastSpeech2
DiffSpeech-vs-FastSpeech2
DiffSpeech-vs-FastSpeech2

Audio Demos

Audio samples can be found in our demo page.

We also put part of the audio samples generated by DiffSpeech+HifiGAN (marked as [P]) and GTmel+HifiGAN (marked as [G]) of test set in resources/demos_1218.

(corresponding to the pre-trained model DiffSpeech)

Citation

@misc{liu2021diffsinger,
  title={DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism}, 
  author={Jinglin Liu and Chengxi Li and Yi Ren and Feiyang Chen and Zhou Zhao},
  year={2021},
  eprint={2105.02446},
  archivePrefix={arXiv},}

Acknowledgements

Our codes are based on the following repos:

Also thanks Keon Lee for fast implementation of our work.

Comments
  • DiffSinger infer problem

    DiffSinger infer problem

    I want to test opencpop preitrain model on unseen song. I don't know how to generate the wav file.

    1. What data I should prepare for model?
    2. How to do it? I saw test_step in FastSpeech2Task, but it seems for tts task. So I need override test_step in DiffSingerMIDITask? Is there other way to solve this? Without packing data into dataloader, just load model, and infer.
    opened by leon2milan 9
  • Inference with unseen songs

    Inference with unseen songs

    Hi. Since the DiffSinger(PopCS) needs ground-truth f0 information at inference, is it possible to synthesize an unseen song (with phoneme labels, phoneme duration and notes provided) using the DIffSinger(PopCS) model?

    opened by Charlottecuc 8
  • about some missing parts

    about some missing parts

    Hi, thanks for your contribution on DiffSinger! and also thanks for mentioning my implementation, I just realized it yesterday:)

    With your detailed documentation in README and paper, I can reproduce the training & inference procedure and the results with this repo. But during that, I found some missing parts to get the full training with shallow version: I think the current code only supports forced K (which is 71) with the pre-trained FastSpeech2 (especially of the decoder). If I understood correctly, we need a process for the boundary prediction and pre-training of FastSpeech2 before training DiffSpeech in shallow. Maybe I missed somewhere in the repo, but if it is not yet pushed, I wonder whether you have planned to provide that part soon or not.

    Thanks in advance!

    Best, keon

    solved 
    opened by keonlee9420 7
  • I can't get checkpoint files.

    I can't get checkpoint files.

    According to README-SVS-opencpop-cascade.md, I made my own datasets and tried training.

    CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/cascade/mydatasets/aux_rel.yaml --exp_name MY_DATASETS_FS_EXP --reset

    The training went to Epoch 50 but I didn't any checkpoint files. Which is the checkpoint file, or could anyone tell me which config should be changed to make checkpoint files (in shorter interval)?

    opened by nakasako 6
  • size mismatch for model.encoder_embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).

    size mismatch for model.encoder_embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).

    Having successfully run step 1, data preparation, I am now trying to run inference. I am using the given dataset preview. Running CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer according to the readme.md, I end up with this error:

    | model Trainable Parameters: 24.253M
    Traceback (most recent call last):
      File "tasks/run.py", line 15, in <module>
        run_task()
      File "tasks/run.py", line 10, in run_task
        task_cls.start()
      File "/.../DiffSinger/tasks/base_task.py", line 258, in start
        trainer.test(task)
      File "/.../DiffSinger/utils/pl_utils.py", line 586, in test
        self.fit(model)
      File "/.../DiffSinger/utils/pl_utils.py", line 489, in fit
        self.run_pretrain_routine(model)
      File "/.../DiffSinger/utils/pl_utils.py", line 541, in run_pretrain_routine
        self.restore_weights(model)
      File "/.../DiffSinger/utils/pl_utils.py", line 617, in restore_weights
        self.restore_state_if_checkpoint_exists(model)
      File "/.../DiffSinger/utils/pl_utils.py", line 655, in restore_state_if_checkpoint_exists
        self.restore(last_ckpt_path, self.on_gpu)
      File "/.../DiffSinger/utils/pl_utils.py", line 668, in restore
        model.load_state_dict(checkpoint['state_dict'], strict=False)
      File "/.../envs/DiffSinger/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for FastSpeech2Task:
    	size mismatch for model.encoder_embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).
    	size mismatch for model.encoder.embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).
    

    Do you have any ideas on what could be wrong here and how to resolve it?

    solved 
    opened by ghost 6
  • The model takes the phoneme duration as input when inference?

    The model takes the phoneme duration as input when inference?

    Thanks for your wonderful work! I was running the inference of 0128_opencpop_ds58_midi, but there's a problem that bothers me.

    https://github.com/MoonInTheRiver/DiffSinger/blob/master/tasks/tts/fs2.py#L348

        ############
        # infer
        ############
        def test_step(self, sample, batch_idx):
            spk_embed = sample.get('spk_embed') if not hparams['use_spk_id'] else sample.get('spk_ids')
            txt_tokens = sample['txt_tokens']
            mel2ph, uv, f0 = None, None, None
            ref_mels = None
            if hparams['profile_infer']:
                pass
            else:
                if hparams['use_gt_dur']:
                    mel2ph = sample['mel2ph']
                if hparams['use_gt_f0']:
                    f0 = sample['f0']
                    uv = sample['uv']
                    print('Here using gt f0!!')
                if hparams.get('use_midi') is not None and hparams['use_midi']:
                    outputs = self.model(
                        txt_tokens, spk_embed=spk_embed, mel2ph=mel2ph, f0=f0, uv=uv, ref_mels=ref_mels, infer=True,
                        pitch_midi=sample['pitch_midi'], midi_dur=sample.get('midi_dur'), is_slur=sample.get('is_slur'))
                else:
                    outputs = self.model(
                        txt_tokens, spk_embed=spk_embed, mel2ph=mel2ph, f0=f0, uv=uv, ref_mels=ref_mels, infer=True)
    

    The param use_gt_dur is True, that is, the model takes the phoneme duration as input when inference. Is it correct?

    solved 
    opened by YawYoung 5
  • Why feed in f0  in the midi version

    Why feed in f0 in the midi version

    Hi @MoonInTheRiver ๏ผŒ

    In the midi version, why also feed in f0 and uv?

    f0 and uv is generated from raw wav, but during the infer, only txt_token and midi are given, how to get f0 and uv?

    opened by zhangsanfeng86 5
  • Using the Universal Vocoder

    Using the Universal Vocoder

    Hello! Can you please tell me if I can use your universal vocoder (trained on ~70 hours singing data) to get a DiffSinger (SVS) model by training on English data or do I need to train it from scratch on English data? If so, how can I do it? I want to get a model that synthesizes English singing without a Chinese accent. I want to make sure that there won't be any problems due to different phonemes.

    opened by ReyraV 3
  •  question about fs2 infer

    question about fs2 infer

    Hi, thank you very much for your valuable SVS corpus and code. I strictly follow your instruction until step "2. Training Example" for SVS, in https://github.com/MoonInTheRiver/DiffSinger . Then I am somewhat stuck here. The error message is: Validation sanity check: 0%| | 0/1 [00:00<?, ?batch/s] Traceback (most recent call last): File "tasks/run.py", line 19, in run_task() File "tasks/run.py", line 14, in run_task task_cls.start() File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/tasks/base_task.py", line 256, in start trainer.fit(task) File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 489, in fit self.run_pretrain_routine(model) File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 565, in run_pretrain_routine self.evaluate(model, self.get_val_dataloaders(), self.num_sanity_val_steps, self.testing) File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 1173, in evaluate for batch_idx, batch in enumerate(dataloader): File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/usr/diffsinger_task.py", line 93, in getitem fs2_mel = torch.Tensor(np.load(f'{fs2_ckpt}/P_mels_npy/{item_name}.npy')) # ~M generated by FFT-singer. File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/numpy/lib/npyio.py", line 416, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/popcs_fs2_pmf0_1230/P_mels_npy/popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ-0000.npy' It seems that the required file is not properly process in "1. Data Preparation" step, though the first step was passed successfully with the following prompt: test_input_dir: , test_num: 0, test_prefixes: ['popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ', 'popcs-้šๅฝข็š„็ฟ…่†€'], test_set_name: test, timesteps: 100, train_set_name: train, use_denoise: False, use_energy_embed: False, use_gt_dur: True, use_gt_f0: True, use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_uv: True, use_var_enc: False, val_check_interval: 2000, valid_num: 0, valid_set_name: valid, validate: False, vocoder: vocoders.hifigan.HifiGAN, vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128, warmup_updates: 2000, weight_decay: 0, win_size: 512, work_dir: , | Binarizer: <class 'data_gen.singing.binarize.SingingBinarizer'> | spk_map: {'SPK1': 0} | Build phone set: ['', '', '', 'a', 'ai', 'an', 'ang', 'ao', 'b', 'c', 'ch', 'd', 'e', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'iou', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'uei', 'uen', 'uo', 'v', 'van', 've', 'vn', 'x', 'z', 'zh', '|'] 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 27/27 [00:13<00:00, 2.01it/s] | valid total duration: 330.677s 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 27/27 [00:13<00:00, 2.04it/s] | test total duration: 330.677s 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1624/1624 [11:55<00:00, 2.27it/s] | train total duration: 20878.560s I guess the output of Step 1 and input of Step 2 are possibly not chained perfectly. Any help or hints will be welcome. Thank you in advance.

    Yes, you are right. There is a problem. You need run "CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer" in advance to produce the files "P_mels_npy". I have fixed the readme file. Thanks for your report!

    I have a question about it, if we are training from scratch and we don't have any saved models for inference, how does the P_mels_npy(predicted mels I guess) generated?

    Originally posted by @Cescfangs in https://github.com/MoonInTheRiver/DiffSinger/issues/11#issuecomment-1038568795

    opened by Cescfangs 3
  • Data Preparation not working?

    Data Preparation not working?

    Thank you for sharing this project! I am trying to run inference on a pre-trained model (DiffSinger), following the directions in the README. I have downloaded your dataset preview and trained models. I have tried both using a symlink to the dataset as instructed and also placing everything in data/processed/popcs/ directly.

    At step 1, packing the dataset, I seem to run into a problem:

    (DiffSinger) user@ubuntu:~/.../DiffSinger$ CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/popcs_ds_beta6.yaml
    
    | Hparams chains:  ['configs/config_base.yaml', 'configs/tts/base.yaml', 'configs/tts/fs2.yaml', 'configs/tts/base_zh.yaml', 'configs/singing/base.yaml', 'usr/configs/base.yaml', 'usr/configs/popcs_ds_beta6.yaml']
    | Hparams: 
    K_step: 51, accumulate_grad_batches: 1, audio_num_mel_bins: 80, audio_sample_rate: 24000, base_config: ['configs/tts/fs2.yaml', 'configs/singing/base.yaml', './base.yaml'], 
    binarization_args: {'shuffle': False, 'with_txt': True, 'with_wav': True, 'with_align': True, 'with_spk_embed': False, 'with_f0': True, 'with_f0cwt': True}, binarizer_cls: data_gen.singing.binarize.SingingBinarizer, binary_data_dir: data/binary/popcs-pmf0, check_val_every_n_epoch: 10, clip_grad_norm: 1, 
    content_cond_steps: [], cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1, 
    cwt_std_scale: 0.8, datasets: ['popcs'], debug: False, dec_ffn_kernel_size: 9, dec_layers: 4, 
    decay_steps: 50000, decoder_type: fft, dict_dir: , diff_decoder_type: wavenet, diff_loss_type: l1, 
    dilation_cycle_length: 1, dropout: 0.1, ds_workers: 4, dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, 
    dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_ffn_kernel_size: 9, enc_layers: 4, encoder_K: 8, 
    encoder_type: fft, endless_ds: True, ffn_act: gelu, ffn_padding: SAME, fft_size: 512, 
    fmax: 12000, fmin: 30, fs2_ckpt: , gen_dir_name: , gen_tgt_spk_id: -1, 
    hidden_size: 256, hop_size: 128, infer: False, keep_bins: 80, lambda_commit: 0.25, 
    lambda_energy: 0.0, lambda_f0: 0.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, 
    lambda_word_dur: 0.0, load_ckpt: , log_interval: 100, loud_norm: False, lr: 0.001, 
    max_beta: 0.06, max_epochs: 1000, max_eval_sentences: 1, max_eval_tokens: 60000, max_frames: 5000, 
    max_input_tokens: 1550, max_sentences: 48, max_tokens: 20000, max_updates: 160000, mel_loss: ssim:0.5|l1:0.5, 
    mel_vmax: 1.5, mel_vmin: -6, min_level_db: -120, norm_type: gn, num_ckpt_keep: 3, 
    num_heads: 2, num_sanity_val_steps: 1, num_spk: 1, num_test_samples: 0, num_valid_plots: 10, 
    optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False, pitch_ar: False, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'], 
    pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: log, pitch_type: frame, pre_align_args: {'use_tone': False, 'forced_align': 'mfa', 'use_sox': True, 'txt_processor': 'zh_g2pM', 'allow_no_txt': False, 'denoise': False}, 
    pre_align_cls: data_gen.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.0, predictor_hidden: -1, predictor_kernel: 5, 
    predictor_layers: 2, prenet_dropout: 0.5, prenet_hidden_size: 256, pretrain_fs_ckpt: , processed_data_dir: data/processed/popcs, 
    profile_infer: False, raw_data_dir: data/raw/popcs, ref_norm_layer: bn, reset_phone_dict: True, residual_channels: 256, 
    residual_layers: 20, save_best: False, save_ckpt: True, save_codes: ['configs', 'modules', 'tasks', 'utils', 'usr'], save_f0: True, 
    save_gt: False, schedule_type: linear, seed: 1234, sort_by_len: True, spec_max: [0.2645, 0.0583, -0.2344, -0.0184, 0.1227, 0.1533, 0.1103, 0.1212, 0.2421, 0.1809, 0.2134, 0.3161, 0.3301, 0.3289, 0.2667, 0.2421, 0.2581, 0.26, 0.1394, 0.1907, 0.1082, 0.1474, 0.168, 0.255, 0.1057, 0.0826, 0.0423, 0.1203, -0.0701, -0.0056, 0.0477, -0.0639, -0.0272, -0.0728, -0.1648, -0.0855, -0.2652, -0.1998, -0.1547, -0.2167, -0.4181, -0.5463, -0.4161, -0.4733, -0.6518, -0.5387, -0.429, -0.4191, -0.4151, -0.3042, -0.381, -0.416, -0.4496, -0.2847, -0.4676, -0.4658, -0.4931, -0.4885, -0.5547, -0.5481, -0.6948, -0.7968, -0.8455, -0.8392, -0.877, -0.952, -0.8749, -0.7297, -0.8374, -0.8667, -0.7157, -0.9035, -0.9219, -0.8801, -0.9298, -0.9009, -0.9604, -1.0537, -1.0781, -1.3766], 
    spec_min: [-6.8276, -7.027, -6.8142, -7.1429, -7.6669, -7.6, -7.1148, -6.964, -6.8414, -6.6596, -6.688, -6.7439, -6.7986, -7.494, -7.7845, -7.6586, -6.9288, -6.7639, -6.9118, -6.8246, -6.7183, -7.1769, -6.9794, -7.4513, -7.3422, -7.5623, -6.961, -6.8158, -6.9595, -6.8403, -6.5688, -6.6356, -7.0209, -6.5002, -6.7819, -6.5232, -6.6927, -6.5701, -6.5531, -6.7069, -6.6462, -6.4523, -6.5954, -6.4264, -6.4487, -6.707, -6.4025, -6.3042, -6.4008, -6.3857, -6.3903, -6.3094, -6.2491, -6.3518, -6.3566, -6.4168, -6.2481, -6.3624, -6.2858, -6.2575, -6.3638, -6.452, -6.1835, -6.2754, -6.1253, -6.1645, -6.0638, -6.1262, -6.071, -6.1039, -6.4428, -6.1363, -6.1054, -6.1252, -6.1797, -6.0235, -6.0758, -5.9453, -6.0213, -6.0446], spk_cond_steps: [], stop_token_weight: 5.0, task_cls: usr.diffsinger_task.DiffSingerTask, test_ids: [], 
    test_input_dir: , test_num: 0, test_prefixes: ['popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ', 'popcs-้šๅฝข็š„็ฟ…่†€'], test_set_name: test, timesteps: 100, 
    train_set_name: train, use_denoise: False, use_energy_embed: False, use_gt_dur: True, use_gt_f0: True, 
    use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False, 
    use_split_spk_id: False, use_uv: True, use_var_enc: False, val_check_interval: 2000, valid_num: 0, 
    valid_set_name: valid, validate: False, vocoder: vocoders.hifigan.HifiGAN, vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128, warmup_updates: 2000, 
    weight_decay: 0, win_size: 512, work_dir: , 
    | Binarizer:  <class 'data_gen.singing.binarize.SingingBinarizer'>
    | spk_map:  {}
    | Build phone set:  []
    0it [00:00, ?it/s]
    | valid total duration: 0.000s
    0it [00:00, ?it/s]
    | test total duration: 0.000s
    0it [00:00, ?it/s]
    | train total duration: 0.000s
    

    It creates the folder data/binary/popcs-pmf0 with 11 files, but they seem to be essentially empty. Can you please tell what I am missing, why it does not find or use the dataset?

    solved 
    opened by ghost 3
  • Determining the durations of segmentation operators (|)

    Determining the durations of segmentation operators (|)

    The MFA outputs don't really provide the durations/frames between the words, and I checked that this project uses the duration of the SEG token (word separator). It is many times 0 and other times not, so I wanted to ask how did you get that on preprocessing step?

    solved 
    opened by PranjalyaDS 3
  • RuntimeError: index 155 is out of bounds for dimension 1 with size 155

    RuntimeError: index 155 is out of bounds for dimension 1 with size 155

    I try to run training on my dataset. Valid data is processed correctly and this error does not occur at this stage. But when training data is used, a RuntimeError always occurs. I tried to analyze the tensors, look at their sizes, but there are no ideas, because they are identical to the valid ones. The only thing I noticed is that I have a lot of zero tensors at the end. But I'm not sure that this is an important point. Valid data was taken randomly of course. In fact, this part of code works correctly for valid data, but does not work for training data:

    torch.gather(F.pad(encoder_out, [0, 0, 1, 0]), 1, mel2ph)

    Please help, I would be glad to any ideas to solve this problem! image

    https://github.com/MoonInTheRiver/DiffSinger/blob/5f2f6eb3c42635f9446363a302602a2ef1d41d70/modules/diffsinger_midi/fs2.py#L100

    opened by ReyraV 4
  • Hello, I have issue as I try to use another english dataset. And I'm wondering why Inference from packed test set can work (`CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer`) but inference model from raw input (`python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME`) needs same phoneme set size?

    Hello, I have issue as I try to use another english dataset. And I'm wondering why Inference from packed test set can work (`CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer`) but inference model from raw input (`python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME`) needs same phoneme set size?

        Hello, I have issue as I try to use another english dataset. And I'm wondering why Inference from packed test set can work (`CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer`) but inference model from raw input (`python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME`) needs same phoneme set size?
    

    Originally posted by @Wayne-wonderai in https://github.com/MoonInTheRiver/DiffSinger/issues/29#issuecomment-1260673475

    opened by michaellin99999 13
  • custom phone_set file

    custom phone_set file

    Hi, with data preview we have create 72 phonemes, is there a way to train the model such that it doesn't use the existing phone_set file with 62 phonemes and can use up to 72 phonemes?

    Thanks

    opened by michaellin99999 1
  • decoder part in e2e trainning using opencpop dataset

    decoder part in e2e trainning using opencpop dataset

    In the e2e trainning mode of opencpop, skip_decoder is true and the decoder part is not trainned at all, right? But in the inference, you still use run_decoder to get mel_out and use it as a start for q_sample, right? Why run_decoder can also used here?

    Is that why you use k=60 in cascade mode but k=1000 in e2e mode?

    opened by Liujingxiu23 0
Owner
Jinglin Liu
Jinglin Liu
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Keon Lee 237 Jan 2, 2023
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin ๅฒไธŠ่ฎญ็ปƒๆœ€็ฎ€ๅ•๏ผŒ้Ÿณ่ดจๆœ€ๅฅฝ็š„่ฏญ้Ÿณๅˆๆˆ็ณป็ปŸ

AmorTX 12 Dec 14, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS ?? green green gradio app.py false Ukrainian TTS ?? ?? Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). ๅฅฝ็”จ็š„ไธญๆ–‡่ฏญ้Ÿณๅ…‹้š†ๅ…ผไธญๆ–‡่ฏญ้Ÿณๅˆๆˆ็ณป็ปŸ๏ผŒๅŒ…ๅซ่ฏญ้Ÿณ็ผ–็ ๅ™จใ€่ฏญ้Ÿณๅˆๆˆๅ™จใ€ๅฃฐ็ ๅ™จๅ’Œๅฏ่ง†ๅŒ–ๆจกๅ—ใ€‚

Kuang Dada 6 Nov 8, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | ไธญๆ–‡ โ— Now we provide inferencing code and pre-training models

null 164 Jan 2, 2023
A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

James Betker 2.1k Jan 1, 2023
Repository for the paper: VoiceMe: Personalized voice generation in TTS

?? VoiceMe: Personalized voice generation in TTS Abstract Novel text-to-speech systems can generate entirely new voices that were not seen during trai

Pol van Rijn 80 Dec 29, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

Maarten van Gompel 46 Dec 14, 2022
๐Ÿ’ฌ Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
๐Ÿ’ฌ Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 3, 2023
๐Ÿ’ฌ Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 10.8k Feb 18, 2021
This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

Hassan Shahzad 3 Oct 15, 2021
null 1 Jun 28, 2022
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 5, 2022
TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

Mozilla 6.5k Jan 8, 2023
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 1, 2023
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). ๅฟซ้€Ÿ่ฏญ้Ÿณๅˆๆˆๆจกๅž‹๏ผŒ้€‚็”จไบŽ่‹ฑ่ฏญใ€ๆ™ฎ้€š่ฏ/ไธญๆ–‡ใ€ๆ—ฅ่ฏญใ€้Ÿฉ่ฏญใ€ไฟ„่ฏญๅ’Œ่—่ฏญ๏ผˆๅฝ“ๅ‰ๅทฒๆต‹่ฏ•๏ผ‰ใ€‚

็ฎ€ไฝ“ไธญๆ–‡ | English ๅนถ่กŒ่ฏญ้Ÿณๅˆๆˆ [TOC] ๆ–ฐ่ฟ›ๅฑ• 2021/04/20 ๅˆๅนถ wavegan ๅˆ†ๆ”ฏๅˆฐ main ไธปๅˆ†ๆ”ฏ๏ผŒๅˆ ้™ค wavegan ๅˆ†ๆ”ฏ๏ผ 2021/04/13 ๅˆ›ๅปบ encoder ๅˆ†ๆ”ฏ็”จไบŽๅผ€ๅ‘่ฏญ้Ÿณ้ฃŽๆ ผ่ฟ็งปๆจกๅ—๏ผ 2021/04/13 softdtw ๅˆ†ๆ”ฏ ๆ”ฏๆŒไฝฟ็”จ Sof

Atomicoo 161 Dec 19, 2022