This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Overview

Speech-Backbones

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Grad-TTS

Official implementation of the Grad-TTS model based on Diffusion Probabilistic Modelling. For all details check out our paper accepted to ICML 2021 via this link.

Authors: Vadim Popov*, Ivan Vovk*, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov.

*Equal contribution.

Comments
  • Typo in some equations in GradTTS paper

    Typo in some equations in GradTTS paper

    Thanks for your great work on GradTTS! However I recently found a tiny error in the arxiv version 2 of GradTTS paper (https://arxiv.org/pdf/2105.06337.pdf). In Eq.31 and Eq.32 in the appendix, the $X_t$ and $\mu$ are put in the wrong order, i.e. it probably should be $\mu - X_t$ rather than $X_t - \mu$. This typo could be originated from the line above Eq.31, " In our case $f(X_t, t) = \frac 12 \Sigma^{-1}(X_t −\mu)\beta$ and ...", where it should be $f(X_t, t)= \frac 12 \Sigma^{-1}(\mu-X_t)\beta$ instead. The other parts of the paper seem not to be affected by this, and the derivations are solid and fluent. Again, great thanks for the work!

    opened by cantabile-kwok 4
  • Not able to generate audio using libritts of as good quality as using ljspeech

    Not able to generate audio using libritts of as good quality as using ljspeech

    Hi, Thank you for the great work and for releasing the pretrained model. I tried to train the grad-tts model using libritts (multispeaker) and using ljspeech (single speaker) and found that the single-speaker setting gives much better quality than the multispeaker. This is even true when using your released grad-tts-libri-tts.pt. Are you able to get better quality in multispeaker setting? These are a few samples I generated in multispeaker setting using your released model: https://drive.google.com/drive/folders/1ze0_rJXtmPY3JNAwnr0A_9C4OVvULEj7?usp=sharing.

    opened by Hertin 4
  • How is `out_size` in `params` determined

    How is `out_size` in `params` determined

    Hi, I am modifying the code for my own purposes. I notice here: https://github.com/huawei-noah/Speech-Backbones/blob/b82fdd546d9d977573c8557f242b06a0770ece8e/Grad-TTS/params.py#L53 the argument is hard-coded, and I guess 22050 and 256 to be the sampling rate and frame shift in the case of LJspeech, right? If this is true, should I change to another value if I am dealing with different datasets?

    opened by cantabile-kwok 2
  • Grad-TTS in multispeaker setting

    Grad-TTS in multispeaker setting

    Thank you for the releasing original implementation of Grad-TTS. I would like to know if a multispeaker setting is available or planned for release.

    I am implementing a multispeaker setting using this repo. Will the maintainer of this repo be interested in discussing or providing feedback on multispeaker Grad-TTS implementation?

    Regards Ajinkya

    opened by ajinkyakulkarni14 2
  • [Errno 13] Permission denied: '/home/user/app/Grad-TTS/model/monotonic_align/core.c'

    [Errno 13] Permission denied: '/home/user/app/Grad-TTS/model/monotonic_align/core.c'

    [Errno 13] Permission denied: '/home/user/app/Grad-TTS/model/monotonic_align/core.c' Traceback (most recent call last): File "/home/user/.local/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 1208, in cythonize_one result = compile_single(pyx_file, options, full_module_name=full_module_name) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Main.py", line 727, in compile_single return run_pipeline(source, options, full_module_name) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Main.py", line 515, in run_pipeline err, enddata = Pipeline.run_pipeline(pipeline, source) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 355, in run_pipeline data = run(phase, data) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 335, in run return phase(data) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 52, in generate_pyx_code_stage module_node.process_implementation(options, result) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/ModuleNode.py", line 143, in process_implementation self.generate_c_code(env, options, result) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/ModuleNode.py", line 411, in generate_c_code f = open_new_file(result.c_file) File "/home/user/.local/lib/python3.8/site-packages/Cython/Utils.py", line 76, in open_new_file return codecs.open(path, "w", encoding="ISO-8859-1") File "/usr/local/lib/python3.8/codecs.py", line 905, in open file = builtins.open(filename, mode, buffering)

    opened by AK391 0
  • Model training question

    Model training question

    Hi, thanks for sharing the code. I have a folder with wav files of different speakers. I don't understand what to do next to get the trained model. What type of files should be in the "mels" and "embeds" folders. How exactly to fill them. Maybe there is some more detailed instructions?

    opened by Cpgrach 1
  • about diffVC on Mandarin datasets

    about diffVC on Mandarin datasets

    Hello, I adapted the diffvc code on Mandarin datasets. However, the audio after VC has the problem of tone sandhi. I want to ask the performance is normal ?

    opened by Theweekfoolish229 1
  • Why does the BNE-PPG-VC model in your demo perform better than the pre-trained model given in the original paper?

    Why does the BNE-PPG-VC model in your demo perform better than the pre-trained model given in the original paper?

    i tried the pre-trained model--bneSeq2seqMoL-vctk-libritts460-oneshot provided in bneSeq2seqMoL-vctk-libritts460-oneshot.Then converted source wavs to target wavs in provided demo by your paper .Yours in paper performed better than the model trained in the original paper.Why? have u trained the hifi-gan model again?Thank u!

    opened by jiazj-jiazj 0
  • About the prior loss and MAS algorithm

    About the prior loss and MAS algorithm

    Great work! I've been studying the paper and the code recently and there's something that confuses me much.

    In my understanding, the encoder outputs some Gaussian distributions with different mu for each phoneme, and the DPM decoder recovers mel-spec y from these Gaussians. Hence y is not Gaussian anymore. But I speculate from Eq.(14) and the code that when you are calculating the prior loss, you are actually calculating the log-likelihood of y in the Gaussian distribution of mu. Also, when applying MAS for duration modeling, you also perform the similar kind of likelihood computation to get the soft alignment (which is denoted as log_prior in the code). So I wonder why is it reasonable? I also compared the code of GlowTTS. They use z to evaluate the Gaussian likelihood with mean mu, and z is the transformed latent variable from mel-spec using normalizing flow. That seems more reasonable to me by now, as z is Gaussian by itself.

    opened by cantabile-kwok 1
  • Possibly missing __dict__ in the Projector class' constructor

    Possibly missing __dict__ in the Projector class' constructor

    While loading the pretrained weights of the ST2VecEncoder, I had to replace **conv_cfg_i with **conv_cfg_i.__dict__ in __init__ of the Projector class (SPIRAL/nemo/collections/asr/parts/spec2vec.py). Doing this allowed me to load all the weights and match the keys successfully -- nonetheless, i was curious to know if I was missing any installation?

    opened by Sri-Harsha 0
Owner
HUAWEI Noah's Ark Lab
Working with and contributing to the open source community in data mining, artificial intelligence, and related fields.
HUAWEI Noah's Ark Lab
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ ?? Technologies Concurrent code

Bobotinho 14 Nov 29, 2022
Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology (EARIST)

?? Coeus - EARIST A.C.E ?? Coeus is an Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology,

Dids Irwyn Reyes 3 Oct 14, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 9, 2023
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

Alibaba 922 Dec 10, 2021
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 3, 2023
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing and many others.

SpeechBrain 5.1k Jan 9, 2023
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 86 Jun 11, 2021