This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

HUAWEI Noah's Ark Lab

Last update: Jan 7, 2023

Related tags

Overview

Speech-Backbones

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Grad-TTS

Official implementation of the Grad-TTS model based on Diffusion Probabilistic Modelling. For all details check out our paper accepted to ICML 2021 via this link.

Authors: Vadim Popov*, Ivan Vovk*, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov.

^{*Equal contribution.}

Comments

Typo in some equations in GradTTS paper

Thanks for your great work on GradTTS! However I recently found a tiny error in the arxiv version 2 of GradTTS paper (https://arxiv.org/pdf/2105.06337.pdf). In Eq.31 and Eq.32 in the appendix, the $X_t$ and $\mu$ are put in the wrong order, i.e. it probably should be $\mu - X_t$ rather than $X_t - \mu$. This typo could be originated from the line above Eq.31, " In our case $f(X_t, t) = \frac 12 \Sigma^{-1}(X_t −\mu)\beta$ and ...", where it should be $f(X_t, t)= \frac 12 \Sigma^{-1}(\mu-X_t)\beta$ instead. The other parts of the paper seem not to be affected by this, and the derivations are solid and fluent. Again, great thanks for the work!

opened by cantabile-kwok 4
Not able to generate audio using libritts of as good quality as using ljspeech

Hi, Thank you for the great work and for releasing the pretrained model. I tried to train the grad-tts model using libritts (multispeaker) and using ljspeech (single speaker) and found that the single-speaker setting gives much better quality than the multispeaker. This is even true when using your released grad-tts-libri-tts.pt. Are you able to get better quality in multispeaker setting? These are a few samples I generated in multispeaker setting using your released model: https://drive.google.com/drive/folders/1ze0_rJXtmPY3JNAwnr0A_9C4OVvULEj7?usp=sharing.

opened by Hertin 4
How is `out_size` in `params` determined

Hi, I am modifying the code for my own purposes. I notice here: https://github.com/huawei-noah/Speech-Backbones/blob/b82fdd546d9d977573c8557f242b06a0770ece8e/Grad-TTS/params.py#L53 the argument is hard-coded, and I guess 22050 and 256 to be the sampling rate and frame shift in the case of LJspeech, right? If this is true, should I change to another value if I am dealing with different datasets?

opened by cantabile-kwok 2
Grad-TTS in multispeaker setting

Thank you for the releasing original implementation of Grad-TTS. I would like to know if a multispeaker setting is available or planned for release.

I am implementing a multispeaker setting using this repo. Will the maintainer of this repo be interested in discussing or providing feedback on multispeaker Grad-TTS implementation?

Regards Ajinkya

opened by ajinkyakulkarni14 2
[Errno 13] Permission denied: '/home/user/app/Grad-TTS/model/monotonic_align/core.c'

[Errno 13] Permission denied: '/home/user/app/Grad-TTS/model/monotonic_align/core.c' Traceback (most recent call last): File "/home/user/.local/lib/python3.8/site-packages/Cython/Build/Dependencies.py", line 1208, in cythonize_one result = compile_single(pyx_file, options, full_module_name=full_module_name) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Main.py", line 727, in compile_single return run_pipeline(source, options, full_module_name) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Main.py", line 515, in run_pipeline err, enddata = Pipeline.run_pipeline(pipeline, source) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 355, in run_pipeline data = run(phase, data) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 335, in run return phase(data) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/Pipeline.py", line 52, in generate_pyx_code_stage module_node.process_implementation(options, result) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/ModuleNode.py", line 143, in process_implementation self.generate_c_code(env, options, result) File "/home/user/.local/lib/python3.8/site-packages/Cython/Compiler/ModuleNode.py", line 411, in generate_c_code f = open_new_file(result.c_file) File "/home/user/.local/lib/python3.8/site-packages/Cython/Utils.py", line 76, in open_new_file return codecs.open(path, "w", encoding="ISO-8859-1") File "/usr/local/lib/python3.8/codecs.py", line 905, in open file = builtins.open(filename, mode, buffering)

opened by AK391 0
Model training question

Hi, thanks for sharing the code. I have a folder with wav files of different speakers. I don't understand what to do next to get the trained model. What type of files should be in the "mels" and "embeds" folders. How exactly to fill them. Maybe there is some more detailed instructions?

opened by Cpgrach 1
about diffVC on Mandarin datasets

Hello, I adapted the diffvc code on Mandarin datasets. However, the audio after VC has the problem of tone sandhi. I want to ask the performance is normal ?

opened by Theweekfoolish229 1
Why does the BNE-PPG-VC model in your demo perform better than the pre-trained model given in the original paper?

i tried the pre-trained model--bneSeq2seqMoL-vctk-libritts460-oneshot provided in bneSeq2seqMoL-vctk-libritts460-oneshot.Then converted source wavs to target wavs in provided demo by your paper .Yours in paper performed better than the model trained in the original paper.Why? have u trained the hifi-gan model again?Thank u!

opened by jiazj-jiazj 0
About the prior loss and MAS algorithm

Great work! I've been studying the paper and the code recently and there's something that confuses me much.

In my understanding, the encoder outputs some Gaussian distributions with different mu for each phoneme, and the DPM decoder recovers mel-spec y from these Gaussians. Hence y is not Gaussian anymore. But I speculate from Eq.(14) and the code that when you are calculating the prior loss, you are actually calculating the log-likelihood of y in the Gaussian distribution of mu. Also, when applying MAS for duration modeling, you also perform the similar kind of likelihood computation to get the soft alignment (which is denoted as log_prior in the code). So I wonder why is it reasonable? I also compared the code of GlowTTS. They use z to evaluate the Gaussian likelihood with mean mu, and z is the transformed latent variable from mel-spec using normalizing flow. That seems more reasonable to me by now, as z is Gaussian by itself.

opened by cantabile-kwok 1
Possibly missing __dict__ in the Projector class' constructor

While loading the pretrained weights of the ST2VecEncoder, I had to replace **conv_cfg_i with **conv_cfg_i.__dict__ in __init__ of the Projector class (SPIRAL/nemo/collections/asr/parts/spec2vec.py). Doing this allowed me to load all the weights and match the keys successfully -- nonetheless, i was curious to know if I was missing any installation?

opened by Sri-Harsha 0

Owner

HUAWEI Noah's Ark Lab

Working with and contributing to the open source community in data mining, artificial intelligence, and related fields.

GitHub

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明基于Python3+Selenium的华为商城抢购爬虫脚本，修改自近两年没更新的项目BUY-HW，为女神抢Nova 8（什么时候华为开始学小米玩饥饿营销了？）原项目的登陆以及抢购部分已经不可用，本项目对原项目进行了改正以适应新华为商城，并增加一些功能

111 Dec 22, 2022

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ ?? Technologies Concurrent code

14 Nov 29, 2022

Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology (EARIST)

?? Coeus - EARIST A.C.E ?? Coeus is an Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology,

3 Oct 14, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

11 Nov 17, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

1 Dec 20, 2021

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

922 Dec 10, 2021

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

90 Dec 27, 2022

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

105 Jan 3, 2023

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing and many others.

5.1k Jan 9, 2023

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

26 Dec 14, 2022

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

86 Jun 11, 2021