Content
- What is deepaudio-speaker?
- Installation
- Get Started
- Model Architecture
- How to contribute to deepaudio-speaker?
- Acknowledge
What is deepaudio-speaker?
Deepaudio-speaker is a framework for training neural network based speaker embedders. It supports online audio augmentation thanks to torch-audiomentation. It inlcudes or will include popular neural network architectures and losses used for speaker embedder.
To make it easy to use various functions such as mixed-precision, multi-node training, and TPU training etc, I introduced PyTorch-Lighting and Hydra in this framework (just like what pyannote-audio and openspeech do).
Deepaudio-tts is coming soon.
Installation
conda create -n deepaudio python=3.8.5
conda activate deepaudio
conda install numpy cffi
conda install libsndfile=1.0.28 -c conda-forge
git clone https://github.com/deepaudio/deepaudio-speaker.git
cd deepaudio-speaker
pip install -e .
Get Started
Supported Datasets
####Voxceleb2
- Download VoxCeleb dataset and follow this script to obtain this kind of directory structure:
/path/to/voxceleb/voxceleb1/dev/wav/id10001/1zcIwhmdeo4/00001.wav
/path/to/voxceleb/voxceleb1/test/wav/id10270/5r0dWxy17C8/00001.wav
/path/to/voxceleb/voxceleb2/dev/aac/id00012/21Uxsk56VDQ/00001.m4a
/path/to/voxceleb/voxceleb2/test/aac/id00017/01dfn2spqyE/00001.m4a
Training examples
- Example1: Train the
ecapa-tdnn
model withfbank
features on GPU.
$ deepaudio-speaker-train \
dataset=voxceleb2 \
dataset.dataset_path=/your/path/to/voxceleb2/dev/wav/ \
model=ecapa \
model.channels=1024 \
feature=fbank \
lr_scheduler=warmup_reduce_lr_on_plateau \
trainer=gpu \
criterion=aamsoftmax
- Example2: Extract speaker embedding with trained model.
Todo
Model Architecture
ECAPA-TDNN This is an unofficial implementation from @lawlict. Please find more details in this link.
ECAPA-TDNN This is implemented by @joonson. Please find more details in this link.
ResNetSE34L This is borrowed from voxceleb trainer.
ResNetSE34V2 This is borrowed from voxceleb trainer.
resnet101 This is proposed by BUT for speaker diarization. Please note that the feature used in this framework is different from VB-HMM
How to contribute to deepaudio-speaker
It is a personal project. So I don't have enough gpu resources to do a lot of experiments. I appreciate any kind of feedback or contributions. Please feel free to make a pull requsest for some small issues like bug fixes, experiment results. If you have any questions, please open an issue.
Acknowledge
I borrow a lot of codes from openspeech and pyannote-audio