Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

Tao Ruijie

Last update: Dec 31, 2022

Related tags

Overview

Introduction

This repository contains my unofficial reimplementation of the standard ECAPA-TDNN, which is the speaker recognition in VoxCeleb2 dataset.

This repository is modified based on voxceleb_trainer.

Best Performance in this project (with AS-norm)

Dataset	Vox1_O	Vox1_E	Vox1_H
EER	0.86	1.18	2.17
minDCF	0.0686	0.0765	0.1295

System Description

I will write a technique report about this system and all the details later. Please wait.

Dependencies

Note: That is the setting based on my device, you can modify the torch and torchaudio version based on your device.

Start from building the environment

conda create -n ECAPA python=3.7.9 anaconda
conda activate ECAPA
pip install -r requirements.txt

Start from the existing environment

pip install -r requirements.txt

Data preparation

Please follow the official code to perpare your VoxCeleb2 dataset from the 'Data preparation' part in this repository.

Dataset for training usage:

VoxCeleb2 training set;
MUSAN dataset;
RIR dataset.

Dataset for evaluation:

VoxCeleb1 test set for Vox1_O
VoxCeleb1 train set for Vox1_E and Vox1_H (Optional)

Training

Then you can change the data path in the trainECAPAModel.py. Train ECAPA-TDNN model end-to-end by using:

python trainECAPAModel.py --save_path exps/exp1

Every test_step epoches, system will be evaluated in Vox1_O set and print the EER.

The result will be saved in exps/exp1/score.txt. The model will saved in exps/exp1/model

In my case, I trained 80 epoches in one 3090 GPU. Each epoch takes 37 mins, the total training time is about 48 hours.

Pretrained model

Our pretrained model performs EER: 0.96 in Vox1_O set without AS-norm, you can check it by using:

python trainECAPAModel.py --eval --initial_model exps/pretrain.model

With AS-norm, this system performs EER: 0.86, we will release the code of AS-norm later.

We also update the score.txt file in exps/pretrain_score.txt, it contains the training loss, training acc and EER in Vox1_O in each epoch for your reference.

Reference

@inproceedings{desplanques2020ecapa,
  title={{ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification}},
  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  booktitle={Interspeech 2020},
  pages={3830--3834},
  year={2020}
}
@inproceedings{chung2020in,
  title={In defence of metric learning for speaker recognition},
  author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
  booktitle={Interspeech},
  year={2020}
}

Acknowledge

We study many useful projects in our codeing process, which includes:

clovaai/voxceleb_trainer.

lawlict/ECAPA-TDNN.

speechbrain/speechbrain

ranchlai/speaker-verification

Thanks for these authors to open source their code!

Notes

If you meet the problems about this repository, Please ask me from the 'issue' part in Github (using English) instead of sending the messages to me from bilibili, so others can also benifit from it. Thanks for your understanding!

If you improve the result based on this repository by some methods, please let me know. Thanks!

Comments

Accelerating evaluation speed

During evaluation, the current implementation calculates the similarity scores one by one using a for loop, that could be slow when the size of "lines" gets larger. Is there an elegant way of vectorizing it?

opened by dopiwoo 8
模型输入不统一？

我看到推理代码中： with torch.no_grad(): embedding_1 = self.speaker_encoder.forward(data_1, aug = False) embedding_1 = F.normalize(embedding_1, p=2, dim=1) embedding_2 = self.speaker_encoder.forward(data_2, aug = False) embedding_2 = F.normalize(embedding_2, p=2, dim=1) embeddings[file] = [embedding_1, embedding_2] 其中，data1是语音全部的数据，data2是分割后又stack的数据。对于不同长度的语音，data1和data2是没有规定长度的？都可以输入到self.speaker_encoder.forward计算embedding？？？

opened by JJ-Guo1996 7
can not prepare the dataset

When I followed the Data preparation part in the link and ran the this code python3 dataprep.py --save_path data --download --user USERNAME --password PASSWORD , I met with the following error.

--2021-11-26 14:04:56-- http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa Resolving www.robots.ox.ac.uk (www.robots.ox.ac.uk)... 129.67.94.2 Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa [following] --2021-11-26 14:04:58-- https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:443... connected. HTTP request sent, awaiting response... 404 Not Found 2021-11-26 14:04:59 ERROR 404: Not Found.

Traceback (most recent call last): File "Downloads/voxceleb_trainer-master/dataprep.py", line 176, in download(args,fileparts) File "Downloads/voxceleb_trainer-master/dataprep.py", line 58, in download raise ValueError('Download failed %s. If download fails repeatedly, use alternate URL on the VoxCeleb website.'%url) ValueError: Download failed http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa. If download fails repeatedly, use alternate URL on the VoxCeleb website.

How can I solve this problem? Thanks!

opened by chonghaozhang1998 7
How to use pretrain.model for continuing training?

I want to add some chinese audios to the training data.

Can I use your pretrain.model and continue to train using my data,

Or Do I have to download all the VoxCeleb1data plusing my data, and train it from the beginning?

Thank you for your reply.

opened by youyou098888 7
Questions about reproduced ECAPA-Tdnn paper

Hi

I found out there are some differences between your code configrations and original configurations in ECAPA.

The most important one is in your code, you just random choose 1 of the 6 noise to add . And in ECAPA, they use all 6 noise methods which means they have a largger dataset.

I trained the 512 channels model, which only can achieve 1.16 EER (1.01 in ECAPA) , but your result in 1024 channel is even better than ECAPA. So is there any secret you holding about training skill? or you changed the configrations in your upload code ( I just copy your project and change the channel num, and everything else stays the same). OR because the tiny differences in your code leads it is better on a large model.

And thank you for your excellent work! Any help will be appriciated!

Best

opened by sliver-7 5
About the training time

Hello, thank you so much for contributing this project. I am training this model recently. I also use one 3090 and the same setting as you. But i need spend about 20 hours for each epoch. Do you know what's the reason? Thank you so much for your answering in advance.

opened by KAI-LI-JAIST 5
How to evaluate your nn

Hi! I'm new at neural networks and i'm having trouble discovering how to evaluate your implementation. By now I'm using an audio dataset which is different from your --eval_path and --eval_list, so I'm running this command:

python trainECAPAModel.py --eval --initial_model exps/pretrain.model --eval_list /eval_list_directory --eval_path /eval_path_directory

Is this the correct way to evaluate your implementation? Should I use any different argument? The point is I don't think I understand what exps/pretrain.model is, so I don't know how to use it.

Looking forward to your response! Thanks

opened by rosana-sc7 3
关于AS-norm的问题，

Hi！Ruijie，在B站关注你好久了！最近在做SASVC的比赛，发现用了你这个仓库做ASV 的 Baseline code. 你在Readme中写了这个ECAPA-TDNN结果是as-norm后的结果，可我没有在你的代码里找到任何关于backend norm的部分。请问是typo吗？还是您没有向本仓库中添加那一段代码?

opened by ikou-austin 3
GPU utilization error！

Hi, author. I am training the ECAPA-TDNN model end-to-end by using: python trainECAPAModel.py However, I found that while training, the training time per epoch is very long. After checking, I found that the GPU memory is occupied, but its utilization is 0. I manually set model.cuda(), but it does not work. I'm wondering what part of the program should I change to make the model load successfully.

opened by daiyuuu 2
Do you have any open source plans for the Stage II in your latest paper?

I have read your great work in SELF-SUPERVISED SPEAKER RECOGNITION WITH LOSS-GATED LEARNING（ICASSP 2022).

I attempt to follow Stage II in your paper, which shows great gains in your experiments.

If these parts of codes are available, it will benefit a lot.

Thanks a lot.

opened by seacj 2
training set is not 5 times bigger after augmentation

I notice that in dataloader, the size of training set is the same size as original audio size after augmentation.

So, adding augmentation is not to increase the amount of training data, only to increase the diversity of it ?

opened by youyou098888 2
ECAPA-TDNN

Hi!

I'm having trouble understanding ECAPA-TDNN architecture.

To be specific, I don't understand what does the elements in ECAPA-TDNN do (PreEmphasis,MelSpectrogram,FBankAug,conv1d,relu, batchNorm1d, bottleneck, Attention...) in the context of speaker verification?

What about classifier AAAsoftmax, optimizer Adam and scheduler stepLR?

Thanks for your attention and time!

opened by rosana-sc7 1

Owner

Tao Ruijie

NUS ECE PhD student

GitHub

Unofficial implement with paper SpeakerGAN: Speaker identification with conditional generative adversarial network

Introduction This repository is about paper SpeakerGAN , and is unofficially implemented by Mingming Huang ([email protected]), Tiezheng Wang (wtz920729

7 Jan 3, 2023

text_recognition_toolbox: The reimplementation of a series of classical scene text recognition papers with Pytorch in a uniform way.

text recognition toolbox 1. 项目介绍该项目是基于pytorch深度学习框架，以统一的改写方式实现了以下6篇经典的文字识别论文，论文的详情如下。该项目会持续进行更新，欢迎大家提出问题以及对代码进行贡献。模型论文标题发表年份模型方法划分 CRNN 《An End-t

168 Dec 24, 2022

PyTorch reimplementation of the paper Involution: Inverting the Inherence of Convolution for Visual Recognition [CVPR 2021].

Involution: Inverting the Inherence of Convolution for Visual Recognition Unofficial PyTorch reimplementation of the paper Involution: Inverting the I

100 Dec 1, 2022

Ever felt tired after preprocessing the dataset, and not wanting to write any code further to train your model? Ever encountered a situation where you wanted to record the hyperparameters of the trained model and able to retrieve it afterward? Models Playground is here to help you do that. Models playground allows you to train your models right from the browser.

Models Playground ??️ Upload a Preprocessed Dataset ?? Choose whether to perform Classification or Regression ?? Enter the Dependent Variable ?

19 Dec 10, 2022

[ECCV 2020] Reimplementation of 3DDFAv2, including face mesh, head pose, landmarks, and more.

Stable Head Pose Estimation and Landmark Regression via 3D Dense Face Reconstruction Reimplementation of (ECCV 2020) Towards Fast, Accurate and Stable

221 Dec 30, 2022

PyTorch reimplementation of minimal-hand (CVPR2020)

Minimal Hand Pytorch Unofficial PyTorch reimplementation of minimal-hand (CVPR2020). you can also find in youtube or bilibili bare hand youtube or bil

228 Dec 29, 2022

Reimplementation of the paper `Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (ACL2020)`

Human Attention for Text Classification Re-implementation of the paper Human Attention Maps for Text Classification: Do Humans and Neural Networks Foc

15 Dec 13, 2021

PyTorch reimplementation of hand-biomechanical-constraints (ECCV2020)

Hand Biomechanical Constraints Pytorch Unofficial PyTorch reimplementation of Hand-Biomechanical-Constraints (ECCV2020). This project reimplement foll

59 Dec 20, 2022

A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

TecoGAN-PyTorch Introduction This is a PyTorch reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution (VSR). Please refer to

165 Dec 17, 2022

a reimplementation of Optical Flow Estimation using a Spatial Pyramid Network in PyTorch

pytorch-spynet This is a personal reimplementation of SPyNet [1] using PyTorch. Should you be making use of this work, please cite the paper according

269 Jan 2, 2023

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

253 Jan 6, 2023

PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Smooth ReLU in PyTorch Unofficial PyTorch reimplementation of the Smooth ReLU (SmeLU) activation function proposed in the paper Real World Large Scale

10 Jan 2, 2023

Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

Related tags

Overview

Introduction

Best Performance in this project (with AS-norm)

System Description

Dependencies

Data preparation

Training

Pretrained model

Reference

Acknowledge

Notes

Comments

Owner

Tao Ruijie

Unofficial implement with paper SpeakerGAN: Speaker identification with conditional generative adversarial network

text_recognition_toolbox: The reimplementation of a series of classical scene text recognition papers with Pytorch in a uniform way.

PyTorch reimplementation of the paper Involution: Inverting the Inherence of Convolution for Visual Recognition [CVPR 2021].

[ECCV 2020] Reimplementation of 3DDFAv2, including face mesh, head pose, landmarks, and more.

PyTorch reimplementation of minimal-hand (CVPR2020)

Reimplementation of the paper `Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (ACL2020)`

PyTorch reimplementation of hand-biomechanical-constraints (ECCV2020)

A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

a reimplementation of Optical Flow Estimation using a Spatial Pyramid Network in PyTorch

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

PyTorch reimplementation of REALM and ORQA

Reimplementation of the paper "Attention, Learn to Solve Routing Problems!" in jax/flax.

a reimplementation of UnFlow in PyTorch that matches the official TensorFlow version

a reimplementation of LiteFlowNet in PyTorch that matches the official Caffe version

a reimplementation of Holistically-Nested Edge Detection in PyTorch

Reimplementation of Dynamic Multi-scale filters for Semantic Segmentation.

Pytorch reimplementation of PSM-Net: "Pyramid Stereo Matching Network"

PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].