VGGVox models for Speaker Identification and Verification trained on the VoxCeleb (1 & 2) datasets

Related tags

Deep Learning VGGVox
Overview

VGGVox models for speaker identification and verification

This directory contains code to import and evaluate the speaker identification and verification models pretrained on the VoxCeleb(1 & 2) datasets as described in the following papers (1 and 2):

[1] A. Nagrani*, J. S. Chung*, A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, 
INTERSPEECH, 2017

[2] J. S. Chung*, A. Nagrani*, A. Zisserman, VoxCeleb2: Deep Speaker Recognition, 
INTERSPEECH, 2018

The models trained for verification map voice spectrograms to a compact Euclidean space where distances directly correspond to a measure of speaker similarity. Such embeddings can be used for tasks such as speaker verification, clustering and diarisation.

Prerequisites

[1] Matlab

[2] Matconvnet.

Installing

The easiest way to use the code in this repo is with the vl_contrib package manager. To install, follow these steps:

  1. Install and compile matconvnet by following instructions here.

  2. Run:

vl_contrib install VGGVox
vl_contrib setup VGGVox
  1. You can then run the demo scripts provided to import and test the models. There are three short demo scripts. The first two scripts are for identification and verification models trained on VoxCeleb1. The third script imports and test a verification model trained on VoxCeleb2. These demos demonstrate how to evaluate the models directly on .wav audio files:
demo_vggvox_identif 
demo_vggvox_verif 
demo_vggvox_verif_voxceleb2

Models

The matconvnet models can also be downloaded directly using the following links:

Model trained for identification on VoxCeleb1

Model trained for verification on VoxCeleb1

Model trained for verification on VoxCeleb2 (this is a resnet based model)

Datasets

These models have been pretrained on the VoxCeleb (1&2) datasets. VoxCeleb contains over 1 million utterances for 7,000+ celebrities, extracted from videos uploaded to YouTube. The speakers span a wide range of different ethnicities, accents, professions and ages. The dataset can be downloaded directly from here.

Citation

If you use this code then please cite:

@InProceedings{Nagrani17,
  author       = "Nagrani, A. and Chung, J.~S. and Zisserman, A.",
  title        = "VoxCeleb: a large-scale speaker identification dataset",
  booktitle    = "INTERSPEECH",
  year         = "2017",
}


@InProceedings{Nagrani17,
  author       = "Chung, J.~S. and Nagrani, A. and Zisserman, A.",
  title        = "VoxCeleb2: Deep Speaker Recognition",
  booktitle    = "INTERSPEECH",
  year         = "2018",
}

Fixes

Note - since we take only the magnitude of the spectrogram, the matlab functions here to extract spectrograms provide mirrored spectrograms (along the freq axis). This has been fixed in later models where we chop the spectrograms in half before feeding them into the network.

Comments
  • number of filters in conv3 layer?

    number of filters in conv3 layer?

    I was looking at reimplementing the model in the VoxCeleb paper, and then cross checking with the setup in this repo. In the paper, conv3 has 256 filters, whereas in http://www.robots.ox.ac.uk/~vgg/data/voxceleb/models/vggvox_ident_net.mat it appears to have 384. Did you find that bigger was better for this layer+dataset?

    Thanks - big fan of the VoxCeleb paper, great work :-)

    opened by paulfitz 6
  • Verification Siamese embedding

    Verification Siamese embedding

    Hello, thanks for sharing the trained model!

    I wanted to use the verification model for extracting a speaker embedding, as described in the paper. There its explained that the embedding is trained as the output of a Siamese network at layer fc8 with a dimension of 256. It seems that in the provided verification model the last layer has a dimension of 1024 instead of 256 (or the number of classes 1251).

    Is this the correct embedding? or am i extracting the wrong layer

    I want to compare the embeddings with a distance function like proposed.

    Regards

    opened by lodemo 1
  • some problems in configuring the environment

    some problems in configuring the environment

    The author of the code, Hello!I am a beginner. I have some problems in configuring the environment and executing the test code. Some of them I tried to solve but failed. Could you please give me your own running environment, such as Linux system version, MATLAB version, matconvnet version, CUDA version, cudnn version? Looking forward to your reply, thank you!

    opened by TTTJJJWWW 0
  • Can anyone help with to create a equal positive and negative paris of trial list?

    Can anyone help with to create a equal positive and negative paris of trial list?

    1 id10001/Y8hIVOBuels/00001.wav id10001/utrA-v8pPm4/00001.wav 0 id10001/Y8hIVOBuels/00001.wav id10341/rX4LkvzySSM/00014.wav 1 id10001/Y8hIVOBuels/00001.wav id10001/zELwAz2W6hM/00010.wav 0 id10001/Y8hIVOBuels/00001.wav id10341/5DAommAsxmE/00007.wav 1 id10001/Y8hIVOBuels/00002.wav id10001/zELwAz2W6hM/00005.wav 0 id10001/Y8hIVOBuels/00002.wav id10293/X7uOKQUYTCM/00001.wav 1 id10001/Y8hIVOBuels/00002.wav id10001/7gWzIy6yIIk/00001.wav 0 id10001/Y8hIVOBuels/00002.wav id10016/o524HaR7jfE/00007.wav 1 id10001/Y8hIVOBuels/00003.wav id10001/J9lHsKG98U8/00024.wav 0 id10001/Y8hIVOBuels/00003.wav id10246/ojc6G1jqXOU/00001.wav 1 id10001/Y8hIVOBuels/00003.wav id10001/9mQ11vBs1wc/00003.wav 0 id10001/Y8hIVOBuels/00003.wav id10425/kV-qT4iLGTs/00002.wav 1 id10001/Y8hIVOBuels/00004.wav id10001/J9lHsKG98U8/00023.wav 0 id10001/Y8hIVOBuels/00004.wav id10166/PPZBsH24NyE/00002.wav 1 id10001/Y8hIVOBuels/00004.wav id10001/zELwAz2W6hM/00015.wav 0 id10001/Y8hIVOBuels/00004.wav id10425/x2ZdgyFnZwc/00002.wav 1 id10001/Y8hIVOBuels/00005.wav id10001/J9lHsKG98U8/00004.wav 0 id10001/Y8hIVOBuels/00005.wav id10104/i6tWIMbpZFs/00004.wav

    opened by dimuthuanuraj 0
  • Model architecture seems different from original paper

    Model architecture seems different from original paper

    Hi, I have tried to export both the VGG-M and the Resnet-50 models for verification to Keras. In the first case, everything worked well, and the architecture proposed in the paper was the same as the architecture that I obtained from the matlab file contanining the model. However, in the second case I have found the following discrepancies:

    1. The embedding dimension proposed in the paper is 512. In the matlab model the dimension is 128 (why is this the case?)

    2. In the VoxCeleb2 paper (and in the Resnet original paper: https://arxiv.org/pdf/1512.03385.pdf) an activation is applied after the addition of the nonlinear stack's output and the shortcut connection. However, in the matlab model the reLU activation is applied both before and after the shortcut addition. The Resnet original paper is explicit about applying the activation just after the shortcut addition, so I don't understand the reason behind it.

    3. Just before the last block is applied (fc_1, pool_time, fc_2, following VoxCeleb2 paper notation), the matlab model adds a pooling layer ( pool_final_b1 and pool_final_b2 for each network of the siamese architecture). I couldn't find any mention of this layer in the original paper.

    4. Except from the first convolutional layer (conv0_b1, conv0_b2 following the matlab model notation) and the feed forward layers (fc65_b1, fc65_b2, fc8_s1, fc8_s2, following the matlab model notation), every intermediate conv layer has no bias parameters. Is there any reason for this?

    opened by MoteroAltostratus 0
  • Loss and Acc on VGGVox (Thin-Resnet & NetVLAD)

    Loss and Acc on VGGVox (Thin-Resnet & NetVLAD)

    Hi, correct me if I'm wrong but you trained VGGVox on VoxCeleb2 as Classification, right ? If so, what kind of Loss and Acc did you get ? for training set and validation set ? Thanks.

    opened by jjeremy40 0
  • What is the final embedding dimension used for computing distance.

    What is the final embedding dimension used for computing distance.

    In the paper, the dimension of the final embedding is mentioned as 512, and in the code, the final computed embedding dimension is 128, but in the provided code 2048 dimensional embedding is used for computing distance.

    opened by abinayreddy 0
Owner
null
Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

TDY-CNN for Text-Independent Speaker Verification Official implementation of Temporal Dynamic Convolutional Neural Network for Text-Independent Speake

Seong-Hu Kim 16 Oct 17, 2022
A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Awesome Pretrained StyleGAN2 A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution. Note the readme is a

Justin 1.1k Dec 24, 2022
Unofficial implement with paper SpeakerGAN: Speaker identification with conditional generative adversarial network

Introduction This repository is about paper SpeakerGAN , and is unofficially implemented by Mingming Huang ([email protected]), Tiezheng Wang (wtz920729

null 7 Jan 3, 2023
Annotate datasets with a semi-trained or fully trained YOLOv5 model

YOLOv5 Auto Annotator Annotate datasets with a semi-trained or fully trained YOLOv5 model Prerequisites Ubuntu >=20.04 Python >=3.7 System dependencie

Akash James 3 May 14, 2022
Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Maths from examples - Learning advanced mathematical computations from examples This is the source code and data sets relevant to the paper Learning a

Facebook Research 171 Nov 23, 2022
TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

Microsoft 1.3k Dec 30, 2022
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
Pocsploit is a lightweight, flexible and novel open source poc verification framework

Pocsploit is a lightweight, flexible and novel open source poc verification framework

cckuailong 208 Dec 24, 2022
Contrastive Fact Verification

VitaminC This repository contains the dataset and models for the NAACL 2021 paper: Get Your Vitamin C! Robust Fact Verification with Contrastive Evide

null 47 Dec 19, 2022
Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

Zero-shot-Fact-Verification-by-Claim-Generation This repository contains code and models for the paper: Zero-shot Fact Verification by Claim Generatio

Liangming Pan 47 Jan 1, 2023
The VeriNet toolkit for verification of neural networks

VeriNet The VeriNet toolkit is a state-of-the-art sound and complete symbolic interval propagation based toolkit for verification of neural networks.

null 9 Dec 21, 2022
Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

Pranaydeep Singh 22 Dec 8, 2022
SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frede

Edresson Casanova 92 Dec 9, 2022
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.13

Keon Lee 140 Dec 21, 2022
Look Who’s Talking: Active Speaker Detection in the Wild

Look Who's Talking: Active Speaker Detection in the Wild Dependencies pip install -r requirements.txt In addition to the Python dependencies, ffmpeg

Clova AI Research 60 Dec 8, 2022