This repository contains a set of codes to run (i.e., train, perform inference with, evaluate) a diarization method called EEND-vector-clustering.

Last update: Dec 26, 2022

Related tags

Deep Learning EEND-vector-clustering

Overview

EEND-vector clustering

The EEND-vector clustering (End-to-End-Neural-Diarization-vector clustering) is a speaker diarization framework that integrates two complementary major diarization approaches, i.e., traditional clustering-based and emerging end-to-end neural network-based approaches, to make the best of both worlds. In [1] it is shown that the EEND-vector clustering outperforms EEND when the recording is long (e.g., more than 5 min), while in [2] it is shown based on CALLHOME data that it outperforms x-vector clustering and EEND-EDA especially when the number of speakers in recordings is large.

This repository contains an example implementation of the EEND-vector clustering based on Pytorch to reproduce the results in [2], i.e., the CALLHOME experiments. For the trainer, we use Padertorch. This repository is implemented based on EEND and relies on some useful functions provided therein.

References

[1] Keisuke Kinoshita, Marc Delcroix, and Naohiro Tawara, "Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds," Proc. ICASSP, pp. 7198–7202, 2021

[2] Keisuke Kinoshita, Marc Delcroix, and Naohiro Tawara, "Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech," Proc. Interspeech, 2021 (to appear)

Citation

@inproceedings{eend-vector-clustering,
 author = {Keisuke Kinoshita and Marc Delcroix and Naohiro Tawara},
 title = {Integrating End-to-End Neural and Clustering-Based Diarization: Getting the Best of Both Worlds},
 booktitle = {{ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},
 pages={7198-7202}
 year = {2021}
}

Install tools

Requirements

NVIDIA CUDA GPU
CUDA Toolkit (version == 9.2, 10.1 or 10.2)

Install kaldi and python environment

cd tools
make

This command builds kaldi at tools/kaldi
- if you want to use pre-build kaldi
```
cd tools
make KALDI=<existing_kaldi_root>
```
  This option make a symlink at tools/kaldi
This command extracts miniconda3 at tools/miniconda3, and creates conda envirionment named 'eend'
Then, installs Pytorch and Padertorch into 'eend' environment
- use CUDA in /usr/local/cuda/
  - if you need to specify your CUDA path
```
cd tools
make CUDA_PATH=/your/path/to/cuda-10.1
```
    The pytorch install command to be executed is depended on your CUDA version. See https://pytorch.org/get-started/previous-versions/
Then, clones EEND to reference symbolic links stored under eend/, egs/ and utils/

Test recipe (mini_librispeech)

Configuration

Modify egs/mini_librispeech/v1/cmd.sh according to your job schedular. If you use your local machine, use "run.pl" (default). If you use Grid Engine, use "queue.pl" If you use SLURM, use "slurm.pl". For more information about cmd.sh see http://kaldi-asr.org/doc/queue.html.

Run data preparation, training, inference, and scoring

cd egs/mini_librispeech/v1
CUDA_VISIBLE_DEVICES=0 ./run.sh

See RESULT.md and compare with your result.

CALLHOME experiment

Configuraition

Modify egs/callhome/v1/cmd.sh according to your job schedular. If you use your local machine, use "run.pl" (default). If you use Grid Engine, use "queue.pl" If you use SLURM, use "slurm.pl". For more information about cmd.sh see http://kaldi-asr.org/doc/queue.html.

Run data preparation, training, inference, and scoring

cd egs/callhome/v1
CUDA_VISIBLE_DEVICES=0 ./run.sh --db_path <db_path>
# <db_path> means absolute path of the directory where the necessary LDC corpora are stored.

See RESULT.md and compare with your result.
If you want to run multi-GPU training, simply set CUDA_VISIBLE_DEVICES appropriately. This environment variable may be automatically set by your job schedular such as SLURM.

Comments

Requesting for the callhome result.

Recently, I want to compare our algorithm with your paper in callhome result. Can you kindly provide the rttm hypothesis of callhome for us in the original paper? Thanks a lot.

opened by liutaocode 2
which file should I pass to spkv_lab while resume training with initmodel?

Hi, for some reson my training process was interupted. I want to resume the train from the lastest ckpt and continue training on the old data. There is a para --spkv_lab: "file path of speaker vector with label and speaker ID conversion table for adaptation" . which file does it exactly mean? I tried the featlab_chunk_indices.txt but failed. I cannot find another file suitable for it... please help. Thanks

opened by kli017 1
Performance of different net architecture

Hello, I was wondering have you evaluate with different net architecture? I modified the net according to the transformer's paper (layer numbers, heads numbers and hidden units size). And I found that the result does not get better (even worse on some unseen test wav) with the net become complicated.

opened by kli017 0
Possible to train with audios contained different number of speaker?

Hello, I found there is a paramter named num_speakers in the train.yaml. Dose that mean the number of speaker in audio shoud equal to num_speakers? I

opened by kli017 0

get invalid input shape while modified the layer and head num in train.yaml

Hello, while I modified the layer and head num of transformer in train.yaml I got a RuntimeError. RuntimeError: shape '[128, -1, 12, 21]' is invalid for input of size 4915200

spk_loss_ratio: 0.03
spkv_dim: 256
max_epochs: 120
input_transform: logmel23_mn
lr: 0.001
optimizer: noam
num_speakers: 3
gradclip: 5
chunk_size: 150
batchsize: 128
num_workers: 8
hidden_size: 256
context_size: 7
subsampling: 10
frame_size: 200
frame_shift: 80
sampling_rate: 8000
noam_scale: 1.0
noam_warmup_steps: 25000
transformer_encoder_n_heads: 12
transformer_encoder_n_layers: 8
transformer_encoder_dropout: 0.1
seed: 777
feature_nj: 100
batchsize_per_gpu: 8
test_run: 0

I didnt go through the model structure code now. So these two parameter cannot random modified? or they are related to other para(context_size)?

opened by kli017 0

Potentiel issue excluding silent speaker
Hello there,

Thanks for your efforts in open-sourcing the code, it's vital for us trying to reproduce the result presented in the paper.

Problem

But I've come across a RuntimeError when adapting the model with our private data which shows:

/*/EEND-vector-clustering/eend/pytorch_backend/train.py:186: RuntimeWarning: invalid value encountered in true_divide fet_arr[spk] = org / norm ... Traceback (most recent call last): ... RuntimeError: The loss (nan) is not finite.

Detail

After some debugging, I found the problem actually happens during the backpropagation step when there exists an entry left with zeros in the embedding layer: https://github.com/nttcslab-sp/EEND-vector-clustering/blob/b3649eed02fe4f0239f2000fb895120d3c549631/eend/pytorch_backend/train.py#L173-L186

Since the embeddings are actually loaded from the dumped speaker embeddings generated by the save_spkv_lab.py script when adapting the model, I suspect there might exist some issue in the save_spkv_lab function.

After some careful step-by-step checking with pdb, I found there is actually some silent speaker label added in the all_labels variable when dumping the speaker embeddings: https://github.com/nttcslab-sp/EEND-vector-clustering/blob/b3649eed02fe4f0239f2000fb895120d3c549631/eend/pytorch_backend/infer.py#L349-L355

Even when if torch.sum(t_chunked_t[sigma[i]]) > 0, lab can still be -1 which is considered as silent speaker acroding to code in: https://github.com/nttcslab-sp/EEND-vector-clustering/blob/b3649eed02fe4f0239f2000fb895120d3c549631/eend/pytorch_backend/diarization_dataset.py#L94-L99. (This is where makes me feels confused since it should not happen as both lab and T/t_chunked produced with info from kaldi_obj.utt2spk)

Since these silent speaker labels are -1 and the python list support negative indexing, this issue is silently ignored when dumping the embedding but will cause Exceptions when training begins.

Question

I could simply fix this issue by adding speaker label to all_labels only if lab < 0 when saving speaker embeddings and the followed training process could continue smoothly resulting in a good performing model.

But before opening any PR, I would like to know if you guys have ever come across such an issue or do you have any idea on why this will happen.

Thanks!
opened by Zenglinxiao 0
fix Agg. Clustering ValueError with sample<2
Currently, the following Error might arise when a trained EEND-vector model is used to do inference on an audio record with only a single speaker.

ValueError: Found array with 1 sample(s) (shape=(1, 1)) while a minimum of 2 is required by AgglomerativeClustering.

This PR fixes this issue by adding an extra verification to make sure min_n_samples is always greater than two which avoids doing clustering on one single sample.
opened by Zenglinxiao 0

This repository contains a set of codes to run (i.e., train, perform inference with, evaluate) a diarization method called EEND-vector-clustering.

Related tags

Overview

EEND-vector clustering

References

Citation

Install tools

Requirements

Install kaldi and python environment

Test recipe (mini_librispeech)

Configuration

Run data preparation, training, inference, and scoring

CALLHOME experiment

Configuraition

Run data preparation, training, inference, and scoring

Comments

Requesting for the callhome result.

which file should I pass to spkv_lab while resume training with initmodel?

Performance of different net architecture

Possible to train with audios contained different number of speaker?

get invalid input shape while modified the layer and head num in train.yaml

Potentiel issue excluding silent speaker

Problem

Detail

Question

fix Agg. Clustering ValueError with sample<2

Owner

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.

HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

Automatically download the cwru data set, and then divide it into training data set and test data set

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks

This project deploys a yolo fastest model in the form of tflite on raspberry 3b+. The model is from another repository of mine called -Trash-Classification-Car