I am trying to finetune models to support one more speaker, but it looks I am doing something wrong.
I want to use "dia_hard" pipeline, so I need to finetune models: {sad_dihard, scd_dihard, emb_voxceleb}.
For my speaker I have one WAV file with duration more then 1 hour.
So, I created database.yml file:
Databases:
IK: /content/fine/kirilov/{uri}.wav
Protocols:
IK:
SpeakerDiarization:
kirilov:
train:
uri: train.lst
annotation: train.rttm
annotated: train.uem
and put additional files near database.yml:
kirilov
├── database.yml
├── kirilov.wav
├── train.lst
├── train.rttm
└── train.uem
train.lst:
kirilov
train.rttm:
SPEAKER kirilov 1 0.0 3600.0 <NA> <NA> Kirilov <NA> <NA>
train.uem:
kirilov NA 0.0 3600.0
I assume it will say trainer to use kirilov.wav file and take 3600 seconds of audio from it to use for training.
Now I finetune the models, current folder is /content/fine/kirilov, so database.yml is taken from the current directory:
!pyannote-audio sad train --pretrained=sad_dihard --subset=train --to=1 --parallel=4 "/content/fine/sad" IK.SpeakerDiarization.kirilov
!pyannote-audio scd train --pretrained=scd_dihard --subset=train --to=1 --parallel=4 "/content/fine/scd" IK.SpeakerDiarization.kirilov
!pyannote-audio emb train --pretrained=emb_voxceleb --subset=train --to=1 --parallel=4 "/content/fine/emb" IK.SpeakerDiarization.kirilov
Output looks like:
Using cache found in /root/.cache/torch/hub/pyannote_pyannote-audio_develop
Loading labels: 0file [00:00, ?file/s]/usr/local/lib/python3.6/dist-packages/pyannote/database/protocol/protocol.py:128: UserWarning:
Existing key "annotation" may have been modified.
Loading labels: 1file [00:00, 20.49file/s]
/usr/local/lib/python3.6/dist-packages/pyannote/audio/train/trainer.py:128: UserWarning:
Did not load optimizer state (most likely because current training session uses a different loss than the one used for pre-training).
2020-06-19 15:35:26.763592: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Training: 0%| | 0/1 [00:00<?, ?epoch/s]
Epoch pyannote/pyannote-database#1: 0%| | 0/29 [00:00<?, ?batch/s]
Epoch pyannote/pyannote-database#1: 0%| | 0/29 [00:00<?, ?batch/s, loss=0.676]
Epoch pyannote/pyannote-database#1: 3%|▋ | 1/29 [00:00<00:26, 1.04batch/s, loss=0.676]
Etc.
And try to run pipeline with new .pt's:
import os
import torch
from pyannote.audio.pipeline import SpeakerDiarization
pipeline = SpeakerDiarization(embedding = "/content/fine/emb/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt",
sad_scores = "/content/fine/sad/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt",
scd_scores = "/content/fine/scd/train/IK.SpeakerDiarization.kirilov.train/weights/0001.pt",
method= "affinity_propagation")
#params from dia_dihard\train\X.SpeakerDiarization.DIHARD_Official.development\params.yml
pipeline.load_params("/content/drive/My Drive/pyannote/params.yml")
FILE = {'audio': "/content/groundtruth/new.wav"}
diarization = pipeline(FILE)
diarization
The result is that for my new.wav the whole audio is recognized as speaker talking without pauses. So I assume that the models were broken. And it does not matter if I train for 1 epoch or for 100.
In case I use:
- 0000.pt - I assume these are the original models
pipeline = SpeakerDiarization(embedding = "/content/fine/emb/train/IK.SpeakerDiarization.kirilov.train/weights/0000.pt",
sad_scores = "/content/fine/sad/train/IK.SpeakerDiarization.kirilov.train/weights/0000.pt",
scd_scores = "/content/fine/scd/train/IK.SpeakerDiarization.kirilov.train/weights/0000.pt",
method= "affinity_propagation")
or
- weights from original models
pipeline = SpeakerDiarization(embedding = "/content/drive/My Drive/pyannote/emb_voxceleb/train/X.SpeakerDiarization.VoxCeleb.train/weights/0326.pt",
sad_scores = "/content/drive/My Drive/pyannote/sad_dihard/sad_dihard/train/X.SpeakerDiarization.DIHARD_Official.train/weights/0231.pt",
scd_scores = "/content/drive/My Drive/pyannote/scd_dihard/train/X.SpeakerDiarization.DIHARD_Official.train/weights/0421.pt",
method= "affinity_propagation")
everything is ok and the result is similar to
pipeline = torch.hub.load('pyannote/pyannote-audio', 'dia_dihard')
FILE = {'audio': "/content/groundtruth/new.wav"}
diarization = pipeline(FILE)
diarization
Could you please advise what could be wrong with my training\finetuning process?