Introduction
This repository is about paper SpeakerGAN , and is unofficially implemented by Mingming Huang ([email protected]), Tiezheng Wang ([email protected]) and thanks for advice from TongFeng.
SpeakerGAN paper
SpeakerGAN: Speaker identification with conditional generative adversarial network, by Liyang Chen , Yifeng Liu , Wendong Xiao , Yingxue Wang ,Haiyong Xie.
Usage
For train / test / generate:
python speakergan.py
You may need to change the path of wav vad preprocessed files.
Our results
acc: 94.27% with random sampled testset.
acc: 93.21% with fixed start sampled testset.
using model file: model/49_D.pkl
acc: 98.44% on training classification accuracy with real samples.
There is about 4% gap on testset lower compared to paper result. We can't find out the reason. We want your help !
Details of paper
The following are details about this paper.
================ input ==================
-
feature: fbank, 8000hz, 25ms frame, 10ms overlap. shape:(160,64)
-
dataset: librispeech-100 train-clean-100 POI:251
-
data preprocess: vad、mean and variance normalization, shuffled.
-
60% train. 40% test.
================ model architecture ==================
-
dataflow: data -> feature extraction -> G & D
-
model architecture:
G: gated CNN, encoder-decoder, Huber loss + adversarial loss
D: ResnetBlocks, template average pooling, FC, softmax, crossentropy loss + adversarial loss
-
G: shuffler layer, GLU
-
D: ReLU
================ training ==================
-
lr: 0-9, 0.0005 | 9-49, 0.0002
-
L(d): λ1 λ2 = 1
-
batch_size: 64
-
D_train steps / G_train steps = 4
-
Ladv Loss: Label smoothing, 1 -> 0.7 ~ 1.0, 0 -> 0 ~ 0.3
======== not sure or differences with paper ========
-
weights,bias initialize function, use: xavier_uniform and zeros
-
pytorch huber_loss: + 0.5 to be same with paper. but no implement here.
-
for shorter wav, paper: padded with zero. we: padded with feature again.
-
gated cnn architecture.
-
we use webrtcvad mode(3) for vad preprocess.