Speech Separation
The simple project to separate mixed voice (2 clean voices) to 2 separate voices.
Result Example (Clisk to hear the voices): mix || prediction voice1 || prediction voice2
Mix Spectrogram
Predict Voice1's Spectrogram
Predict Voice2's Spectrogram
1. Quick train
Step 1:
Download LibriMixSmall, extract it and move it to the root of the project.
Step 2:
./train.sh
It will take about ONLY 2-3 HOURS to train with normal GPU. After each epoch, the prediction is generated to ./viz_outout
folder.
2. Quick inference
./inference.sh
The result will be generated to ./viz_outout
folder.
3. More detail
-
Input: The Complex spectrogram. Get from the raw mixed audio signal
-
Output: The complex ratio mask (cRM) ---> complex spectrogram ---> separated voices.
-
Model: Use the simple version of this implementation , which is defined in paper Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
-
Loss function: Permutation Invariant Training Loss and PairWise Neg SisDr Loss (more SOTA)
-
Dataset: A small version of
LibriMix
dataset. I get from LibriMixSmall
4. Current problem
Due to small dataset size for fast training, the model is a bit overfitting to the training set. Use the bigger dataset will potentially help to overcome that. Some suggestions:
- Use the original LibriMix Dataset which is way much bigger (around 60 times bigger that what I have trained).
- Use this work to download much more in-the-wild dataset and use
datasets/VoiceMixtureDataset.py
instead of the Libri one that I am using. p/s I have trained and it work too.