This repo contains simple to use, pretrained/training-less models for speaker diarization.

Last update: Jan 20, 2022

Related tags

Text Data & NLP pydiar

Overview

PyDiar

This repo contains simple to use, pretrained/training-less models for speaker diarization.

Supported Models

Binary Key Speaker Modeling

Based on pyBK by Jose Patino which implements the diarization system from "The EURECOM submission to the first DIHARD Challenge" by Patino, Jose and Delgado, Héctor and Evans, Nicholas

If you have any other models you would like to see added, please open an issue.

Usage

This library seeks to provide a very basic interface. To use the Binary Key model on a file, do something like this:

import numpy as np
from pydiar.models import BinaryKeyDiarizationModel, Segment
from pydiar.util.misc import optimize_segments
from pydub import AudioSegment

INPUT_FILE = "test.wav"

sample_rate = 32000
audio = AudioSegment.from_wav(test.wav)
audio = audio.set_frame_rate(sample_rate)
audio = audio.set_channels(1)

diarization_model = BinaryKeyDiarizationModel()
segments = diarization_model.diarize(
    sample_rate, np.array(audio.get_array_of_samples())
)
optimized_segments = optimize_segments(segments)

Now optimized_segments contains a list of segments with their start, length and speaker id

Example

A simple script which reads an audio file, diarizes it and transcribes it into the WebVTT format can be found in examples/generate_webvtt.py. To use it, download a vosk model from https://alphacephei.com/vosk/models and then run the script using

poetry install
poetry run python -m examples.generate_webvtt -i PATH/TO/INPUT.wav -m PATH/TO/VOSK_MODEL

You might also like...

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022

A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

multitask-learning-transformers A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You

48 Jan 2, 2023

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

142 Jan 6, 2023

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my thesis if you're curious or if you're looking for info I haven't documented. Mostly I would recommend giving a quick look to the figures beyond the introduction.

38.5k Jan 3, 2023

TalkNet: Audio-visual active speaker detection Model

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

Comments

🐛 Fix a bug in getBestClustering

This commit fixes a bug in getBestClustering which occured if no cluster was found that has the maximum number of speakers - 1. It also fixes an alleged logic-bug where getBestClustering would never return a clustering with the max number of speakers, but only maxNrSpeakers-1.

This could for example happen if the algorithm generates clusterings of sizes: 11, 9, 8, 7, 6, 5, 4, 3, 2, 1. If maxNrSpeakers is >= 11 np.maximum( np.minimum(maxNrSpeakers, np.max(nrSpeakersPerSolution)) - 1, 1 ) would return 10. But there is no cluster that fulfills nrSpeakersPerSolution == 10, so np.where would return an empty list, which would lead to a crash in np.maximum

opened by pajowu 0
Process killed when input is a very long file.

When I do: poetry run python -m examples.generate_webvtt -i ~/<path>/segment.wav -m vosk-model-en-us-0.22-lgraph it works just fine but when i try to run it with the complete file (1.5 hours) it outputs that the process had been killed.

The segment.wav file is 1.5 minutes.

I tried to delete the transcription part (I only need the speaking diarization) and the same problem occurs.

Is the maximum size or length of the file defined somewhere?

opened by danpad01 0

This repo contains simple to use, pretrained/training-less models for speaker diarization.

Related tags

Overview

PyDiar

Supported Models

Usage

Example

You might also like...

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

TalkNet: Audio-visual active speaker detection Model

neural network based speaker embedder

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Comments

🐛 Fix a bug in getBestClustering

Process killed when input is a very long file.

Owner

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

This repository contains the code for "Generating Datasets with Pretrained Language Models".

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

A library for finding knowledge neurons in pretrained transformer models.

Code for evaluating Japanese pretrained models provided by NTT Ltd.

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).