Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Overview

Streaming speaker diarization

Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation
by Juan Manuel Coria, Hervé Bredin, Sahar Ghannay and Sophie Rosset.

We propose to address online speaker diarization as a combination of incremental clustering and local diarization applied to a rolling buffer updated every 500ms. Every single step of the proposed pipeline is designed to take full advantage of the strong ability of a recently proposed end-to-end overlap-aware segmentation to detect and separate overlapping speakers. In particular, we propose a modified version of the statistics pooling layer (initially introduced in the x-vector architecture) to give less weight to frames where the segmentation model predicts simultaneous speakers. Furthermore, we derive cannot-link constraints from the initial segmentation step to prevent two local speakers from being wrongfully merged during the incremental clustering step. Finally, we show how the latency of the proposed approach can be adjusted between 500ms and 5s to match the requirements of a particular use case, and we provide a systematic analysis of the influence of latency on the overall performance (on AMI, DIHARD and VoxConverse).

Citation

Paper currently under review.

Installation

  1. Create environment:
conda create -n diarization python==3.8
conda activate diarization
  1. Install the latest PyTorch version following the official instructions

  2. Install dependencies:

pip install -r requirements.txt

Usage

CLI

Stream a previously recorded conversation:

python main.py /path/to/audio.wav

Or use a real audio stream from your microphone:

python main.py microphone

This will launch a real-time visualization of the diarization outputs as they are produced by the system:

Example of a state of the real-time output plot

By default, the script uses step = latency = 500ms, and it sets reasonable values for all hyper-parameters. See python main.py -h for more information.

API

We provide various building blocks that can be combined to process an audio stream. Our streaming implementation is based on RxPY, but the functional module is completely independent.

In this example we show how to obtain speaker embeddings from a microphone stream with Equation 2:

from sources import MicrophoneAudioSource
from functional import FrameWiseModel, ChunkWiseModel, OverlappedSpeechPenalty, EmbeddingNormalization

mic = MicrophoneAudioSource(sample_rate=16000)

# Initialize independent modules
segmentation = FrameWiseModel("pyannote/segmentation")
embedding = ChunkWiseModel("pyannote/embedding")
osp = OverlappedSpeechPenalty(gamma=3, beta=10)
normalization = EmbeddingNormalization(norm=1)

# Branch the microphone stream to calculate segmentation
segmentation_stream = mic.stream.pipe(ops.map(segmentation))
# Join audio and segmentation stream to calculate speaker embeddings
embedding_stream = rx.zip(mic.stream, segmentation_stream).pipe(
    ops.starmap(lambda wave, seg: (wave, osp(seg))),
    ops.starmap(embedding),
    ops.map(normalization)
)

embedding_stream.suscribe(on_next=lambda emb: print(emb.shape))

mic.read()

Output:

(4, 512)
(4, 512)
(4, 512)
...

Reproducible research

Table 1

In order to reproduce the results of the paper, use the following hyper-parameters:

Dataset latency tau rho delta
DIHARD III any 0.555 0.422 1.517
AMI any 0.507 0.006 1.057
VoxConverse any 0.576 0.915 0.648
DIHARD II 1s 0.619 0.326 0.997
DIHARD II 5s 0.555 0.422 1.517

For instance, for a DIHARD III configuration, one would use:

python main.py /path/to/file.wav --latency=5 --tau=0.555 --rho=0.422 --delta=1.517 --output /output/dir

And then to obtain the diarization error rate:

from pyannote.metrics.diarization import DiarizationErrorRate
from pyannote.database.util import load_rttm

metric = DiarizationErrorRate()
hypothesis = load_rttm("/output/dir/output.rttm")
hypothesis = list(hypothesis.values())[0]  # Extract hypothesis from dictionary
reference = load_rttm("/path/to/reference.rttm")
reference = list(reference.values())[0]  # Extract reference from dictionary

der = metric(reference, hypothesis)

For convenience and to facilitate future comparisons, we also provide the expected outputs in RTTM format corresponding to every entry of Table 1 and Figure 5 in the paper. This includes the VBx offline baseline as well as our proposed online approach with latencies 500ms, 1s, 2s, 3s, 4s, and 5s.

Figure 5

License

MIT License

Copyright (c) 2021 Université Paris-Saclay
Copyright (c) 2021 CNRS

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Comments
  • Is it possible to pass raw byte data into the pipeline?

    Is it possible to pass raw byte data into the pipeline?

    I'm currently working on a implementation where I occasionally receive raw byte data from a TCP socket. I want to pass this data into the pipeline but the current AudioSource:s seem to be limited to microphone input and audio files. Does the current version support what I'm trying to implement or do I have to write it myself?

    duplicate feature question 
    opened by chanleii 16
  • Question about changing model in system and real-time issue

    Question about changing model in system and real-time issue

    Hi @juanmc2005 ! it's me again, sorry for bothering you... I have several questions

    1. I have already fine-tuned a segmentation model with other dataset (not AIM, VoxConverse) and store in ".ckpt" model. I sure that it can be loaded by using ckpt_path = "path/where/store/ckpt/epoch=9_best.ckpt" Model.from_pretrained(ckpt_path, strict=False) i want to put replace this model in your system for some testing. But when i change here by segmentation: PipelineModel = Model.from_pretrained('path/where/store/ckpt/epoch=9_best.ckpt'). When using "python -m diart.stream", the system is still using "pyannote/segmentation". could you give me some hints?
    2. Is there any useful augmentation method for training or fine-tuning? I know that can add back noise or something. Where should i add them?
    3. this system can use for real-time right (streaming)? If yes, how to confirm "real-time"? like real-time factor or something else?

    Thanks for your awesome project, it helps me a lot!! Looking forward to your reply~~ Best Regards

    question 
    opened by Shoawen0213 11
  • Example on README doesn't work

    Example on README doesn't work

    Hi,

    The following example on top level README doesn't work:

    from sources import MicrophoneAudioSource
    from functional import FrameWiseModel, ChunkWiseModel, OverlappedSpeechPenalty, EmbeddingNormalization
    
    mic = MicrophoneAudioSource(sample_rate=16000)
    
    # Initialize independent modules
    segmentation = FrameWiseModel("pyannote/segmentation")
    embedding = ChunkWiseModel("pyannote/embedding")
    osp = OverlappedSpeechPenalty(gamma=3, beta=10)
    normalization = EmbeddingNormalization(norm=1)
    
    # Branch the microphone stream to calculate segmentation
    segmentation_stream = mic.stream.pipe(ops.map(segmentation))
    # Join audio and segmentation stream to calculate speaker embeddings
    embedding_stream = rx.zip(mic.stream, segmentation_stream).pipe(
        ops.starmap(lambda wave, seg: (wave, osp(seg))),
        ops.starmap(embedding),
        ops.map(normalization)
    )
    
    embedding_stream.suscribe(on_next=lambda emb: print(emb.shape))
    
    mic.read()
    

    First of all there is a type in declaring variable ops (it's been declared as osp).

    Secondly, in line segmentation_stream = mic.stream.pipe(ops.map(segmentation)), there is no method named map in variable ops.

    Thanks!

    bug documentation 
    opened by abhinavkulkarni 11
  • PortAudio Windows Error and Compatibility on Python3.7

    PortAudio Windows Error and Compatibility on Python3.7

    Hi, @juanmc2005

    Is there a chance that diart can run on windows since PortAudio library is only supported on Linux ?

    Also, if its possible to import Literal from typing_extensions(py3.7) instead of Typing(py3.8)

    I get this error while running this demo script :

    import rx
    import rx.operators as ops
    import diart.operators as myops
    from diart.sources import MicrophoneAudioSource
    import diart.functional as fn
    
    sample_rate = 16000
    mic = MicrophoneAudioSource(sample_rate)
    

    error : ImportError: cannot import name 'Literal' from 'typing' (/usr/lib/python3.7/typing.py)

    feature 
    opened by Yagna24 9
  • My output is different from expected_outputs

    My output is different from expected_outputs

    I try to use the following command test the model: python src/main.py path/to/voxconverse_test_wav/aepyx.wav --latency=0.5 --tau=0.576 --rho=0.915 --delta=0.648 --output results/vox_test_0.5s But I compared my model output with 'expected_outputs/online/0.5s/VoxConverse.rttm', and found that my output ended early for example: The last line of my output is SPEAKER aepyx 1 149.953 1.989 <NA> <NA> speaker0 <NA> <NA> But regarding the last line of the 'expected_outputs/online/0.5s/VoxConverse.rttm' of the aepyx.wav file is SPEAKER aepyx 1 168.063 0.507 <NA> <NA> F <NA> <NA> They ended at different times. I don’t know if the command I entered is wrong, or if it is due to other reasons

    bug 
    opened by sablea 9
  • Question about reproduce the result

    Question about reproduce the result

    Hi! It's me again, sorry for bothering you. I have several questions... Q1. I try to reproduce the results of the paper by using the following hyper-parameters. image SO~ I test them on the AMI test dataset and VoxConverse test dataset, but the result seems different. In the AMI dataset, there are 24 wav files. I wrote a script containing several python codes to do the testing. For each wav file, i use the command provided below. python -m diart.demo $wavpath --tau=0.507 --rho=0.006 --delta=1.057 --output ./output/AMI_dia_fintetune/ After obtain each rttm files, i calculate the DER for each wav files, like "der = metric(reference, hypothesis)" The reference rttm is from "AMI/MixHeadset.test.rttm" them I calculate "sum of DER"/total file(24files in this case), i got 0.3408511916216123 (which means 34.085% DER). Do i do something wrong....? I could provide the rttm or the DER for each wav file. The VoxConverse dataset is still processing. I'm afraid I misunderstood something, so I ask about the problem first...

    BTW, I use the pyannote v1.1 to do the same things, them i got 0.48516973769490174 as final DER. # v1.1 import torch pipeline = torch.hub.load("pyannote/pyannote-audio", "dia") diarization = pipeline({"audio": "xxx.wav"}) So o'm afraid that i did something wrong....

    Q2 At the same time, I have another question. image Here show that you try lots of methods with different latency. Does python -m diart.demo using the "5.0 latency" way which has the greatest result in the paper? If the answer is yes, how to change different model for other latency for inference? And how to train this part?

    Again, thanks for your awesome project!! Sorry for all those stupid questions... Looking forward to your reply...

    question 
    opened by Shoawen0213 8
  • Question about Optuna on own pipeline with segmentation model

    Question about Optuna on own pipeline with segmentation model

    hi, @juanmc2005 sorry for bothering you! as title, I want to use optuna to tuning parameters, and the segmentation model used in pipeline is my own trained follow training_a_model.ipynb. And i have already saved in "ckpt" format when i followed the "Custom models" part, try to define own segmentation model like

    class MySegmentationModel(SegmentationModel):
       def __init__(self):
            super().__init__()
            self.my_pretrained_model = torch.load("./epoch=0-step=69.ckpt")  **<- put my own ckpt here**
        def __call__(
            self,
            waveform: torch.Tensor,
            weights: Optional[torch.Tensor] = None
        ) -> torch.Tensor:
            return self.my_pretrained_model(waveform, weights)
    

    then, i redefine the config like config = PipelineConfig(segmentation=MySegmentationModel()) optimizer = Optimizer(benchmark, config, hparams, p) it comes out the error shown as below image

    then i trace back, and it seems that it can detect duration, sample rate, or sth could you tell me how to fix these problems or I can't use my segmentation model in optuna?

    I want to do this because i guess the parameter will be affected in different model, so i want to give it a try

    Thanks for your awesome work and help!!! expected your response!

    documentation question 
    opened by Shoawen0213 6
  • Crash on `MicrophoneAudioSource.read()`

    Crash on `MicrophoneAudioSource.read()`

    I have an issue with the audio access with the example code from the README, probably due to my audio setup:

    >>> mic.read()
    torch.Size([4, 512])
    python: src/hostapi/alsa/pa_linux_alsa.c:3641: PaAlsaStreamComponent_BeginPolling: Assertion `ret == self->nfds' failed.
    Aborted (core dumped)
    

    This is strange because I'm running this in a Docker container and I also use the sounddevice library without problems in another project in the same container.

    Originally posted by @KannebTo in https://github.com/juanmc2005/StreamingSpeakerDiarization/issues/8#issuecomment-949453753

    bug 
    opened by KannebTo 6
  • Creating a websockets based streaming API endpoint for diarization

    Creating a websockets based streaming API endpoint for diarization

    Hi Team,

    Thanks for the great work with this project. I tried the model with few audio samples and it seems to work great!

    I was wondering if there is any interest in developing a streaming websocket-based API endpoint to serve the model output. I went through the code and while I am not familiar with the reactive programming paradigm (RxPY), it looked like with some effort, it should be possible to develop a streaming solution.

    Let me know what you all think.

    Thanks and keep up the great work!

    question 
    opened by abhinavkulkarni 5
  • Question about Table 1

    Question about Table 1

    Hello!! Thanks for your sharing !! It's a masterpiece work! I am just confused about what is "oracle segmentation" in the Table.1 I can't get it~I try to survey it but failed to get the answer. thanks!

    question 
    opened by Shoawen0213 4
  • How to fetch audio segments in real time from diarization pipeline

    How to fetch audio segments in real time from diarization pipeline

    Hi, First of all thank you for this repo.

    I was wondering if its possible to use the diarization pipeline to fetch the audio segments of discrete speakers. It would immensely help if there's a way around it.

    question 
    opened by Yagna24 4
  • Client for the websocket server

    Client for the websocket server

    Hello,

    I'm trying to use the websocket server for diarization. I implemented a client as it is not provided. I get an output for the diarization but it is always speaker0.

    Here is the js code for my client:

    // Set up the start and stop buttons.
    document.getElementById('start-button').onclick = start;
    document.getElementById('stop-button').onclick = stop;
    
    // Global variables to hold the audio stream and the WebSocket connection.
    var stream;
    var socket;
    
    function b64encode(chunk) {
        // Convert the chunk array to a Float32Array
        const bytes = new Float32Array(chunk).buffer;
    
        // Encode the bytes as a base64 string
        let encoded = btoa(String.fromCharCode.apply(null, new Uint8Array(bytes)));
    
        // Return the encoded string as a UTF-8 encoded string
        return decodeURIComponent(encoded);
    }
    
    
    function start() {
        // Disable the start button.
        document.getElementById('start-button').disabled = true;
        // Enable the stop button.
        document.getElementById('stop-button').disabled = false;
    
        // Get access to the microphone.
        navigator.mediaDevices.getUserMedia({ audio: true }).then(function (s) {
            stream = s;
    
            // Create a new WebSocket connection.
            socket = new WebSocket('ws://localhost:7007');
    
            // When the WebSocket connection is open, start sending the audio data.
            socket.onopen = function () {
                var audioContext = new AudioContext();
                var source = audioContext.createMediaStreamSource(stream);
                var processor = audioContext.createScriptProcessor(1024, 1, 1);
                source.connect(processor);
                processor.connect(audioContext.destination);
    
                processor.onaudioprocess = function (event) {
                    var data = event.inputBuffer.getChannelData(0);
                    // var dataString = String.fromCharCode.apply(null, new Uint16Array(data));
                    console.log("sending")
                    socket.send(b64encode(data));
                };
            };
    
            socket.onmessage = function (e) {
                console.log(e);
            }
        }).catch(function (error) {
            console.error('Error getting microphone input:', error);
        });
    }
    
    function stop() {
        // Disable the stop button.
        document.getElementById('stop-button').disabled = true;
        // Enable the start button.
        document.getElementById('start-button').disabled = false;
    
        // Close the WebSocket connection.
        socket.close();
        // Stop the audio stream.
        stream.getTracks().forEach(function (track) { track.stop(); });
    }
    

    Do you have the code for the client you used ? I would like to make sure that I got that right before looking anywhere else.

    Best,

    question 
    opened by funboarder13920 0
  • CLI Refactoring: Use jsonargparse

    CLI Refactoring: Use jsonargparse

    Problem

    The implementation of the CLI is a bit messy and mixed with the python API.

    Idea

    Use jsonargparse to group diart.stream, diart.tune and diart.benchmark into a single diart.py with automatic CLI documentation (i.e. removing argdoc.py). The new API would remove the dot (diart stream instead of diart.stream), --help would be generated from the docstring, and a class Diart could contain the implementation of all three scripts.

    refactoring 
    opened by juanmc2005 0
  • Changes to files not taking effect

    Changes to files not taking effect

    I'm relatively inexperienced when it comes to Python

    when I make changes to the source code, none of the changes take effect when I run the script.

    I've been trying to find solutions for days and I have had no luck.

    Is it something specific to this project that is preventing changes from being made?

    For instance, I simply tried to change the output of the print statement for the reports that are being printed in the console after the plot window is closed and even the simplest change is not taking effect.

    Any help would be GREATLY appreciated, Thank You!

    unrelated 
    opened by jason-daiuto 1
  • Update gif code snippet with newest API

    Update gif code snippet with newest API

    The gif in the README is not up to date with the newest changes in the API. It may be better to wait a bit for the API to stabilize before doing this.

    documentation 
    opened by juanmc2005 0
  • Add inference lifecycle hooks

    Add inference lifecycle hooks

    Problem

    When writing custom models or pipelines, one may want to react to specific inference events, for example before/after benchmarking on a file.

    Idea

    Add classes RealTimeInferenceHook and BenchmarkHook to define listener interfaces. This can also be used to implement other behavior like progress bars, writing results, etc.

    Example

    class MyHook(BenchmarkHook):
        def on_finished(results):
            print("Finished!")
    
    benchmark = Benchmark(..., hooks=[MyHook()])
    
    feature 
    opened by juanmc2005 0
Releases(v0.6)
  • v0.6(Oct 31, 2022)

    What's Changed

    • Compatibility with torchaudio streams by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/91
    • Online speaker diarization as a block by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/92
    • Fix bug: RTTM output not being patched when closing plot window by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/100
    • Add cropping_mode to DelayedAggregation by @bhigy in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/105
    • Compatibility with pyannote.audio 2.1.1 requirements by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/108

    New Contributors

    • Thank you @bhigy for the bug hunting in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/105!

    Full Changelog: https://github.com/juanmc2005/StreamingSpeakerDiarization/compare/v0.5.1...v0.6

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Aug 31, 2022)

    What's Changed

    • Fix wrong config reference and unpatched annotation by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/89

    Full Changelog: https://github.com/juanmc2005/StreamingSpeakerDiarization/compare/v0.5...v0.5.1

    Source code(tar.gz)
    Source code(zip)
  • v0.5(Aug 31, 2022)

    What's Changed

    • Add study_or_path as a Path for conversion from string by @AMITKESARI2000 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/74
    • Update WebSocketAudioSource by @ckliao-nccu in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/78
    • Fix bug with empty RTTMs by @zaouk in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/81
    • Add websocket compatibility + other improvements by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/77
    • Export csv report in diart.benchmark when output is provided by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/86

    New Contributors

    • @AMITKESARI2000 made their first contribution in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/74
    • @ckliao-nccu made their first contribution in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/78

    Acknowledgements

    Thank you @AMITKESARI2000, @ckliao-nccu and @zaouk for all the bug fixes!

    Full Changelog: https://github.com/juanmc2005/StreamingSpeakerDiarization/compare/v0.4...v0.5

    Source code(tar.gz)
    Source code(zip)
  • v0.4(Jul 13, 2022)

    What's Changed

    • Replace resolve_features with TemporalFeatureFormatter by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/59
    • Make pyannote.audio optional (still mandatory to run default pipeline) by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/61
    • Minor features and improvements by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/64
    • Adds documentation for some of the classes and methods by @zaouk in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/31
    • Add hyper-parameter tuning with optuna by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/65

    New Contributors

    • Thank you @zaouk for your contribution in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/31 !

    Full Changelog: https://github.com/juanmc2005/StreamingSpeakerDiarization/compare/v0.3...v0.4

    Source code(tar.gz)
    Source code(zip)
  • v0.3(May 20, 2022)

    What's Changed

    • Python 3.7 compatibility and PortAudio error fix by @Yagna24 in #29
    • Add citation by @hbredin in #38
    • Benchmark script + improvements and bug fixes by @juanmc2005 in #46
    • Improve API names by @juanmc2005 in #47
    • Add OverlapAwareSpeakerEmbedding class by @juanmc2005 in #51
    • Add inference API with RealTimeInference and Benchmark by @juanmc2005 in #55

    New Contributors

    • Thank you @Yagna24 for your contribution in python 3.7 compatibility!

    Full Changelog: https://github.com/juanmc2005/StreamingSpeakerDiarization/compare/v0.2.1...v0.3

    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Jan 26, 2022)

    What's Changed

    • Fix empty segment in buffer_output causing a crash by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/24

    Full Changelog: https://github.com/juanmc2005/StreamingSpeakerDiarization/compare/v0.2...v0.2.1

    Source code(tar.gz)
    Source code(zip)
  • v0.2(Jan 7, 2022)

    What's Changed

    • Replace operators.aggregate() with functional.DelayedAggregation by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/16
    • Add Hamming-weighted average to DelayedAggregation by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/18
    • Asynchronous microphone reading by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/19
    • Modular OutputBuilder + better demo performance by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/20
    • Improve README by @juanmc2005 in https://github.com/juanmc2005/StreamingSpeakerDiarization/pull/21

    Full Changelog: https://github.com/juanmc2005/StreamingSpeakerDiarization/compare/v0.1...v0.2

    Source code(tar.gz)
    Source code(zip)
  • v0.1(Dec 15, 2021)

Owner
Juanma Coria
PhD Student working on Continual Representation Learning
Juanma Coria
Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Dual Encoding for Video Retrieval by Text Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding

null 81 Dec 1, 2022
Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

Albumentations Albumentations is a Python library for image augmentation. Image augmentation is used in deep learning and computer vision tasks to inc

null 11.4k Jan 2, 2023
A post-processing tool for scanned sheets of paper.

unpaper Originally written by Jens Gulden — see AUTHORS for more information. Licensed under GNU GPL v2 — see COPYING for more information. Overview u

null 27 Dec 7, 2022
Code for the paper "DewarpNet: Single-Image Document Unwarping With Stacked 3D and 2D Regression Networks" (ICCV '19)

DewarpNet This repository contains the codes for DewarpNet training. Recent Updates [May, 2020] Added evaluation images and an important note about Ma

CVLab@StonyBrook 354 Jan 1, 2023
An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

InceptText-Tensorflow An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Orien

GeorgeJoe 115 Dec 12, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
A PyTorch implementation of ECCV2018 Paper: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes A PyTorch implement of TextSnake: A Flexible Representation for Detecting

Prince Wang 417 Dec 12, 2022
Implementation of our paper 'PixelLink: Detecting Scene Text via Instance Segmentation' in AAAI2018

Code for the AAAI18 paper PixelLink: Detecting Scene Text via Instance Segmentation, by Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. Contributions

null 758 Dec 22, 2022
An Implementation of the seglink alogrithm in paper Detecting Oriented Text in Natural Images by Linking Segments

Tips: A more recent scene text detection algorithm: PixelLink, has been implemented here: https://github.com/ZJULearning/pixel_link Contents: Introduc

dengdan 484 Dec 7, 2022
This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

Gated Recurrent Convolution Neural Network for OCR This project is an implementation of the GRCNN for OCR. For details, please refer to the paper: htt

null 90 Dec 22, 2022
The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

3D Human Pose Estimation with Spatial and Temporal Transformers This repo is the official implementation for 3D Human Pose Estimation with Spatial and

Ce Zheng 363 Dec 28, 2022
This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

Ju He 307 Jan 3, 2023
Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth? paper Official PyTorch Implementation Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor DAMO Academy, Al

null 213 Nov 12, 2022
Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

This repository hosts the source code of our paper: [AAAI 2021]Sequential End-to-end Network for Efficient Person Search. SeqNet achieves the state-of

Zj Li 218 Dec 31, 2022
Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

null 81 Jan 1, 2023
CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering" official PyTorch implementation.

LED2-Net This is PyTorch implementation of our CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering". Y

Fu-En Wang 83 Jan 4, 2023
An unofficial implementation of the paper "AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss".

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss This is an unofficial implementation of AutoVC based on the official one. The reposi

Chien-yu Huang 27 Jun 16, 2022
Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

AFSD: Learning Salient Boundary Feature for Anchor-free Temporal Action Localization This is an official implementation in PyTorch of AFSD. Our paper

Tencent YouTu Research 146 Dec 24, 2022
This is the code for our paper DAAIN: Detection of Anomalous and AdversarialInput using Normalizing Flows

Merantix-Labs: DAAIN This is the code for our paper DAAIN: Detection of Anomalous and Adversarial Input using Normalizing Flows which can be found at

Merantix 14 Oct 12, 2022