CTC segmentation python package

Overview

CTC segmentation

CTC segmentation can be used to find utterances alignments within large audio files.

Installation

  • With pip:
pip install ctc-segmentation
  • From the Arch Linux AUR as python-ctc-segmentation-git using your favourite AUR helper.

  • From source:

git clone https://github.com/lumaku/ctc-segmentation
cd ctc-segmentation
cythonize -3 ctc_segmentation/ctc_segmentation_dyn.pyx
python setup.py build
python setup.py install --optimize=1 --skip-build

Example Code

  1. prepare_text filters characters not in the dictionary, and generates the character matrix.
  2. ctc_segmentation computes character-wise alignments from CTC activations of an already trained CTC-based network.
  3. determine_utterance_segments converts char-wise alignments to utterance-wise alignments.
  4. In a post-processing step, segments may be filtered by their confidence value.

This code is from asr_align.py of the ESPnet toolkit:

from ctc_segmentation import ctc_segmentation
from ctc_segmentation import CtcSegmentationParameters
from ctc_segmentation import determine_utterance_segments
from ctc_segmentation import prepare_text

# ...

config = CtcSegmentationParameters()
char_list = train_args.char_list

for idx, name in enumerate(js.keys(), 1):
    logging.info("(%d/%d) Aligning " + name, idx, len(js.keys()))
    batch = [(name, js[name])]
    feat, label = load_inputs_and_targets(batch)
    feat = feat[0]
    with torch.no_grad():
        # Encode input frames
        enc_output = model.encode(torch.as_tensor(feat).to(device)).unsqueeze(0)
        # Apply ctc layer to obtain log character probabilities
        lpz = model.ctc.log_softmax(enc_output)[0].cpu().numpy()
    # Prepare the text for aligning
    ground_truth_mat, utt_begin_indices = prepare_text(
        config, text[name], char_list
    )
    # Align using CTC segmentation
    timings, char_probs, state_list = ctc_segmentation(
        config, lpz, ground_truth_mat
    )
    # Obtain list of utterances with time intervals and confidence score
    segments = determine_utterance_segments(
        config, utt_begin_indices, char_probs, timings, text[name]
    )
    # Write to "segments" file
    for i, boundary in enumerate(segments):
        utt_segment = (
            f"{segment_names[name][i]} {name} {boundary[0]:.2f}"
            f" {boundary[1]:.2f} {boundary[2]:.9f}\n"
        )
        args.output.write(utt_segment)

After the segments are written to a segments file, they can be filtered with the parameter min_confidence_score. This is minium confidence score in log space as described in the paper. Utterances with a low confidence score are discarded. This parameter may need adjustment depending on dataset, ASR model and language. For the german ASR model, a value of -1.5 worked very well, but for TEDlium, a lower value of about -5.0 seemed more practical.

awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${unfiltered} > ${filtered}

Parameters

There are several notable parameters to adjust the working of the algorithm:

  • min_window_size: Minimum window size considered for a single utterance. The current default value should be OK in most cases.

  • Localization: The character set is taken from the model dict, i.e., usually are generated with SentencePiece. An ASR model trained in the corresponding language and character set is needed. For asian languages, no changes to the CTC segmentation parameters should be necessary. One exception: If the character set contains any punctuation characters, "#", or the Greek char "ε", adapt the setting in an instance of CtcSegmentationParameters in segmentation.py.

  • CtcSegmentationParameters includes a blank character. Copy over the Blank character from the dictionary to the configuration, if in the model dictionary e.g. "<blank>" instead of the default "_" is used. If the Blank in the configuration and in the dictionary mismatch, the algorithm raises an IndexError at backtracking.

  • If replace_spaces_with_blanks is True, then spaces in the ground truth sequence are replaces by blanks. This option is enabled by default and improves compability with dictionaries with unknown space characters.

  • To align utterances with longer unkown audio sections between them, use blank_transition_cost_zero (default: False). With this option, the stay transition in the blank state is free. A transition to the next character is only consumed if the probability to switch is higher. In this way, more time steps can be skipped between utterances. Caution: in combination with replace_spaces_with_blanks == True, this may lead to misaligned segments.

Two parameters are needed to correctly map the frame indices to a time stamp in seconds:

  • subsampling_factor: If the encoder sub-samples its input, the number of frames at the CTC layer is reduced by this factor. A BLSTMP encoder with subsampling 1_2_2_1_1 has a subsampling factor of 4.
  • frame_duration_ms: This is the non-overlapping duration of a single frame in milliseconds (the inverse of frames per millisecond). Note: if fs is set, then frame_duration_ms is ignored.

But not all ASR systems have subsampling. If you want to directly use the sampling rate:

  1. For a given sample rate, say, 16kHz, set fs=16000.
  2. Then set the subsampling_factor to the number of sample points on a single CTC-encoded frame. In default ASR systems, this can be calculated from the hop length of the windowing times encoder subsampling factor. For example, if the hop length is 128, and the subsampling factor in the encoder is 4, then set subsampling_factor=512.

How it works

1. Forward propagation

Character probabilites from each time step are obtained from a CTC-based network. With these, transition probabilities are mapped into a trellis diagram. To account for preambles or unrelated segments in audio files, the transition cost are set to zero for the start-of-sentence or blank token.

Forward trellis

2. Backtracking

Starting from the time step with the highest probability for the last character, backtracking determines the most probable path of characters through all time steps.

Backward path

3. Confidence score

As this method generates a probability for each aligned character, a confidence score for each utterance can be derived. For example, if a word within an utterance is missing, this value is low.

Confidence score

The confidence score helps to detect and filter-out bad utterances.

Reference

The full paper can be found in the preprint https://arxiv.org/abs/2007.09127 or published at https://doi.org/10.1007/978-3-030-60276-5_27. To cite this work:

@InProceedings{ctcsegmentation,
author="K{\"u}rzinger, Ludwig
and Winkelbauer, Dominik
and Li, Lujun
and Watzel, Tobias
and Rigoll, Gerhard",
editor="Karpov, Alexey
and Potapova, Rodmonga",
title="CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition",
booktitle="Speech and Computer",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="267--278",
abstract="Recent end-to-end Automatic Speech Recognition (ASR) systems demonstrated the ability to outperform conventional hybrid DNN/HMM ASR. Aside from architectural improvements in those systems, those models grew in terms of depth, parameters and model capacity. However, these models also require more training data to achieve comparable performance.",
isbn="978-3-030-60276-5"
}
Comments
  • The problem about last phoneme alignment

    The problem about last phoneme alignment

    Hi, thanks for this great job, I have tried to integrate it on the top of my asr module, most of the phonemes were aligned perfect except the last, as can see in the below.

    ctc1

    ctc2

    the top figure was the original wavform, and the bottom was the alignment result. I found the wavform approach the end was cut down, and the index_duration was right because the phonemes except the last were aligned accurately.

    So how can I solve this problem? thanks in advance.

    opened by taylorlu 9
  • What is the difference between <blank> and self-transition?

    What is the difference between and self-transition?

    Thank you for providing this useful toolkit! I am new to it and is learning it, as I know, in ctc means the continuing the last character, then what does the self-transition mean? Can I treat them as the same?

    opened by houwenxin 7
  • CTC Segmentation for German

    CTC Segmentation for German

    Hello, I have to split the audio files in my dataset and their corresponding transcripts as well. Is there any pretrained model of yours for German language?

    opened by sadia95 6
  • IndexError: out of bounds

    IndexError: out of bounds

    This wave file: pl.zip

    This code:

    import torch, transformers, ctc_segmentation
    import soundfile
    
    # wav2vec2
    model_file = 'jonatasgrosman/wav2vec2-large-xlsr-53-polish'
    vocab_dict = {"<pad>": 0, "<s>": 1, "</s>": 2, "<unk>": 3, "|": 4, "A": 5, "I": 6, "E": 7, "O": 8, "Z": 9, "N": 10, "S": 11, "W": 12, "R": 13, "C": 14, "Y": 15, "M": 16, "T": 17, "D": 18, "K": 19, "P": 20, "Ł": 21, "J": 22, "U": 23, "L": 24, "B": 25, "Ę": 26, "G": 27, "Ą": 28, "Ż": 29, "H": 30, "Ś": 31, "Ó": 32, "Ć": 33, "F": 34, "Ń": 35, "Ź": 36, "V": 37, "-": 38, "Q": 39, "X": 40, "'": 41}
    
    processor = transformers.Wav2Vec2Processor.from_pretrained( model_file )
    model = transformers.Wav2Vec2ForCTC.from_pretrained( model_file )
    
    speech_array, sampling_rate = soundfile.read( '/tmp/pl.wav' )
    assert sampling_rate == 16000
    features = processor(speech_array,sampling_rate=16000, return_tensors="pt")
    input_values = features.input_values
    attention_mask = features.attention_mask
    with torch.no_grad():
        logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]
    transcription = transcription.lower().split()
    
    # ctc-segmentation
    with torch.no_grad():
        softmax = torch.nn.LogSoftmax(dim=-1)
        lpz = softmax(logits)[0].cpu().numpy()
    config = ctc_segmentation.CtcSegmentationParameters()
    config.index_duration = speech_array.shape[0] / lpz.shape[0] / sampling_rate
    char_list = [x.lower() for x in vocab_dict.keys()]
    ground_truth_mat, utt_begin_indices = ctc_segmentation.prepare_text(config, transcription,char_list)
    timings, char_probs, state_list = ctc_segmentation.ctc_segmentation(config, lpz, ground_truth_mat)
    segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, transcription)
    

    Console:

    Traceback (most recent call last):
      File "ctc.py", line 31, in <module>
        segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, transcription)
      File "/home/max/.local/lib/python3.8/site-packages/ctc_segmentation/ctc_segmentation.py", line 387, in determine_utterance_segments
        start = compute_time(utt_begin_indices[i], "begin")
      File "/home/max/.local/lib/python3.8/site-packages/ctc_segmentation/ctc_segmentation.py", line 380, in compute_time
        return max(timings[index + 1] - 0.5, middle)
    IndexError: index 450 is out of bounds for axis 0 with size 450
    
    opened by doublex 5
  • Super large audio file problems

    Super large audio file problems

    Thanks for your work!

    The time length of my audio file is more than 1 hour, so there are some problems when I tried your example codes in ESPNet2 and NeMo using my own data.

    In ESPNet2, either my gpu memory nor my cpu memory isn't large enough to run the code.

    In NeMo, it hints my as followings.

    INFO:root:CTC segmentation of 62154 chars to 8700.20s audio (217505 indices).
    WARNING:root:IndexError: Backtracking was not successful, the window size might be too small.
    WARNING:root:Increasing the window size to: 64000
    WARNING:root:IndexError: Backtracking was not successful, the window size might be too small.
    ERROR:root:Maximum window size reached.
    ERROR:root:Check data and character list!
    
    opened by SenYan1999 4
  • Installation fails on Windows 10

    Installation fails on Windows 10

    Hi, the installation via PIP fails on Windows 10 19043.1766 with the following error message.

    I have installed the Visual Studio Windows 10 SDK and Microsoft Visual C++ 14.0.

    pip install ctc-segmentation==1.7.1
    
    Collecting ctc-segmentation==1.7.1
      Using cached ctc_segmentation-1.7.1.tar.gz (71 kB)
      Preparing metadata (setup.py) ... done
    
    [...]
    
    Building wheels for collected packages: ctc-segmentation
      Building wheel for ctc-segmentation (setup.py) ... error
      ERROR: Command errored out with exit status 1:
       command: '[...]\espnet-venv\Scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'[...]\\AppData\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"'; __file__='"'"'[...]\\AppData
    \\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(comp
    ile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d '[...]\AppData\Local\Temp\pip-wheel-zx7xruid'
           cwd: [...]\AppData\Local\Temp\pip-install-qscp16gf\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\
      Complete output (28 lines):
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-3.9
      creating build\lib.win-amd64-3.9\ctc_segmentation
      copying ctc_segmentation\ctc_segmentation.py -> build\lib.win-amd64-3.9\ctc_segmentation
      copying ctc_segmentation\partitioning.py -> build\lib.win-amd64-3.9\ctc_segmentation
      copying ctc_segmentation\__init__.py -> build\lib.win-amd64-3.9\ctc_segmentation
      running build_ext
      creating build\temp.win-amd64-3.9
      creating build\temp.win-amd64-3.9\Release
      creating build\temp.win-amd64-3.9\Release\ctc_segmentation
      "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -I[...]\espnet-venv\include -I[...]\AppData\Local\Programs\Python\Python39\include -I[...]\AppData\Local\Programs\Python\Python39\Include -IC:\Users\Fab
    ian\PyCharmProjects\MA_Tuyet\espnet-venv\lib\site-packages\numpy\core\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt" "-IC:\Program Files (x86)\Windows K
    its\10\include\10.0.20348.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\winrt" /Tcctc_segmentation/ctc_segmentation_dyn.c /Fobuild\temp.win-amd64-3.9\Release\ctc_segmentation/ctc_segmentation_dyn.obj
      ctc_segmentation_dyn.c
      c:\users\fabian\pycharmprojects\ma_tuyet\espnet-venv\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
      ctc_segmentation/ctc_segmentation_dyn.c(2338): warning C4244: "=": Konvertierung von "double" in "float", m”glicher Datenverlust
      ctc_segmentation/ctc_segmentation_dyn.c(2482): warning C4244: "Funktion": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
      ctc_segmentation/ctc_segmentation_dyn.c(2499): warning C4244: "=": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
      ctc_segmentation/ctc_segmentation_dyn.c(2512): warning C4244: "=": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
      ctc_segmentation/ctc_segmentation_dyn.c(3240): warning C4244: "=": Konvertierung von "npy_intp" in "int", m”glicher Datenverlust
      "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe" /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:[...]\espnet-venv\libs /LIBPATH:[...]\AppData\Local\Programs\Python\Python39\libs /LIBPATH:[...]\AppData\
    Local\Programs\Python\Python39 /LIBPATH:[...]\espnet-venv\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0
    .20348.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.20348.0\um\x64" /EXPORT:PyInit_ctc_segmentation_dyn build\temp.win-amd64-3.9\Release\ctc_segmentation/ctc_segmentation_dyn.obj /OUT:build\lib.win-amd64-3.9\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.9\
    Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.lib
      ctc_segmentation_dyn.obj : warning LNK4197: Export "PyInit_ctc_segmentation_dyn" wurde mehrmals angegeben; erste Angabe wird verwendet.
         Bibliothek "build\temp.win-amd64-3.9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.lib" und Objekt "build\temp.win-amd64-3.9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.exp" werden erstellt.
      Code wird generiert.
      Codegenerierung ist abgeschlossen.
      LINK : fatal error LNK1327: Fehler beim Ausführen von rc.exe
      error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\link.exe' failed with exit code 1327
      ----------------------------------------
      ERROR: Failed building wheel for ctc-segmentation
      Running setup.py clean for ctc-segmentation
    Failed to build ctc-segmentation
    Installing collected packages: ctc-segmentation
        Running setup.py install for ctc-segmentation ... error
        ERROR: Command errored out with exit status 1:
         command: '[...]\espnet-venv\Scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'[...]\\AppData\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"'; __file__='"'"'[...]\\AppDa
    ta\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(co
    mpile(code, __file__, '"'"'exec'"'"'))' install --record '[...]\AppData\Local\Temp\pip-record-skx2310s\install-record.txt' --single-version-externally-managed --compile --install-headers '[...]\espnet-venv\include\site\python3.9\ctc-segmentation'
             cwd: [...]\AppData\Local\Temp\pip-install-qscp16gf\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\
        Complete output (30 lines):
        running install
        [...]\espnet-venv\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
          warnings.warn(
        running build
        running build_py
        creating build
        creating build\lib.win-amd64-3.9
        creating build\lib.win-amd64-3.9\ctc_segmentation
        copying ctc_segmentation\ctc_segmentation.py -> build\lib.win-amd64-3.9\ctc_segmentation
        copying ctc_segmentation\partitioning.py -> build\lib.win-amd64-3.9\ctc_segmentation
        copying ctc_segmentation\__init__.py -> build\lib.win-amd64-3.9\ctc_segmentation
        running build_ext
        creating build\temp.win-amd64-3.9
        creating build\temp.win-amd64-3.9\Release
        creating build\temp.win-amd64-3.9\Release\ctc_segmentation
        "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -I[...]\espnet-venv\include -I[...]\AppData\Local\Programs\Python\Python39\include -I[...]\AppData\Local\Programs\Python\Python39\Include -IC:\Users\F
    abian\PyCharmProjects\MA_Tuyet\espnet-venv\lib\site-packages\numpy\core\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt" "-IC:\Program Files (x86)\Windows
     Kits\10\include\10.0.20348.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\winrt" /Tcctc_segmentation/ctc_segmentation_dyn.c /Fobuild\temp.win-amd64-3.9\Release\ctc_segmentation/ctc_segmentation_dyn.obj
        ctc_segmentation_dyn.c
        c:\users\fabian\pycharmprojects\ma_tuyet\espnet-venv\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
        ctc_segmentation/ctc_segmentation_dyn.c(2338): warning C4244: "=": Konvertierung von "double" in "float", m”glicher Datenverlust
        ctc_segmentation/ctc_segmentation_dyn.c(2482): warning C4244: "Funktion": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
        ctc_segmentation/ctc_segmentation_dyn.c(2499): warning C4244: "=": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
        ctc_segmentation/ctc_segmentation_dyn.c(2512): warning C4244: "=": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
        ctc_segmentation/ctc_segmentation_dyn.c(3240): warning C4244: "=": Konvertierung von "npy_intp" in "int", m”glicher Datenverlust
        "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe" /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:[...]\espnet-venv\libs /LIBPATH:[...]\AppData\Local\Programs\Python\Python39\libs /LIBPATH:[...]\AppDat
    a\Local\Programs\Python\Python39 /LIBPATH:[...]\espnet-venv\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10
    .0.20348.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.20348.0\um\x64" /EXPORT:PyInit_ctc_segmentation_dyn build\temp.win-amd64-3.9\Release\ctc_segmentation/ctc_segmentation_dyn.obj /OUT:build\lib.win-amd64-3.9\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.
    9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.lib
        ctc_segmentation_dyn.obj : warning LNK4197: Export "PyInit_ctc_segmentation_dyn" wurde mehrmals angegeben; erste Angabe wird verwendet.
           Bibliothek "build\temp.win-amd64-3.9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.lib" und Objekt "build\temp.win-amd64-3.9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.exp" werden erstellt.
        Code wird generiert.
        Codegenerierung ist abgeschlossen.
        LINK : fatal error LNK1327: Fehler beim Ausführen von rc.exe
        error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\link.exe' failed with exit code 1327
        ----------------------------------------
    ERROR: Command errored out with exit status 1: '[...]\espnet-venv\Scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'[...]\\AppData\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"'; __fil
    e__='"'"'[...]\\AppData\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"'
    , '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record '[...]\AppData\Local\Temp\pip-record-skx2310s\install-record.txt' --single-version-externally-managed --compile --install-headers '[...]\espnet-venv\include\site\python3.9\ctc-segmentation
    ' Check the logs for full command output.
    WARNING: You are using pip version 21.3.1; however, version 22.1.2 is available.
    You should consider upgrading via the '[...]\espnet-venv\Scripts\python.exe -m pip install --upgrade pip' command.
    

    Can somebody help me? Cheers

    opened by FabianNiehaus 3
  • Timing squeezed in the beginning

    Timing squeezed in the beginning

    Dear authors, I tried to use your library to align a true-cased text containing punctuation but I have a problem with the timings obtained because they all seem squeezed in the beginning. I set the index_duration to 0.04 since I extract features every 10ms and I have a subsample of 4 at the beginning. My tokenized textual predictions look like the following: ▁But ▁if ▁you ▁could ▁take ▁a ▁pill ▁ <eol> ▁or ▁a ▁vaccine , ▁ <eob> ▁and ▁just ▁like ▁getting ▁over ▁a ▁cold , ▁ <eob> ▁you ▁could ▁heal ▁your ▁wind ▁faster ? ▁ <eob> Where <eob> and <eol> are treated as special characters in my vocabulary. I select <eob> as a split token i.e., a sentence is split when we found <eob> in the text. The timings obtained are: 0.04-1.52 1.52-2.24 2.24-5.44 And the first thing that is not correct is that the total duration of the segment is 6.75s. I looked at the timings obtained from the ctc segmentation and their last value is 5.44s. The other thing is that, if I compute the interval between the timings obtained by your library I got: 1.48s 0.32s 2.20s but, if I listen to the audio and compute them they are almost: 1.9s 2s 1.7s Also if I look at other examples I can observe the same phenomenon, it seems that all the timinings are squeezed towards the beginning of the sentence. I have used both prepare_text and prepare_token_list but it is not the cause of the problem. Have you any hint on where the problem is? Thank you in advance

    opened by sarapapi 3
  • how it works when i use my own CTC probabilities and char_list?

    how it works when i use my own CTC probabilities and char_list?

    hi, I want to test CTC-segment on Chinese, so I use my own CTC probabilities which gets from acoustics model(CTC+LSTM) and the char_list(~2000 syllables, one Chinese character corresponds one syllable). But the result is incorrect: 20220407112346 the pcm file IC0773W0044-nosp.pcm is labled by "放一首邓丽君的我只在乎你", which corresponds to the syllable string"f_ang4 ii_i1 sh_ou3 d_eng4 l_i4 j_vn1 d_e0 uu_uo3 zh_i3 z_ai4 h_u1 n_i3". and the audio during time is 2.46s. my script is align.py:

    import numpy as np
    import pyximport
    pyximport.install(setup_args={"include_dirs":np.get_include()},build_dir="build", build_in_temp=False)
    from align_fill import cython_fill_table
    import sys
    import torch
    sys.path.append("../../../espnet")
    from espnet.asr.asr_utils import get_model_conf
    import os
    from pathlib import Path
    from time import time
    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument("model_path")
    parser.add_argument("data_path")
    parser.add_argument("eval_path")
    parser.add_argument('--start_win', type=int, default=8000)
    args = parser.parse_args()
    max_prob = -10000000000.0
    
    def align(lpz, char_list, ground_truth, utt_begin_indices, skip_prob):   
        #blank = 0
        blank = 2344 #my blank id
        print("Audio length: " + str(lpz.shape[0]))
        print("Text length: " + str(len(ground_truth)))
        if len(ground_truth) > lpz.shape[0] and skip_prob <= max_prob:
            raise AssertionError("Audio is shorter than text!")
        window_len = args.start_win
    
        # Try multiple window lengths if it fails
        while True:
            # Create table which will contain alignment probabilities
            print(lpz.shape[0],len(ground_truth),ground_truth.shape[0],ground_truth.shape[1])
            table = np.zeros([min(window_len, lpz.shape[0]), len(ground_truth)], dtype=np.float32)
            table.fill(max_prob)
            # Use array to log window offsets per character
            offsets = np.zeros([len(ground_truth)], dtype=np.int)
    
            # Run actual alignment
            t, c = cython_fill_table(table, lpz.astype(np.float32), np.array(ground_truth), offsets, np.array(utt_begin_indices), blank, skip_prob)
            #for i in table:
            #    print(' '.join(map(str,i)))
            print("Max prob: " + str(table[:, c].max()) + " at " + str(t))
    
            # Backtracking
            timings = np.zeros([len(ground_truth)])
            char_probs = np.zeros([lpz.shape[0]])
            char_list = [''] * lpz.shape[0]
            current_prob_sum = 0
            try:
                # Do until start is reached
                while t != 0 or c != 0:
                    # Calculate the possible transition probabilities towards the current cell
                    min_s = None
                    min_switch_prob_delta = np.inf
                    max_lpz_prob = max_prob
                    for s in range(ground_truth.shape[1]): 
                        if ground_truth[c, s] != -1:                   
                            offset = offsets[c] - (offsets[c - 1 - s] if c - s > 0 else 0)
                            switch_prob = lpz[t + offsets[c], ground_truth[c, s]] if c > 0 else max_prob
                            est_switch_prob = table[t, c] - table[t - 1 + offset, c - 1 - s]
                            if abs(switch_prob - est_switch_prob) < min_switch_prob_delta:
                                min_switch_prob_delta = abs(switch_prob - est_switch_prob)
                                min_s = s
    
                            max_lpz_prob = max(max_lpz_prob, switch_prob)
                    
                    stay_prob = max(lpz[t + offsets[c], blank], max_lpz_prob) if t > 0 else max_prob
                    est_stay_prob = table[t, c] - table[t - 1, c]
                    
                    # Check which transition has been taken
                    if abs(stay_prob - est_stay_prob) > min_switch_prob_delta:
                        # Apply reverse switch transition
                        if c > 0:
                            # Log timing and character - frame alignment
                            for s in range(0, min_s + 1):
                                timings[c - s] = (offsets[c] + t) * 10 * 4 / 1000
                            char_probs[offsets[c] + t] = max_lpz_prob
                            char_list[offsets[c] + t] = train_args.char_list[ground_truth[c, min_s]]
                            current_prob_sum = 0
    
                        c -= 1 + min_s
                        t -= 1 - offset
                     
                    else:
                        # Apply reverse stay transition
                        char_probs[offsets[c] + t] = stay_prob
                        char_list[offsets[c] + t] = "ε"
                        t -= 1
            except IndexError:
                # If the backtracking was not successful this usually means the window was too small
                window_len *= 2
                print("IndexError: Trying with win len: " + str(window_len))
                if window_len < 100000:
                    continue
                else:
                    raise
            break
        return timings, char_probs, char_list
    
    def prepare_text(text):
        # Prepares the given text for alignment
        # Therefore we create a matrix of possible character symbols to represent the given text
    
        # Create list of char indices depending on the models char list
        ground_truth = "#"
        utt_begin_indices = []
        for utt in text:
            # Only one space in-between
            if ground_truth[-1] != " ":
                ground_truth += " "
    
            # Start new utterance remeber index
            utt_begin_indices.append(len(ground_truth.strip().split()) - 1)
    
            # Add chars of utterance
            for char in utt.strip().split():
                if char.isspace():
                    if ground_truth.strip().split()[-1] != " ":
                        ground_truth += " "
                elif char in train_args.char_list and char not in [ ".", ",", "-", "?", "!", ":", "»", "«", ";", "'", "›", "‹", "(", ")"]:
                    ground_truth += char
    
        # Add space to the end
        if ground_truth[-1] != " ":
            ground_truth += " "
        utt_begin_indices.append(len(ground_truth.strip().split()) - 1)
        print(ground_truth)
        # Create matrix where first index is the time frame and the second index is the number of letters the character symbol spans
        max_char_len = max([len(c) for c in train_args.char_list])
        # ground_truth_mat = np.ones([len(ground_truth), max_char_len], np.int) * -1    
        ground_truth_mat = np.ones([len(ground_truth.strip().split()), 1], np.int) * -1    
        for i in range(len(ground_truth.strip().split())):
            # for s in range(max_char_len):
            for s in range(1):
                if i-s < 0:
                    continue
                span = ' '.join(ground_truth.strip().split()[i-s:i+1])
                # span = span.replace(" ", '▁')
                span = span.replace(" ", 'SP')
                print(span)
                if span in train_args.char_list:
                    ground_truth_mat[i, s] = train_args.char_list.index(span)        
        print(ground_truth_mat)
        print(utt_begin_indices)
        return ground_truth_mat, utt_begin_indices
    
    def write_output(out_path, utt_begin_indices, char_probs):
        # Uses char-wise alignments to get utterance-wise alignmentes and writes them into the given file
        with open(str(out_path), 'w') as outfile:
            outfile.write(str(path_wav.name) + '\n')
            def compute_time(index, type):
                # Compute start and end time of utterance.            
                middle = (timings[index] + timings[index - 1]) / 2
                if type == "begin":
                    return max(timings[index + 1] - 0.5, middle)
                elif type == "end":
                    return min(timings[index - 1] + 0.5, middle)
    
            for i in range(len(text)):
                start = compute_time(utt_begin_indices[i], "begin")
                end = compute_time(utt_begin_indices[i + 1], "end")
                start_t = int(round(start * 1000 / 40))
                end_t = int(round(end * 1000 / 40))
                # Compute confidence score by using the min mean probability after splitting into segments of 30 frames
                n = 30
                if end_t == start_t:
                    min_avg = 0
                elif end_t - start_t <= n:
                    min_avg = char_probs[start_t:end_t].mean()
                else:
                    min_avg = 0
                    for t in range(start_t, end_t - n):
                        min_avg = min(min_avg, char_probs[t:t + n].mean())                
                outfile.write(str(start) + " " + str(end) + " " + str(min_avg) + " | " + text[i] + '\n')
    
    model_path = args.model_path
    model_conf = None
    
    # read training config
    idim, odim, train_args = get_model_conf(model_path, model_conf)
    
    #space_id = train_args.char_list.index('▁')
    space_id = train_args.char_list.index('SP')
    train_args.char_list[0] = "ε"
    # train_args.char_list = [c.lower() for c in train_args.char_list]
    
    data_path = Path(args.data_path)
    eval_path = Path(args.eval_path)
    
    for path_wav in data_path.glob("*.pcm"):
        chapter_sents = data_path / path_wav.name.replace(".pcm", ".txt")
        chapter_prob = eval_path / path_wav.name.replace(".pcm", ".npz")
        out_path = eval_path / path_wav.name.replace(".pcm", ".txt")
        with open(str(chapter_sents), "r") as f:
            text = [t.strip() for t in f.readlines()]
        lpz = np.load(str(chapter_prob))["arr_0"]
        print("Syncing " + str(path_wav))                    
        ground_truth_mat, utt_begin_indices = prepare_text(text)
        try:
            timings, char_probs, char_list = align(lpz, train_args.char_list, ground_truth_mat, utt_begin_indices, max_prob)
            print(timings)
        except AssertionError:
            print("Skipping: Audio is shorter than text")
            continue
        write_output(out_path, utt_begin_indices, char_probs)
    

    I wonder where did i go wrong, or the ctc-segment is not suitable for Chinese syllable. Any suggestion is helpful for me! Thanks very much!

    opened by HalFTeen 3
  • Align text from wav2vec2

    Align text from wav2vec2

    How to use ctc-segmentation with wav2vec2? The down below code works - but is not properly aligned.

    This wav: meisterfloh.zip This code:

    import torch, transformers, ctc_segmentation
    import soundfile
    
    # wav2vec2
    model_file = 'facebook/wav2vec2-large-xlsr-53-german'
    vocab_dict = {"<pad>": 0, "<s>": 1, "</s>": 2, "<unk>": 3, "|": 4, "E": 5, "N": 6, "I": 7, "S": 8, "R": 9, "T": 10, "A": 11, "H": 12, "D": 13, "U": 14, "L": 15, "C": 16, "G": 17, "M": 18, "O": 19, "B": 20, "W": 21, "F": 22, "K": 23, "Z": 24, "V": 25, "Ü": 26, "P": 27, "Ä": 28, "Ö": 29, "J": 30, "Y": 31, "'": 32, "X": 33, "Q": 34, "-": 35}
    
    processor = transformers.Wav2Vec2Processor.from_pretrained( model_file )
    model = transformers.Wav2Vec2ForCTC.from_pretrained( model_file )
    
    speech_array, sampling_rate = soundfile.read( 'meisterfloh.wav' )
    assert sampling_rate == 16000
    features = processor(speech_array,sampling_rate=16000, return_tensors="pt")
    input_values = features.input_values
    attention_mask = features.attention_mask
    with torch.no_grad():
        logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]
    transcription = transcription.lower()
    
    # ctc-segmentation
    config = ctc_segmentation.CtcSegmentationParameters()
    with torch.no_grad():
        softmax = torch.nn.Softmax(dim = -1)
        lpz = softmax(logits)[0].cpu().numpy()
    char_list = [x.lower() for x in vocab_dict.keys()]
    ground_truth_mat, utt_begin_indices = ctc_segmentation.prepare_text(config, transcription,char_list)
    timings, char_probs, state_list = ctc_segmentation.ctc_segmentation(config, lpz, ground_truth_mat)
    segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, transcription)
    
    # dump
    for word, segment in zip(transcription.split(' '), segments):
        print( word, segment )
    
    opened by doublex 2
  • fix installation issue

    fix installation issue

    https://github.com/espnet/espnet/issues/2365#issuecomment-680987279

    I fixed setup.py.

    • Fix installation issue due to the wrong extension name
    • Fix issues to avoid import numpy. It's not allowed to use numpy in setup.py at the header because it's not installed.
    • I added github actions.
    opened by kamo-naoyuki 2
  • Installation fails

    Installation fails

    Defaulting to user installation because normal site-packages is not writeable Collecting ctc-segmentation Using cached ctc_segmentation-1.7.4.tar.gz (73 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: Cython in c:\users\zhx\appdata\roaming\python\python39\site-packages (from ctc-segmentation) (0.29.32) Requirement already satisfied: numpy in c:\users\zhx\appdata\roaming\python\python39\site-packages (from ctc-segmentation) (1.23.4) Requirement already satisfied: setuptools in c:\program files\python39\lib\site-packages (from ctc-segmentation) (56.0.0) Building wheels for collected packages: ctc-segmentation Building wheel for ctc-segmentation (pyproject.toml) ... error error: subprocess-exited-with-error

    × Building wheel for ctc-segmentation (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [12 lines of output] running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-cpython-39 creating build\lib.win-amd64-cpython-39\ctc_segmentation copying ctc_segmentation\ctc_segmentation.py -> build\lib.win-amd64-cpython-39\ctc_segmentation copying ctc_segmentation\partitioning.py -> build\lib.win-amd64-cpython-39\ctc_segmentation copying ctc_segmentation_init_.py -> build\lib.win-amd64-cpython-39\ctc_segmentation running build_ext building 'ctc_segmentation.ctc_segmentation_dyn' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/ [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for ctc-segmentation Failed to build ctc-segmentation ERROR: Could not build wheels for ctc-segmentation, which is required to install pyproject.toml-based projects

    opened by Zhonghexin 1
  • Differences between lumaku and cornerfarmer implementations

    Differences between lumaku and cornerfarmer implementations

    First of all marvelous piece of work done here! Thanks @lumaku , your continued participation in ASR projects as well is invaluable!!

    I had a query regarding differences between the implementation and features exposed by this repo and the repo at cornerfarmers repo.

    Are there any differences wrt implementation? What about

    • Performance? GPU computability - I understand the algo is not meant for the GPU and instead better on a strong single core CPU unit, but mentions of using RNN instead confuse me. Perhaps that's only for getting the logits? This part confused me a little, as well as trying to search for such a suitable pretrained RNN+ctc based character outputting STT implementation.
    • Working with longer audio files
    • Any other interfaces exposed or features provided

    I understand that this repo is based off on the cornerfarmer one, as that is the code for the paper (DOI: 10.1007/978-3-030-60276-5_27), but would like to ask the author @lumaku if there are any insights to be gained here.

    Of particular use would be for my use case of force aligning long form audio using ctc-segmentation using ASR-generated transcripts. Any insights regarding this would be appreciated as well, otherwise I can create a new topic if that is more acceptable!

    Thanks again @lumaku !

    opened by ShantanuNair 1
  • How to solve

    How to solve "Audio is shorter than text" Error?

    Hello, I want to do a characters-alignment for a recording. I am using your “ctc_segmentation” tool, and I am sending each character in a new line in the input text file. For example, for “IT GAVE” I will send a file with the following : utt0 I utt1 T utt2 _ utt3 G utt4 A utt5 V utt6 E For some recordings it works pretty well. The problem is that sometimes I get the error : “Audio is shorter than text!” I understand that is something about the ratio between the utterances to be aligned and audio length, but how can I solve this problem? Could tuning the ‘Time stamp parameters’ solve the issue?

    opened by chenasr 3
Owner
Ludwig Kürzinger
Ludwig Kürzinger
Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP Abstract: We introduce a method that allows to automatically se

Daniil Pakhomov 134 Dec 19, 2022
TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

This project is a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

yifan liu 147 Dec 3, 2022
nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

jsguo 610 Dec 28, 2022
Realtime segmentation with ENet, the fast and accurate segmentation net.

Enet This is a realtime segmentation net with almost 22 fps on GTX1080 ti, and the model size is very small with only 28M. This repo contains the infe

JinTian 14 Aug 30, 2022
Mae segmentation - Reproduction of semantic segmentation using masked autoencoder (mae)

ADE20k Semantic segmentation with MAE Getting started Install the mmsegmentation

null 97 Dec 17, 2022
Multi-atlas segmentation (MAS) is a promising framework for medical image segmentation

Multi-atlas segmentation (MAS) is a promising framework for medical image segmentation. Generally, MAS methods register multiple atlases, i.e., medical images with corresponding labels, to a target image;

NanYoMy 13 Oct 9, 2022
POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propagation including diffraction

POPPY: Physical Optics Propagation in Python POPPY (Physical Optics Propagation in Python) is a Python package that simulates physical optical propaga

Space Telescope Science Institute 132 Dec 15, 2022
Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

Amazasp Shaumyan 482 Jan 4, 2023
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 3, 2023
A machine learning package for streaming data in Python. The other ancestor of River.

scikit-multiflow is a machine learning package for streaming data in Python. creme and scikit-multiflow are merging into a new project called River. W

null 670 Dec 30, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 2.8k Feb 12, 2021
Implements Gradient Centralization and allows it to use as a Python package in TensorFlow

Gradient Centralization TensorFlow This Python package implements Gradient Centralization in TensorFlow, a simple and effective optimization technique

Rishit Dagli 101 Nov 1, 2022
Python package facilitating the use of Bayesian Deep Learning methods with Variational Inference for PyTorch

PyVarInf PyVarInf provides facilities to easily train your PyTorch neural network models using variational inference. Bayesian Deep Learning with Vari

null 342 Dec 2, 2022
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

OCTIS : Optimizing and Comparing Topic Models is Simple! OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and compa

MIND 478 Jan 1, 2023
Leibniz is a python package which provide facilities to express learnable partial differential equations with PyTorch

Leibniz is a python package which provide facilities to express learnable partial differential equations with PyTorch

Beijing ColorfulClouds Technology Co.,Ltd. 16 Aug 7, 2022
TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

null 912 Jan 8, 2023
This python-based package offers a way of creating a parametric OpenMC plasma source from plasma parameters.

openmc-plasma-source This python-based package offers a way of creating a parametric OpenMC plasma source from plasma parameters. The OpenMC sources a

Fusion Energy 10 Oct 18, 2022
A Python Package for Portfolio Optimization using the Critical Line Algorithm

PyCLA A Python Package for Portfolio Optimization using the Critical Line Algorithm Getting started To use PyCLA, clone the repo and install the requi

null 19 Oct 11, 2022
HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

HeatNet HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events glob

Google Research 6 Jul 7, 2022