Charsiu: A transformer-based phonetic aligner

Overview

Charsiu: A transformer-based phonetic aligner [arXiv]

Note. This is a preview version. The aligner is under active development. New functions, new languages and detailed documentation will be added soon!

Intro

Charsiu is a phonetic alignment tool, which can:

  • recognise phonemes in a given audio file
  • perform forced alignment using phone transcriptions created in the previous step or provided by the user.
  • directly predict the phone-to-audio alignment from audio (text-independent alignment)

Fun fact: Char Siu is one of the most representative dishes of Cantonese cuisine 🍲 (see wiki).

Tutorial (In progress)

You can directly run our model in the cloud via Google Colab!

  • Forced alignment: Open In Colab
  • Textless alignmnet: Open In Colab

Development plan

  • Package
Items Progress
Documentation Nov 2021
Textgrid support Nov 2021
Model compression TBD
  • Multilingual support
Language Progress
English (American)
Mandarin Chinese Nov 2021
Spanish Dec 2021
English (British) TBD
Cantonese TBD
AAVE TBD

Pretrained models

Our pretrained models are availble at the HuggingFace model hub: https://huggingface.co/charsiu.

Dependencies

pytorch
transformers
datasets
librosa
g2pe
praatio

Training

Coming soon!

Finetuning

Coming soon!

Attribution and Citation

For now, you can cite this tool as:

@article{zhu2019charsiu,
  title={Phone-to-audio alignment without text: A Semi-supervised Approach},
  author={Zhu, Jian and Zhang, Cong and Jurgens, David},
  journal={arXiv preprint arXiv:????????????????????},
  year={2021}
 }

Or

To share a direct web link: https://github.com/lingjzhu/charsiu/.

References

Transformers
s3prl
Montreal Forced Aligner

Disclaimer

This tool is a beta version and is still under active development. It may have bugs and quirks, alongside the difficulties and provisos which are described throughout the documentation. This tool is distributed under MIT liscence. Please see license for details.

By using this tool, you acknowledge:

  • That you understand that this tool does not produce perfect camera-ready data, and that all results should be hand-checked for sanity's sake, or at the very least, noise should be taken into account.

  • That you understand that this tool is a work in progress which may contain bugs. Future versions will be released, and bug fixes (and additions) will not necessarily be advertised.

  • That this tool may break with future updates of the various dependencies, and that the authors are not required to repair the package when that happens.

  • That you understand that the authors are not required or necessarily available to fix bugs which are encountered (although you're welcome to submit bug reports to Jian Zhu ([email protected]), if needed), nor to modify the tool to your needs.

  • That you will acknowledge the authors of the tool if you use, modify, fork, or re-use the code in your future work.

  • That rather than re-distributing this tool to other researchers, you will instead advise them to download the latest version from the website.

... and, most importantly:

  • That neither the authors, our collaborators, nor the the University of Michigan or any related universities on the whole, are responsible for the results obtained from the proper or improper usage of the tool, and that the tool is provided as-is, as a service to our fellow linguists.

All that said, thanks for using our tool, and we hope it works wonderfully for you!

Support or Contact

Please contact Jian Zhu ([email protected]) for technical support.
Contact Cong Zhang ([email protected]) if you would like to receive more instructions on how to use the package.

You might also like...
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

Transformer - Transformer in PyTorch

Transformer 完成进度 Embeddings and PositionalEncoding with example. MultiHeadAttent

Transformer Huffman coding - Complete Huffman coding through transformer

Transformer_Huffman_coding Complete Huffman coding through transformer 2022/2/19

Implementation of the Transformer variant proposed in
Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

FLASH - Pytorch Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time Install $ pip install FLASH-pytorch

Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020).
Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020).

SentiBERT Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020). https://arxiv.org/abs/20

a general-purpose Transformer based vision backbone

Swin Transformer By Ze Liu*, Yutong Lin*, Yue Cao*, Han Hu*, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo. This repo is the official implement

TransReID: Transformer-based Object Re-Identification
TransReID: Transformer-based Object Re-Identification

TransReID: Transformer-based Object Re-Identification [arxiv] The official repository for TransReID: Transformer-based Object Re-Identification achiev

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

Code to reproduce the experiments in the paper
Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Transformer Based Multi-Source Domain Adaptation Dustin Wright and Isabelle Augenstein To appear in EMNLP 2020. Read the preprint: https://arxiv.org/a

Comments
  • TextGrid file isn't according to spec

    TextGrid file isn't according to spec

    Could you check if .textgrid file produced is according to spec?

    i'm using this https://github.com/nltk/nltk_contrib/blob/95d1806e2f4e89e960b76a685b1fba2eaa7d5142/nltk_contrib/textgrid.py to test generated textgrid files.

    opened by skol101 4
  • How do we use the pretrained attention aligner?

    How do we use the pretrained attention aligner?

    Hi, I find that getting a pretrained predictive aligner (aligner='charsiu/en_w2v2_fc_10ms') to work with librispeech seems straightforward. However, I'm unable to get the attention aligner working - how do I go about initializing the aligner and how do I get the corresponding bert config to go with it? Keeps throwing an error for the same.

    opened by vishhvak 1
  • Bug in phoneme to word conversion -- duplicate words

    Bug in phoneme to word conversion -- duplicate words

    Something seems to be not right with how SIL is used in the word transcriptions.

    This is the first example in the LibriSpeech Test set.

    Here is the true transcript:

    HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT
    

    Here is the forced aligned word transcript:

    array([['0.0', '0.23', '[SIL]'],
           ['0.23', '0.33', 'he'],
           ['0.33', '0.65', 'began'],
           ['0.65', '0.69', 'a'],
           ['0.69', '1.21', 'confused'],
           ['1.21', '1.62', 'complaint'],
           ['1.62', '1.93', 'against'],
           ['1.93', '2.01', 'the'],
           ['2.01', '2.41', 'wizard'],
           ['2.41', '2.56', '[SIL]'],
           ['2.56', '2.57', 'wizard'],
           ['2.57', '2.63', '[SIL]'],
           ['2.63', '2.75', 'who'],
           ['2.75', '2.84', 'had'],
           ['2.84', '3.26', 'vanished'],
           ['3.26', '3.59', 'behind'],
           ['3.59', '3.66', 'the'],
           ['3.66', '4.02', 'curtain'],
           ['4.02', '4.15', 'on'],
           ['4.15', '4.23', 'the'],
           ['4.23', '4.66', 'left'],
           ['4.66', '4.89', '[SIL]']], dtype='<U32')
    

    Here is the forced aligned phonetic transcript:

    array([['0.0', '0.23', '[SIL]'],
           ['0.23', '0.3', 'HH'],
           ['0.3', '0.33', 'IY'],
           ['0.33', '0.39', 'B'],
           ['0.39', '0.44', 'IH'],
           ['0.44', '0.53', 'G'],
           ['0.53', '0.6', 'AE'],
           ['0.6', '0.65', 'N'],
           ['0.65', '0.69', 'AH'],
           ['0.69', '0.77', 'K'],
           ['0.77', '0.81', 'AH'],
           ['0.81', '0.86', 'N'],
           ['0.86', '0.97', 'F'],
           ['0.97', '1.02', 'Y'],
           ['1.02', '1.1', 'UW'],
           ['1.1', '1.16', 'Z'],
           ['1.16', '1.21', 'D'],
           ['1.21', '1.26', 'K'],
           ['1.26', '1.3', 'AH'],
           ['1.3', '1.34', 'M'],
           ['1.34', '1.44', 'P'],
           ['1.44', '1.49', 'L'],
           ['1.49', '1.55', 'EY'],
           ['1.55', '1.58', 'N'],
           ['1.58', '1.62', 'T'],
           ['1.62', '1.66', 'AH'],
           ['1.66', '1.74', 'G'],
           ['1.74', '1.78', 'EH'],
           ['1.78', '1.84', 'N'],
           ['1.84', '1.9', 'S'],
           ['1.9', '1.93', 'T'],
           ['1.93', '1.96', 'DH'],
           ['1.96', '2.01', 'AH'],
           ['2.01', '2.1', 'W'],
           ['2.1', '2.15', 'IH'],
           ['2.15', '2.26', 'Z'],
           ['2.26', '2.34', 'ER'],
           ['2.34', '2.41', 'D'],
           ['2.41', '2.56', '[SIL]'],
           ['2.56', '2.57', 'D'],
           ['2.57', '2.63', '[SIL]'],
           ['2.63', '2.7', 'HH'],
           ['2.7', '2.75', 'UW'],
           ['2.75', '2.78', 'HH'],
           ['2.78', '2.8', 'AE'],
           ['2.8', '2.84', 'D'],
           ['2.84', '2.95', 'V'],
           ['2.95', '3.04', 'AE'],
           ['3.04', '3.09', 'N'],
           ['3.09', '3.15', 'IH'],
           ['3.15', '3.23', 'SH'],
           ['3.23', '3.26', 'T'],
           ['3.26', '3.3', 'B'],
           ['3.3', '3.35', 'IH'],
           ['3.35', '3.43', 'HH'],
           ['3.43', '3.53', 'AY'],
           ['3.53', '3.56', 'N'],
           ['3.56', '3.59', 'D'],
           ['3.59', '3.62', 'DH'],
           ['3.62', '3.66', 'AH'],
           ['3.66', '3.78', 'K'],
           ['3.78', '3.9', 'ER'],
           ['3.9', '3.93', 'T'],
           ['3.93', '3.96', 'AH'],
           ['3.96', '4.02', 'N'],
           ['4.02', '4.09', 'AA'],
           ['4.09', '4.15', 'N'],
           ['4.15', '4.19', 'DH'],
           ['4.19', '4.23', 'AH'],
           ['4.23', '4.36', 'L'],
           ['4.36', '4.47', 'EH'],
           ['4.47', '4.58', 'F'],
           ['4.58', '4.66', 'T'],
           ['4.66', '4.89', '[SIL]']], dtype='<U32')
    

    I suspect this may indicate a general problem with the phoneme to word conversion.

    opened by jhkonan 1
  • Can't currently support long audio ?

    Can't currently support long audio ?

    “Charsiu works the best when your files are shorter than 15 ms. Test whether your files are longer than 15ms”

    I saw this hint in the description and tested it.

    Forcing alignment of long audio, the following error message will appear:

    Traceback (most recent call last): File "test.py", line 31, in <module> charsiu.align(audio=audio, text=text) File "E:\***/python/charsiu/charsiu/src\Charsiu.py", line 157, in align pred_words = self.charsiu_processor.align_words(pred_phones,phones,words) File "E:\***/python/charsiu/charsiu/src\processors.py", line 417, in align_words word_dur.append((dur,words_rep[count])) #((start,end,phone),word) IndexError: list index out of range

    opened by wxbool 3
Owner
jzhu
Michigan Linguistics
jzhu
Alex Pashevich 62 Dec 24, 2022
Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

Phil Wang 272 Dec 23, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

null 61 Jan 1, 2023
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

null 87 Nov 29, 2022
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 52 Dec 29, 2022
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

Microsoft 409 Jan 6, 2023
nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

jsguo 610 Dec 28, 2022
3D-Transformer: Molecular Representation with Transformer in 3D Space

3D-Transformer: Molecular Representation with Transformer in 3D Space

null 55 Dec 19, 2022