Charsiu: A transformer-based phonetic aligner

jzhu

Last update: Dec 9, 2022

Related tags

Deep Learning charsiu

Overview

Charsiu: A transformer-based phonetic aligner [arXiv]

Note. This is a preview version. The aligner is under active development. New functions, new languages and detailed documentation will be added soon!

Intro

Charsiu is a phonetic alignment tool, which can:

recognise phonemes in a given audio file
perform forced alignment using phone transcriptions created in the previous step or provided by the user.
directly predict the phone-to-audio alignment from audio (text-independent alignment)

Fun fact: Char Siu is one of the most representative dishes of Cantonese cuisine 🍲 (see wiki).

Tutorial (In progress)

You can directly run our model in the cloud via Google Colab!

Forced alignment:
Textless alignmnet:

Development plan

Package

Items	Progress
Documentation	Nov 2021
Textgrid support	Nov 2021
Model compression	TBD

Multilingual support

Language	Progress
English (American)	√
Mandarin Chinese	Nov 2021
Spanish	Dec 2021
English (British)	TBD
Cantonese	TBD
AAVE	TBD

Pretrained models

Our pretrained models are availble at the HuggingFace model hub: https://huggingface.co/charsiu.

Dependencies

pytorch
transformers
datasets
librosa
g2pe
praatio

Training

Coming soon!

Finetuning

Coming soon!

Attribution and Citation

For now, you can cite this tool as:

@article{zhu2019charsiu,
  title={Phone-to-audio alignment without text: A Semi-supervised Approach},
  author={Zhu, Jian and Zhang, Cong and Jurgens, David},
  journal={arXiv preprint arXiv:????????????????????},
  year={2021}
 }

To share a direct web link: https://github.com/lingjzhu/charsiu/.

References

Transformers
s3prl
Montreal Forced Aligner

Disclaimer

This tool is a beta version and is still under active development. It may have bugs and quirks, alongside the difficulties and provisos which are described throughout the documentation. This tool is distributed under MIT liscence. Please see license for details.

By using this tool, you acknowledge:

That you understand that this tool does not produce perfect camera-ready data, and that all results should be hand-checked for sanity's sake, or at the very least, noise should be taken into account.
That you understand that this tool is a work in progress which may contain bugs. Future versions will be released, and bug fixes (and additions) will not necessarily be advertised.
That this tool may break with future updates of the various dependencies, and that the authors are not required to repair the package when that happens.
That you understand that the authors are not required or necessarily available to fix bugs which are encountered (although you're welcome to submit bug reports to Jian Zhu ([email protected]), if needed), nor to modify the tool to your needs.
That you will acknowledge the authors of the tool if you use, modify, fork, or re-use the code in your future work.
That rather than re-distributing this tool to other researchers, you will instead advise them to download the latest version from the website.

... and, most importantly:

That neither the authors, our collaborators, nor the the University of Michigan or any related universities on the whole, are responsible for the results obtained from the proper or improper usage of the tool, and that the tool is provided as-is, as a service to our fellow linguists.

All that said, thanks for using our tool, and we hope it works wonderfully for you!

Support or Contact

Please contact Jian Zhu ([email protected]) for technical support.
Contact Cong Zhang ([email protected]) if you would like to receive more instructions on how to use the package.

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021

Transformer - Transformer in PyTorch

Transformer 完成进度 Embeddings and PositionalEncoding with example. MultiHeadAttent

1 Jan 6, 2022

Transformer Huffman coding - Complete Huffman coding through transformer

Transformer_Huffman_coding Complete Huffman coding through transformer 2022/2/19

3 May 19, 2022

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

FLASH - Pytorch Implementation of the Transformer variant proposed in the paper Transformer Quality in Linear Time Install $ pip install FLASH-pytorch

209 Dec 28, 2022

Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020).

SentiBERT Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020). https://arxiv.org/abs/20

66 Aug 13, 2022

a general-purpose Transformer based vision backbone

Swin Transformer By Ze Liu*, Yutong Lin*, Yue Cao*, Han Hu*, Yixuan Wei, Zheng Zhang, Stephen Lin and Baining Guo. This repo is the official implement

9.9k Jan 8, 2023

TransReID: Transformer-based Object Re-Identification

TransReID: Transformer-based Object Re-Identification [arxiv] The official repository for TransReID: Transformer-based Object Re-Identification achiev

569 Dec 30, 2022

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

489 Jan 7, 2023

Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Transformer Based Multi-Source Domain Adaptation Dustin Wright and Isabelle Augenstein To appear in EMNLP 2020. Read the preprint: https://arxiv.org/a

36 Dec 5, 2022

Comments

TextGrid file isn't according to spec

Could you check if .textgrid file produced is according to spec?

i'm using this https://github.com/nltk/nltk_contrib/blob/95d1806e2f4e89e960b76a685b1fba2eaa7d5142/nltk_contrib/textgrid.py to test generated textgrid files.

opened by skol101 4
How do we use the pretrained attention aligner?

Hi, I find that getting a pretrained predictive aligner (aligner='charsiu/en_w2v2_fc_10ms') to work with librispeech seems straightforward. However, I'm unable to get the attention aligner working - how do I go about initializing the aligner and how do I get the corresponding bert config to go with it? Keeps throwing an error for the same.

opened by vishhvak 1

Bug in phoneme to word conversion -- duplicate words

Something seems to be not right with how SIL is used in the word transcriptions.

This is the first example in the LibriSpeech Test set.

Here is the true transcript:

HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT

Here is the forced aligned word transcript:

array([['0.0', '0.23', '[SIL]'],
       ['0.23', '0.33', 'he'],
       ['0.33', '0.65', 'began'],
       ['0.65', '0.69', 'a'],
       ['0.69', '1.21', 'confused'],
       ['1.21', '1.62', 'complaint'],
       ['1.62', '1.93', 'against'],
       ['1.93', '2.01', 'the'],
       ['2.01', '2.41', 'wizard'],
       ['2.41', '2.56', '[SIL]'],
       ['2.56', '2.57', 'wizard'],
       ['2.57', '2.63', '[SIL]'],
       ['2.63', '2.75', 'who'],
       ['2.75', '2.84', 'had'],
       ['2.84', '3.26', 'vanished'],
       ['3.26', '3.59', 'behind'],
       ['3.59', '3.66', 'the'],
       ['3.66', '4.02', 'curtain'],
       ['4.02', '4.15', 'on'],
       ['4.15', '4.23', 'the'],
       ['4.23', '4.66', 'left'],
       ['4.66', '4.89', '[SIL]']], dtype='<U32')

Here is the forced aligned phonetic transcript:

array([['0.0', '0.23', '[SIL]'],
       ['0.23', '0.3', 'HH'],
       ['0.3', '0.33', 'IY'],
       ['0.33', '0.39', 'B'],
       ['0.39', '0.44', 'IH'],
       ['0.44', '0.53', 'G'],
       ['0.53', '0.6', 'AE'],
       ['0.6', '0.65', 'N'],
       ['0.65', '0.69', 'AH'],
       ['0.69', '0.77', 'K'],
       ['0.77', '0.81', 'AH'],
       ['0.81', '0.86', 'N'],
       ['0.86', '0.97', 'F'],
       ['0.97', '1.02', 'Y'],
       ['1.02', '1.1', 'UW'],
       ['1.1', '1.16', 'Z'],
       ['1.16', '1.21', 'D'],
       ['1.21', '1.26', 'K'],
       ['1.26', '1.3', 'AH'],
       ['1.3', '1.34', 'M'],
       ['1.34', '1.44', 'P'],
       ['1.44', '1.49', 'L'],
       ['1.49', '1.55', 'EY'],
       ['1.55', '1.58', 'N'],
       ['1.58', '1.62', 'T'],
       ['1.62', '1.66', 'AH'],
       ['1.66', '1.74', 'G'],
       ['1.74', '1.78', 'EH'],
       ['1.78', '1.84', 'N'],
       ['1.84', '1.9', 'S'],
       ['1.9', '1.93', 'T'],
       ['1.93', '1.96', 'DH'],
       ['1.96', '2.01', 'AH'],
       ['2.01', '2.1', 'W'],
       ['2.1', '2.15', 'IH'],
       ['2.15', '2.26', 'Z'],
       ['2.26', '2.34', 'ER'],
       ['2.34', '2.41', 'D'],
       ['2.41', '2.56', '[SIL]'],
       ['2.56', '2.57', 'D'],
       ['2.57', '2.63', '[SIL]'],
       ['2.63', '2.7', 'HH'],
       ['2.7', '2.75', 'UW'],
       ['2.75', '2.78', 'HH'],
       ['2.78', '2.8', 'AE'],
       ['2.8', '2.84', 'D'],
       ['2.84', '2.95', 'V'],
       ['2.95', '3.04', 'AE'],
       ['3.04', '3.09', 'N'],
       ['3.09', '3.15', 'IH'],
       ['3.15', '3.23', 'SH'],
       ['3.23', '3.26', 'T'],
       ['3.26', '3.3', 'B'],
       ['3.3', '3.35', 'IH'],
       ['3.35', '3.43', 'HH'],
       ['3.43', '3.53', 'AY'],
       ['3.53', '3.56', 'N'],
       ['3.56', '3.59', 'D'],
       ['3.59', '3.62', 'DH'],
       ['3.62', '3.66', 'AH'],
       ['3.66', '3.78', 'K'],
       ['3.78', '3.9', 'ER'],
       ['3.9', '3.93', 'T'],
       ['3.93', '3.96', 'AH'],
       ['3.96', '4.02', 'N'],
       ['4.02', '4.09', 'AA'],
       ['4.09', '4.15', 'N'],
       ['4.15', '4.19', 'DH'],
       ['4.19', '4.23', 'AH'],
       ['4.23', '4.36', 'L'],
       ['4.36', '4.47', 'EH'],
       ['4.47', '4.58', 'F'],
       ['4.58', '4.66', 'T'],
       ['4.66', '4.89', '[SIL]']], dtype='<U32')

I suspect this may indicate a general problem with the phoneme to word conversion.

opened by jhkonan 1

Can't currently support long audio ?

“Charsiu works the best when your files are shorter than 15 ms. Test whether your files are longer than 15ms”

I saw this hint in the description and tested it.

Forcing alignment of long audio, the following error message will appear:

Traceback (most recent call last): File "test.py", line 31, in <module> charsiu.align(audio=audio, text=text) File "E:\***/python/charsiu/charsiu/src\Charsiu.py", line 157, in align pred_words = self.charsiu_processor.align_words(pred_phones,phones,words) File "E:\***/python/charsiu/charsiu/src\processors.py", line 417, in align_words word_dur.append((dur,words_rep[count])) #((start,end,phone),word) IndexError: list index out of range

opened by wxbool 3

Charsiu: A transformer-based phonetic aligner

Related tags

Overview

Charsiu: A transformer-based phonetic aligner [arXiv]

Intro

Tutorial (In progress)

Development plan

Pretrained models

Dependencies

Training

Finetuning

Attribution and Citation

References

Disclaimer

Support or Contact

You might also like...

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Transformer - Transformer in PyTorch

Transformer Huffman coding - Complete Huffman coding through transformer

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

Code for SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics (ACL'2020).

a general-purpose Transformer based vision backbone

TransReID: Transformer-based Object Re-Identification

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Comments

TextGrid file isn't according to spec

How do we use the pretrained attention aligner?

Bug in phoneme to word conversion -- duplicate words

Can't currently support long audio ?

Owner

jzhu

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

3D-Transformer: Molecular Representation with Transformer in 3D Space