This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Overview

Common Voice Utils

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims to provide a one-stop-shop for utilities and data useful in training ASR and TTS systems.

Tools

  • Phonemiser:
    • A rudimentary grapheme to phoneme (g2p) system based on either:
      • a deterministic longest-match left-to-right replacement of orthographic units; or
      • a weighted finite-state transducer
  • Validator:
    • A validation and normalisation script.
    • It checks a sentence to see if it can be converted and if possible normalises the encoding, removes punctuation and returns it
  • Alphabet:
    • The relevant alphabet of the language, appropriate for use in training ASR
  • Segmenter:
    • A deterministic sentence segmentation algorithm tuned for segmenting paragraphs from Wikipedia
  • Corpora:
    • Contains metadata for different corpora you may be interested in using with Common Voice

Installation

The easiest way is with pip:

$ pip install git+https://github.com/ftyers/commonvoice-utils.git

How to use it

Command line tool

There is also a command line tool, covo /ˈkəʊvəʊ/ which aims to expose much of the functionality through the command line. Some examples on the next lines:

Process a Wikipedia dump

Use a Wikipedia dump to get text for a language mode in the right format:

$ covo dump mtwiki-latest-pages-articles.xml.bz2 | covo segment mt | covo norm mt
x'inhi l-wikipedija
il-wikipedija hi mmexxija mill-fondazzjoni wikimedia fondazzjoni mingħajr fini ta' lukru li tospita proġetti oħra b'kontenut ħieles u multilingwi
il-malti huwa l-ilsien nazzjonali tar-repubblika ta' malta
huwa l-ilsien uffiċjali flimkien mal-ingliż kif ukoll wieħed mill-ilsna uffiċjali tal-unjoni ewropea

Query the OPUS corpus collection

Get a list of URLs for a particular language from the OPUS corpus collection:

$ covo opus mt | sort -gr
23859 documents,69.4M tokens	https://object.pouta.csc.fi/OPUS-DGT/v2019/mono/mt.txt.gz
8665 documents,25.8M tokens	https://object.pouta.csc.fi/OPUS-JRC-Acquis/v3.0/mono/mt.txt.gz
5388 documents,8.9M tokens	https://object.pouta.csc.fi/OPUS-JW300/v1b/mono/mt.txt.gz
...

Convert grapheme input to phonemes

Get the grapheme to phoneme output for some arbitrary input:

$ echo "euskal herrian euskaraz" | covo phon eu
eus̺kal erian eus̺kaɾas̻

$ echo "قايتا نىشان بەلگىلەش ئورنى ئۇيغۇرچە ۋىكىپىدىيە" | covo phon ug
qɑjtɑ nɪʃɑn bɛlɡɪlɛʃ ornɪ ujʁurtʃɛ vɪkɪpɪdɪjɛ

Export data for use in Coqui STT

Designed for use with Coqui STT, converts to 16kHz mono-channel PCM .wav files and runs the transcripts through the validation step. In addition outputs .csv files for each of the input .tsv files.

$ covo export myv cv-corpus-8.0-2022-01-19/myv/
Loading TSV file:  cv-corpus-8.0-2022-01-19/myv/test.tsv
  Importing mp3 files...
  Imported 292 samples.
  Skipped 2 samples that were longer than 10 seconds.
  Final amount of imported audio: 0:27:03 from 0:27:23.
  Saving new Coqui STT-formatted CSV file to:  cv-corpus-8.0-2022-01-19/myv/clips/test.csv
  Writing CSV file for train.py as:  cv-corpus-8.0-2022-01-19/myv/clips/test.csv

Python module

The code can also be used as a Python module, here are some examples:

Alphabet

Returns an alphabet appropriate for end-to-end speech recognition.

>>> from cvutils import Alphabet
>>> a = Alphabet('cv')
>>> a.get_alphabet()
' -абвгдежзийклмнопрстуфхцчшщыэюяёҫӑӗӳ'

Corpora

Some miscellaneous tools for working with corpora:

>>> from cvutils import Corpora
>>> c = Corpora('kpv')
>>> c.dump_url()
'https://dumps.wikimedia.org/kvwiki/latest/kvwiki-latest-pages-articles.xml.bz2'
>>> c.target_segments()
[]
>>> c = Corpora('cv')
>>> c.target_segments()
['нуль', 'пӗрре', 'иккӗ', 'виҫҫӗ', 'тӑваттӑ', 'пиллӗк', 'улттӑ', 'ҫиччӗ', 'саккӑр', 'тӑххӑр', 'ҫапла', 'ҫук']
>>> c.dump_url()
'https://dumps.wikimedia.org/cvwiki/latest/cvwiki-latest-pages-articles.xml.bz2'

Grapheme to phoneme

For a given token, return an approximate broad phonemised version of it.

>>> from cvutils import Phonemiser
>>> p = Phonemiser('ab')
>>> p.phonemise('гӏапынхъамыз')
'ʕapənqaməz'

>>> p = Phonemiser('br')
>>> p.phonemise("implijout")
'impliʒut'

Validator

For a given input sentence/utterance, the validator returns either a validated and normalised version of the string according to the validation rules, or None if the string cannot be validated.

>>> from cvutils import Validator
>>> v = Validator('ab')
>>> v.validate('Аллаҳ хаҵеи-ԥҳәыси иеилыхны, аҭыԥҳацәа роума иалихыз?')
'аллаҳ хаҵеи-ԥҳәыси иеилыхны аҭыԥҳацәа роума иалихыз'

>>> v = Validator('br')
>>> v.validate('Ha cʼhoant hocʼh eus da gendercʼhel da implijout ar servijer-mañ ?')
"ha c'hoant hoc'h eus da genderc'hel da implijout ar servijer-mañ"

Sentence segmentation

Mostly designed for use with Wikipedia, takes a paragraph and returns a list of the sentences found within it.

>> for sent in s.segment(para): ... print(sent) ... Peurliesañ avat e kemm ar vogalennoù e c'hengerioù evit dont da vezañ heñvel ouzh ar vogalennoù en nominativ (d.l.e. ar stumm-meneg), da skouer e hungareg: Aour, tungsten, zink, uraniom, h.a., a vez kavet e kondon Bouryatia. A-bouez-bras evit armerzh ar vro eo al labour-douar ivez pa vez gounezet gwinizh ha legumaj dreist-holl. A-hend-all e vez gounezet arc'hant dre chaseal ha pesketa.">
>>> from cvutils import Segmenter 
>>> s = Segmenter('br')
>>> para = "Peurliesañ avat e kemm ar vogalennoù e c'hengerioù evit dont da vezañ heñvel ouzh ar vogalennoù en nominativ (d.l.e. ar stumm-meneg), da skouer e hungareg: Aour, tungsten, zink, uraniom, h.a., a vez kavet e kondon Bouryatia. A-bouez-bras evit armerzh ar vro eo al labour-douar ivez pa vez gounezet gwinizh ha legumaj dreist-holl. A-hend-all e vez gounezet arc'hant dre chaseal ha pesketa."
>>> for sent in s.segment(para):
...     print(sent)
... 
Peurliesañ avat e kemm ar vogalennoù e c'hengerioù evit dont da vezañ heñvel ouzh ar vogalennoù en nominativ (d.l.e. ar stumm-meneg), da skouer e hungareg: Aour, tungsten, zink, uraniom, h.a., a vez kavet e kondon Bouryatia.
A-bouez-bras evit armerzh ar vro eo al labour-douar ivez pa vez gounezet gwinizh ha legumaj dreist-holl.
A-hend-all e vez gounezet arc'hant dre chaseal ha pesketa.

Language support

Language Autonym Code (CV) (WP) Phon Valid Alphabet Segment
Abkhaz Аԥсуа abk ab
Amharic አማርኛ amh am
Arabic اَلْعَرَبِيَّةُ ara ar ar
Assamese অসমীয়া asm as as
Azeri Azərbaycanca aze az az
Bashkort Башҡортса bak ba ba
Basaa Basaa bas bas
Belarusian Беларуская мова bel be be
Bengali বাংলা ben bn bn
Breton Brezhoneg bre br br
Bulgarian Български bul bg bg
Catalan Català cat ca ca
Czech Čeština ces cs cs
Chukchi Ԓыгъоравэтԓьэн ckt
Chuvash Чӑвашла chv cv cv
Hakha Chin Hakha Lai cnh cnh
Welsh Cymraeg cym cy cy
Dhivehi ދިވެހި div dv dv
Greek Ελληνικά ell el el
Danish Dansk dan da da
German Deutsch deu de de
English English eng en en
Esperanto Esperanto epo eo eo
Ewe Eʋegbe ewe ee ee
Spanish Español spa es es
Erzya Эрзянь кель myv myv myv
Estonian Eesti est et et
Basque Euskara eus eu eu
Persian فارسی pes fa fa
Finnish Suomi fin fi fi
French Français fra fr fr
Frisian Frysk fry fy-NL fy
Igbo Ásụ̀sụ́ Ìgbò ibo ig ig
Irish Gaeilge gle ga-IE ga
Galician Galego glg gl gl
Guaraní Avañeʼẽ gug gn gn
Hindi हिन्दी hin hi hi
Hausa Harshen Hausa hau ha ha
Upper Sorbian Hornjoserbšćina hsb hsb hsb
Hungarian Magyar nyelv hun hu hu
Armenian Հայերեն hye hy-AM hy
Interlingua Interlingua ina ia ia
Indonesian Bahasa indonesia ind id id
Icelandic Íslenska isl is is
Italian Italiano ita it it
Japanese 日本語 jpn ja ja
Georgian ქართული ენა kat ka ka
Kabyle Taqbaylit kab kab kab
Kazakh Қазақша kaz kk kk
Kikuyu Gĩgĩkũyũ kik ki ki
Kyrgyz Кыргызча kir ky ky
Kurmanji Kurdish Kurmancî kmr ku ku
Sorani Kurdish سۆرانی ckb ckb ckb
Komi-Zyrian Коми кыв kpv kv kv
Luganda Luganda lug lg lg
Lithuanian Lietuvių kalba lit lt lt
Lingala Lingála lin ln ln
Latvian Latviešu valoda lvs lv lv
Luo Dholuo luo luo
Macedonian Македонски mkd mk mk
Malayalam മലയാളം mal ml ml
Marathi मराठी mar mr mr
Mongolian Монгол хэл khk mn mn
Moksha Мокшень кяль mdf mdf mdf
Maltese Malti mlt mt mt
Dutch Nederlands nld nl nl
Chewa Chichewa nya ny ny
Norwegian Nynorsk Nynorsk nno nn-NO nn
Oriya ଓଡ଼ିଆ ori or or
Punjabi ਪੰਜਾਬੀ pan pa-IN pa
Polish Polski pol pl pl
Portuguese Português por pt pt
Kʼicheʼ Kʼicheʼ quc
Romansch (Sursilvan) Romontsch roh rm-sursilv rm
Romansch (Vallader) Rumantsch roh rm-vallader rm
Romanian Românește ron ro ro
Russian Русский rus ru ru
Kinyarwanda Kinyarwanda kin rw rw
Sakha Саха тыла sah sah sah
Santali ᱥᱟᱱᱛᱟᱲᱤ sat sat sat
Serbian Srpski srp sr sr
Slovak Slovenčina slk sk sk
Slovenian Slovenščina slv sl sl
Swahili Kiswahili swa sw
Swedish Svenska swe sv-SE sv
Tamil தமிழ் tam ta ta
Thai ภาษาไทย tha th th
Turkish Türkçe tur tr tr
Tatar Татар теле tat tt tt
Twi Twi tw tw tw
Ukrainian Українська мова ukr uk uk
Urdu اُردُو urd ur ur
Uyghur ئۇيغۇر تىلى uig ug ug
Uzbek Oʻzbekcha uzb uz uz
Vietnamese Tiếng Việt vie vi vi
Votic Vaďďa tšeeli vot vot
Wolof Wolof wol wo
Yoruba Èdè Yorùbá yor yo
Chinese (China) 中文 cmn zh-CN zh
Chinese (Hong Kong) 中文 cmn zh-HK zh
Chinese (Taiwan) 中文 cmn zh-TW zh

Frequently asked questions

Why not use [insert better system] for [insert task here] ?

There are potentially a lot of better language-specific systems for doing these tasks, but each one has a slightly different API, so if you want to support all the Common Voice languages or even a reasonable subset you have to learn and use the same number of language-specific APIs.

The idea of these utilities is to provide adequate implementations of things are are likely to be useful when working with all the languages in Common Voice. If you are working on a single language or have a specific setup or are using more data than just Common Voice, maybe this isn't for you. But if you want to just train coqui-ai/STT on Common Voice, then maybe it is :)

Why not just make the alphabet from the transcripts ?

Depending on the language in Common Voice, the transcripts can contain a lot of random punctuation, numerals, and incorrect character encodings (for example Latin ç instead of Cyrillic ҫ for Chuvash). These may look the same but will result in bigger sparsity for the model. Additionally some languages may have several encodings of the same character, such as the apostrophe. These will ideally be normalised before training.

Also, if you are working with a single language you probably have time to look through all the transcripts for the alphabetic symbols, but if you want to work with a large number of Common Voice languages at the same time it's useful to have them all in one place.

Hey aren't some of those languages not in Common Voice ?

That's right, some of the languages are either not in Common Voice (yet!) or are in Common Voice but have not been released yet. If I've been working with them I've included them anyway.

See also

  • epitran: Great grapheme to phoneme system that supports a wide range of languages.

Licence

All the code, aside from that explicitly licensed under a different licence, is licensed under the AGPL v 3.0.

Acknowledgements

Comments
  • Ukrainian needs apostrophe

    Ukrainian needs apostrophe

    the apostrophe is needed to write Ukrainian, as in ім'я ("name")

    https://en.wiktionary.org/wiki/%D1%96%D0%BC%27%D1%8F#Ukrainian

    h/t @robinhad

    opened by JRMeyer 6
  • on the Validator

    on the Validator

    Hi, thanks for the practical toolkit for CV data preprocessing!

    I recently utilized this toolkit to validate data of different languages, but found the Validator failed to initialize, i.e. it. After checking the code I found, the initialization of Validator demands data/$lang/validate.tsv to be given.

    Thus my question is: 1) Will the missing data be updated recently? and 2) How to prepare the data/$lang/validate.tsv file from the scratch?

    Thanks in advance!

    opened by wenjie-p 5
  • added ß to German alphabet file

    added ß to German alphabet file

    the letter ß is part of the german alphabet and definetely part of the german corpus. It is woth noting that this letter is not used in the swiss version of high german.

    opened by stefangrotz 4
  • Issue with encoding during setup in Windows

    Issue with encoding during setup in Windows

    This happened when I try to pip install it in the Windows Anaconda cmd (Windows 10 US English version, Python 3.9 and 3.8 tested).

    c:\Users\xxxx> pip install git+https://github.com/ftyers/commonvoice-utils.git
    Collecting git+https://github.com/ftyers/commonvoice-utils.git
      Cloning https://github.com/ftyers/commonvoice-utils.git to c:\temp1\pip-req-build-06f01lti
      Running command git clone -q https://github.com/ftyers/commonvoice-utils.git 'C:\TEMP1\pip-req-build-06f01lti'
      Resolved https://github.com/ftyers/commonvoice-utils.git to commit c738e7f8031cd2e1ca83fdbd6dd3e8a5db1ad583
        ERROR: Command errored out with exit status 1:
         command: 'D:\Anaconda\Anaconda3\envs\p39_c112\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\TEMP1\\pip-req-build-06f01lti\\setup.py'"'"'; __file__='"'"'C:\\TEMP1\\pip-req-build-06f01lti\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\TEMP1\pip-pip-egg-info-mvngstdl'
             cwd: C:\TEMP1\pip-req-build-06f01lti\
        Complete output (9 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "C:\TEMP1\pip-req-build-06f01lti\setup.py", line 8, in <module>
            README = (HERE / "README.md").read_text()
          File "D:\Anaconda\Anaconda3\envs\p39_c112\lib\pathlib.py", line 1267, in read_text
            return f.read()
          File "D:\Anaconda\Anaconda3\envs\p39_c112\lib\encodings\cp1252.py", line 23, in decode
            return codecs.charmap_decode(input,self.errors,decoding_table)[0]
        UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2697: character maps to <undefined>
        ----------------------------------------
    WARNING: Discarding git+https://github.com/ftyers/commonvoice-utils.git. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    

    It seems setup.py needs to configure encoding as UTF-8, but I'm a noob here...

    opened by HarikalarKutusu 2
  • Chatino tones are written with superscript letters

    Chatino tones are written with superscript letters

    At the moment the alphabet for Chatino includes sequences of numerals for the tone characters. This is an artefact of the original dataset used to generate the data. The official orthography uses superscript uppercase letters.

    It should be possible to use Unicode superscript letters and implement the conversion within covo, but first we need a mapping from sequence of numeralssuperscript uppercase letter.

    opened by ftyers 2
  • Transliterator module missing

    Transliterator module missing

    Looks like that you forgot to commit it.

    ~$ python3 -m  pip install git+https://github.com/ftyers/commonvoice-utils.git
    Defaulting to user installation because normal site-packages is not writeable
    Collecting git+https://github.com/ftyers/commonvoice-utils.git
      Cloning https://github.com/ftyers/commonvoice-utils.git to /tmp/pip-req-build-4ptidukg
      Running command git clone -q https://github.com/ftyers/commonvoice-utils.git /tmp/pip-req-build-4ptidukg
    Building wheels for collected packages: commonvoice-utils
      Building wheel for commonvoice-utils (setup.py) ... done
      Created wheel for commonvoice-utils: filename=commonvoice_utils-0.2.7-py3-none-any.whl size=142813 sha256=541d42fa2c786d602f4ca04e6f1ad8848a57ded5376f69a19629b1b602577fc7
      Stored in directory: /tmp/pip-ephem-wheel-cache-x_86ocb3/wheels/56/67/73/4bf2d8a681334251a44405673d52e767f646121bbd89c8b7fa
    Successfully built commonvoice-utils
    Installing collected packages: commonvoice-utils
    Successfully installed commonvoice-utils-0.2.7
    ~$ python3 -c "from cvutils import Alphabet"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/selimcan/.local/lib/python3.8/site-packages/cvutils/__init__.py", line 10, in <module>
        from transliterator import Transliterator
    ModuleNotFoundError: No module named 'transliterator'
    

    I don't think it's this package https://pypi.org/project/transliterator/ that's required ?

    opened by IlnarSelimcan 2
  • ' is a valid character in Italian

    ' is a valid character in Italian

    Hi! I think I found an error here for Italian, the ' is a valid character for Italian and in cvutils/data/it/validate.tsv#L6 it is replaced with blank, even if there seem to be other rules below in the same file converting other symbols to ' and later allowing '. Maybe I'm not seeing a reason to have that replacement rule there on line 6? Thanks for the useful tools!

    opened by lucarinelli 1
  • Add Basque support to the Segmenter

    Add Basque support to the Segmenter

    This adds Basque support to the segmenter.

    • abbr.tsv:
      • Added common abbreviations from different sources:
        • Elhuyar: http://www.euskara.euskadi.net/appcont/elhuyar/laburdura_zerrenda.pdf
        • Euskaltzaindia: https://www.euskaltzaindia.eus/index.php?option=com_ebe&view=bilaketa&Itemid=1161&task=bilaketa&id=985
        • Euskaljakintza: https://euskaljakintza.com/kontsultategia/laburdurak/
      • Added two regex rules to avoid splitting sentences with . (-garren) in ordinal numbers.
    • validate.tsv: normalize diacritics from Spanish and French that are frequent in Basque texts due to common names or foreign words.

    With this segmenter, I re-trained your previous models adding the Wikipedia corpus here: https://github.com/coqui-ai/STT-models/pull/25

    opened by zuazo 1
  • Update Turkish abbreviation list

    Update Turkish abbreviation list

    Processed TDK's list to extract abbr which uses periods (removed 2 instances already existing here). These are added with 1000 in the first column. https://tdk.gov.tr/wp-content/uploads/2019/01/Kısaltmalar_Dizini.pdf

    Also added single capital letters with periods, which are used to shorten names. These are added with 999 in the first column. (we already had T. on the list)

    opened by HarikalarKutusu 1
  • Fixes for DV

    Fixes for DV

    I've made some updates to the DV specifications. I think with these additions the phonemiser output is sensible enough for now. Some of the rules are obviously simplified since their actual rules are more complex and probably require fst

    The following changes were made:

    • Add missing char ޱ in alphabet.txt
    • Fix phons.tsv
      • Add rules for pre-nasalised consonants
      • Make އް and ށް default to stops. This is a sensible default for them. Should sound fine 90% of the time
      • default ން to voiced retroflex nasal.
      • fix vowel rules that were previously tied to consonant އ, preventing proper placement
      • Other minor adjustments to consonant maps
    opened by kudanai 1
  • Incorrect alphabet for Ukrainian

    Incorrect alphabet for Ukrainian

    At https://github.com/ftyers/commonvoice-utils/blob/main/cvutils/data/uk/alphabet.txt should be абвгґдеєжзиіїйклмнопрстуфхцчшщьюя-ʼ. ы is a Russian letter.

    opened by robinhad 1
  • Adding feature to exclude group of information during export

    Adding feature to exclude group of information during export

    Is it possible to implement optional "--exclude-xxx fn" flags to exclude recordings during cv export?

    --exclude-voices voices.txt            // E.g. to measure the effect of a single person recording too much
    --exclude-sentences sentences.txt             // E.g. to exclude reported sentences
    --exclude-gender [male|female|other|empty]             // E.g. to train with male voices and test with female voices
    etc
    

    That would very much ease any experiments on biasing effects.

    PS: The correct place to implement these would be CorporaCreator but it is not actively maintained as you know.

    Similar can be implemented for opus corpora.

    Bülent

    opened by HarikalarKutusu 1
  • Please add Korean support

    Please add Korean support

    Although Korean is not fully enabled on Common Voice yet, it only lacks 1500 sentences. If added, we can start using alphabet/normalization support provided by covo.

    opened by HarikalarKutusu 0
  • hindi encoding issue

    hindi encoding issue

    Hi. I tried using the g2p tool to phonemize hindi words, but there was some encoding issues.

    from cvutils import Phonemiser p = Phonemiser('hi') p.phonemise('अवकाशग्रहण')

    At first, the error message was like:

    UnicodeDecodeError Traceback (most recent call last) C:\Users\MAGICD~1\AppData\Local\Temp/ipykernel_10860/951158175.py in 1 from cvutils import Phonemiser ----> 2 p = Phonemiser('hi') 3 p.phonemise('अवकाशग्रहण')

    ~\Anaconda3\lib\site-packages\cvutils\phonemiser.py in init(self, lang) 22 print('[Phonemiser] Function not implemented', file=sys.stderr) 23 try: ---> 24 self.validator = Validator(self.lang) 25 except FileNotFoundError: 26 pass

    ~\Anaconda3\lib\site-packages\cvutils\validator.py in init(self, lang) 13 self.nfkd = False 14 try: ---> 15 self.load_data() 16 except FileNotFoundError: 17 print('[Validator] Function not implemented', file=sys.stderr)

    ~\Anaconda3\lib\site-packages\cvutils\validator.py in load_data(self) 24 self.lower = False 25 data_dir = os.path.abspath(os.path.dirname(file)) + '/data/' ---> 26 for line in open(data_dir + self.lang + '/validate.tsv').readlines(): 27 if line[0] == '#': 28 continue

    UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 50: illegal multibyte sequence

    Then I set encoding='utf-8 in the line 26 of '~\Anaconda3\lib\site-packages\cvutils\validator.py' , but it didn't work. It still went like:

    UnicodeDecodeError Traceback (most recent call last) ... ~\Anaconda3\lib\site-packages\cvutils\validator.py in load_data(self) 24 self.lower = False 25 data_dir = os.path.abspath(os.path.dirname(file)) + '/data/' ---> 26 for line in open(data_dir + self.lang + '/validate.tsv',encoding='utf-8').readlines(): 27 if line[0] == '#': 28 continue

    UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 50: illegal multibyte sequence

    Is there anything I did wrong? And I wonder is there any other method to solve the encoding issue? Thanks!

    opened by treya-lin 3
  • Add a method of checking CJK

    Add a method of checking CJK

    Perhaps something like PASS to basically return whatever was input and REPL for removing punctuation.

    Another option would be something like CB for check Unicode Block.

    opened by ftyers 3
Owner
Francis Tyers
Francis Tyers
This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

Hassan Shahzad 3 Oct 15, 2021
This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Joseph Imperial 1 Oct 5, 2021
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 3, 2023
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 10.8k Feb 18, 2021
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 9, 2022
TLA - Twitter Linguistic Analysis

TLA - Twitter Linguistic Analysis Tool for linguistic analysis of communities TLA is built using PyTorch, Transformers and several other State-of-the-

Tushar Sarkar 47 Aug 14, 2022
This code is the implementation of Text Emotion Recognition (TER) with linguistic features

APSIPA-TER This code is the implementation of Text Emotion Recognition (TER) with linguistic features. The network model is BERT with a pretrained mod

kenro515 1 Feb 8, 2022
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

Dynamics of Language 14 Dec 9, 2022
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 31 Nov 7, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 18, 2021
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023