WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

Last update: Jan 1, 2023

Related tags

Text Data & NLP nlp language speech pronunciation linguistics phonology python-api scraped-data phonetics computational-linguistics g2p

Overview

WikiPron

WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronunciation dictionaries mined using this tool.

Command-line tool
Python API
Data
Models
Development

If you use WikiPron in your research, please cite the following:

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). Massively multilingual pronunciation mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228. [bibtex]

Command-line tool

Installation

WikiPron requires Python 3.6+. It is available from PyPI:

pip install wikipron

Usage

Quick Start

After installation, the terminal command wikipron will be available. As a basic example, the following command scrapes G2P data for French:

wikipron fra

Specifying the Language

The language is indicated by a three-letter ISO 639-2 or ISO 639-3 language code, e.g., fra for French. For which languages can be scraped, here is the complete list of languages on Wiktionary that have pronunciation entries.

Specifying the Dialect

One can optionally specify dialects to target using the --dialect flag. The dialect name can be found together with the transcription on Wiktionary. For example, "(UK, US) IPA: /təˈmɑːtəʊ/". To restrict to the union of dialects use the pipe character '|': e.g., --dialect='General American | US'. Transcriptions which lack a dialect specification are selected regardless of the value of this flag.

Segmentation

By default, the segments library is used to segment the transcription into whitespace. The segmentation tends to place IPA diacritics and modifiers on the "parent" symbol. For instance, [kʰæt] is rendered kʰ æ t. This can be disabled using the --no-segment flag.

Parentheses

Some of transcriptions contain parentheses to indicate alternative pronunciations. The parentheses (but not the content) are discarded in the scrape unless the --no-skip-parens flag is used.

Output

The scraped data is organized with each pair on its own line, where the word and pronunciation are separated by a tab. Note that the pronunciation is in International Phonetic Alphabet (IPA), segmented by spaces that correctly handle the combining and modifier diacritics for modeling purposes, e.g., we have kʰ æ t with the aspirated k instead of k ʰ æ t.

For illustration, here is a snippet of French data scraped by WikiPron:

accrémentitielle    a k ʁ e m ɑ̃ t i t j ɛ l
accrescent  a k ʁ ɛ s ɑ̃
accrétion   a k ʁ e s j ɔ̃
accrétions  a k ʁ e s j ɔ̃

By default, the scraped data appears in the terminal. To save the data in a TSV file, please redirect the standard output to a filename of your choice:

wikipron fra > fra.tsv

Advanced Options

The wikipron terminal command has an array of options to configure your scraping run. For a full list of the options, please run wikipron -h.

Python API

The underlying module can also be used from Python. A standard workflow looks like:

import wikipron

config = wikipron.Config(key="fra")  # French, with default options.
for word, pron in wikipron.scrape(config):
    ...

Data

We also make available a database of over 3 million word/pronunciation pairs mined using WikiPron.

Models

We host grapheme-to-phoneme models and modeling software in a separate repository.

Development

Repository

The source code of WikiPron is hosted on GitHub at https://github.com/CUNY-CL/wikipron, where development also happens.

For the latest changes not yet released through pip or working on the codebase yourself, you may obtain the latest source code through GitHub and git:

Create a fork of the wikipron repo on your GitHub account.
Locally, make sure you are in some sort of a virtual environment (venv, virtualenv, conda, etc).

Download and install the library in the "editable" mode together with the core and dev dependencies within the virtual environment:

git clone https://github.com/<your-github-username>/wikipron.git
cd wikipron
pip install -U pip setuptools
pip install -r requirements.txt
pip install --no-deps -e .

We keep track of notable changes in CHANGELOG.md.

Contribution

For questions, bug reports, and feature requests, please file an issue.

If you would like to contribute to the wikipron codebase, please see CONTRIBUTING.md.

License

WikiPron is released under an Apache 2.0 license. Please see LICENSE.txt for details.

Please note that Wiktionary data in the data/ directory has its own licensing terms.

Comments

Potential problem in _parse_combining_modifiers()

I started the second big scrape and while scraping for phonetic data from Albanian, Wikipron threw an error, the last line of which I'll reproduce below:

File ".../wikipron/config.py", line 73, in _parse_combining_modifiers last_char = chars.pop() IndexError: pop from empty list

The final line in the Albanian phonetic tsv is herë h ɛː ɾ meaning the scrape likely failed on this entry which contains what looks like word initial aspiration.

I guess for words like the one that caused this error we would want to combine with next char ʰi d r ɔ ɟ ɛ n?
bug

opened by lfashby 22
TSV files for all Wiktionary languages with over 1000 entries

-Adds tsv files in wikipron/langauges/wikipron/tsv_files -Adds a readme in wikpron/languages/wikipron

The tsv file names are formatted as such: iso693-2(B)code_phonetic/phonemic (If the language only has an iso639-3 code then that code is used. If a language doesn't have any phonetic entries on Wiktionary, then it will not have a phonetic file. Same goes for phonemic.)

The readme tsv file link links to the file (phonetic or phonemic) with more entries. I tried to determine whether or not to apply case-folding for each language, but may have gotten it wrong for a few languages. If you see any instances in the readme where I incorrectly applied or failed to apply case-folding then let me know and I can rerun those languages if need be.

I will add Russian once I have it, but it may take quite a long time to get it. I can also add all languages with more than 100 but less than 1000 entries in the same pull request as Russian if you'd like those files as well. I can submit a pull request for the code that generated all these files after that.

opened by lfashby 21
[arm] can't use wikipron because of potential readtimeout? Can we use a wiktionary dump?

Hello

I can use the terminal version of WikiPron to scrape a small language like Amharic [amh], and to scrape a big one like French. But when I try to run it on Armenian [hye or arm], the code just stops running after an hour and outputs nothing -- there's not even any errors thrown. I suspect the code is finding a readtimeout error and then skipping it.

I suspect there's a readtimeout error because in the past, I used other wiktionary extractors to Wiktextract and that took 9-12 hours to scrape the Armenian words (just 17k words). I suspect that the Armenian entries are just oddly dispersed across Wiktionary that it takes a while for some scrapers to find them. Granted Wiktextract was using a wiktionary dump and that's how it managed to eventually work. Can WikiPron work over a Wiktionary dump or does it need to actively use an internet connection?
enhancement

opened by jhdeov 16
[arm] issues in the phones list
Wonderful resource.

There are some errors in the phone list for the Armenian dialects. A lot of the errors got cleaned up since you last scraped them.

The way that the Wiktionary contributors for Armenian work is that they do the following:

They take the orthographic form like գրպան (transliteration is <grban>, pronunciation is [gərbɑn])

They manually the orthographic and rewrite it with Armenian letters in order to apply any phonological rules like schwa insertion: գրպան to գըրպան <gərban>

They then use this script to convert from the rewritten form (2) into the IPA for Eastern and Western Armenian

This means that the Western entries are almost all redundant and automatically derived from the Eastern entries. A lot of words actually include both IPA entries together.

For Eastern (EA)

You're right that the <ա> grapheme is /ɑ/. The script automatically converts the rewritten grapheme to [ɑ]. Some people had manually written the pronunciation entries and used non-IPA symbols like [a]. But they've mostly gotten cleaned up. A lot got cleaned before August. Then I personally scraped Wiktionary with another python package, found some overlooked [a]'s, and I cleaned them up too.

Same issue with <ո> actually being [ɔ] but sometimes incorrectly written as [o].

Armenian doesn't have phonemic geminates. What you see is that the orthography has a sequence of identical consonants like փթթել <pttel> [pəttel]. The script automatically converts a sequence of identical segments into a geminate with the [ː] diacritic. So if the rewritten form is փըթթել <pəttel>, then the outputted pronunciation is [pət:el]. If the rewritten form is փըթել <pətel>, then the automatic output is [pətel]. IMO, it would make more sense if the automatic transcription was with doubled segments [pəttel] because Armenian lacks phonemic length. So all the consonants have a possible geminated/lengthened/doubled form. It might just be an accidental gap in the Wiktionary data that you don't have geminate t͡s: and others.

The back fricatives are in free variation between velar and uvular, with more tendency towards uvular. The automatic transcription uses χ, so any [x] that you see is just an error (which I think also got cleaned up in around summer).

What's the issue with missing tie-bars?

TODO: for future cleanup, missing tie-bars on segments like: <ց> [tʃʰ] <չ> [tsʰ]

For Western (WA)

Any [r] that you see is an error. The orthography has two rhotic graphemes ր ռ that are both pronounced the same as a flap. The trill is just someone manually writing it. That counts as an error and I think most of them got cleaned up on the WIktionary entries.

Yup you're right about the mergers.

TODO: supposedly the following mergers have occurred in this dialect: aspirated > aspirated voiced > aspirated voiceless unaspirated > voiced

The W dialect also developing a devoicing rule for clusters of orthographic voice+voiceless consonants. The automatic transcription covers that though.

language support
opened by jhdeov 16
Negative flags are renamed to positive statements (#141)
This pull request is for #141 Negative flags in cli.py are renamed to positive statements. In order to accommodate this change,Wikipron/config.py and tests/test_wikipron/test_config.py are also edited accordingly.

[x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.
opened by yeonju123 15
[mdf] Scrape Moksha + slightly more flexible default pron selector.
Strangely enough some (but not all) Moksha pages are not standard. The pronunciations don't reside under the regular list item elements ("li"), but under the paragraph elements ("p"). For an example, see the page for ала

Please let me know whether you'd prefer a custom extractor for this rather than changing the default template.

[x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.
opened by agutkin 14
[geo] inconsistencies

The Wikipedia transcription guidelines say that the high front phoneme is /i/ but many transcriptions have /ɪ/ instead. We should fix this upstream (e.g., on Wiktionary itself) and rescrape.

See issue for more context.
language support

opened by kylebgorman 14
Add config options to big scrape, separate scraping and writing within big scrape.
(Sorry for the wall of text...) These changes are meant to address issues #66, #67, #68 as well as a few suggestions made in the comments of pull request #61.

Here are the larger changes introduced by this pull request:

Separated the scraping and writing part of scrape.py (formerly scrape_and_write.py).

write.py now generates the README table by inspecting the contents of the tsv/ directory. In the process it also creates a tsv readme_tsv.tsv with similar information as is in the README table.

Added no_stress, no_syllable_boundaries, and cut_off_date options to languages in languages.json.

Modified codes.py to specify default values for these options when adding new languages to languages.json and to copy over previously set values for these options.

cut_off_date should now be set in codes.py prior to running codes.py, I’ve updated the README in languages/wikipron with those instructions.

Added dialect config option (and require_dialect_label option) to English, Spanish and Portuguese.

Restructured scrape.py to handle when one or more dialects are specified for a language. (Ran this new code on Portuguese, because it is a smaller language, to generate some sample data.)

README table now includes dialect information in Wiktionary language name column

The only small changes worth noting are:

Logging in scrape.py will now also output to scraping.log which I’ve added to .gitignore. This way finding the languages that failed to be scraped is a bit easier (don’t need to scroll through the console). It also outputs the language dict from languages.json in the error message for languages that failed to be scraped within our set amount of retries, so it is a bit easier to build a temporary languages.json with the failed languages.

scrape.py will now remove files with less than 100 entries. (TSVs with less than 100 entries have been removed.)

I have a few questions regarding dialects that I'd like your thoughts on:

How dialects are handled in languages.json.

I added dialects to languages.json in the following way:
"por": { ... "require_dialect_label": true, "dialect": { "_bz": "Brazil", "_po": "Portugal" }, ... },

The keys (_bz, _po) in dialect serve as a sort of extension when naming the dialect tsv files (por_bz_phonetic.tsv, for example) and help with easy access to the dialect strings ("Brazil") in write.py. If you'd like me to change any of the keys because you'd like different extensions for certain dialects let me know. They can be longer than two letters. I'll provide links to the English, Spanish, and Portuguese entries in languages.json as a separate comment so you can review the keys and dialect strings I'm using.

Is there a process for finding which dialects are frequently used within a given Wiktionary language category? (Aside from just checking entries and seeing whether dialect information is specified.)

How dialects are handled in scrape.py.

As written scrape.py will first scrape a language entirely and then scrape for dialects if any are specified. This means it will scrape por (Portuguese) with no dialect, then por with "Brazil" as the dialect and then por with "Portugal" as the dialect. Is there any reason to scrape for por with no dialect when we are specifying a dialect? Do we want to keep the tsvs generated from previously scraping por (or eng/spa) with no dialect?

Within scrape.py, I moved a lot of what was in main() to a separate function in order to handle dialects. There may well be a better way of handling dialects than the way I've tried to do it and I'm open to suggestions on how to improve it.
opened by lfashby 14
Create generate_phone_summary.py
[x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data. This script automatically generates phone_summary, which has similar structure as language_summary. This script is written based on generate_summary.py .

The output of generate_phone_summary.py is a TSV file instead of README.md, since README.md is already used. I will change the output path after we discuss what would be the good path for the output.
opened by yeonju123 13
Rename files from "phonemic"/"phonetic" to "broad"/"narrow"
[x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.

Closes #389

I renamed the files with this script: link

I only changed the filenames, so I'm sure a lot of stuff is broken at the moment…
opened by ajmalanoski 12
adds unimorph data repo and download routine
adds the json file with all unimorph files and wikipron lg names. Additionally uses download routine to grab data and logs statements to the console

[ ] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.
opened by reubenraff 12
[pam] Can't parse both types of transcriptions from the same line?
For Kapampangan(pam) the format of all pronunciation entries looks as follows:

Hyphenation: ba‧tia‧uan IPA(key): /bəˈtjawən/, [bəˈtjäː.wən]

I suspect we can't parse this when both transcriptions are under the same heading. May be a duplicate.
opened by agutkin 4
remove default casefolding
Removed the statement casefold:true from the languages.json list. I rescraped hye and apw to confirm that the languages were still scrapped, but now with the original case marking from Wiktionary.

[x] Updated Unreleased in CHANGELOG.md to reflect the changes in code or data.
opened by jhdeov 3
[arm] finding IPA transcriptions outside of the Pronunciation block

For the word կարկանդակ, wikipron finds the correct pronunciation of [kɑɾkɑndɑk] but it also finds the IPA transcriptions of other words in the Usage Notes section like [pɛrɑʃˈki]. I'm not sure if this is an unavoidable glitch from Wikipron's side, or if it's a glitch that could be fixed from the Wiktionary side.

It seems that what's going on is that WikiPron is just finding any IPA transcription that's inside the Armenian entry, even if it's not associated with a dialect. E.g., if you run wikipron arm --dialect='ladygaga' --no-skip-parens --narrow > randos.tsv you get a handful of IPA transcriptions that aren't associated with the pre-defined dialects. These are either a) IPA transcriptions in the Usage notes or etymology, or b) IPA transcriptions for non-standard dialects. This isn't a problem for using Wikipron on a specific language (because the person can just filter those out manually). But I wonder if this glitch causes any other funny business for the other languages.

Side note: I wonder if there's been enough situations where people had to fix Wiktionary entries in order to optimize Wikipron's scraper (like on the various closed issues). If so, perhaps a tips and tricks page would be helpful down the line?

opened by jhdeov 5
Undoing casefolding?
The commandline lets the user choose to apply casefolding so that entries like English can be changed to either English or english. But for the scraped data on the repo, it seems you apply casefolding by default. Would it be more useful if the online data didn't do casefolding? That way,

If the user wanted to get the original data (with the correct cases), then they can just use the scraped data online instead of running WIkipron on the terminal

If the user wanted to get the casefolded data, then they can take the un-casefolded data from the repo and then apply casefolding on their on their own machine (a simple fast Excel function).

Right now, if the user wants to get the original cases, then they have to run the terminal option (which takes a while).
enhancement good first issue
opened by jhdeov 5
scraping audio files?

Do you think there's a reasonable way to make an enhancement that will extract audio file URLs for Wiktionary words? At least for Armenian, the audio files are linked in the Pronunciation section.
enhancement

opened by jhdeov 5

Releases(v1.3.0)

v1.3.0(Nov 28, 2022)
[1.3.0] - 2022-11-28

Under data/

Added

Big scrape for 2022. (#464)

Added the --fresh flag to data/scrape/scrape.py to facilitate running the big scrape in batches. (#464)

Added the --exclude flag for excluding one or more languages in data/scrape/scrape.py. (#460)

Added data/src/normalize.py. (#356)

Updated README.md. (#360)

Added data/cg/tsv/geo.tsv. (#367)

Added data/morphology. (#369)

Added SIGMORPHON 2021 morphology data. (#375)

Added data/cg/tsv/jpn_hira.tsv. (#384)

Enforced final newlines. (#387)

Adds all UniMorph languages to morphology. (#393)

Added data/covering_grammar/tsv/fre_latn_phonemic.tsv (#398)

Added data/covering_grammar/lib/make_test_file.py (#396, #399)

Added Komi-Zyrian (kpv). (#400)

Added Makasar (mak). (#415, #419)

Added Zou (zom). (#421)

Added Wiyot (wiy). (#422)

Added Sidamo (sid). (#423)

Added Central Atlas Tamazight (tzm). (#429)

Added Chibcha (chb). (#430)

Added Kashmiri (kas). (#431)

Added Malayalam (mal). (#434)

Added Dhivehi (div). (#437)

Added Akkadian (akk). (#441)

Added Central Nahuatl (nhn). (#443)

Added Etruscan (ett). (#444)

Added Gujarati (guj). (#445)

Added Kannada (kan). (#446)

Added Karelian (krl). (#447)

Added Romagnol (rgn). (#448)

Added Southern Yukaghir (yux). (#449)

Added Urak Lawoi' (urk). (#451)

Added Hausa (ha). (#452)

Added Kashubian (csb). (#453)

Added Tabaru (tby). (#455)

Added West Makian (mqs). (#457)

Added Amharic (amh). (#458)

Added Livvi (olo). (#459)

Added Kalmyk (xal). (#472)

Added Ternate (tft). (#473)

Added Abkhaz (abk). (#474)

Added Farefare (gur). (#475)

Added Iban (iba). (#476)

Added Laz (lzz). (#477)

Changed

Switched to ISO 639-3 language codes. (#468)

Updated scraped data in preparation for the SIGMORPHON 2022 shared task: swe nno ger dut ita rum ukr bel tgl ceb ben asm per pus tha lwl. (#461)

Made scripts under data/frequencies/ and data/morphology/ more flexible, especially for the purposes of preparing data for a shared task. (#461)

Fixed the --restriction flag for specifying multiple languages in data/scrape/scrape.py. (#460)

Added covering grammar coverage error log and specified error_type in error_analysis.py. (#424)

Added error log writing in error_analysis.py. (#420)

Added new columns in summary tables. (#365)

Fixed broken paths in data/src/generate_phones_summary.py and in data/phones/HOWTO.md. (#352)

Added Atong (India) (aot). (#353)

Added Egyptian Arabic (arz). (#354)

Added Lolopo (ycl). (#355)

Fixed Unicode normalization in data/phones/slv_phonemic.phones and re-scraped Slovenian data. (#356)

Updated data/phones/HOWTO.md to include instructions on applying the NFC Unicode normalization (#357)

Updated data/src/normalize.py to be more efficient. (#358)

Fixed inaccuracies in data/phones/geo_phonemic.phones. (#367)

Fixed typo in data/cg/tsv/geo.tsv and added missing character. (#370)

Morphology URLs are now provided as a list. (#376)

Configured and scraped Yamphu (ybi). (#380)

Configured and scraped Khumi Chin (cnk). (#381)

Made summary generation in common_characters.py optional. (#382)

Fixed phone counting in data/src/generate_phones_summary.py (#390, #392)

Reorganizes scraping scripts under data/scrape (#394)

Reorganizes .phones files and related scripts under data/phones (#395)

Reorganizes CG files and related scripts under data/covering_grammar (#395)

Reorganized data/phones/phones/fre_phonemic.phones (#398)

Removed data/src/ (#401)

Renamed TSV files and phonelists to use the terms "broad"/"narrow" instead of "phonemic"/"phonetic" (#389, #402, #405)

Fixed typo in README.md (#407)

Fixed column ordering of the test file read by the script in data/covering_grammar/lib/error_analysis.py (#411)

Fixed Common character collection in common_characters.py (#419)

Scraping test fixed for blt. (#436)

Changed URLs to point at CUNY-CL repo, where applicable. (#438)

Under wikipron/ and elsewhere

Added

Added ckb in languagecodes.py. (#464)

Added support for Python 3.10. (#462)

Added test of phones list generation in test_data/test_summary.py (#363)

Added Min Nan extraction function. (#397)

Added Tai Dam extraction function, configuration and initial scrape. (#435)

Added test of casefold value for languages in data/scrape/lib/languages.json (#442)

Added support for Python 3.11. (#479)

Added checks for the Python source distribution and wheel on CI. (#479)

Turned on tests for Windows on CI. (#479)

Removed

Dropped support for Python 3.6. (#462)

Dropped support for Python 3.7. (#479)

Changed

Switched to ISO 639-3 language codes. (#468)

Converted setup.py to pyproject.toml. (#479)

Source code(tar.gz)
Source code(zip)
v1.2.0(Jan 30, 2021)
Under data/

Added

Added Afrikaans whitelists, filtered TSV file, rescraped phonemic and phonetic TSV files. (#311)

Added German whitelists and filtered TSV file. (#285)

Added whitelisting capabilities to postprocess. (#152)

Added whitelists for Dutch, English, Greek, Latin, Korean, and Spanish. (#158, etc.)

Logged dialect configuration if specified. (#133)

Added typing to big scrape code. (#140)

Added argparse to allow limiting 'big scrape' to individual languages with --restriction flag. (#154)

Added Manchu (mnc). (#185)

Added Polabian (pox). (#186)

Added aar, bdq, jje, and lsi. (#202)

Added tyv to languagecodes.py (#203, #205)

Added bcl, egl, izh, ltg, azg, kir and mga to languagecodes.py. (#205)

Added nep to languagecodes.py. (#206)

Added Ingrian (izh). (#215)

Added French phoneme list and filtered TSV file. (#213, #217)

Added Corsican (cos). (#222)

Added Middle Korean (okm). (#223)

Added Middle Irish (mga). (#224)

Added Old Portuguese (opt). (#225)

Added Serbo-Croatian phoneme list and filtered TSV files. (#227)

Added Tuvan (tyv). (#228)

Added Shan (shn) with custom extraction. (#229)

Added Northern Kurdish (kmr). (#243)

Added a script to facilitate the creation of a .phones file. (#246)

Added IPA validity checks for phonemes. (#248)

Split multiple pronunciations joined by tilde in eng_us_phonetic.

Added Italian phoneme list and filtered TSV file. (#260, #261)

Added Adyghe phone list and filtered TSV file. (#262, #263)

Added Bulgarian phoneme list and filtered TSV file. (#264, #267)

Added Icelandic phoneme list and filtered TSV file. (#269, #270)

Added Slovenian phoneme list and filtered TSV file. (#271, #273)

Added normalization to list_phones.py. Corrected errors relating to ipapy (#275)

Added Welsh .phones lists and filtered TSV files. (#274, #276)

Added draft of covering grammar script. (#297)

Updated data/phones/README.md with instructions to re-scrape. (#279, #281)

Added Vietnamese .phones files and re-scraped and filtered .tsv files. (#278, #283)

Added Hindi .phones files and the re-scraped and filtered .tsv files. (#282, #284)

Added Old Frisian (ofs). (#294)

Added Dungan (dng). (#293)

Added Latgalian (ltg). (#296)

Added draft of covering grammar script. (#297)

Added Portuguese .phones files and re-scraped data. (#290, #304)

Added Japanese .phones files and re-scraped data. (#230, #307)

Added Moksha (mdf). (#295)

Added Azerbaijani .phones files and re-scraped data. (#306, #312)

Added Turkish .phones file and re-scraped data. (#313, #314)

Added Maltese .phones file and re-scraped data. (#317, #318)

Added Latvian .phones file and re-scraped data. (#321, #322)

Added Khmer .phones file and re-scraped data. (#324, #327)

Added Østnorsk (Bokmål) .phones file and re-scraped data. (#324, #327)

Several languages added to languagecodes.py. (#334)

Changed

Edited the arm_e_phonetic.phones and arm_w_phonetic.phones files. (#298)

Improved printing in the README table. (#145)

Renamed data directory data. (#147)

Split may into Latin and Arabic files. (#164)

Split pan into Gurmukhi and Shahmukhī. (#169)

Split uig into Perso-Arabic and Cyrillic. (#173)

Only allowed Latin spellings in Maltese lexicon. (#166).

Split mon into Cyrillic and Mongol Bichig (#179).

Merged whitelist.py into 'big scrape' script. src scrape.py now checks for existence of whitelist file during scrape to create second filtered TSV. New TSV placed under tsv/\*\_filtered.tsv. (#154).

Updated generate_summary.py to reflect presence of 'filtered' tsv. (#154)

Imperial Aramaic (arc) split into three scripts properly. (#187)

Flattened data directory structure. (#194)

Updated Georgian (geo) to take advantage of upstream bot-based consistency fixes. (#138)

Split arm into Eastern and Western dialects. (#197)

Rescraped files with new whitelists. (#199)

Updated logging statements for consistency. (#196)

Renamed .whitelist file extension name as .phones. (#207)

Split ban into Latin and Balinese scripts. (#214)

Split kir into Cyrillic and Arabic. (#216)

Split Latin (lat) into its dialects. (#233)

Added MyPy coverage for wikipron, tests and data directories. (#247)

Modified paths in codes.py, scrape.py, and split.py. (#251, #256)

Modified config flags in languages.json and scrape.py. (#258)

Edited Serbo-Croatian .phones file to list all vowel/pitch accent combinations. Re-scraped Serbo-Croatian data. (#288)

Moved list_phones.py to parent directory. (#265, #266)

Moved list_phones.py to src directory. (#297)

Frequencies code no longer overwrites TSV files. (#320)

Updated data/phones/README.md to specify that .phones files should be in NFC normalization form. (#333)

Kurdish (kur) and Opata (opt) removed from languages.json. (#334)

Re-scraped Armenian data. Fixed an error in West Armenian phone list. (#338)

Fixed

Fixed path issue with phonetic whitelisted files. (#195)

Under wikipron/ and Elsewhere

Added

Added positive flags for stress, syllable boundaries, tones, segment to cli.py. (#141)

Added positive flags for space skipping to cli.py. (#257)

Added two Vietnamese dialects to languages.json. (#139)

Handled additional language codes. (#132, #148)

Added --no-skip-spaces-word and --no-skip-spaces-pron flag. (#135)

Allowed ASCII apostrophes (0x27) in spellings. (#172).

Added Vietnamese extraction function. (#181).

Modified pron selector in Latin extraction function. (#183).

Added --no-tone flag. (#188)

Customized extractor and new scraped prons for khb. (#219)

Added tests/test_data directory containing two tests. (#226, #251)

Added HTTP User-Agent header to API calls to Wiktionary. (#234)

Added support for python 3.9 (#240)

Added black style formatting to .circleci/config.yml. (#242)

Added logging for scraping a language with --dialect specified that requires its custom extraction logic. (#245)

Improved CircleCI workflow with orbs. (#249)

Added test_split.py to tests/test_data. (#256)

Handled Cantonese for scraping. (#277)

Added exclusion for reconstructions. (#302)

Added Vietnamese contour tone grouping test in tests/test_config.py (#308)

Added restart functionality. (#340)

Changed

Renamed arguments to positive statements in wikipron/config.py and edited _get_process_pron function accordingly. (#141, #257)

Changed testing values used in tests/test_config.py in order to accomodate the addition of positive flags. (#141)

Specified UTF-8 encoding in handling text files. (#221)

Moved previous contents of tests into tests/test_wikipron (#226)

Updated the packages version numbers in requirements.txt to their latest according to PyPI (#239)

Updated the default pron selector to also look for IPA strings under paragraphs in addition to list items. (#295)

Updated segments package version to 2.2.0 (#308)

Removed

Moved Wiktionary querying functions from test_languagecodes.py to codes.py (#205)

Source code(tar.gz)
Source code(zip)
v1.1.0(Mar 3, 2020)
[1.1.0] - 2020-03-03

Added

Added the extraction function for Mandarin Chinese and its scraped data. (#124)

Integrated Wortschatz frequencies. (#122)

Changed

Updated the Japanese extraction function and Japanese data. (#129)

Updated all scraped Wiktionary data and frequency data. (#127, #128)

Generalized the splitting script in the big scrape. (#123)

Moved small file removal to generate_summary.py. (#119)

Updated Russian data. (#115)

Fixed

Avoided and logged error in case of pron processing failure. (#130)

Source code(tar.gz)
Source code(zip)

Owner

GitHub

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural G2P to portuguese language Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written for

11 Nov 16, 2022

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge This is an implementation of the paper,

19 Oct 14, 2022

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

47 Sep 5, 2022

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

2.3k Jan 1, 2023

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

NLP-Pytorch-Assignment An assignment from my grad-level data mining course (before I started personal projects) demonstrating some experience with NLP

0 Feb 6, 2022

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

5 Mar 7, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Dec 30, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

Automated Phrase Mining from Massive Text Corpora in Python.

28 Apr 15, 2021

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

29 Dec 1, 2022

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

27 Dec 22, 2022

Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

466 Dec 6, 2022

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

2.1k Jan 7, 2023

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

1.8k Feb 10, 2021

Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

1.8k Feb 18, 2021

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.

587 Dec 20, 2022

WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

Related tags

Overview

WikiPron

Command-line tool

Installation

Usage

Quick Start

Specifying the Language

Specifying the Dialect

Segmentation

Parentheses

Output

Advanced Options

Python API

Data

Models

Development

Repository

Contribution

License

Comments

Releases(v1.3.0)

v1.3.0(Nov 28, 2022)

[1.3.0] - 2022-11-28

Under data/

Added

Changed

Under wikipron/ and elsewhere

Added

Removed

Changed

v1.2.0(Jan 30, 2021)

Under data/

Added

Changed

Fixed

Under wikipron/ and Elsewhere

Added

Changed

Removed

v1.1.0(Mar 3, 2020)

[1.1.0] - 2020-03-03

Added

Changed

Fixed

Owner

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Automated Phrase Mining from Massive Text Corpora in Python.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Blue Brain text mining toolbox for semantic search and structured information extraction

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

A python package for deep multilingual punctuation prediction.

Various Algorithms for Short Text Mining

Multilingual text (NLP) processing toolkit

Multilingual text (NLP) processing toolkit

Multilingual text (NLP) processing toolkit

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

Under `data/`

Under `wikipron/` and elsewhere

Under `data/`

Under `wikipron/` and Elsewhere