Persian Lexicon
This repo uses Uppsala Persian Corpus (UPC) to construct a lexicon of 70664 unique words. With all the excitement around game Wordle, we also extracted words with different length (2, 3, 4, ..., 10) and stored them to separate files for easier access. Please note that these files might contain offensive words, I have not check them manually.
GetWords.py
can read these files and return words as a list of strings.
Cleanup details
Main Lexicon
The main lexicon (data/persian-words.txt
) is build very liberally; we only filter out words that contain ASCII characters or Arabic numerals.
Fixed length Lexicons
More conservative filtering has been applied to files with fixed word length. We drop all words that contain any of the following characters:
- All forms of hamze (همزه).
- All forms of tanvin (تنوین).
- All forms of short vowels.
- Tashdid (تشدید).
- Zero-width non-joiner (نیمفاصله).
After applying these filters, we ended up with these number of words per file:
- 2 letter words: 310 unique words
- 3 letter words: 2378 unique words
- 4 letter words: 7059 unique words
- 5 letter words: 10043 unique words
- 6 letter words: 9541 unique words
- 7 letter words: 7350 unique words
- 8 letter words: 4681 unique words
- 9 letter words: 2529 unique words
- 10 letter words: 1250 unique words