We already have gen_cjk() and per pull #63 might have gen_cyrillic.
If we wanted to, in the future, support other methods (Tamil, Telugu, etc.), we can see where this would get very cumbersome/duplicitous, very quickly.
It might be good to have some generic function that takes any specific range and plugs it in, and then wrap that with a function specific to the unicode block you want to test.
e.g., instead of
codepoints = [random.randint(0x4E00, 0x9FCC) for _ in range(length)]
try:
# (undefined-variable) pylint:disable=E0602
output = u''.join(unichr(codepoint) for codepoint in codepoints)
except NameError:
output = u''.join(chr(codepoint) for codepoint in codepoints)
return _make_unicode(output)
...put this into a generate_unicode_range()
function that can have codepoint
values passed to it, and then use that inside a function for any desired unicode block...
gen_bengali()
gen_hebrew()
gen_hiragana()
Now, there is a sticky wicket in all this. Some character sets span multiple, non contiguous blocks. More details here:
http://en.wikipedia.org/wiki/Unicode_block
So really, we should be able to pass all desired blocks into a python list, and then either make a single range to rule them all, or simply the ability to choose a random character out of each block within the list.