This PR integrates more inclusive language identification as compared to cld2/3. To this end, the SIL Language Identification API is used as the default language identification model. This API supports 1035 languages currently including many lower resourced languages, and hopefully by using this language ID COVID-QA can start leveraging a variety of data gathered in lowered resourced languages (e.g., via elasticsearch). Some sources of this info are the SIL COVID resources, the endangered languages project, and this repo.
Important points
- I've integrated new environmental variables for the API key, secret and url in
config.py
. These can be obtained for free at developers.sil.org
- I took cld2/3 out of the loop because closely related languages (e.g., russian and bulgarian) might be misidentified. cld2/3 doesn't support many languages and is thus biased in terms of the data that was used to train the model.
- The SIL API returns ISO 639-3 codes only (3 letter language codes)
How to test
$ cd covid_nlp/language/
$ python detect_language.py
Other suggestions/ notes:
- Right now any non-English language is routed to the FAQ-based matching with elasticsearch (I think). It would be great to upgrade this a bit to determine which languages currently have data. Then for languages without any content, we could route those to something like PanLex resources or SIL posters for low resource languages.
- I would like to upgrade the SIL API to support language ID for many samples at once. This may help in dynamically determining what content we have in the datastore.
Supported languages in the API
Supported via Text Classification: biv, cub, cmo, hvn, kus, yal, pwg, myv, guo, des, leu, eip, cso, zia, kri, mca, kno, zza, maz, bps, qub, rmy, lvs, tab, nld, moa, ssg, maw, pww, sab, udm, zsm, zao, dzo, gnw, bru, kog, cwe, bim, tgo, mlh, blz, ckt, lok, smo, kpq, eng, nnq, kmr, pir, cab, tuo, bvc, xte, txu, pny, klv, jic, khm, mhy, yli, kha, dop, ojb, gvl, meq, cof, qvo, kqe, btd, bwu, nii, arb, xtn, top, lex, lob, cjp, por, ote, tmc, sun, grt, mcd, sja, naw, plw, zas, soq, khq, cek, ozm, kud, ted, bmq, pan, rjs, ktb, nhw, krj, ycn, ita, tir, prg, kpf, qup, msy, emp, ncu, qxo, ell, mzk, tim, yaz, dtb, upv, cou, noa, nhy, adh, cly, saj, fuq, rmo, gla, sim, apd, kpr, ota, kqp, gso, afr, kxc, mbt, wiu, pbb, cor, qul, gwr, twu, qve, arl, bku, alz, mto, bak, guc, lat, kgr, agm, cwt, iws, mip, ctp, khz, kyq, vie, dad, dug, yas, irk, kez, mza, nou, yue, law, kur, atg, mco, acr, lhu, myb, tik, djk, hae, tpp, yuj, mwq, rav, kzj, tuk, pbi, ffm, kmo, ybb, bgz, slk, cbv, gof, bjz, jiv, lln, xrb, cjo, qxn, prk, cot, xed, dgi, nsn, mpm, bzj, kne, cnl, bhw, gyr, akh, ntp, pls, aoz, som, tlf, xsb, eus, mfe, hak, aby, mej, myw, dsb, kru, snw, tpt, cle, nyy, tgp, agd, btt, mf1, quz, swg, sck, dyo, qvc, due, mmo, nca, oss, urt, hrv, btx, ban, pib, iri, sba, kub, lif, npl, icr, mbh, amu, sag, zpt, pss, gle, azz, hag, lzh, acu, ara, hns, zpq, mio, zty, cuc, usa, dan, miq, akb, nyo, cbi, caa, gdn, pms, mpt, wer, teo, ghs, mxt, fin, mjv, kwd, cax, zpl, ntr, ake, nog, tlj, aah, ach, mit, fij, apz, ceb, gde, gdr, mcp, cui, twb, mta, ncj, ino, men, mhi, mir, pez, quy, yre, asm, bdd, zpm, hot, zpi, kao, kyu, mvc, zpz, nzi, stp, srp, dik, guk, hat, zca, opm, aso, way, uig, krs, dig, sbl, glg, ava, avk, mkd, con, jac, mbb, heb, ces, mwv, wob, ddn, fuf, jbu, chr, kms, kwi, soy, qvn, rap, sxn, sgw, rel, ukr, gnd, bgt, thk, nob, dga, mie, orv, kyz, guh, pag, pse, tfr, cul, bhl, xsr, vag, qvw, nst, azg, muv, pad, cco, ese, gcf, pol, akp, sey, bex, vut, pam, lus, gvc, vol, stn, kdc, gym, med, wuv, gng, pui, kle, arz, myx, aak, hif, ian, sig, ign, mvp, xuo, kup, bbr, amf, zai, cya, nia, raw, nyf, ayp, czt, saq, zae, sah, kzf, swe, jam, poi, dob, hnn, mhr, okv, aze, gor, nij, aai, mkl, ron, isl, cpb, mup, nod, sus, knf, laj, nnb, tqo, bfd, cok, alj, pcm, kpw, myk, bbo, uvl, jbo, kia, kat, mux, agn, bjv, tly, mak, ixi, spp, xtd, ifu, urd, bom, bel, ruf, mhl, kek, bts, nhe, duo, mfz, otq, trs, old, bus, dbq, tcc, bba, cat, tee, cfm, bef, nwb, tca, dgz, cnk, crn, dah, chv, kwf, aom, bcl, nfr, fal, tpw, gos, crh, tnr, deu, yuw, oku, hoc, luc, rim, zar, ndy, pbc, udu, daa, miy, mog, obo, aia, knk, sgb, kbh, aoj, gaw, jvn, hsb, ljp, rnl, acc, avt, kbm, sbd, nhi, itv, yle, kbp, mzm, ame, amk, srn, ido, mqj, acm, box, xla, gag, tem, ses, boa, lmk, ker, bov, lew, bul, gbo, bmv, agu, aau, kkj, smt, ziw, ind, ter, hla, xsu, lef, qwh, zpu, xal, adj, gux, rus, ztq, kij, lgg, alp, frd, agr, miz, nin, mfq, gmv, urb, bpr, hye, boj, bua, wnu, naf, tgl, acd, sgz, lsm, yat, ton, for, fuv, wwa, tue, atq, iry, kyc, rai, pab, grn, hus, tav, lao, sda, tat, ilo, ury, nyn, lis, nkf, mtp, mxb, waj, kpz, aeu, krc, rwo, tbo, mai, avn, npy, vid, wba, mox, sne, yaa, hun, ben, mhx, viv, bav, vun, tuf, gur, cmr, cgc, sld, aon, ttc, ura, wap, dyi, gwi, ann, kue, quw, cbc, mfy, mtj, mya, mti, mgo, ppk, tac, est, ngp, pkb, zpo, cap, zab, fuh, tbc, dos, mag, mcu, xmm, cmn, mil, mww, apr, big, cdf, gvf, mda, lad, cnt, ipi, bon, kki, mqb, gum, kab, ctd, cme, ong, taj, usp, tpz, moz, ina, kvn, quf, thv, mlp, hin, sps, eka, bmr, sdm, mop, ubu, bnj, lem, gog, kbr, ahk, enb, gej, mif, uzb, ixl, dtp, yid, mnb, mpg, bss, ccp, muy, kto, avu, tos, dww, car, qvm, neb, csk, yrb, amn, jun, imo, nmz, gbi, maa, snc, lip, jpn, nak, bkv, awb, iba, mqf, tvw, xsm, cym, cuk, guu, mxv, nan, coe, mgh, msm, fra, sue, amr, rom, gai, kcg, mur, ctg, nlc, nch, yss, gfk, bos, myy, mxq, kpv, dnw, lac, shn, taq, tna, thl, kdj, spa, jav, anv, atb, cbs, ken, yam, asg, spl, zaw, gah, tnn, alt, enq, sqi, mib, yad, zyp, lit, sur, ife, ktj, ifk, nsu, abt, hne, rro, zpc, mfi, gun, hil, run, qxh, lia, dts, lee, ltz, mzw, pis, epo, ptp, tzj, chz, nim, pes, tzt, ngu, mor, pao, wmw, dsh, bwq, sny, zaa, ber, yut, keo, faa, kxm, ndz, arq, not, cko, ceg, dgk, gqr, tlb, bxr, kaq, mnf, jmc, mar, muh, inb, knj, tha, prf, mon, nnw, nhu, mfh, bcw, bre, kmd, acn, quc, hig, pah, sri, bfo, ade, bgr, rmc, cjv, auy, amh, war, guq, bud, tby, tlh, hrx, pmf, kwj, awa, mee, vmy, cpa, heh, dwr, cni, gui, kje, sas, srm, wuu, buk, lgl, xav, kyf, lwo, mal, fai, far, lww, oci, blt, pau, hto, cbr, abi, mbc, mim, rej, sml, min, yby, nno, lfn, roo, kor, yva, toc, tnk, knv, bvz, nds, gna, nhx, nuj, kjh, urk, gub, amm, nho, huu, qvz, mva, ile, grc, bao, mfk, sil, cbk, sll, snn, mcq, mek, slv, ksr, qvs, kaz, kqy, bkl, bib, tur, yml, suk, kaa, huv, krl, bmh, kze, csb, ape, ppo, ttr, ndj, hub, tte, ess, zos, nvm
Supported via rule-based methods (based on unicode blocks and writing system scripts): xsr, ind, cmo, lif, ron, ojb, nod, mww, men, jun, rus, btd, jpn, hil, arb, pol, run, mai, alt, kyu, amh, taj, war, nld, pam, ljp, bud, lus, grt, mak, sun, akb, dzo, bku, urd, bru, tgl, pag, som, kbp, arz, pan, bts, vie, sas, gag, kxm, mjv, taq, fuf, ita, chr, bul, mya, jav, atb, blt, ceb, ccp, bcl, lao, ilo, mar, oss, hnn, btx, rej, ban, lis