Stand-alone language identification system

Last update: Jan 4, 2023

Related tags

Text Data & NLP langid.py

Overview

`langid.py` readme

Introduction

langid.py is a standalone Language Identification (LangID) tool.

The design principles are as follows:

Fast
Pre-trained over a large number of languages (currently 97)
Not sensitive to domain-specific features (e.g. HTML/XML markup)
Single .py file with minimal dependencies
Deployable as a web service

All that is required to run langid.py is >= Python 2.7 and numpy. The main script langid/langid.py is cross-compatible with both Python2 and Python3, but the accompanying training tools are still Python2-only.

langid.py is WSGI-compliant. langid.py will use fapws3 as a web server if available, and default to wsgiref.simple_server otherwise.

langid.py comes pre-trained on 97 languages (ISO 639-1 codes given):

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

The training data was drawn from 5 different sources:

JRC-Acquis
ClueWeb 09
Wikipedia
Reuters RCV2
Debian i18n

Usage

langid.py [options]

Options:

`-h, --help`	show this help message and exit
`-s, --serve`	launch web service
`--host=HOST`	host/ip to bind to
`--port=PORT`	port to listen on
`-v`	increase verbosity (repeat for greater effect)
`-m MODEL`	load model from file
`-l LANGS, --langs=LANGS`
	comma-separated set of target ISO639 language codes (e.g en,de)
`-r, --remote`	auto-detect IP address for remote access
`-b, --batch`	specify a list of files on the command line
`--demo`	launch an in-browser demo application
`-d, --dist`	show full distribution over languages
`-u URL, --url=URL`
	langid of URL
`--line`	process pipes line-by-line rather than as a document
`-n, --normalize`
	normalize confidence scores to probability values

The simplest way to use langid.py is as a command-line tool, and you can invoke using python langid.py. If you installed langid.py as a Python module (e.g. via pip install langid), you can invoke langid instead of python langid.py -n (the two are equivalent). This will cause a prompt to display. Enter text to identify, and hit enter:

>>> This is a test
('en', -54.41310358047485)
>>> Questa e una prova
('it', -35.41771221160889)

langid.py can also detect when the input is redirected (only tested under Linux), and in this case will process until EOF rather than until newline like in interactive mode:

python langid.py < README.rst
('en', -22552.496054649353)

The value returned is the unnormalized probability estimate for the language. Calculating the exact probability estimate is disabled by default, but can be enabled through a flag:

python langid.py -n < README.rst
('en', 1.0)

More details are provided in this README in the section on Probability Normalization.

You can also use langid.py as a Python library:

# python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import langid
>>> langid.classify("This is a test")
('en', -54.41310358047485)

Finally, langid.py can use Python's built-in wsgiref.simple_server (or fapws3 if available) to provide language identification as a web service. To do this, launch python langid.py -s, and access http://localhost:9008/detect . The web service supports GET, POST and PUT. If GET is performed with no data, a simple HTML forms interface is displayed.

The response is generated in JSON, here is an example:

{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

A utility such as curl can be used to access the web service:

# curl -d "q=This is a test" localhost:9008/detect
{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

You can also use HTTP PUT:

# curl -T readme.rst localhost:9008/detect
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
100  2871  100   119  100  2752    117   2723  0:00:01  0:00:01 --:--:--  2727
{"responseData": {"confidence": -22552.496054649353, "language": "en"}, "responseDetails": null, "responseStatus": 200}

If no "q=XXX" key-value pair is present in the HTTP POST payload, langid.py will interpret the entire file as a single query. This allows for redirection via curl:

# echo "This is a test" | curl -d @- localhost:9008/detect
{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

langid.py will attempt to discover the host IP address automatically. Often, this is set to localhost(127.0.1.1), even though the machine has a different external IP address. langid.py can attempt to automatically discover the external IP address. To enable this functionality, start langid.py with the -r flag.

langid.py supports constraining of the output language set using the -l flag and a comma-separated list of ISO639-1 language codes (the -n flag enables probability normalization):

# python langid.py -n -l it,fr
>>> Io non parlo italiano
('it', 0.99999999988965627)
>>> Je ne parle pas français
('fr', 1.0)
>>> I don't speak english
('it', 0.92210605672341062)

When using langid.py as a library, the set_languages method can be used to constrain the language set:

python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import langid
>>> langid.classify("I do not speak english")
('en', 0.57133487679900674)
>>> langid.set_languages(['de','fr','it'])
>>> langid.classify("I do not speak english")
('it', 0.99999835791478453)
>>> langid.set_languages(['en','it'])
>>> langid.classify("I do not speak english")
('en', 0.99176190378750373)

Batch Mode

langid.py supports batch mode processing, which can be invoked with the -b flag. In this mode, langid.py reads a list of paths to files to classify as arguments. If no arguments are supplied, langid.py reads the list of paths from stdin, this is useful for using langid.py with UNIX utilities such as find.

In batch mode, langid.py uses multiprocessing to invoke multiple instances of the classifier, utilizing all available CPUs to classify documents in parallel.

Probability Normalization

The probabilistic model implemented by langid.py involves the multiplication of a large number of probabilities. For computational reasons, the actual calculations are implemented in the log-probability space (a common numerical technique for dealing with vanishingly small probabilities). One side-effect of this is that it is not necessary to compute a full probability in order to determine the most probable language in a set of candidate languages. However, users sometimes find it helpful to have a "confidence" score for the probability prediction. Thus, langid.py implements a re-normalization that produces an output in the 0-1 range.

langid.py disables probability normalization by default. For command-line usages of langid.py, it can be enabled by passing the -n flag. For probability normalization in library use, the user must instantiate their own LanguageIdentifier. An example of such usage is as follows:

>> from langid.langid import LanguageIdentifier, model
>> identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
>> identifier.classify("This is a test")
('en', 0.9999999909903544)

Training a model

We provide a full set of training tools to train a model for langid.py on user-supplied data. The system is parallelized to fully utilize modern multiprocessor machines, using a sharding technique similar to MapReduce to allow parallelization while running in constant memory.

The full training can be performed using the tool train.py. For research purposes, the process has been broken down into indiviual steps, and command-line drivers for each step are provided. This allows the user to inspect the intermediates produced, and also allows for some parameter tuning without repeating some of the more expensive steps in the computation. By far the most expensive step is the computation of information gain, which will make up more than 90% of the total computation time.

The tools are:

index.py - index a corpus. Produce a list of file, corpus, language pairs.
tokenize.py - take an index and tokenize the corresponding files
DFfeatureselect.py - choose features by document frequency
IGweight.py - compute the IG weights for language and for domain
LDfeatureselect.py - take the IG weights and use them to select a feature set
scanner.py - build a scanner on the basis of a feature set
NBtrain.py - learn NB parameters using an indexed corpus and a scanner

The tools can be found in langid/train subfolder.

Each tool can be called with --help as the only parameter to provide an overview of the functionality.

To train a model, we require multiple corpora of monolingual documents. Each document should be a single file, and each file should be in a 2-deep folder hierarchy, with language nested within domain. For example, we may have a number of English files:

./corpus/domain1/en/File1.txt ./corpus/domainX/en/001-file.xml

To use default settings, very few parameters need to be provided. Given a corpus in the format described above at ./corpus, the following is an example set of invocations that would result in a model being trained, with a brief description of what each step does.

To build a list of training documents:

python index.py ./corpus

This will create a directory corpus.model, and produces a list of paths to documents in the corpus, with their associated language and domain.

We then tokenize the files using the default byte n-gram tokenizer:

python tokenize.py corpus.model

This runs each file through the tokenizer, tabulating the frequency of each token according to language and domain. This information is distributed into buckets according to a hash of the token, such that all the counts for any given token will be in the same bucket.

The next step is to identify the most frequent tokens by document frequency:

python DFfeatureselect.py corpus.model

This sums up the frequency counts per token in each bucket, and produces a list of the highest-df tokens for use in the IG calculation stage. Note that this implementation of DFfeatureselect assumes byte n-gram tokenization, and will thus select a fixed number of features per ngram order. If tokenization is replaced with a word-based tokenizer, this should be replaced accordingly.

We then compute the IG weights of each of the top features by DF. This is computed separately for domain and for language:

python IGweight.py -d corpus.model
python IGweight.py -lb corpus.model

Based on the IG weights, we compute the LD score for each token:

python LDfeatureselect.py corpus.model

This produces the final list of LD features to use for building the NB model.

We then assemble the scanner:

python scanner.py corpus.model

The scanner is a compiled DFA over the set of features that can be used to count the number of times each of the features occurs in a document in a single pass over the document. This DFA is built using Aho-Corasick string matching.

Finally, we learn the actual Naive Bayes parameters:

python NBtrain.py corpus.model

This performs a second pass over the entire corpus, tokenizing it with the scanner from the previous step, and computing the Naive Bayes parameters P(C) and p(t|C). It then compiles the parameters and the scanner into a model compatible with langid.py.

In this example, the final model will be at the following path:

./corpus.model/model

This model can then be used in langid.py by invoking it with the -m command-line option as follows:

python langid.py -m ./corpus.model/model

It is also possible to edit langid.py directly to embed the new model string.

langid.py is based on our published research. [1] describes the LD feature selection technique in detail, and [2] provides more detail about the module langid.py itself. [3] compares the speed of langid.py to Google's Chrome CLD2, as well as my own pure-C implementation and the authors' implementation on specialized hardware.

[1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062

[2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea. Available from www.aclweb.org/anthology/P12-3005

[3] Kenneth Heafield and Rohan Kshirsagar and Santiago Barona (2015) Language Identification and Modeling in Specialized Hardware, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Available from http://aclweb.org/anthology/P15-2063

Contact

Marco Lui <[email protected]>

I appreciate any feedback, and I'm particularly interested in hearing about places where langid.py is being used. I would love to know more about situations where you have found that langid.py works well, and about any shortcomings you may have found.

Acknowledgements

Thanks to aitzol for help with packaging langid.py for PyPI. Thanks to pquentin for suggestions and improvements to packaging.

Related Implementations

Dawid Weiss has ported langid.py to Java, with a particular focus on speed and memory use. Available from https://github.com/carrotsearch/langid-java

I have written a Pure-C version of langid.py, which an external evaluation (see Read more) has found to be up to 20x as fast as the pure Python implementation here. Available from https://github.com/saffsd/langid.c

I have also written a JavaScript version of langid.py which runs entirely in the browser. Available from https://github.com/saffsd/langid.js

Changelog

v1.0:

Initial release

v1.1:

Reorganized internals to implement a LanguageIdentifier class

v1.1.2:

Added a 'langid' entry point

v1.1.3:

Made classify and rank return Python data types rather than numpy ones

v1.1.4:

Added set_languages to __init__.py, fixing #10 (and properly fixing #8)

v1.1.5:

remove dev tag
add PyPi classifiers, fixing #34 (thanks to pquentin)

v1.1.6:

make nb_numfeats an int, fixes #46, thanks to @remibolcom

Comments

pip install langid does not work

Hello, I try to install the library with pip on OS X 10.6, python 3.3. Running the pip command I receive the following.

$ pip install langid
Downloading/unpacking langid
Could not find a version that satisfies the requirement langid (from versions: 1.0dev, 1.1.1dev, 1.1.2dev, 1.1.3dev, 1.1.4dev, 1.1dev)
Cleaning up...
No distributions matching the version for langid
Storing debug log for failure in /Users/xx/.pip/pip.log

So I downloaded the package and tried to install it without pip

$ python setup.py install

Looked good until here:

Extracting langid-1.1.4dev-py2.6.egg to /Library/Python/2.6/site-packages
SyntaxError: ('invalid syntax', ('/Library/Python/2.6/site-packages/langid-1.1.4dev-py2.6.egg/langid/train/common.py', 39, 34, "  with gzip.open(path, 'rb') as f, tempfile.TemporaryFile() as t:\n"))

Adding langid 1.1.4dev to easy-install.pth file
error: /Library/Python/2.6/site-packages/easy-install.pth: Permission denied

Why isnt it installing? Furthermore, I dont want it to install with python 2.6 (I guess thats the default python version of OS X 10.6) but with python 3.3

opened by Stophface 14

Strange classifications
I get these results for english text,

>>> feeling ('en', 0.16946150595865342) >>> good ('en', 0.16946150595865342) >>> feeling good ('de', 0.2691886134361688) >>>

Am I right to assume that these results can be more accurate for english if I improved the training data for the english language?
opened by corpulent 9
training data

Thanks for making langid available! It's awesome! We (researchers at Carnegie Mellon University) would like to augment the training data with more languages. Shall we send you the data so that you can retrain the models when your time permits? Alternatively, feel free to send us the data and we would retrain the models ourselves.

many thanks! waleed ammar

opened by wammar 9
wrong detection

Hello,

with an english text "Ángel Di María: Louis van Gaal dynamic was why I left Manchester United", the classifier returns ('la', 0.9665266986710674) because of the "Ángel Di María" is a Latin name.

Is there any way to overcome this situation?

Thanks in advance, Canh

opened by canhduong28 7

Class probability computation is very inefficient (patch enclosed)

The following patch produces the same output with a 4.4-fold speedup for language identification (not counting startup time) in --line mode given 650-byte average line lengths, and a 33-fold speedup with 62-byte average line lengths when using the default language model. Larger models with more features show an even larger speedup.

The speedup results from avoiding a matrix multiplication against a feature-count vector which is mostly zeros. You may wish to tweak the cut-over from "short" to "long" texts by adjusting the self.nb_numfeats/10; it could probably be moved higher, but I was being conservative.

259a260,302

optimized version by Ralf Brown

def instance2classprobs(self, text): """ Compute class probabilities for an instance according to the trained model """ if isinstance(text, unicode): text = text.encode('utf8')

# Convert the text to a sequence of ascii values
ords = map(ord, text)

state = 0
if len(ords) < self.nb_numfeats / 10:
    # for very short texts, just apply each production every time the
    # state changes, rather than counting the number of occurrences of
    # each state
    pdc = np.zeros(len(self.nb_classes))
    for letter in ords:
        state = self.tk_nextmove[(state << 8) + letter]
        for index in self.tk_output.get(state, []):
            # compute the dot product incrementally, avoiding lots
            # of multiplications by zero with a sparse
            # feature-count vector
            pdc += self.nb_ptc[index]
else:
    # Count the number of times we enter each state
    statecount = defaultdict(int)
    for letter in ords:
        state = self.tk_nextmove[(state << 8) + letter]
        statecount[state] += 1

    # Update all the productions corresponding to the state
    arr = np.zeros((self.nb_numfeats,), dtype='uint32')
    for state in statecount:
        for index in self.tk_output.get(state, []):
            arr[index] += statecount[state]
    # compute the partial log-probability of the document given each class
    pdc = np.dot(arr,self.nb_ptc)

# compute the partial log-probability of the document in each class
pd = pdc + self.nb_pc
return pd

271,272c314,315 < fv = self.instance2fv(text)

< probs = self.norm_probs(self.nb_classprobs(fv))

probs = self.instance2classprobs(text)
probs = self.norm_probs(probs)

282,283c325,326 < fv = self.instance2fv(text)

< probs = self.norm_probs(self.nb_classprobs(fv))

probs = self.instance2classprobs(text)
probs = self.norm_probs(probs)

opened by ralfbrown 7

Training a new language on Windows doesn't work

I am trying to train it on some language files I downloaded from the internet. But unfortunately no matter what I try, it always crashes.

D:\Django\langid\Scripts>python.exe LDfeatureselect.py -c d:\corpus\wikipedia\langid\corpus -o features -j 1 output path: features temp path: c:\users\nick\appdata\local\temp corpus path: d:\corpus\wikipedia\langid\corpus will tokenize 2 files langs: ['am', 'af'] domains: ['domain1'] chunk size: 1 (3 chunks) Traceback (most recent call last): File "LDfeatureselect.py", line 533, in chunk_paths, features, chunk_offsets = build_inverted_index(paths, options) File "LDfeatureselect.py", line 423, in build_inverted_index for i, keycount in enumerate(pass1_out): File "C:\Python27\Lib\multiprocessing\pool.py", line 626, in next raise value OSError: [Errno 9] Bad file descriptor

I am using Python 2.7.3 on Windows 7 64bit and the latest version of langid.

opened by ghost 7

Different result when giving the same text

I have a database from which I read. I want to identify the language in a specific cell, defined by column.

I read from my database like this:

connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
    #print (type(row)) #tuple
    rf = str(row)
    #print (type(rf)) #string
    lan = langid.classify("{}".format(rf))

Technically, it works. It identifies the languages used and later on (not displayed here) writes the identified language back into the database.

So, now comes the weird part. I wanted to double check some results manually. So I have these words:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"

When I perform the language identification on the database it plots me Portuguese into the database. But, performing it like this:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)

Well, that returns me French. Apart from that it is neither French nor Portuguese, why is it returned different results?!

opened by Stophface 6

Fix versioning

Your versioning scheme is not PEP 440 compliant, which means than many people, as you have noticed, have issues when installing langid.py with pip.

First, 1.1.4dev is not valid, you need to use 1.1.4.dev0 instead.

Second, is there any good reason to choose a dev suffix? Can you simply remove tag_build = dev from setup.cfg before releases?

Thanks!

opened by pquentin 5
Fix pypi classifiers
Your setup.py file does not define any pypi classifier, which I think prevents pip3 install --pre langid. Here is the list: http://pypi.python.org/pypi?%3Aaction=list_classifiers. I think we should start with:

Programming Language :: Python :: 2 Programming Language :: Python :: 2.7 Programming Language :: Python :: 3

This would fix the Python 3 issue. But since we're on such a good start, we could continue.

You then need to choose a status. Either Beta or Production/Stable, probably Production/Stable.

Development Status :: 4 - Beta Development Status :: 5 - Production/Stable

Then, an intended audience. I suggest:

Intended Audience :: Developers Intended Audience :: Science/Research

Then the license. Is your custom license OSI Approved?

And finally a topic:

Topic :: Scientific/Engineering :: Artificial Intelligence

Once you tell me your choices, I can send a pull request. Thanks!
opened by pquentin 5
Make redirection in a stream fashion

Hello,

I am trying out this fancy language detection script, and I am actually thinking - do you think it makes more sense to make for line in sys.stdin.readlines(): into for line in sys.stdin:

in langid.py ?

opened by parafish 5
Training on Windows returns error at DFfeatureselect.py step

I'm trying to train a new language identifier model on my own languages dataset. Unfortunately, it crashes at the DFfeatureselect.py script, returning "TypeError: marshal.load() arg must be file" error message. Below is the log until the crash point.

C:\langid.py-master\langid\train>C:\Python27\python.exe train.py corpus
corpus path: corpus
model path: ..model langs(22): el(26) eo(42) en(1674) af(285) ca(287) am(2426) an(226) cy(79) ar(82) cs(432) et(449) az(534) es(457) be(292) bg(818) bn(65) de(2795) da(90) dz(220) br(532) bs(493) as(101) domains(1): domain(12405) identified 12405 documents will tokenize 12405 documents using byte NGram tokenizer, max_order: 4 chunk size: 50 (249 chunks) job count: 8 whole-document tokenization tokenized chunk (1/249) [11880 keys] tokenized chunk (2/249) [12305 keys] tokenized chunk (3/249) [10517 keys] tokenized chunk (4/249) [18799 keys] tokenized chunk (5/249) [17955 keys] tokenized chunk (6/249) [6092 keys] tokenized chunk (7/249) [21901 keys] tokenized chunk (8/249) [11344 keys] tokenized chunk (9/249) [6342 keys] tokenized chunk (10/249) [6499 keys] tokenized chunk (11/249) [5452 keys] tokenized chunk (12/249) [5734 keys] tokenized chunk (13/249) [6204 keys] tokenized chunk (14/249) [5252 keys] tokenized chunk (15/249) [6565 keys] tokenized chunk (16/249) [3035 keys] tokenized chunk (17/249) [2157 keys] tokenized chunk (18/249) [9931 keys] tokenized chunk (19/249) [8004 keys] tokenized chunk (20/249) [5949 keys] tokenized chunk (21/249) [8345 keys] tokenized chunk (22/249) [13381 keys] tokenized chunk (23/249) [18026 keys] tokenized chunk (24/249) [15978 keys] tokenized chunk (25/249) [12526 keys] tokenized chunk (26/249) [17599 keys] tokenized chunk (27/249) [11572 keys] tokenized chunk (28/249) [18360 keys] tokenized chunk (29/249) [8206 keys] tokenized chunk (30/249) [11074 keys] tokenized chunk (31/249) [14938 keys] tokenized chunk (32/249) [12470 keys] tokenized chunk (33/249) [10483 keys] tokenized chunk (34/249) [14454 keys] tokenized chunk (35/249) [9515 keys] tokenized chunk (36/249) [10757 keys] tokenized chunk (37/249) [8575 keys] tokenized chunk (38/249) [13322 keys] tokenized chunk (39/249) [8586 keys] tokenized chunk (40/249) [8388 keys] tokenized chunk (41/249) [16794 keys] tokenized chunk (42/249) [6053 keys] tokenized chunk (43/249) [8165 keys] tokenized chunk (44/249) [4032 keys] tokenized chunk (45/249) [3898 keys] tokenized chunk (46/249) [3113 keys] tokenized chunk (47/249) [2738 keys] tokenized chunk (48/249) [12874 keys] tokenized chunk (49/249) [7597 keys] tokenized chunk (50/249) [4921 keys] tokenized chunk (51/249) [3117 keys] tokenized chunk (52/249) [8515 keys] tokenized chunk (53/249) [9234 keys] tokenized chunk (54/249) [13384 keys] tokenized chunk (55/249) [13649 keys] tokenized chunk (56/249) [13531 keys] tokenized chunk (57/249) [12832 keys] tokenized chunk (58/249) [12293 keys] tokenized chunk (59/249) [25620 keys] tokenized chunk (60/249) [6443 keys] tokenized chunk (61/249) [15453 keys] tokenized chunk (62/249) [10807 keys] tokenized chunk (63/249) [19978 keys] tokenized chunk (64/249) [44970 keys] tokenized chunk (65/249) [14168 keys] tokenized chunk (66/249) [12106 keys] tokenized chunk (67/249) [27309 keys] tokenized chunk (68/249) [12115 keys] tokenized chunk (69/249) [20707 keys] tokenized chunk (70/249) [19919 keys] tokenized chunk (71/249) [11967 keys] tokenized chunk (72/249) [16046 keys] tokenized chunk (73/249) [8409 keys] tokenized chunk (74/249) [20964 keys] tokenized chunk (75/249) [12275 keys] tokenized chunk (76/249) [16301 keys] tokenized chunk (77/249) [12272 keys] tokenized chunk (78/249) [21592 keys] tokenized chunk (79/249) [19530 keys] tokenized chunk (80/249) [17342 keys] tokenized chunk (81/249) [19946 keys] tokenized chunk (82/249) [15298 keys] tokenized chunk (83/249) [17531 keys] tokenized chunk (84/249) [17299 keys] tokenized chunk (85/249) [24131 keys] tokenized chunk (86/249) [16513 keys] tokenized chunk (87/249) [19510 keys] tokenized chunk (88/249) [14266 keys] tokenized chunk (89/249) [22952 keys] tokenized chunk (90/249) [15482 keys] tokenized chunk (91/249) [15573 keys] tokenized chunk (92/249) [20496 keys] tokenized chunk (93/249) [18156 keys] tokenized chunk (94/249) [22490 keys] tokenized chunk (95/249) [29002 keys] tokenized chunk (96/249) [20352 keys] tokenized chunk (97/249) [44165 keys] tokenized chunk (98/249) [34627 keys] tokenized chunk (99/249) [49905 keys] tokenized chunk (100/249) [53103 keys] tokenized chunk (101/249) [51983 keys] tokenized chunk (102/249) [31038 keys] tokenized chunk (103/249) [31409 keys] tokenized chunk (104/249) [33165 keys] tokenized chunk (105/249) [37822 keys] tokenized chunk (106/249) [10940 keys] tokenized chunk (107/249) [71118 keys] tokenized chunk (108/249) [38858 keys] tokenized chunk (109/249) [37634 keys] tokenized chunk (110/249) [51967 keys] tokenized chunk (111/249) [56836 keys] tokenized chunk (112/249) [27115 keys] tokenized chunk (113/249) [15849 keys] tokenized chunk (114/249) [14734 keys] tokenized chunk (115/249) [26009 keys] tokenized chunk (116/249) [19294 keys] tokenized chunk (117/249) [32044 keys] tokenized chunk (118/249) [29201 keys] tokenized chunk (119/249) [39628 keys] tokenized chunk (120/249) [6244 keys] tokenized chunk (121/249) [7435 keys] tokenized chunk (122/249) [21227 keys] tokenized chunk (123/249) [29732 keys] tokenized chunk (124/249) [35250 keys] tokenized chunk (125/249) [10271 keys] tokenized chunk (126/249) [32891 keys] tokenized chunk (127/249) [7873 keys] tokenized chunk (128/249) [10418 keys] tokenized chunk (129/249) [7311 keys] tokenized chunk (130/249) [9516 keys] tokenized chunk (131/249) [11074 keys] tokenized chunk (132/249) [15263 keys] tokenized chunk (133/249) [11205 keys] tokenized chunk (134/249) [8567 keys] tokenized chunk (135/249) [7678 keys] tokenized chunk (136/249) [44950 keys] tokenized chunk (137/249) [21967 keys] tokenized chunk (138/249) [35438 keys] tokenized chunk (139/249) [49606 keys] tokenized chunk (140/249) [55683 keys] tokenized chunk (141/249) [49369 keys] tokenized chunk (142/249) [48286 keys] tokenized chunk (143/249) [44039 keys] tokenized chunk (144/249) [11811 keys] tokenized chunk (145/249) [41120 keys] tokenized chunk (146/249) [69629 keys] tokenized chunk (147/249) [70067 keys] tokenized chunk (148/249) [46883 keys] tokenized chunk (149/249) [52358 keys] tokenized chunk (150/249) [127523 keys] tokenized chunk (151/249) [37044 keys] tokenized chunk (152/249) [74712 keys] tokenized chunk (153/249) [63824 keys] tokenized chunk (154/249) [55408 keys] tokenized chunk (155/249) [61234 keys] tokenized chunk (156/249) [54418 keys] tokenized chunk (157/249) [39921 keys] tokenized chunk (158/249) [62581 keys] tokenized chunk (159/249) [71439 keys] tokenized chunk (160/249) [53094 keys] tokenized chunk (161/249) [76232 keys] tokenized chunk (162/249) [36778 keys] tokenized chunk (163/249) [71083 keys] tokenized chunk (164/249) [71121 keys] tokenized chunk (165/249) [54315 keys] tokenized chunk (166/249) [62550 keys] tokenized chunk (167/249) [67024 keys] tokenized chunk (168/249) [69247 keys] tokenized chunk (169/249) [66758 keys] tokenized chunk (170/249) [54992 keys] tokenized chunk (171/249) [62659 keys] tokenized chunk (172/249) [60409 keys] tokenized chunk (173/249) [44923 keys] tokenized chunk (174/249) [43095 keys] tokenized chunk (175/249) [50332 keys] tokenized chunk (176/249) [62506 keys] tokenized chunk (177/249) [51782 keys] tokenized chunk (178/249) [71541 keys] tokenized chunk (179/249) [63289 keys] tokenized chunk (180/249) [85046 keys] tokenized chunk (181/249) [63942 keys] tokenized chunk (182/249) [58598 keys] tokenized chunk (183/249) [63150 keys] tokenized chunk (184/249) [47424 keys] tokenized chunk (185/249) [65839 keys] tokenized chunk (186/249) [93418 keys] tokenized chunk (187/249) [12910 keys] tokenized chunk (188/249) [53958 keys] tokenized chunk (189/249) [37259 keys] tokenized chunk (190/249) [11532 keys] tokenized chunk (191/249) [52861 keys] tokenized chunk (192/249) [14390 keys] tokenized chunk (193/249) [11546 keys] tokenized chunk (194/249) [43913 keys] tokenized chunk (195/249) [66130 keys] tokenized chunk (196/249) [10962 keys] tokenized chunk (197/249) [9993 keys] tokenized chunk (198/249) [11903 keys] tokenized chunk (199/249) [28550 keys] tokenized chunk (200/249) [10199 keys] tokenized chunk (201/249) [11053 keys] tokenized chunk (202/249) [11845 keys] tokenized chunk (203/249) [10557 keys] tokenized chunk (204/249) [10736 keys] tokenized chunk (205/249) [19925 keys] tokenized chunk (206/249) [18973 keys] tokenized chunk (207/249) [22198 keys] tokenized chunk (208/249) [13544 keys] tokenized chunk (209/249) [12096 keys] tokenized chunk (210/249) [10717 keys] tokenized chunk (211/249) [23275 keys] tokenized chunk (212/249) [11339 keys] tokenized chunk (213/249) [11669 keys] tokenized chunk (214/249) [12482 keys] tokenized chunk (215/249) [15175 keys] tokenized chunk (216/249) [53832 keys] tokenized chunk (217/249) [52319 keys] tokenized chunk (218/249) [51782 keys] tokenized chunk (219/249) [48032 keys] tokenized chunk (220/249) [44353 keys] tokenized chunk (221/249) [47209 keys] tokenized chunk (222/249) [43914 keys] tokenized chunk (223/249) [48074 keys] tokenized chunk (224/249) [27881 keys] tokenized chunk (225/249) [39001 keys] tokenized chunk (226/249) [41330 keys] tokenized chunk (227/249) [45242 keys] tokenized chunk (228/249) [51633 keys] tokenized chunk (229/249) [38759 keys] tokenized chunk (230/249) [33628 keys] tokenized chunk (231/249) [37245 keys] tokenized chunk (232/249) [28676 keys] tokenized chunk (233/249) [40631 keys] tokenized chunk (234/249) [37609 keys] tokenized chunk (235/249) [41072 keys] tokenized chunk (236/249) [39166 keys] tokenized chunk (237/249) [42001 keys] tokenized chunk (238/249) [14521 keys] tokenized chunk (239/249) [43873 keys] tokenized chunk (240/249) [5256 keys] tokenized chunk (241/249) [5307 keys] tokenized chunk (242/249) [15233 keys] tokenized chunk (243/249) [34008 keys] tokenized chunk (244/249) [16667 keys] tokenized chunk (245/249) [7618 keys] tokenized chunk (246/249) [18999 keys] tokenized chunk (247/249) [17754 keys] tokenized chunk (248/249) [22048 keys] tokenized chunk (249/249) [21140 keys] Traceback (most recent call last): File "train.py", line 196, in doc_count = tally(b_dirs, args.jobs) File "C:\langid.py-master\langid\train\DFfeatureselect.py", line 92, in tally for i, keycount in enumerate(pass_sum_df_out): File "C:\Python27\lib\multiprocessing\pool.py", line 620, in next raise value TypeError: marshal.load() arg must be file

opened by jd-coderepos 4

Dead Languages

Hi, I just stumbled over langid and then, when trying how suitable it'd be for my purposes, stumbled over this:

❯ echo 'در' | langid -l ar,fa,ota
Traceback (most recent call last):
  File "/home/jrs/.local/bin/langid", line 8, in <module>
    sys.exit(main())
  File "/home/jrs/.local/lib/python3.9/site-packages/langid/langid.py", line 504, in main
    identifier.set_languages(langs)
  File "/home/jrs/.local/lib/python3.9/site-packages/langid/langid.py", line 245, in set_languages
    raise ValueError("Unknown language code %s" % lang)
ValueError: Unknown language code ota

What is the project's policy towards ISO-693-2 (as opposed to ISO-693-1 only?). Any chance there'll be support for three-letter codes such as the ota of this example at some point? Or at least a trace-less error message? :-)

opened by sixtyfive 0

Issue with Batch Training

When running batch training with -d flag, the following error outputs:

line 585, in main writer.writerow(['path']+nb_classes) NameError: name 'nb_classes' is not defined

Looks like there is a misplaced variable assignment. Should be a quick fix for someone more familiar with the code.

opened by bonham79 1
the text “Hello China" is detected to 'it'

when l detect ”Hello China" print(langid.classify(”Hello China")) the result : ('it', -37.309250354766846) @Paczesiowa @pquentin @martinth @jnothman @saffsd

opened by gaowenxin95 6
TypeError: function takes exactly 0 arguments (1 given)

在Windows本机（python3.6）上运行么得问题，在Ubuntu服务器（python3.7）上报如下错误： Traceback (most recent call last): File "lan_det.py", line 9, in <module> print(lan_det(text)) File "lan_det.py", line 6, in lan_det return langid.classify(text) File "/home/env/rfh_01/lib/python3.7/site-packages/langid-1.1.6-py3.7.egg/langid/langid.py", line 105, in classify load_model() File "/home/env/rfh_01/lib/python3.7/site-packages/langid-1.1.6-py3.7.egg/langid/langid.py", line 164, in load_model identifier = LanguageIdentifier.from_modelstring(model) File "/home/env/rfh_01/lib/python3.7/site-packages/langid-1.1.6-py3.7.egg/langid/langid.py", line 176, in from_modelstring z = bz2.decompress(b) File "/usr/local/lib/python3.7/bz2.py", line 346, in decompress res = decomp.decompress(data) TypeError: function takes exactly 0 arguments (1 given) 求大佬们指点

opened by RFHzhj 0
if wordn is set in tokenize.py, the max_order in DFfeatureselect.py is according to words or bytes?

if wordn 3-gram is set in tokenize.py, the unit of max_order in DFfeatureselect.py is word or byte?Because in some langs, one string takes up several bytes.

opened by RyanPeking 0

Stand-alone language identification system

Related tags

Overview

langid.py readme

Introduction

Usage

Batch Mode

Probability Normalization

Training a model

Read more

Contact

Acknowledgements

Related Implementations

Changelog

Comments

optimized version by Ralf Brown

< probs = self.norm_probs(self.nb_classprobs(fv))

< probs = self.norm_probs(self.nb_classprobs(fv))

Owner

Stand-alone language identification system

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

NL. The natural language programming language.

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Implementation of legal QA system based on SentenceKoBART

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

`langid.py` readme