👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Peter M. Stahl

Last update: Dec 28, 2022

Related tags

Computer Vision nlp natural-language-processing natural-language kotlin-library language-detection android-library java-library nlp-library nlp-machine-learning language-recognition language-processing language-identification language-classification

Overview

Quick Info

this library tries to solve language detection of very short words and phrases, even shorter than tweets
makes use of both statistical and rule-based approaches
outperforms Apache Tika, Apache OpenNLP and Optimaize Language Detector for more than 70 languages
works within every Java 6+ application and on Android
no additional training of language models necessary
api for adding your own language models
offline usage without having to connect to an external service or API
can be used in a REPL for a quick try-out

What does this library do?
Why does this library exist?
Which languages are supported?
How good is it?
Why is it better than other libraries?
Test report generation
How to add it to your project?
7.1 Using Gradle
7.2 Using Maven
How to build?
How to use?
9.1 Programmatic use
9.2 Standalone mode
What's next for version 1.1.0?

1. What does this library do? ^{Top ▲}

Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

2. Why does this library exist? ^{Top ▲}

Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy.

So far, three other comprehensive open source libraries working on the JVM for this task are Apache Tika, Apache OpenNLP and Optimaize Language Detector. Unfortunately, especially the latter has three major drawbacks:

Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, it doesn't provide adequate results.
The more languages take part in the decision process, the less accurate are the detection results.
Configuration of the library is quite cumbersome and requires some knowledge about the statistical methods that are used internally.

Lingua aims at eliminating these problems. It nearly doesn't need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

3. Which languages are supported? ^{Top ▲}

Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, the following 74 languages are supported:

A
- Afrikaans
- Albanian
- Arabic
- Armenian
- Azerbaijani
B
- Basque
- Belarusian
- Bengali
- Norwegian Bokmal
- Bosnian
- Bulgarian
C
- Catalan
- Chinese
- Croatian
- Czech
D
- Danish
- Dutch
E
- English
- Esperanto
- Estonian
F
- Finnish
- French
G
- Ganda
- Georgian
- German
- Greek
- Gujarati
H
- Hebrew
- Hindi
- Hungarian
I
- Icelandic
- Indonesian
- Irish
- Italian
J
- Japanese
K
- Kazakh
- Korean
L
- Latin
- Latvian
- Lithuanian
M
- Macedonian
- Malay
- Marathi
- Mongolian
N
- Norwegian Nynorsk
P
- Persian
- Polish
- Portuguese
- Punjabi
R
- Romanian
- Russian
S
- Serbian
- Shona
- Slovak
- Slovene
- Somali
- Sotho
- Spanish
- Swahili
- Swedish
T
- Tagalog
- Tamil
- Telugu
- Thai
- Tsonga
- Tswana
- Turkish
U
- Ukrainian
- Urdu
V
- Vietnamese
W
- Welsh
X
- Xhosa
Y
- Yoruba
Z
- Zulu

4. How good is it? ^{Top ▲}

Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts:

a list of single words with a minimum length of 5 characters
a list of word pairs with a minimum length of 10 characters
a list of complete grammatical sentences of various lengths

Both the language models and the test data have been created from separate documents of the Wortschatz corpora offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.

Given the generated test data, I have compared the detection results of Lingua, Apache Tika, Apache OpenNLP and Optimaize Language Detector using parameterized JUnit tests running over the data of Lingua's supported 74 languages. Languages that are not supported by the other libraries are simply ignored for those during the detection process.

The box plot below shows the distribution of the averaged accuracy values for all three performed tasks: Single word detection, word pair detection and sentence detection. Lingua clearly outperforms its contenders. Bar plots for each language and further box plots for the separate detection tasks can be found in the file ACCURACY_PLOTS.md. Detailed statistics including mean, median and standard deviation values for each language and classifier are available in the file ACCURACY_TABLE.md.

5. Why is it better than other libraries? ^{Top ▲}

Every language detector uses a probabilistic n-gram model trained on the character distribution in some training corpus. Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough. The shorter the input text is, the less n-grams are available. The probabilities estimated from such few n-grams are not reliable. This is why Lingua makes use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct language.

A second important difference is that Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text. Only then, in a second step, the probabilistic n-gram model is taken into consideration. This makes sense because loading less language models means less memory consumption and better runtime performance.

In general, it is always a good idea to restrict the set of languages to be considered in the classification process using the respective api methods. If you know beforehand that certain languages are never to occur in an input text, do not let those take part in the classifcation process. The filtering mechanism of the rule-based engine is quite good, however, filtering based on your own knowledge of the input text is always preferable.

6. Test report and plot generation ^{Top ▲}

If you want to reproduce the accuracy results above, you can generate the test reports yourself for all four classifiers and all languages by doing:

./gradlew accuracyReport

You can also restrict the classifiers and languages to generate reports for by passing arguments to the Gradle task. The following task generates reports for Lingua and the languages English and German only:

./gradlew accuracyReport -Pdetectors=Lingua -Planguages=English,German

By default, only a single CPU core is used for report generation. If you have a multi-core CPU in your machine, you can fork as many processes as you have CPU cores. This speeds up report generation significantly. However, be aware that forking more than one process can consume a lot of RAM. You do it like this:

./gradlew accuracyReport -PcpuCores=2

For each detector and language, a test report file is then written into /accuracy-reports, to be found next to the src directory. As an example, here is the current output of the Lingua German report:

com.github.pemistahl.lingua.report.lingua.GermanDetectionAccuracyReport

##### GERMAN #####

>>> Accuracy on average: 89.10%

>> Detection of 1000 single words (average length: 9 chars)
Accuracy: 73.60%
Erroneously classified as DUTCH: 2.30%, ENGLISH: 2.10%, DANISH: 2.10%, LATIN: 2.00%, BOKMAL: 1.60%, ITALIAN: 1.20%, BASQUE: 1.20%, FRENCH: 1.20%, ESPERANTO: 1.10%, SWEDISH: 1.00%, AFRIKAANS: 0.80%, TSONGA: 0.70%, PORTUGUESE: 0.60%, NYNORSK: 0.60%, FINNISH: 0.50%, YORUBA: 0.50%, ESTONIAN: 0.50%, WELSH: 0.50%, SOTHO: 0.50%, SPANISH: 0.40%, SWAHILI: 0.40%, IRISH: 0.40%, ICELANDIC: 0.40%, POLISH: 0.40%, TSWANA: 0.40%, TAGALOG: 0.30%, CATALAN: 0.30%, BOSNIAN: 0.30%, LITHUANIAN: 0.20%, INDONESIAN: 0.20%, ALBANIAN: 0.20%, SLOVAK: 0.20%, ZULU: 0.20%, CROATIAN: 0.20%, ROMANIAN: 0.20%, XHOSA: 0.20%, TURKISH: 0.10%, LATVIAN: 0.10%, MALAY: 0.10%, SLOVENE: 0.10%, SOMALI: 0.10%

>> Detection of 1000 word pairs (average length: 18 chars)
Accuracy: 94.00%
Erroneously classified as DUTCH: 0.90%, LATIN: 0.80%, ENGLISH: 0.70%, SWEDISH: 0.60%, DANISH: 0.50%, FRENCH: 0.40%, BOKMAL: 0.30%, TAGALOG: 0.20%, IRISH: 0.20%, SWAHILI: 0.20%, TURKISH: 0.10%, ZULU: 0.10%, ESPERANTO: 0.10%, ESTONIAN: 0.10%, FINNISH: 0.10%, ITALIAN: 0.10%, NYNORSK: 0.10%, ICELANDIC: 0.10%, AFRIKAANS: 0.10%, SOMALI: 0.10%, TSONGA: 0.10%, WELSH: 0.10%

>> Detection of 1000 sentences (average length: 111 chars)
Accuracy: 99.70%
Erroneously classified as DUTCH: 0.20%, LATIN: 0.10%

The plots have been created with Python and the libraries Pandas, Matplotlib and Seaborn. If you have a global Python 3 installation and the python3 command available on your command line, you can redraw the plots after modifying the test reports by executing the following Gradle task:

./gradlew drawAccuracyPlots

The detailed table in the file ACCURACY_TABLE.md containing all accuracy values can be written with:

./gradlew writeAccuracyTable

7. How to add it to your project? ^{Top ▲}

Lingua is hosted on Jcenter and Maven Central.

7.1 Using Gradle

// Groovy syntax
implementation 'com.github.pemistahl:lingua:1.0.3'

// Kotlin syntax
implementation("com.github.pemistahl:lingua:1.0.3")

7.2 Using Maven

<dependency>
    <groupId>com.github.pemistahl</groupId>
    <artifactId>lingua</artifactId>
    <version>1.0.3</version>
</dependency>

8. How to build? ^{Top ▲}

Lingua uses Gradle to build and requires Java >= 1.8 for that.

git clone https://github.com/pemistahl/lingua.git
cd lingua
./gradlew build

Several jar archives can be created from the project.

./gradlew jar assembles lingua-1.0.3.jar containing the compiled sources only.
./gradlew sourcesJar assembles lingua-1.0.3-sources.jar containing the plain source code.
./gradlew jarWithDependencies assembles lingua-1.0.3-with-dependencies.jar containing the compiled sources and all external dependencies needed at runtime. This jar file can be included in projects without dependency management systems. You should be able to use it in your Android project as well by putting it in your project's lib folder. This jar file can also be used to run Lingua in standalone mode (see below).

9. How to use? ^{Top ▲}

Lingua can be used programmatically in your own code or in standalone mode.

9.1 Programmatic use ^{Top ▲}

The API is pretty straightforward and can be used in both Kotlin and Java code.

/* Kotlin */

import com.github.pemistahl.lingua.api.*
import com.github.pemistahl.lingua.api.Language.*

val detector: LanguageDetector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH).build()
val detectedLanguage: Language = detector.detectLanguageOf(text = "languages are awesome")

By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language. The word prologue, for instance, is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed up probabilities for each possible language have to satisfy. It can be stated in the following way:

val detector = LanguageDetectorBuilder
    .fromAllLanguages()
    .withMinimumRelativeDistance(0.25) // minimum: 0.00 maximum: 0.99 default: 0.00
    .build()

Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise you will get most results returned as Language.UNKNOWN which is the return value for cases where language detection is not reliably possible.

The public API of Lingua never returns null somewhere, so it is safe to be used from within Java code as well.

/* Java */

import com.github.pemistahl.lingua.api.*;
import static com.github.pemistahl.lingua.api.Language.*;

final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH).build();
final Language detectedLanguage = detector.detectLanguageOf("languages are awesome");

There might be classification tasks where you know beforehand that your language data is definitely not written in Latin, for instance (what a surprise :-). The detection accuracy can become better in such cases if you exclude certain languages from the decision process or just explicitly include relevant languages:

// include all languages available in the library
// WARNING: in the worst case this produces high memory 
//          consumption of approximately 3.5GB 
//          and slow runtime performance
LanguageDetectorBuilder.fromAllLanguages()

// include only languages that are not yet extinct (= currently excludes Latin)
LanguageDetectorBuilder.fromAllSpokenLanguages()

// include only languages written with Cyrillic script
LanguageDetectorBuilder.fromAllLanguagesWithCyrillicScript()

// exclude only the Spanish language from the decision algorithm
LanguageDetectorBuilder.fromAllLanguagesWithout(Language.SPANISH)

// only decide between English and German
LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.GERMAN)

// select languages by ISO 639-1 code
LanguageDetectorBuilder.fromIsoCodes639_1(IsoCode639_1.EN, IsoCode639_3.DE)

// select languages by ISO 639-3 code
LanguageDetectorBuilder.fromIsoCodes639_3(IsoCode639_3.ENG, IsoCode639_3.DEU)

Knowing about the most likely language is nice but how reliable is the computed likelihood? And how less likely are the other examined languages in comparison to the most likely one? These questions can be answered as well:

val detector = LanguageDetectorBuilder.fromLanguages(GERMAN, ENGLISH, FRENCH, SPANISH).build()
val confidenceValues = detector.computeLanguageConfidenceValues(text = "Coding is fun.")

// {
//   ENGLISH=1.0, 
//   GERMAN=0.8665738136456169, 
//   FRENCH=0.8249537317466078, 
//   SPANISH=0.7792362923625288
// }

In the example above, a map of all possible languages is returned, sorted by their confidence value in descending order. The values that the detector computes are part of a relative confidence metric, not of an absolute one. Each value is a number between 0.0 and 1.0. The most likely language is always returned with value 1.0. All other languages get values assigned which are lower than 1.0, denoting how less likely those languages are in comparison to the most likely language.

The map returned by this method does not necessarily contain all languages which the calling instance of LanguageDetector was built from. If the rule-based engine decides that a specific language is truly impossible, then it will not be part of the returned map. Likewise, if no ngram probabilities can be found within the detector's languages for the given input text, the returned map will be empty. The confidence value for each language not being part of the returned map is assumed to be 0.0.

9.2 Standalone mode ^{Top ▲}

If you want to try out Lingua before you decide whether to use it or not, you can run it in a REPL and immediately see its detection results.

With Gradle: ./gradlew runLinguaOnConsole --console=plain
Without Gradle: java -jar lingua-1.0.3-with-dependencies.jar

Then just play around:

This is Lingua.
Select the language models to load.

1: enter language iso codes manually
2: all supported languages

Type a number and press <Enter>.
Type :quit to exit.

> 1
List some language iso 639-1 codes separated by spaces and press <Enter>.
Type :quit to exit.

> en fr de es
Loading language models...
Done. 4 language models loaded lazily.

Type some text and press <Enter> to detect its language.
Type :quit to exit.

> languages
ENGLISH
> Sprachen
GERMAN
> langues
FRENCH
> :quit
Bye! Ciao! Tschüss! Salut!

10. What's next for version 1.1.0? ^{Top ▲}

Take a look at the planned issues.

Comments

Improve performance and reduce memory consumption
As pointed out in #39 and #57 Lingua's great accuracy comes at the cost of high memory usage. This imposes a problem for some projects trying to use Lingua. In this issue I will try to highlight some main areas where performance can be improved, some of this is already covered by #98. Note that some of the proposed changes might decrease execution speed or require some larger refactoring.

Model files

Instead of storing the model data in JSON format, a binary format could be used matching the in-memory format (see "In-memory models" section). This would have the advantage that:

Lookup maps such as Char2DoubleOpenHashMap could be created with the expected size avoiding rehashing of the maps during deserialization.

Model file loading is faster.

Model file sizes will be slightly smaller when encoding the frequency only once, followed by the number of ngrams which share this frequency, followed by the ngram values.

Note that even though the fastutil maps are Serializable, using JDK serialization might introduce unnecessary overhead and would make this library dependent on the internal serialization format of the fastutil maps. Instead the data could be written manually to a DataOutputStream.

Model file loading

Use streaming JSON library. The currently used kotlinx-serialization-json does not seem to support streaming yet. Therefore currently the complete model files are loaded as String before being parsed. This is (likely) slow and requires large amounts of memory. Instead streaming JSON libraries such as https://github.com/square/moshi should be used. Note that this point becomes obsolete if a binary format (as described in the "Model files" section above) is used.

In-memory models

Object2DoubleOpenHashMap load factor can increased from the default 0.75 to a higher value. This reduces memory usage but might slow down execution.

Ngrams can be encoded using primitives. Since this project uses only up to fivegrams (5 chars), most of the ngrams (and for some languages even ngrams of all lengths) can be encoded as JVM primitives using bitwise operations, e.g.:

Unigrams as Byte or Char

Bigrams as Short or Int

Trigrams as Int or Long

Quadrigrams as Int or Long

Fivegrams as Long or in the worst case as String object. Note that at least for fivegrams the binary encoding should probably be offset based, so one char is the base code point and the the remaining bits of the Long encode the offsets of the other chars to the base char. This allows encoding alphabets such as Georgian where each char is > Long.SIZE_BITS / 5.

This might even increase execution speed since it avoids hashCode() and equals(...) calls when looking up frequencies (speed-up, if any, has to be tested though).

Reduce frequency accuracy for in-memory models and model files from 64-bit Double to 32-bit. This can have a big impact on memory usage, saving more than 100MB with all models preloaded. However, instead of using a 32-bit Float to store the frequency, a custom 32-bit encoding can (and maybe should) be used since Float 'wastes' some bits for the sign (frequency will never be negative) and the exponent (frequency will never be >= 1.0), though this might decrease language detection speed due to the decoding overhead.

Remove Korean fivegrams (and quadrigrams?). The Korean language models are quite large, additionally due to the large range of Korean code points a great majority (> 1.000.000 fivegrams (?)) cannot be encoded with the primitive encoding approach outlined above. Chinese and Japanse don't seem to have quadrigram and fivegram models as well, not sure if this is due to how the languages work, but maybe it would be acceptable to drop them for Korean as well; also because detection of Korean seems to be rather unambiguous.

Runtime performance

Remove Alphabet. The Alphabet class can probably removed, Character.UnicodeScript seems to be an exact substitute and might allow avoiding some indirection, e.g. only lookup UnicodeScript for a Char once and then compare it with expected ones instead of having each Alphabet look up UnicodeScript.

Avoid creation of Ngram objects. Similar to the primitive encoding described in "In-memory models" above, Ngram objects created as part of splitting up the text can be avoided as well (with a different encoding). A Kotlin inline class can be used to still get type safety and have some convenience functions. Primitive encoding can only support trigrams reliably without too much overhead / too complicated encoding, but that is probably fine because since d0f7a7c211abb03885cc89febae9d77fbf640342 at most trigrams will be used for longer texts.

Instead of accessing lazy frequency lookup in every iteration, it might be faster to access it once at the beginning and then directly use it instead (though this could also be premature optimization).

Conclusion

With some / all of these suggestions applied memory usage can be reduced and execution speed can be increased without affecting accuracy. However, some of the suggestions might be premature optimization, and they only work for 16-bit Char but not for supplementary code points (> 16-bit) (but the current implementation, mainly Ngram creation, seems to have that limitation as well).

I have implemented some of these optimizations and some other minor improvements in https://github.com/Marcono1234/lingua/tree/experimental/performance. However, these changes are pretty experimental: The Git history is not very nice to look at; in some commits I fixed bugs I introduced before or reverted changes again. Additionally the unit tests and model file writing are broken. Some of the changes might also be premature optimization. Though maybe it is interesting nonetheless, it appears the memory usage with all languages being preloaded went down to about ~~640MB~~ (Edit: 920MB, made a mistake in the binary encoding) on AdoptOpenJDK 11.
opened by Marcono1234 14
Compact memory data (#101)

I changed the runtime memory model, the original JSON is translated to a dense map. This reduces memory requirements at cost of speed (frequencies lookup should be slower). Frequencies are stored as Float instead of Double, this introduces an 0.001% error on calculation, and tests are updated accordingly.

fastutil dependency has been removed.

All changes are performed in internal classes, so this request is compatible with the 1.1 version and I hope that the merge will be considered soon.

opened by fvasco 12
Lingua's use of Kotlin coroutines causes leaks in web applications

I'm using lingua 1.1.0 in a Java web application for language detection. The application is set up to load all models on the first request:

LanguageDetectorBuilder.fromAllLanguages().withPreloadedLanguageModels().build();

When I undeploy the web application from the application server, the models stay in memory. I took a heap dump using Eclipse Memory Anaylzer. The dump shows that there are still instances of the classes related to coroutines (e.g. kotlinx.coroutines.scheduling.CoroutineScheduler$WorkerState, kotlinx.coroutines.scheduling.WorkQueue, kotlinx.coroutines.scheduling.CoroutineScheduler) after undeploying the application. The coroutines still seem to reference the models.

I've built a reproducer using only Servlet API that seems to show similar behaviour on Tomcat. Tomcat shows warnings like this:

WARNUNG: The web application [lingua-reproducer] appears to have started a thread named [DefaultDispatcher-worker-9] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338) kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.park(CoroutineScheduler.kt:795) kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.tryPark(CoroutineScheduler.kt:740) kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:711) kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)

Is there are way to ensure that the threads created by Lingua terminate when the application is undeployed?
bug

opened by dnb-erik-brangs 10

v0.6.1 seems better than v1.0.0

I compared version 0.6.1 vs version 1.0.0 on two private test sets

In the following table (see below) I reported results for both the version and their difference for all languages of my benchmark. Scores are the ratio of corrected classified segments with respect to a gold reference. Actually, the sets contains real-world text from the web and from technical domains, and they are not manually checked. So it is possible, that they contain some wrongly classified sentences.

Nevertheless, v1.0.0 seems worse for than v0.6.1 for many languages.

One of the big difference relates with text including string in different languages: For instance, Chinese segment having both Chinese and English, like the following ones, which are detected as English and Portuguese, instead of Chinese

Snapchat 并不 是唯一 一家 触 及 这些 文化 底 线 的公 司 
Gomes ： 我們 的目 標 是 提 升 搜 尋 服 務 的 品 質

or Greek segment with few western strings, like the following ones, which are detected as Danish and Italian, instead of Greek

Rasmussen μετά τη συνάντησή τους στο ΥΠΕΘΑ
γεγονός που, πέραν της σημασίας των σχετικών πολιτικών επαφών, εμπεριέχει και ιδιαίτερη συμβολική αξία, καθώς η επίσκεψη πραγματοποιήθηκε δύο μόλις έτη μετά την επίσκεψη του πρώην ιταλού Προέδρου, κ. Azeglio Ciampi, στην Αθήνα, στις 15-17.

or Arabic segment with few English strings, like the following ones, which are detected as English and Tagalog, instead of Arabic

أداة Google Scholar وضعت أبحاثًا متاحةً للجميع البحث عنها سهل والوصول إليها أسهل.
يارد "YARID" - اللاجئون الأفارقة الشباب للتنمية المتكاملة- بدأت كمحادثة داخل المُجتمع الكونغو

Several other examples, can be found even between languages with more similar alphabets.

It seems that v1.0.0 relies too much on Western alphabets to identify the language, without considering the amount of such Western characters.

set  lng Lingua Lingua100 diffs_vs_Lingua
setA ar  0.930   0.902  diff: -0.028
setA az  0.807   0.784  diff: -0.023
setA be  0.861   0.816  diff: -0.045
setA bg  0.801   0.734  diff: -0.067
setA bs  0.412   0.408  diff: -0.004
setA ca  0.760   0.762  diff: 0.002
setA cs  0.792   0.785  diff: -0.007
setA da  0.760   0.752  diff: -0.008
setA de  0.848   0.848  diff: 0
setA el  0.947   0.932  diff: -0.015
setA es  0.804   0.853  diff: 0.049
setA et  0.856   0.853  diff: -0.003
setA fi  0.865   0.864  diff: -0.001
setA fr  0.868   0.882  diff: 0.014
setA he  0.972   0.961  diff: -0.011
setA hi  0.790   0.733  diff: -0.057
setA hr  0.628   0.623  diff: -0.005
setA hu  0.858   0.848  diff: -0.01
setA hy  0.827   0.801  diff: -0.026
setA id  0.665   0.665  diff: 0
setA is  0.863   0.831  diff: -0.032
setA it  0.866   0.865  diff: -0.001
setA ja  0.758   0.752  diff: -0.006
setA ka  0.802   0.787  diff: -0.015
setA ko  0.887   0.827  diff: -0.06
setA lt  0.839   0.828  diff: -0.011
setA lv  0.882   0.869  diff: -0.013
setA mk  0.786   0.723  diff: -0.063
setA ms  0.801   0.809  diff: 0.008
setA nb  0.735   0.733  diff: -0.002
setA nl  0.799   0.835  diff: 0.036
setA nn  0.768   0.768  diff: 0
setA pl  0.879   0.881  diff: 0.002
setA pt  0.862   0.858  diff: -0.004
setA ro  0.765   0.751  diff: -0.014
setA ru  0.820   0.773  diff: -0.047
setA sk  0.783   0.766  diff: -0.017
setA sl  0.714   0.708  diff: -0.006
setA sq  0.829   0.826  diff: -0.003
setA sr  0.417   0.302  diff: -0.115
setA sv  0.833   0.830  diff: -0.003
setA th  0.940   0.927  diff: -0.013
setA tl  0.747   0.748  diff: 0.001
setA tr  0.901   0.895  diff: -0.006
setA uk  0.877   0.848  diff: -0.029
setA vi  0.920   0.877  diff: -0.043
setA zh  0.941   0.858  diff: -0.083

setB ar   0.996   0.988  diff: -0.008
setB bg   0.957   0.947  diff: -0.01
setB bs   0.495   0.494  diff: -0.001
setB ca   0.946   0.953  diff: 0.007
setB cs   0.993   0.992  diff: -0.001
setB da   0.947   0.946  diff: -0.001
setB de   0.996   0.996  diff: 0
setB el   0.996   0.992  diff: -0.004
setB en   0.964   0.966  diff: 0.002
setB es   0.897   0.920  diff: 0.023
setB et   0.978   0.974  diff: -0.004
setB fi   0.998   0.998  diff: 0
setB fr   0.962   0.971  diff: 0.009
setB he   1.000   0.999  diff: -0.001
setB hr   0.858   0.868  diff: 0.01
setB hu   0.988   0.988  diff: 0
setB id   0.765   0.765  diff: 0
setB is   0.979   0.971  diff: -0.008
setB it   0.939   0.937  diff: -0.002
setB ja   0.986   0.986  diff: 0
setB ko   0.998   0.998  diff: 0
setB lt   0.992   0.990  diff: -0.002
setB lv   0.990   0.983  diff: -0.007
setB mk   0.927   0.930  diff: 0.003
setB ms   0.927   0.927  diff: 0
setB nb   0.927   0.928  diff: 0.001
setB nl   0.921   0.949  diff: 0.028
setB nn   0.942   0.946  diff: 0.004
setB pl   0.993   0.992  diff: -0.001
setB pt   0.952   0.948  diff: -0.004
setB ro   0.964   0.958  diff: -0.006
setB ru   0.997   0.911  diff: -0.086
setB sk   0.977   0.975  diff: -0.002
setB sl   0.943   0.942  diff: -0.001
setB sq   0.983   0.983  diff: 0
setB sv   0.973   0.971  diff: -0.002
setB th   0.996   0.996  diff: 0
setB tr   0.993   0.990  diff: -0.003
setB uk   0.943   0.964  diff: 0.021
setB vi   0.994   0.954  diff: -0.04
setB zh  0.992   0.955  diff: -0.037

bug

opened by nicolabertoldi 10

Add function to avoid ambiguous results
Hi,

while testing the library with some texts I encountered some ambiguous detection results. As far as I understand, the detectLanguageOf method always returns a language as soon as it has at least some possibility. However, there exist texts where this behaviour is probably not desired.

Imagine a text which leads to similar possibilities for two languages, with the first one just a little bit more likely. It would be nice to be able to detect such cases, or at least to ensure a certain distance between the possibilities of the most and the second most likely language (otherwise the method may return UNKNOWN). In our use-case we would prefer to have more detection as unknown rather than (a lot of) false-positives.

The following code snippet illustrates my idea:

@JvmOverloads fun detectLanguageOf(text: String, requiredRelativeDistance: Double = 0.95): Language { [...] return getMostLikelyLanguage(allProbabilities, unigramCountsOfInputText, requiredRelativeDistance) } internal fun getMostLikelyLanguage( probabilities: List<Map<Language, Double>>, unigramCountsOfInputText: Map<Language, Int>, requiredRelativeDistance: Double = 0.95 ): Language { [...] return when { filteredProbabilities.none() -> UNKNOWN filteredProbabilities.singleOrNull() != null -> filteredProbabilities.first().key else -> { val candidate = filteredProbabilities.maxBy { it.value }!! val second = filteredProbabilities.filter { it.key != candidate.key }.maxBy { it.value }!! if (second.value * requiredRelativeDistance < candidate.value) { candidate.key } else { UNKNOWN } } } }

Feel free to copy the code if you want. I don't know whether this is a good approach for the problem or if there are better ways to do that. However, it would be really nice to have a solution for this in some way.

Thanks in advance!
enhancement
opened by bgeisberger 9
LanguageDetector and multithreading

I had a plan to use 'lingua' in a multi threaded Java environment, but, if I got it right, 'LanguageDetector' instance is not thread safe, i.e. if several threads will use it simultaneously, they may corrupt each other work. Am I right? New 'LanguageDetector' instance seems to be very expensive.
question

opened by werder06 9

[ Performance and Memory Analysis for Large Dataset ] very slow for large numbers of Hits

I am trying to run language on using this scrit

            final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH, JAPANESE, CHINESE,ITALIAN, PORTUGUESE,ARABIC,RUSSIAN,DUTCH,KOREAN,SWEDISH,HINDI,POLISH).build();

	    long start=System.currentTimeMillis();  


	    final Language detectedLanguage = detector.detectLanguageOf("Zum Vergleich kann es auch nützlich sein, diese Rankings neben einigen etwas älteren Forschungsergebnissen zu sehen. Im Jahr 2013, Common Sense Advisory zur Verfügung gestellt , eine empirische Studie basiert auf einer Wallet World Online (WOW) - definiert als ‚die gesamte wirtschaftliche Chance, sowohl online als auch offline, berechnet durch einen Anteil eines Landes BIP zu allen wichtigen Blöcken dieser Gesellschaft assoziieren. ' Hier ist, was uns ihre Studie gezeigt hat.");
//	    System.out.println(detectedLanguage.toString());
	    long end=System.currentTimeMillis();  
	    System.out.println("Time: "+ (end - start));

it's taking 700millisecong. which is very slow. which can not be used for 10000+ files.. is there any approach to get results with 1-10milliseconds?

or any function like isEnglish(). which will be true only for English..

opened by the-black-knight-01 8

How to reduce the size of the jar file by excluding language profiles?

I need to run this lib in a memory constrained environment: less than 200Mb for the unzipped package. How can I exclude rare language profiles from the library?

An alternative: can the memory size be significantly decreased by minifying the json files used for each language?

Note: I am using the maven build of the lingua.
question

opened by seinecle 8
Memory leak when using Lingua in web applications

As discussed in #110 , there seems to be a memory leak when using Lingua in a Java web application. I've uploaded an example application at https://github.com/deutsche-nationalbibliothek/lingua-reproducer-memory-leak . The README contains some information about the problem. Please let me know if you need more information.
bug

opened by dnb-erik-brangs 7

java.lang.NoClassDefFoundError: kotlin/KotlinNothingValueException

Hi,

I am trying to use Lingua inside a plain Java maven project and using following maven dependency for that:

<dependency>
                <groupId>com.github.pemistahl</groupId>
                <artifactId>lingua</artifactId>
                <version>1.0.3</version>
</dependency>

Code Sample:

import com.github.pemistahl.lingua.api.Language;
import com.github.pemistahl.lingua.api.LanguageDetector;
import com.github.pemistahl.lingua.api.LanguageDetectorBuilder;

import static com.github.pemistahl.lingua.api.Language.*;

public class Test {
    public static void detect(){
        LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH).build();
        Language language = detector.detectLanguageOf("languages are awesome");
        System.out.println(language);
    }

    public static void main(String[] args) {
        detect();
    }
}

While running I am getting following exception:

Exception in thread "main" java.lang.NoClassDefFoundError: kotlin/KotlinNothingValueException
	at kotlinx.serialization.SerializersKt.serializer(Unknown Source)
	at com.github.pemistahl.lingua.internal.TrainingDataLanguageModel$Companion.fromJson(TrainingDataLanguageModel.kt:150)
	at com.github.pemistahl.lingua.api.LanguageDetector.loadLanguageModel$lingua(LanguageDetector.kt:401)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:407)
	at com.github.pemistahl.lingua.api.LanguageDetector$loadLanguageModels$1.invoke(LanguageDetector.kt:79)
	at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
	at com.github.pemistahl.lingua.api.LanguageDetector.lookUpNgramProbability$lingua(LanguageDetector.kt:390)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeSumOfNgramProbabilities$lingua(LanguageDetector.kt:366)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeLanguageProbabilities$lingua(LanguageDetector.kt:353)
	at com.github.pemistahl.lingua.api.LanguageDetector.computeLanguageConfidenceValues(LanguageDetector.kt:162)
	at com.github.pemistahl.lingua.api.LanguageDetector.detectLanguageOf(LanguageDetector.kt:102)
	at com.tomtom.ssv.apt.ingestion.service.LanguageDetector.detect(LanguageDetector.java:67)
	at com.tomtom.ssv.apt.ingestion.service.LanguageDetector.main(LanguageDetector.java:72)
Caused by: java.lang.ClassNotFoundException: kotlin.KotlinNothingValueException
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 13 more

bug

opened by piyusht007 7

Compare against CLD3 and CLD2

Google's Compact Language Detectors (CLD) are good libraries that are used in Chrome browser and in many other projects. While being written in C++ they have wrappers for Java (cld2, cld3) and Python (cld2, cld3). While 2nd version is n-gram based, 3rd version uses Neural Networks.

Please compare their performance on your test set, both accuracy and speed wise.

opened by igrinis 7

Language recognition error

LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH,CHINESE , THAI, VIETNAMESE).build();
SortedMap<Language, Double> languageDoubleSortedMap = detector.computeLanguageConfidenceValues("ี่มีประสิทธิภาพหลอดไฟพลังงานแสงอาทิตย์กลางแจ้งเซ็นเซอร์ตรวจจับการเคลื่อนไหวสวนกันน้ำ LED พลังงานแสงอาทิตย์โคมไฟสปอร์ตไลท์สำหรับ Garden เส้นทางถนนแบ็คดรอปเป่าลม Led Light");
System.out.println(languageDoubleSortedMap);

The following information is printed : {ENGLISH=1.0, VIETNAMESE=0.5658177137374878} I think it's Thai, but I can recognize English, even Vietnamese, and Thai doesn't version is : 1.2.2

opened by xujiaw 0

Option: Other

Great tool - thank you! Suggestion: The possibility to add OTHER as a language. Lets say I want to find English and French in a multi-language set. I want to add English and French to LanguageDetectorBuilder.from_languages, but if the probability is low, I don't want everything to be marked as English or French, but something else -> Other.
enhancement

opened by thsm-kb 1
Reduce resources to load language models
Currently, the language models are parsed from json files and loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python. Perhaps it is even possible to store those data structures in some kind of binary format on disk which can be loaded faster than the current json files.

Promising candidates could be:

EJML

Colt

la4j

Apache Commons Math

enhancement
opened by pemistahl 0
Support specifying custom `Executor`
Related to #119

Currently Lingua uses ForkJoinPool.commonPool() for model loading and language detection. However, maybe it would be useful to allow users to specify their own Executor, for example with LanguageDetectorBuilder.withExecutor(Executor) (the default could still be commonPool()). This would have the following advantages:

could customize worker thread count, or even run single-threaded, e.g. executor = r -> r.run()

could customize worker threads:

custom name to make performance monitoring easier

custom priority

It would not be possible anymore to use invokeAll then, but a helper function such as the following one might add the missing functionality:

private fun <E> executeTasks(tasks: List<Callable<E>>): List<E> { val futures = tasks.map { FutureTask(it) } futures.forEach(executor::execute) return futures.map(Future<E>::get) }

(Note that I have not extensively checked how well this performs compared to invokeAll, and whether exception collection from the Futures could be improved. Probably this implementation is flawed because called would wait on get() call without participating in the work.) Alternatively CompletableFuture could be used; but then care must be taken to not use ForkJoinPool.commonPool() when its parallelism is 1, otherwise performance might be pretty bad due to JDK-8213115.

This would require some changes to the documentation which currently explicitly refers to ForkJoinPool.commonPool().

What do you think?
new feature
opened by Marcono1234 1
Add more classification metrics in library comparisons

Hello!

So I've been trying out the lingua library and it's awesome. Was wondering if it's possible to add other classification metrics such as Precision, Recall, Specificity and F1 in the comparisons between tika, Optimaize and the other java language detection libraries for more transparency?

Thanks!
enhancement

opened by willyspinner 2

Releases(v1.2.2)

v1.2.2(Aug 2, 2022)
Bug Fixes

Due to a bug in the Moshi JSON serialization library, language detection was not possible in certain cases. (#144, #147)

Lingua could not be used properly when a security manager was enabled in the JVM. (#141)

Source code(tar.gz)
Source code(zip)
lingua-1.2.2-javadoc.jar(340.75 KB)
lingua-1.2.2-sources.jar(33.77 KB)
lingua-1.2.2-with-dependencies.jar(104.01 MB)
lingua-1.2.2.jar(76.68 MB)
v1.2.1(Jun 9, 2022)
Bug Fixes

An exception was thrown when trying to detect the language of unigrams and bigrams in low accuracy mode which operates only with trigrams and larger strings. This has been fixed.

Source code(tar.gz)
Source code(zip)
lingua-1.2.1-javadoc.jar(340.75 KB)
lingua-1.2.1-sources.jar(33.69 KB)
lingua-1.2.1-with-dependencies.jar(104.01 MB)
lingua-1.2.1.jar(76.68 MB)
v1.2.0(Jun 7, 2022)
Features

The library can now be used as a Java 9 module. Thanks to @Marcono1234 for helping with the implementation. (#120, #138)

The new method LanguageDetectorBuilder.withLowAccuracyMode() has been introduced. By activating it, detection accuracy for short text is reduced in favor of a smaller memory footprint and faster detection performance. (#136)

Improvements

The memory footprint has been reduced significantly by applying several internal optimizations. Thanks to @Marcono1234, @fvasco and @sigpwned for their help. (#101, #127)

Several language model files have become obsolete and could be deleted without decreasing detection accuracy. This results in a smaller memory footprint and a 36% smaller jar file.

Bug Fixes

A bug in the rule engine has been fixed that caused incorrect language detection for certain texts. Thanks to @bdecarne who has found it.

Other changes

Due to a refactoring of how the internal thread pool works, the method LanguageDetector.destroy() has been deprecated in favor of the newly introduced method LanguageDetector.unloadLanguageModels().

Source code(tar.gz)
Source code(zip)
lingua-1.2.0-javadoc.jar(340.75 KB)
lingua-1.2.0-sources.jar(33.67 KB)
lingua-1.2.0-with-dependencies.jar(104.01 MB)
lingua-1.2.0.jar(76.68 MB)
v1.1.1(Dec 12, 2021)
Improvements

The new method LanguageDetector.destroy() has been introduced that frees internal resources to prevent memory leaks within application server deployments. (#110, #116)

Language model loading performance has been improved by creating a manually optimized internal thread pool. This replaces the coroutines used in the previous release. (#116)

Bug Fixes

The character â was erroneously not treated as a possible indicator for the French language. (#115)

Language detection was non-deterministic when multiple alphabets had the same occurrence count. (#105)

Source code(tar.gz)
Source code(zip)
lingua-1.1.1-javadoc.jar(339.68 KB)
lingua-1.1.1-sources.jar(33.29 KB)
lingua-1.1.1-with-dependencies.jar(149.73 MB)
lingua-1.1.1.jar(125.10 MB)
v1.1.0(May 2, 2021)
Languages

There is now support for the Maori language which was contributed to the Rust implementation of Lingua. (#93)

Features

Language models are now loaded asynchronously and in parallel using Kotlin coroutines, making this step more performant. (#84)

Language Models can now be loaded either lazily (default) or eagerly. (#79)

Instead of loading multiple copies of the language models into memory for each separate instance of LanguageDetector, multiple instances now share the same language models and access them asynchronously. (#91)

Improvements

Language detection for sentences with more than 120 characters now performs more quickly by iterating through trigrams only which is enough to achieve high detection accuracy.

Textual input that includes logograms from Chinese, Japanese or Korean is now split at each logogram and not only at whitespace. This provides for more reliable language detection for sentences that include multi-language content. (#85)

Bug Fixes

For an odd number of words as input, the method LanguageDetector.computeLanguageConfidenceValues computed wrong values under certain circumstances. (#87)

When Lingua was used in projects with an explictly set Kotlin version which differed from Lingua's implicitly set version in the Gradle script, several errors occurred during runtime. By explicitly setting Lingua's Kotlin version, these errors are now hopefully gone. (#88, #89)

Errors in the rule engine for the Latvian language have been resolved. (#92)

Source code(tar.gz)
Source code(zip)
lingua-1.1.0-javadoc.jar(337.21 KB)
lingua-1.1.0-sources.jar(32.54 KB)
lingua-1.1.0-with-dependencies.jar(150.91 MB)
lingua-1.1.0.jar(125.11 MB)
v1.0.3(Oct 15, 2020)
Bug Fixes

When two languages had exactly the same confidence values, one of them was erroneously removed from the result map. Thanks to @mmedek for reporting this bug. (#72)

There was still a problem with the classification of texts consisting of certain alphabets. Thanks to @nicolabertoldi for reporting this bug. (#76)

The language detection for Spanish did not take the rarely used accented characters á, é, í, ó, ú and ü into account. Thanks to @joeporter for reporting this bug. (#73)

A bug in the rule engine led to weak detection accuracy for Macedonian and Serbian. This has been fixed.

Other Changes

The Kotlin compiler and runtime have been updated to version 1.4. This includes the current stable release 1.0.0 of the kotlinx-serialization framework.

The accuracy report files have been moved to their own Gradle source set. This allows for separate compilation of unit tests and accuracy report tests, leading to more flexible and slightly faster compilation.

Source code(tar.gz)
Source code(zip)
lingua-1.0.3-javadoc.jar(49.20 KB)
lingua-1.0.3-sources.jar(31.12 KB)
lingua-1.0.3-with-dependencies.jar(145.08 MB)
lingua-1.0.3.jar(124.79 MB)
v1.0.2(Aug 9, 2020)
Bug Fixes

The language mapping for character ë was incorrect which has been fixed. Thanks to @sandernugterenedia for reporting this bug. (#66)

The implementation of LanguageDetector made use of functionality that was introduced in Java 8 which made the library unusable for Java 6 and 7. Thanks to @levant916 for reporting this bug. (#69)

The Gradle shadow plugin has been added so that ./gradlew jarWithDependencies produces a jar file whose dependencies do not conflict anymore with the same dependencies of different versions in the same project. (#67)

Source code(tar.gz)
Source code(zip)
lingua-1.0.2-javadoc.jar(49.22 KB)
lingua-1.0.2-sources.jar(31.10 KB)
lingua-1.0.2-with-dependencies.jar(144.95 MB)
lingua-1.0.2.jar(124.79 MB)
v1.0.1(Jul 4, 2020)
Bug Fixes

If no ngram probabilities were found for a given input text, a NullPointerException would be thrown. Thanks to @fsonntag for finding and fixing this bug. (#63)

Source code(tar.gz)
Source code(zip)
lingua-1.0.1-javadoc.jar(49.22 KB)
lingua-1.0.1-sources.jar(30.33 KB)
lingua-1.0.1-with-dependencies.jar(144.74 MB)
lingua-1.0.1.jar(124.80 MB)
v1.0.0(Jun 24, 2020)
Languages

added 9 new languages, this time with a focus on Africa: Ganda, Shona, Sotho, Swahili, Tsonga, Tswana, Xhosa, Yoruba, Zulu

removed language Norwegian in favor of Bokmal and Nynorsk (#59)

Features

LanguageDetector can now provide confidence scores for each evaluated language. (#11)

The public API for creating language model (LanguageModelFilesWriter) and test data files (TestDataFilesWriter) has been stabilized. (#37)

New convenience methods have been added to LanguageDetectorBuilder in order to build LanguageDetector from languages written in a certain script. (#61)

Improvements

The rule-based detection algorithm has been made less sensitive so that single words in a different language cannot mislead the algorithm so easily.

The fastutil library has been added again to reduce memory consumption. (#58)

The language model-based algorithm has been optimized so that language detection performs approximately 25% faster now. (#58)

Support for the Kotlin linter ktlint has been added to help with a consistent coding style. (#47)

Third-party dependencies have been updated to their latest versions. (#36)

Bug Fixes

Incorrect regex character classes caused the library to not work properly on Android. (#32)

Test Coverage

Test coverage has been extended from 59% to 72%.

Documentation

The README contains a new section describing how users can add their own languages to Lingua.

Other changes

There is a breaking change in this release:

Methods with the prefix fromAllBuiltIn... have been renamed to fromAll... to make them more succinct and clear. (#61)

Source code(tar.gz)
Source code(zip)
lingua-1.0.0-javadoc.jar(49.14 KB)
lingua-1.0.0-sources.jar(30.25 KB)
lingua-1.0.0-with-dependencies.jar(144.74 MB)
lingua-1.0.0.jar(124.80 MB)
v0.6.1(Feb 6, 2020)
Bug Fixes

The rule-based engine did not take language subset filtering from public api into account (#23).

It was possible to pass through Language.UNKNOWN within the public api (#24).

Fixed a bug in the rule-based engine's alphabet detection algorithm which could be misled by single characters (#25).

Source code(tar.gz)
Source code(zip)
lingua-0.6.1-javadoc.jar(36.52 KB)
lingua-0.6.1-sources.jar(25.63 KB)
lingua-0.6.1-with-dependencies.jar(125.82 MB)
lingua-0.6.1.jar(123.90 MB)
v0.6.0(Jan 5, 2020)
Languages

added 11 new languages: Armenian, Bosnian, Azerbaijani, Esperanto, Georgian, Kazakh, Macedonian, Marathi, Mongolian, Serbian, Ukrainian

Features

There are some breaking changes in this release:

The support for MapDB has been removed. It did not provide enough advantages over Kotlin's lazy loading of language models. It used a lot of disc space and language detection became slow. With the long-term goal of creating a multiplatform library, only those features will be implemented in the future that support JavaScript as well.

The dependency on the fastutil library has been removed. It did not provide enough advantages over Kotlin's lazy loading of language models.

The method LanguageDetector.detectLanguagesOf(text: Iterable<String>) has been removed because the sorting order of the returned languages was undefined for input collections such as a HashSet. From now on, the method LanguageDetector.detectLanguageOf(text: String) will be the only one to be used.

The LanguageDetector can now be built with the following additional methods:

LanguageDetectorBuilder.fromIsoCodes639_1(vararg isoCodes: IsoCode639_1)

LanguageDetectorBuilder.fromIsoCodes639_3(vararg isoCodes: IsoCode639_3)

the following method has been removed: LanguageDetectorBuilder.fromIsoCodes(isoCode: String, vararg isoCodes: String)

The Gson library has been replaced with kotlinx-serialization for the loading of the json language models. This results in a significant reduction of code and makes reflection obsolete, so the dependency on kotlin-reflect could be removed.

Improvements

The overall detection algorithm has been improved again several times to fix several detection bugs.

Source code(tar.gz)
Source code(zip)
lingua-0.6.0-javadoc.jar(36.52 KB)
lingua-0.6.0-sources.jar(25.57 KB)
lingua-0.6.0-with-dependencies.jar(125.82 MB)
lingua-0.6.0.jar(123.90 MB)
v0.5.0(Aug 12, 2019)
Languages

added 12 new languages: Bengali, Chinese (not differentiated between traditional and simplified, as of now), Gujarati, Hebrew, Hindi, Japanese, Korean, Punjabi, Tamil, Telugu, Thai, Urdu

Features

The LanguageDetectorBuilder now supports the additional method withMinimumRelativeDistance() that allows to specify the minimum distance between the logarithmized and summed up probabilities for each possible language. If two or more languages yield nearly the same probability for a given input text, it is likely that the wrong language may be returned. By specifying a higher value for the minimum relative distance, Language.UNKNOWN is returned instead of risking false positives.

Test report generation can now use multiple CPU cores, allowing to run as many reports as CPU cores are available. This has been implemented as an additional attribute for the respective Gradle task: ./gradlew writeAccuracyReports -PcpuCores=...

The REPL now allows to freely specify the languages you want to try out by entering the desired ISO 639-1 codes. Before, it has only been possible to choose between certain language combinations.

Improvements

The overall detection algorithm has been improved, yielding slightly more accurate results for those languages that are based on the Latin alphabet.

Bug Fixes

Thanks to the great work of contributor Bernhard Geisberger, two bugs could be fixed.

The fix in pull request #8 solves the problem of not being able to recreate the MapDB cache files automatically in case the data has been corrupted.

The fix in pull request #9 makes the class LanguageDetector completely thread-safe. Previously, in some rare cases it was possible that two threads mutated one of the internal variables at the same time, yielding inaccurate language detection results.

Thank you, Bernhard.
Source code(tar.gz)
Source code(zip)
lingua-0.5.0-sources.jar(24.49 KB)
lingua-0.5.0-with-dependencies.jar(147.69 MB)
lingua-0.5.0.jar(111.25 MB)
v0.4.0(May 7, 2019)
This release took some time, but here it is.

Languages

added 18 new languages: Afrikaans, Albanian, Basque, Bokmal, Catalan, Greek, Icelandic, Indonesian, Irish, Malay, Norwegian, Nynorsk, Slovak, Slovene, Somali, Tagalog, Vietnamese, Welsh

Features

Language models are now lazy-loaded into memory upon first access and not already when an instance of LanguageDetector is created. This way, if the rule-based engine can filter out some unlikely languages, their language models are not loaded into memory as they are not necessary at that point. So the overall memory consumption is further reduced.

The fastutil library is used to compress the probability values of the language models in memory. They are now stored as primitive data types (double) instead of objects (Double) which reduces memory consumption by approximately 500 MB if all language models are selected.

Improvements

The overall code quality has been improved significantly. This allows for easier unit testing, configuration and extensibility.

Bug Fixes

Reported bug #3 has been fixed which prevented certain character classes to be used on Android.

Build system

Starting from this version, Gradle is used as this library's build system instead of Maven. This allows for more customizations, such as in test report generation, and is a first step towards multiplatform support. Please take a look at this project's README to read about the available Gradle tasks.

Test Coverage

Test coverage has been extended from 24% to 55%.

Source code(tar.gz)
Source code(zip)
lingua-0.4.0-sources.jar(22.81 KB)
lingua-0.4.0-with-dependencies.jar(99.93 MB)
lingua-0.4.0.jar(63.67 MB)
v0.3.2(Feb 8, 2019)
This minor update fixes a critical bug reported in issue #1.

Bug Fixes

The attempt to detect the language of a string solely containing characters that do not occur in any of the supported languages returned kotlin.KotlinNullPointerException. This has been fixed in this release. Instead, Language.UNKNOWN is now returned as expected.

Dependency Updates

The Kotlin compiler, standard library and runtime have been updated from version 1.3.20 to 1.3.21

Source code(tar.gz)
Source code(zip)
lingua-0.3.2-sources.jar(23.15 KB)
lingua-0.3.2-with-dependencies.jar(61.17 MB)
lingua-0.3.2.jar(42.70 MB)
v0.3.1(Jan 24, 2019)
This minor update contains some significant detection accuracy improvements.

Accuracy Improvements

added new detection rules to improve accuracy especially for single words and word pairs

accuracy for single words has been increased from 78% to 82% on average

accuracy for word pairs has been increased from 92% to 94% on average

accuracy for sentences has been increased from 98% to 99% on average

overall accuracy has been increased from 90% to 91% on average

overall standard deviation has been reduced from 6.01 to 5.35

API changes

LanguageDetectorBuilder.fromIsoCodes() now accepts vararg arguments instead of a List in order to have a consistent API with the other methods of LanguageDetectorBuilder

If a language iso 639-1 code is passed to LanguageDetectorBuilder.fromIsoCodes() which does not exist, then an IllegalArgumentException is thrown. Previously, Language.UNKNOWN was returned. However, this could lead to bugs as a LanguageDetector with Language.UNKNOWN was built. This is now prevented.

Dependency Updates

The Kotlin compiler, standard library and runtime have been updated from version 1.3.11 to 1.3.20

Source code(tar.gz)
Source code(zip)
lingua-0.3.1-sources.jar(23.14 KB)
lingua-0.3.1-with-dependencies.jar(61.17 MB)
lingua-0.3.1.jar(42.70 MB)
v0.3.0(Jan 16, 2019)
This major release offers a lot of new features, including new languages. Finally! :-)

Languages

added 18 languages: Arabic, Belarusian, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Hungarian, Latvian, Lithuanian, Polish, Persian, Romanian, Russian, Swedish, Turkish

Features

Language models can now be cached by MapDB to reduce memory usage and speed up loading times.

Improvements

In the standalone app, you can now choose which language models to load in order to compare detection accuracy between strongly related languages.

For test report generation using Maven, you can now select a specific language using the attribute language and do not need to run the reports for all languages anymore: mvn test -P accuracy-reports -D detector=lingua -D language=German.

API changes

Lingua's package structure has been simplified. The public API intended for end users now lives in com.github.pemistahl.lingua.api. Breaking changes herein are tried to keep to a minimum in 0.*.* versions and will not be performed anymore starting from version 1.0.0. All other code is stored in com.github.pemistahl.lingua.internal and is subject to change without any further notice.

added new class com.github.pemistahl.lingua.api.LanguageDetectorBuilder which is now responsible for building and configuring instances of com.github.pemistahl.lingua.api.LanguageDetector

Test Coverage

Test coverage of the public API has been extended from 6% to 23%.

Documentation

In addition to the test reports, graphical plots have been created in order to compare the detection results between the different classifiers even more easily. The code for the plots has been written in Python and is stored in an IPython notebook under /accuracy-reports/accuracy-reports-analysis-notebook.ipynb.

Source code(tar.gz)
Source code(zip)
lingua-0.3.0-sources.jar(23.50 KB)
lingua-0.3.0-with-dependencies.jar(61.12 MB)
lingua-0.3.0.jar(42.68 MB)
v0.2.2(Dec 28, 2018)
This minor version update provides the following:

Improvements

The included language model JSON files now use a more efficient formatting, saving approximately 25% disk space in uncompressed format compared to version 0.2.1.

Bug Fixes

The version of the Jacoco test coverage Maven plugin was incorrectly specified, leading to download errors. Now the most current snapshot version of Jacoco is used which provides enhancements for Kotlin test coverage measurement.

Source code(tar.gz)
Source code(zip)
lingua-0.2.2-sources.jar(17.46 KB)
lingua-0.2.2-with-dependencies.jar(13.28 MB)
lingua-0.2.2.jar(9.20 MB)
v0.2.1(Dec 20, 2018)
This minor version update provides the following:

Performance Improvements

Lingua's language detection has been speeded up. It is now approximately 25% faster for large data sets.

Comparison with Apache Tika

Accuracy report test classes have been written for Apache Tika to compare its language detection performance with Lingua's one. Lingua actually outperforms Tika for short paragraphs of text by up to 15% in accuracy. A detailed comparison table can be found in the README.

Source code(tar.gz)
Source code(zip)
lingua-0.2.1-sources.jar(17.42 KB)
lingua-0.2.1-with-dependencies.jar(14.92 MB)
lingua-0.2.1.jar(10.84 MB)
v0.2.0(Dec 17, 2018)
This release provides both new features and bug fixes. It is the first release that has been published to JCenter. Publication on Maven Central will follow soon.

Languages

added detection support for Portuguese

Features

extended language models for already existing languages to provide for more accurate detection results

the larger language models are now lazy-loaded to reduce waiting times during start-up, especially when starting the lingua REPL

added some unit tests for the LanguageDetector class that cover the most basic functionality (will be extended in upcoming versions)

added accuracy reports and test data for each supported language, in order to measure language detection accuracy (can be generated with mvn test -P accuracy-reports)

added accuracy statistics summary of the current implementation to README

API changes

renamed method LanguageDetector.detectLanguageFrom() to LanguageDetector.detectLanguageOf() to use the grammatically correct English preposition

in version 0.1.0, the now called method LanguageDetector.detectLanguageOf() returned null for strings whose language could not be detected reliably. Now, Language.UNKNOWN is returned instead in those cases to prevent NullPointerExceptions especially in Java code.

Bug Fixes

fixed a bug in lingua's REPL that caused non-ASCII characters to get broken in consoles which do not use UTF-8 encoding by default, especially on Windows systems

Source code(tar.gz)
Source code(zip)
lingua-0.2.0-sources.jar(16.84 KB)
lingua-0.2.0-with-dependencies.jar(14.92 MB)
lingua-0.2.0.jar(10.84 MB)
v0.1.0(Nov 16, 2018)
This is the very first release of Lingua. It aims at accurate language detection results for both long and especially short text. Detection on short text fragments such as Twitter messages is a weak spot of many similar libraries.

Supported languages so far:

English

French

German

Italian

Latin

Spanish

Source code(tar.gz)
Source code(zip)

👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Related tags

Overview

Quick Info

Table of Contents

1. What does this library do? Top ▲

2. Why does this library exist? Top ▲

3. Which languages are supported? Top ▲

4. How good is it? Top ▲

5. Why is it better than other libraries? Top ▲

6. Test report and plot generation Top ▲

7. How to add it to your project? Top ▲

7.1 Using Gradle

7.2 Using Maven

8. How to build? Top ▲

9. How to use? Top ▲

9.1 Programmatic use Top ▲

9.2 Standalone mode Top ▲

10. What's next for version 1.1.0? Top ▲

Comments

Model files

Model file loading

In-memory models

Runtime performance

Conclusion

Releases(v1.2.2)

v1.2.2(Aug 2, 2022)

Bug Fixes

v1.2.1(Jun 9, 2022)

Bug Fixes

v1.2.0(Jun 7, 2022)

Features

Improvements

Bug Fixes

Other changes

v1.1.1(Dec 12, 2021)

Improvements

Bug Fixes

v1.1.0(May 2, 2021)

Languages

Features

Improvements

Bug Fixes

v1.0.3(Oct 15, 2020)

Bug Fixes

Other Changes

v1.0.2(Aug 9, 2020)

Bug Fixes

v1.0.1(Jul 4, 2020)

Bug Fixes

v1.0.0(Jun 24, 2020)

Languages

Features

Improvements

Bug Fixes

Test Coverage

Documentation

Other changes

v0.6.1(Feb 6, 2020)

Bug Fixes

v0.6.0(Jan 5, 2020)

Languages

Features

Improvements

v0.5.0(Aug 12, 2019)

Languages

Features

Improvements

Bug Fixes

v0.4.0(May 7, 2019)

Languages

Features

Improvements

Bug Fixes

Build system

Test Coverage

v0.3.2(Feb 8, 2019)

Bug Fixes

Dependency Updates

v0.3.1(Jan 24, 2019)

1. What does this library do? ^{Top ▲}

2. Why does this library exist? ^{Top ▲}

3. Which languages are supported? ^{Top ▲}

4. How good is it? ^{Top ▲}

5. Why is it better than other libraries? ^{Top ▲}

6. Test report and plot generation ^{Top ▲}

7. How to add it to your project? ^{Top ▲}

8. How to build? ^{Top ▲}

9. How to use? ^{Top ▲}

9.1 Programmatic use ^{Top ▲}

9.2 Standalone mode ^{Top ▲}

10. What's next for version 1.1.0? ^{Top ▲}