State of the Art Natural Language Processing

John Snow Labs

Last update: Jan 5, 2023

Related tags

Text Data & NLP nlp natural-language-processing spark sentiment-analysis tensorflow machine-translation transformers language-detection pyspark named-entity-recognition seq2seq lemmatizer spell-checker albert bert part-of-speech-tagger entity-extraction spark-ml xlnet tf-hub-models

Overview

Spark NLP: State of the Art Natural Language Processing

Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports state-of-the-art transformers such as BERT, XLNet, ELMO, ALBERT, and Universal Sentence Encoder that can be used seamlessly in a cluster. It also offers Tokenization, Word Segmentation, Part-of-Speech Tagging, Named Entity Recognition, Dependency Parsing, Spell Checking, Multi-class Text Classification, Multi-class Sentiment Analysis, Machine Translation (+180 languages), Summarization and Question Answering (Google T5), and many more NLP tasks.

Project's website

Take a look at our official Spark NLP page: http://nlp.johnsnowlabs.com/ for user documentation and examples

Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Medium Spark NLP articles
YouTube Spark NLP video tutorials

Features
Requirements
Quick Start
Apache Spark Support
Databricks Support
EMR Support
Using Spark NLP
Pipelines & Models
- Pipelines
- Models
Examples
FAQ
Troubleshooting
Citation
Contributing

Features

Tokenization
Trainable Word Segmentation
Stop Words Removal
Token Normalizer
Document Normalizer
Stemmer
Lemmatizer
NGrams
Regex Matching
Text Matching
Chunking
Date Matcher
Sentence Detector
Deep Sentence Detector (Deep learning)
Dependency parsing (Labeled/unlabeled)
Part-of-speech tagging
Sentiment Detection (ML models)
Spell Checker (ML and DL models)
Word Embeddings (GloVe and Word2Vec)
BERT Embeddings (TF Hub models)
ELMO Embeddings (TF Hub models)
ALBERT Embeddings (TF Hub models)
XLNet Embeddings
Universal Sentence Encoder (TF Hub models)
BERT Sentence Embeddings (42 TF Hub models)
Sentence Embeddings
Chunk Embeddings
Unsupervised keywords extraction
Language Detection & Identification (up to 375 languages)
Multi-class Sentiment analysis (Deep learning)
Multi-label Sentiment analysis (Deep learning)
Multi-class Text Classification (Deep learning)
Neural Machine Translation
Text-To-Text Transfer Transformer (Google T5)
Named entity recognition (Deep learning)
Easy TensorFlow integration
GPU Support
Full integration with Spark ML functions
+710 pre-trained models in +192 languages!
+450 pre-trained pipelines in +192 languages!
Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hewbrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, and Urdu.

Requirements

In order to use Spark NLP you need the following requirements:

Java 8
Apache Spark 2.4.x (or Apache Spark 2.3.x)

Quick Start

This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark:

$ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.6 -y
$ conda activate sparknlp
$ pip install spark-nlp==2.7.3 pyspark==2.4.7

In Python console or Jupyter Python3 kernel:

# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start Spark Session with Spark NLP
# start() functions has two parameters: gpu and spark23
# sparknlp.start(gpu=True) will start the session with GPU support
# sparknlp.start(spark23=True) is when you have Apache Spark 2.3.x installed
spark = sparknlp.start()

# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en')

# Your testing dataset
text = """
The Mona Lisa is a 16th century oil painting created by Leonardo.
It's held at the Louvre in Paris.
"""

# Annotate your testing dataset
result = pipeline.annotate(text)

# What's in the pipeline
list(result.keys())
Output: ['entities', 'stem', 'checked', 'lemma', 'document',
'pos', 'token', 'ner', 'embeddings', 'sentence']

# Check the results
result['entities']
Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']

For more examples, you can visit our dedicated repository to showcase all Spark NLP use cases!

Apache Spark Support

Spark NLP 2.7.3 has been built on top of Apache Spark 2.4.x and fully supports Apache Spark 2.3.x:

Spark NLP	Apache Spark 2.3.x	Apache Spark 2.4.x
2.7.x	YES	YES
2.6.x	YES	YES
2.5.x	YES	YES
2.4.x	Partially	YES
1.8.x	Partially	YES
1.7.x	YES	NO
1.6.x	YES	NO
1.5.x	YES	NO

NOTE: Starting 2.5.4 release, we support both Apache Spark 2.4.x and Apache Spark 2.3.x at the same time.

Find out more about Spark NLP versions from our release notes.

Databricks Support

Spark NLP 2.7.3 has been tested and is compatible with the following runtimes:

6.2
6.2 ML
6.3
6.3 ML
6.4
6.4 ML
6.5
6.5 ML

EMR Support

Spark NLP 2.7.3 has been tested and is compatible with the following EMR releases:

5.26.0
5.27.0

Full list of EMR releases.

Usage

Spark Packages

Command line (requires internet connection)

This library has been uploaded to the spark-packages repository.

The benefit of spark-packages is that makes it available for both Scala-Java and Python

To use the most recent version on Apache Spark 2.4.x just add the --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3 to you spark command:

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3

This can also be used to create a SparkSession manually by using the spark.jars.packages option in both Python and Scala.

NOTE: To use Spark NLP with GPU you can use the dedicated GPU package com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.7.3

NOTE: To use Spark NLP on Apache Spark 2.3.x you should instead use the following packages:

CPU: com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:2.7.3
GPU: com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:2.7.3

NOTE: In case you are using large pretrained models like UniversalSentenceEncoder, you need to have the following set in your SparkSession:

spark-shell --driver-memory 16g --conf spark.kryoserializer.buffer.max=1000M --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3

Scala

Our package is deployed to maven central. To add this package as a dependency in your application:

Maven

spark-nlp on Apache Spark 2.4.x:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.11</artifactId>
    <version>2.7.3</version>
</dependency>

spark-nlp-gpu:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.11</artifactId>
    <version>2.7.3</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-spark23 -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>2.7.3</version>
</dependency>

spark-nlp-gpu:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu-spark23 -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>2.7.3</version>
</dependency>

SBT

spark-nlp on Apache Spark 2.4.x:

// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.7.3"

spark-nlp-gpu:

// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "2.7.3"

spark-nlp on Apache Spark 2.3.x:

// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-spark23
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-spark23" % "2.7.3"

spark-nlp-gpu:

// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu-spark23
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu-spark23" % "2.7.3"

Maven Central: https://mvnrepository.com/artifact/com.johnsnowlabs.nlp

Python

Python without explicit Pyspark installation

Pip/Conda

If you installed pyspark through pip/conda, you can install spark-nlp through the same channel.

Pip:

pip install spark-nlp==2.7.3

Conda:

conda install -c johnsnowlabs spark-nlp

PyPI spark-nlp package / Anaconda spark-nlp package

Then you'll have to create a SparkSession either from Spark NLP:

import sparknlp

spark = sparknlp.start()

or manually:

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3")\
    .config("spark.kryoserializer.buffer.max", "1000M")\
    .getOrCreate()

If using local jars, you can use spark.jars instead for a comma delimited jar files. For cluster setups, of course you'll have to put the jars in a reachable location for all driver and executor nodes.

Quick example:

import sparknlp
from sparknlp.pretrained import PretrainedPipeline

#create or get Spark Session

spark = sparknlp.start()

sparknlp.version()
spark.version

#download, load, and annotate a text by pre-trained pipeline

pipeline = PretrainedPipeline('recognize_entities_dl', 'en')
result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created by Leonardo')

Compiled JARs

Build from source

spark-nlp

FAT-JAR for CPU on Apache Spark 2.4.x

sbt assembly

FAT-JAR for GPU on Apache Spark 2.4.x

sbt -Dis_gpu=true assembly

FAT-JAR for CPU on Apache Spark 2.3.x

sbt -Dis_spark23=true assembly

FAT-JAR for GPU on Apache Spark 2.3.x

sbt -Dis_gpu=true -Dis_spark23=true assembly

Using the jar manually

If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it from Maven Central.

To add JARs to spark programs use the --jars option:

spark-shell --jars spark-nlp.jar

The preferred way to use the library when running spark programs is using the --packages option as specified in the spark-packages section.

Apache Zeppelin

Use either one of the following options

Add the following Maven Coordinates to the interpreter's library list

com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3

Add path to pre-built jar from here in the interpreter's library list making sure the jar is available to driver path

Python in Zeppelin

Apart from previous step, install python module through pip

pip install spark-nlp==2.7.3

Or you can install spark-nlp from inside Zeppelin by using Conda:

python.conda install -c johnsnowlabs spark-nlp

Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose.

Finally, in Zeppelin interpreter settings, make sure you set properly zeppelin.python to the python you want to use and install the pip library with (e.g. python3).

An alternative option would be to set SPARK_SUBMIT_OPTIONS (zeppelin-env.sh) and make sure --packages is there as shown earlier, since it includes both scala and python side installation.

Jupyter Notebook (Python)

The easiest way to get this done is by making Jupyter Notebook run using pyspark as follows:

export SPARK_HOME=/path/to/your/spark/folder
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3

Alternatively, you can mix in using --jars option for pyspark + pip install spark-nlp

If not using pyspark at all, you'll have to run the instructions pointed here

Google Colab Notebook

Google Colab is perhaps the easiest way to get started with spark-nlp. It requires no installation or set up other than having a Google account.

Run the following code in Google Colab notebook and start using spark-nlp right away.

import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.7

# Install Spark NLP
! pip install --ignore-installed spark-nlp==2.7.3

# Quick SparkSession start
import sparknlp
spark = sparknlp.start()

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version

Here is a live demo on Google Colab that performs sentiment analysis and NER using pretrained spark-nlp models.

Databricks Cluster

Create a cluster if you don't have one already
On a new cluster or existing one you need to add the following to the Advanced Options -> Spark tab:

spark.kryoserializer.buffer.max 1000M
spark.serializer org.apache.spark.serializer.KryoSerializer

Check Enable autoscaling local storage box to have persistent local storage
In Libraries tab inside your cluster you need to follow these steps:

4.1. Install New -> PyPI -> spark-nlp -> Install

4.2. Install New -> Maven -> Coordinates -> com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3 -> Install
Now you can attach your notebook to the cluster and use Spark NLP!

S3 Cluster

With no Hadoop configuration

If your distributed storage is S3 and you don't have a standard Hadoop configuration (i.e. fs.defaultFS) You need to specify where in the cluster distributed storage you want to store Spark NLP's tmp files. First, decide where you want to put your application.conf file

import com.johnsnowlabs.util.ConfigLoader
ConfigLoader.setConfigPath("/somewhere/to/put/application.conf")

And then we need to put in such application.conf the following content

sparknlp {
  settings {
    cluster_tmp_dir = "somewhere in s3n:// path to some folder"
  }
}

Pipelines and Models

Pipelines

Spark NLP offers more than 450+ pre-trained pipelines in 192 languages.

English pipelines:

Pipeline	Name	Build	lang
Explain Document ML	`explain_document_ml`	2.4.0	`en`
Explain Document DL	`explain_document_dl`	2.4.3	`en`
Recognize Entities DL	`recognize_entities_dl`	2.4.3	`en`
Recognize Entities DL	`recognize_entities_bert`	2.4.3	`en`
OntoNotes Entities Small	`onto_recognize_entities_sm`	2.4.0	`en`
OntoNotes Entities Large	`onto_recognize_entities_lg`	2.4.0	`en`
Match Datetime	`match_datetime`	2.4.0	`en`
Match Pattern	`match_pattern`	2.4.0	`en`
Match Chunk	`match_chunks`	2.4.0	`en`
Match Phrases	`match_phrases`	2.4.0	`en`
Clean Stop	`clean_stop`	2.4.0	`en`
Clean Pattern	`clean_pattern`	2.4.0	`en`
Clean Slang	`clean_slang`	2.4.0	`en`
Check Spelling	`check_spelling`	2.4.0	`en`
Check Spelling DL	`check_spelling_dl`	2.5.0	`en`
Analyze Sentiment	`analyze_sentiment`	2.4.0	`en`
Analyze Sentiment DL	`analyze_sentimentdl_use_imdb`	2.5.0	`en`
Analyze Sentiment DL	`analyze_sentimentdl_use_twitter`	2.5.0	`en`
Dependency Parse	`dependency_parse`	2.4.0	`en`

Quick example:

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP

SparkNLP.version()

val testData = spark.createDataFrame(Seq(
(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
)).toDF("id", "text")

val pipeline = PretrainedPipeline("explain_document_dl", lang="en")

val annotation = pipeline.transform(testData)

annotation.show()
/*
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
2.5.0
testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_dl,en,public/models)
annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 10 more fields]
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id|                text|            document|               token|            sentence|             checked|               lemma|                stem|                 pos|          embeddings|                 ner|            entities|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|Google has announ...|[[document, 0, 10...|[[token, 0, 5, Go...|[[document, 0, 10...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, go...|[[pos, 0, 5, NNP,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
|  2|The Paris metro w...|[[document, 0, 11...|[[token, 0, 2, Th...|[[document, 0, 11...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 4, 8, Pa...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/

annotation.select("entities.result").show(false)

/*
+----------------------------------+
|result                            |
+----------------------------------+
|[Google, TensorFlow]              |
|[Donald John Trump, United States]|
+----------------------------------+
*/

Please check out our Models Hub for the full list of pre-trained pipelines with examples, demos, benchmarks, and more

Models

Spark NLP offers more than 710+ pre-trained models in 192 languages.

Some of the selected languages: Afrikaans, Arabic, Armenian, Basque, Bengali, Breton, Bulgarian, Catalan, Czech, Dutch, English, Esperanto, Finnish, French, Galician, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Latvian, Marathi, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Southern Sotho, Spanish, Swahili, Swedish, Tswana, Turkish, Ukrainian, Zulu

English Models:

Model	Name	Build	Lang
LemmatizerModel (Lemmatizer)	`lemma_antbnc`	2.0.2	`en`
PerceptronModel (POS)	`pos_anc`	2.0.2	`en`
PerceptronModel (POS UD)	`pos_ud_ewt`	2.2.2	`en`
NerCrfModel (NER with GloVe)	`ner_crf`	2.4.0	`en`
NerDLModel (NER with GloVe)	`ner_dl`	2.4.3	`en`
NerDLModel (NER with BERT)	`ner_dl_bert`	2.4.3	`en`
NerDLModel (OntoNotes with GloVe 100d)	`onto_100`	2.4.0	`en`
NerDLModel (OntoNotes with GloVe 300d)	`onto_300`	2.4.0	`en`
SymmetricDeleteModel (Spell Checker)	`spellcheck_sd`	2.0.2	`en`
NorvigSweetingModel (Spell Checker)	`spellcheck_norvig`	2.0.2	`en`
ViveknSentimentModel (Sentiment)	`sentiment_vivekn`	2.0.2	`en`
DependencyParser (Dependency)	`dependency_conllu`	2.0.8	`en`
TypedDependencyParser (Dependency)	`dependency_typed_conllu`	2.0.8	`en`

Embeddings:

Model	Name	Build	Lang
WordEmbeddings (GloVe)	`glove_100d`	2.4.0	`en`
BertEmbeddings	`bert_base_uncased`	2.4.0	`en`
BertEmbeddings	`bert_base_cased`	2.4.0	`en`
BertEmbeddings	`bert_large_uncased`	2.4.0	`en`
BertEmbeddings	`bert_large_cased`	2.4.0	`en`
ElmoEmbeddings	`elmo`	2.4.0	`en`
UniversalSentenceEncoder (USE)	`tfhub_use`	2.4.0	`en`
UniversalSentenceEncoder (USE)	`tfhub_use_lg`	2.4.0	`en`
AlbertEmbeddings	`albert_base_uncased`	2.5.0	`en`
AlbertEmbeddings	`albert_large_uncased`	2.5.0	`en`
AlbertEmbeddings	`albert_xlarge_uncased`	2.5.0	`en`
AlbertEmbeddings	`albert_xxlarge_uncased`	2.5.0	`en`
XlnetEmbeddings	`xlnet_base_cased`	2.5.0	`en`
XlnetEmbeddings	`xlnet_large_cased`	2.5.0	`en`

Classification:

Model	Name	Build	Lang
ClassifierDL (with tfhub_use)	`classifierdl_use_trec6`	2.5.0	`en`
ClassifierDL (with tfhub_use)	`classifierdl_use_trec50`	2.5.0	`en`
SentimentDL (with tfhub_use)	`sentimentdl_use_imdb`	2.5.0	`en`
SentimentDL (with tfhub_use)	`sentimentdl_use_twitter`	2.5.0	`en`
SentimentDL (with glove_100d)	`sentimentdl_glove_imdb`	2.5.0	`en`

Quick online example:

# load NER model trained by deep learning approach and GloVe word embeddings
ner_dl = NerDLModel.pretrained('ner_dl')
# load NER model trained by deep learning approach and BERT word embeddings
ner_bert = NerDLModel.pretrained('ner_dl_bert')

// load French POS tagger model trained by Universal Dependencies
val french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang="fr")
// load Italain LemmatizerModel
val italian_lemma = LemmatizerModel.pretrained("lemma_dxc", lang="it")

Quick offline example:

Loading PerceptronModel annotator model inside Spark NLP Pipeline

val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/")
      .setInputCols("document", "token")
      .setOutputCol("pos")

Please check out our Models Hub for the full list of pre-trained models with examples, demo, benchmark, and more

Examples

Need more examples? Check out our dedicated Spark NLP Showcase repository to showcase all Spark NLP use cases!

In addition, don't forget to check Spark NLP in Action built by Streamlit.

All examples: spark-nlp-workshop

FAQ

Check our Articles and Videos page here

Citation

We have published a paper that you can cite for the Spark NLP library:

@article{KOCAMAN2021100058,
    title = {Spark NLP: Natural language understanding at scale},
    journal = {Software Impacts},
    pages = {100058},
    year = {2021},
    issn = {2665-9638},
    doi = {https://doi.org/10.1016/j.simpa.2021.100058},
    url = {https://www.sciencedirect.com/science/article/pii/S2665963821000063},
    author = {Veysel Kocaman and David Talby},
    keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
    abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
    }
}

Contributing

We appreciate any sort of contributions:

ideas
feedback
documentation
bug reports
NLP training and testing corpora
development and testing

Clone the repo and submit your pull-requests! Or directly create issues in this repo.

Contact

[email protected]

John Snow Labs

http://johnsnowlabs.com

Comments

spark-nlp won't download pretrained model on Hadoop Cluster

Description

I am using the code below to get word embeddings using BERT model.

from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

spark = SparkSession.builder\
    .master("yarn")\
    .config("spark.locality.wait", "0")\
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.0")\
    .config("spark.sql.autoBroadcastJoinThreshold", -1)\
    .config("spark.sql.codegen.aggregate.map.twolevel.enabled", "false")\
    .getOrCreate()

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setLazyAnnotator(False)

embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
      .setInputCols("sentence") \
      .setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

The script works great on spark local development mode but when i deployed the script on the Hadoop Cluster ( using YARN as a resource manager ) i get the following error

labse download started this may take some time.
Traceback (most recent call last):
  File "testing_bert_hadoop.py", line 138, in <module>
    embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
  File "/usr/local/lib/python3.6/site-packages/sparknlp/annotator.py", line 1969, in pretrained
    return ResourceDownloader.downloadModel(BertSentenceEmbeddings, name, lang, remote_loc)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/pretrained.py", line 32, in downloadModel
    file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
  File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 192, in __init__
    "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 129, in __init__
    self._java_obj = self.new_java_obj(java_obj, *args)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 139, in new_java_obj
    return self._new_java_obj(java_class, *args)
  File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/pyspark.zip/pyspark/ml/wrapper.py", line 63, in _new_java_obj
  File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:61)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:90)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:89)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at scala.collection.Iterator$$anon$14.next(Iterator.scala:541)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
	at scala.collection.AbstractIterator.toList(Iterator.scala:1336)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:92)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:84)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:70)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:399)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:496)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

I tried to manually updated the jars json4s-native, json4s-scalap and many others but the error still persists.

Expected Behavior

The pretrained pipeline should be downloaded and loaded into the pipeline_model variable

Current Behavior

Gives the above mentioned error while running on Hadoop cluster

Possible Solution

I tried to manually updated the jars json4s-native, json4s-scalap and many others but the error still persists. but maybe i am lacking some knowledge or misunderstanding the problem

Context

I was trying to get word embeddings using LABSE model for classification problem

Your Environment

Spark NLP version 3.0.0 on all nodes
Apache NLP version 2.3.0.2.6.5.1175-1
Java version OpenJDK Runtime Environment (build 1.8.0_292-b10) OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)
Setup and installation : spark comes default with Hadoop installation
Operating System and version: centos 7
Cluster Manager: Ambari (HDP 2.6.5.1175-1)

Please do let me know if u need any more info. Thanks

question

opened by DanielOX 39

TypeError: 'JavaPackage' object is not callable

Get "TypeError: 'JavaPackage' object is not callable " error whenever trying to call any annotators.

Description

Platform: Ubuntu 16.04LTS on Windows 10's Linux System (wls) Python: Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) Pyspark: Use pip to install (ie python without explcit spark installation) spark-nlp: pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.5.4

Tried running the followings, but all returned with the same "TypeError: 'JavaPackage' object is not callable " error. There seems to have a similar bug "Python annotators should be loadable on its own #91" that was closed sometime ago, but it still happened to me.

from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .config("spark.driver.extraClassPath", "lib/sparknlp.jar") \ .getOrCreate()

from sparknlp.annotator import * from sparknlp.common import * from sparknlp.base import *

documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

lemmatizer = Lemmatizer()
.setInputCols(["token"])
.setOutputCol("lemma")
.setDictionary("./lemmas001.txt")

normalizer = Normalizer()
.setInputCols(["token"])
.setOutputCol("normalized")

Here are the errors:

=== from documentassembler ==============================================

File "", line 1, in documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document")

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/base.py", line 175, in init super(DocumentAssembler, self).init(classname="com.johnsnowlabs.nlp.DocumentAssembler")

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/base.py", line 20, in init self._java_obj = self._new_java_obj(classname, self.uid)

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj return java_obj(*java_args)

TypeError: 'JavaPackage' object is not callable

=== from lemmatizer ====================================================

Traceback (most recent call last):

File "", line 1, in lemmatizer = Lemmatizer() .setInputCols(["token"]) .setOutputCol("lemma") .setDictionary("./lemmas001.txt")

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/annotator.py", line 281, in init super(Lemmatizer, self).init(classname="com.johnsnowlabs.nlp.annotators.Lemmatizer")

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/annotator.py", line 95, in init self._java_obj = self._new_java_obj(classname, self.uid)

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj return java_obj(*java_args)

TypeError: 'JavaPackage' object is not callable

=== from normalizer ====================================================

Traceback (most recent call last):

File "", line 1, in normalizer = Normalizer() .setInputCols(["token"]) .setOutputCol("normalized")

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/annotator.py", line 198, in init super(Normalizer, self).init(classname="com.johnsnowlabs.nlp.annotators.Normalizer")

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/annotator.py", line 95, in init self._java_obj = self._new_java_obj(classname, self.uid)

File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj return java_obj(*java_args)

TypeError: 'JavaPackage' object is not callable

opened by bigheadming 32

Version Compatibility between sparkNLP 2.5.3 and spark 2.3.x

Apache Spark version 2.3.2.3.1.5.0-152
Spark NLP version 1.7.3
Apache Spark setup (OS, docker, jupyter, zeppelin, Couldera, Databricks, EMR, etc.) : cloudera
How did you install Spark NLP: Quoting the IT team – “we don't install packages from source because doing so would not allow us to pass a umask value to the package during installation and thus making it only importable by the root user so we install via pip, specifically using the pip module in ansible, in order to pass the needed umask value”
Java version : 1.8.0_121
Python/Scala version : Python 3.6.5
Does anything else work in Apache Spark and only Spark NLP related part fails? Not sure I’m working on linux and assuming it is connected to Hadoop system letting me code on spark

Code Snippet:*****************************************************

import os
import sys
sys.path.append('../../')

print(sys.version)

from sparknlp.pretrained import ResourceDownloader
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import *

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

spark = SparkSession.builder \
    .appName("ner")\
    .master("local[*]")\
    .config("spark.driver.memory","4G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.driver.extraClassPath", "/hadoop/anaconda3.6/lib/python3.6/site-packages/sparknlp/lib/sparknlp.jar")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

downloader = ResourceDownloader()


l = [(1,'Thanks for calling to ESI'),(2,'How can i help you'),(3,'Please reach out to us on mail')]

data = spark.createDataFrame(l, ['docID','text'])

#Working fine
document_assembler = DocumentAssembler() \
    .setInputCol("text")

#Working fine
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

#Working fine
tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

#Working fine
lemma = LemmatizerModel.load("/user/elxxx/emma_mod").setInputCols(["token"]).setOutputCol("lemma")

#Working fine
pos = PerceptronModel.load("/user/elxxx/pos_anc_mod/").setInputCols(["document","token"]).setOutputCol("pos")

#Working fine
nor_sweet = NorvigSweetingModel.load("/user/elxxx/spell_nor_mod").setInputCols(["token"]).setOutputCol("corrected")

#Working fine
sent_viv = ViveknSentimentModel.load("/user/elxxx/sent_vivek_mod").setInputCols(["sentence","token"]).setOutputCol("sentiment")


**#Error: WordEmbeddingsModel not defined**
embed = WordEmbeddingsModel.load("/user/elxxx/wordEmbedMod").setStoragePath("/user/elxxx/wordEmbedMod/glove.6B.100d.txt", "TEXT")\
      .setDimension(100)\
      .setStorageRef("glove_100d") \
      .setInputCols("document", "token") \
      .setOutputCol("embeddings")

#Similar issue with other modules
#Error: BertEmbeddingsModel not defined
#bert = BertEmbeddings.load ("/user/elxxx/bert").setInputCols("sentence", "token") .setOutputCol("bert").
**************************************************************************************************************************

We replaced the previous sparkNLP.jar with the newly provided sparkNLP fatJAR (and renamed it to sparkNLP.jar) file by @maziyarpanahi . It seems like it had some conflict with Jackson.Jar file which might be the reason the spark crashed.

Could you help us configure the sparkNLP for our version of spark given there are jar files that support the compatibility. Happy to fill you in with more details if needed.

question Requires more input

opened by akash166d 28

Problematic frame: C [libtensorflow_framework.so.1+0x744da9] _GLOBAL__sub_I_loader.cc+0x99

Description

I have to perform a spark job, which uses the recognize_entities_dl pretrained pipeline, in a mesos (dockerized) cluster. The cmd is as follows:

/opt/spark/spark-2.4.5-bin-hadoop2.7/bin/spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.0,com.couchbase.client:spark-connector_2.11:2.3.0 --master mesos://zk://remote_ip:2181/mesos --deploy-mode client --class tags_extraction.tags_extraction_eng /opt/sparkscala_2.11-0.1.jar

This is the code:

val (sparkSession, sc) = start_spark_session()

def start_spark_session(): (SparkSession, SparkContext) = {

  val sparkSession = SparkSession.builder()
      .master("mesos://zk://remote-ip:32181/mesos")
      .config("spark.mesos.executor.home", "/opt/spark/spark-2.4.5-bin-hadoop2.7")

      .config("spark.jars",
        "/opt/sparkscala_2.11-0.1.jar," +
          "https://repo1.maven.org/maven2/com/couchbase/client/java-client/2.7.6/java-client-2.7.6.jar," +
          "https://repo1.maven.org/maven2/com/couchbase/client/core-io/1.7.6/core-io-1.7.6.jar," +
          "https://repo1.maven.org/maven2/com/couchbase/client/spark-connector_2.11/2.3.0/spark-connector_2.11-2.3.0.jar," +
          "https://repo1.maven.org/maven2/io/opentracing/opentracing-api/0.31.0/opentracing-api-0.31.0.jar," +
          "https://repo1.maven.org/maven2/io/reactivex/rxjava/1.3.8/rxjava-1.3.8.jar," +
          "https://repo1.maven.org/maven2/io/reactivex/rxscala_2.11/0.26.5/rxscala_2.11-0.26.5.jar," +

          //I tried them both and they give the same error
          "https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-2.5.0.jar"+
          "https://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/2.5.0/spark-nlp_2.11-2.5.0.jar"
      )
      .config("spark.executor.extraLibraryPath",
        "/sparkscala_2.11-0.1.jar" +
          "/java-client-2.7.6.jar" +
          "/core-io-1.7.6.jar" +
          "/spark-connector_2.11-2.3.0.jar" +
          "/opentracing-api-0.31.0.jar" +
          "/rxjava-1.3.8.jar" +
          "/rxscala_2.11-0.26.5.jar" +
          "/core-1.1.2.jar" +
          "/spark-streaming-kafka-0-10_2.11-2.4.5.jar" +
          "/spark-sql-kafka-0-10_2.11-2.4.5.jar" +
          "/kafka-clients-2.4.0.jar" +
          "/kafka_2.11-2.4.1.jar" +
          "/spark-nlp-assembly-2.5.0.jar" +
          "/spark-nlp_2.11-2.5.0.jar"
      )
      .getOrCreate()

    sparkSession.sparkContext.setLogLevel("DEBUG")

    val sc = sparkSession.sparkContext
    sc.getConf.getAll.foreach(println)

    (sparkSession, sc)
  }


def main(args: Array[String]) {
  
    val feeds_df = sparkSession.read.couchbase(schema = feedSchema, options = Map("bucket" -> "feeds"))
  
    val pipeline = new PretrainedPipeline("recognize_entities_dl", "en")
   
    println("PIPELINE LOADED") // not printed

    val feeds_tags = pipeline.transform(feeds_df)
      .selectExpr("author_id", "id", "category", "text", "entities.result as tags")

    feeds_tags.printSchema()
    println(feeds_tags)
    println(feeds_tags.getClass.toString)
    println(SizeEstimator.estimate(feeds_tags))
     println("COUNT", feeds_tags.count)

    feeds_tags.show()

    sparkSession.close()
  }

}

While the pipeline is being downloaded, this error is raised when loading stage 4:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00007f8c09bc2da9, pid=4192, tid=0x00007f8d51343700
#
# JRE version: OpenJDK Runtime Environment (8.0_252-b09) (build 1.8.0_252-8u252-b09-1~16.04-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.252-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99
#
# Core dump written. Default location: /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0000/executors/ct:1591367792198:0:tags_extraction_eng:/runs/2a41d953-7343-4dd5-a59b-2e253f0cda55/core or core.4192
#
# An error report file with more information is saved as:
# /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0000/executors/ct:1591367792198:0:tags_extraction_eng:/runs/2a41d953-7343-4dd5-a59b-2e253f0cda55/hs_err_pid4192.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Expected Behavior

Download pretrained pipeline withval pipeline = new PretrainedPipeline("recognize_entities_dl", "en")

Current Behavior

Driver's stdout:

(spark.repl.local.jars,file:///root/.ivy2/jars/com.johnsnowlabs.nlp_spark-nlp_2.11-2.5.0.jar,file:///root/.ivy2/jars/com.couchbase.client_spark-connector_2.11-2.3.0.jar,file:///root/.ivy2/jars/com.typesafe_config-1.3.0.jar,file:///root/.ivy2/jars/org.rocksdb_rocksdbjni-6.5.3.jar,file:///root/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.2.0.jar,file:///root/.ivy2/jars/com.amazonaws_aws-java-sdk-core-1.11.603.jar,file:///root/.ivy2/jars/com.amazonaws_aws-java-sdk-s3-1.11.603.jar,file:///root/.ivy2/jars/com.github.universal-automata_liblevenshtein-3.0.0.jar,file:///root/.ivy2/jars/com.navigamez_greex-1.0.jar,file:///root/.ivy2/jars/org.json4s_json4s-ext_2.11-3.5.3.jar,file:///root/.ivy2/jars/org.tensorflow_tensorflow-1.15.0.jar,file:///root/.ivy2/jars/net.sf.trove4j_trove4j-3.0.3.jar,file:///root/.ivy2/jars/commons-logging_commons-logging-1.1.3.jar,file:///root/.ivy2/jars/org.apache.httpcomponents_httpclient-4.5.9.jar,file:///root/.ivy2/jars/software.amazon.ion_ion-java-1.0.2.jar,file:///root/.ivy2/jars/com.fasterxml.jackson.dataformat_jackson-dataformat-cbor-2.6.7.jar,file:///root/.ivy2/jars/org.apache.httpcomponents_httpcore-4.4.11.jar,file:///root/.ivy2/jars/commons-codec_commons-codec-1.11.jar,file:///root/.ivy2/jars/com.amazonaws_aws-java-sdk-kms-1.11.603.jar,file:///root/.ivy2/jars/com.amazonaws_jmespath-java-1.11.603.jar,file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-databind-2.6.7.2.jar,file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-annotations-2.6.0.jar,file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.6.7.jar,file:///root/.ivy2/jars/com.google.code.findbugs_annotations-3.0.1.jar,file:///root/.ivy2/jars/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar,file:///root/.ivy2/jars/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar,file:///root/.ivy2/jars/it.unimi.dsi_fastutil-7.0.12.jar,file:///root/.ivy2/jars/org.projectlombok_lombok-1.16.8.jar,file:///root/.ivy2/jars/org.slf4j_slf4j-api-1.7.21.jar,file:///root/.ivy2/jars/net.jcip_jcip-annotations-1.0.jar,file:///root/.ivy2/jars/com.google.code.findbugs_jsr305-3.0.1.jar,file:///root/.ivy2/jars/com.google.code.gson_gson-2.3.jar,file:///root/.ivy2/jars/dk.brics.automaton_automaton-1.11-8.jar,file:///root/.ivy2/jars/joda-time_joda-time-2.9.5.jar,file:///root/.ivy2/jars/org.joda_joda-convert-1.8.1.jar,file:///root/.ivy2/jars/org.tensorflow_libtensorflow-1.15.0.jar,file:///root/.ivy2/jars/org.tensorflow_libtensorflow_jni-1.15.0.jar,file:///root/.ivy2/jars/com.couchbase.client_java-client-2.7.6.jar,file:///root/.ivy2/jars/com.couchbase.client_dcp-client-0.23.0.jar,file:///root/.ivy2/jars/io.reactivex_rxscala_2.11-0.26.5.jar,file:///root/.ivy2/jars/org.apache.logging.log4j_log4j-api-2.2.jar,file:///root/.ivy2/jars/com.couchbase.client_core-io-1.7.6.jar,file:///root/.ivy2/jars/io.reactivex_rxjava-1.3.8.jar,file:///root/.ivy2/jars/io.opentracing_opentracing-api-0.31.0.jar)
(spark.sql.execution.arrow.enabled,true)
(spark.couchbase.nodes,couchbase://remote_ip)
(com.couchbase.connectTimeout,300000)
(spark.jars,/opt/sparkscala_2.11-0.1.jar,https://repo1.maven.org/maven2/com/couchbase/client/java-client/2.7.6/java-client-2.7.6.jar,https://repo1.maven.org/maven2/com/couchbase/client/core-io/1.7.6/core-io-1.7.6.jar,https://repo1.maven.org/maven2/com/couchbase/client/spark-connector_2.11/2.3.0/spark-connector_2.11-2.3.0.jar,https://repo1.maven.org/maven2/io/opentracing/opentracing-api/0.31.0/opentracing-api-0.31.0.jar,https://repo1.maven.org/maven2/io/reactivex/rxjava/1.3.8/rxjava-1.3.8.jar,https://repo1.maven.org/maven2/io/reactivex/rxscala_2.11/0.26.5/rxscala_2.11-0.26.5.jar,https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-2.5.0.jar)
(spark.executor.id,driver)
(spark.driver.port,41651)
(spark.couchbase.bucket.feeds,)
(spark.couchbase.bucket.users,)
(spark.driver.memory,1g)
(spark.serializer,org.apache.spark.serializer.KryoSerializer)
(com.couchbase.username,apps)
(spark.cores.max,1)
(spark.sql.tungsten.enabled,true)
(spark.driver.host,mesos-slave)
(spark.executor.memory,1g)
(spark.couchbase.bucket.action_sink,)
(com.couchbase.password,password)
(spark.master,mesos://zk://remote_ip:2181/mesos)
(com.couchbase.socketConnect,300000)
(spark.mesos.executor.home,/opt/spark/spark-2.4.5-bin-hadoop2.7)
(spark.submit.deployMode,client)
(spark.app.name,tags_extraction_eng)
(spark.app.id,fb88a3ad-d32c-41ae-be67-36517a272bcb-0005)
(spark.ui.showConsoleProgress,true)
(spark.worker.cleanup.enabled,true)
(spark.executor.extraLibraryPath,/sparkscala_2.11-0.1.jar/java-client-2.7.6.jar/core-io-1.7.6.jar/spark-connector_2.11-2.3.0.jar/opentracing-api-0.31.0.jar/rxjava-1.3.8.jar/rxscala_2.11-0.26.5.jar/core-1.1.2.jar/spark-streaming-kafka-0-10_2.11-2.4.5.jar/spark-sql-kafka-0-10_2.11-2.4.5.jar/kafka-clients-2.4.0.jar/kafka_2.11-2.4.1.jar/spark-nlp-assembly-2.5.0.jar/spark-nlp_2.11-2.5.0.jar)

recognize_entities_dl download started this may take some time.
Approximate size to download 159 MB
Download done! Loading the resource.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00007f8c09bc2da9, pid=4192, tid=0x00007f8d51343700
#
# JRE version: OpenJDK Runtime Environment (8.0_252-b09) (build 1.8.0_252-8u252-b09-1~16.04-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.252-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99
#
# Core dump written. Default location: /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0000/executors/ct:1591367792198:0:tags_extraction_eng:/runs/2a41d953-7343-4dd5-a59b-2e253f0cda55/core or core.4192
#
# An error report file with more information is saved as:
# /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0000/executors/ct:1591367792198:0:tags_extraction_eng:/runs/2a41d953-7343-4dd5-a59b-2e253f0cda55/hs_err_pid4192.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Executor's Logs:

...
20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 17
20/06/05 14:40:01 INFO Executor: Running task 0.0 in stage 14.0 (TID 17)
20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 26
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_26_piece0 stored as bytes in memory (estimated size 2.2 KB, free 362.9 MB)
20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 26 took 11 ms
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_26 stored as values in memory (estimated size 3.7 KB, free 362.9 MB)
20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/metadata/part-00000:0+408
20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 25
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 23.1 KB, free 362.8 MB)
20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 25 took 25 ms
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 322.8 KB, free 362.5 MB)
20/06/05 14:40:01 INFO Executor: Finished task 0.0 in stage 14.0 (TID 17). 1209 bytes result sent to driver
20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 18
20/06/05 14:40:01 INFO Executor: Running task 0.0 in stage 15.0 (TID 18)
20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 28
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 2.2 KB, free 362.5 MB)
20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 28 took 13 ms
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_28 stored as values in memory (estimated size 3.7 KB, free 362.5 MB)
20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/metadata/part-00000:0+408
20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 27
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_27_piece0 stored as bytes in memory (estimated size 23.1 KB, free 362.5 MB)
20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 27 took 11 ms
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_27 stored as values in memory (estimated size 322.8 KB, free 362.2 MB)
20/06/05 14:40:01 INFO Executor: Finished task 0.0 in stage 15.0 (TID 18). 1166 bytes result sent to driver
20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 19
20/06/05 14:40:01 INFO Executor: Running task 0.0 in stage 16.0 (TID 19)
20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 30
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 2.4 KB, free 362.2 MB)
20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 30 took 11 ms
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_30 stored as values in memory (estimated size 3.9 KB, free 362.2 MB)
20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00005:0+95
20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 29
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_29_piece0 stored as bytes in memory (estimated size 23.1 KB, free 362.1 MB)
20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 29 took 17 ms
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_29 stored as values in memory (estimated size 322.8 KB, free 361.8 MB)
20/06/05 14:40:01 INFO Executor: Finished task 0.0 in stage 16.0 (TID 19). 765 bytes result sent to driver
20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 20
20/06/05 14:40:01 INFO Executor: Running task 0.0 in stage 17.0 (TID 20)
20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 31
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_31_piece0 stored as bytes in memory (estimated size 2.4 KB, free 362.1 MB)
20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 31 took 19 ms
20/06/05 14:40:01 INFO MemoryStore: Block broadcast_31 stored as values in memory (estimated size 3.9 KB, free 362.2 MB)
20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00007:0+95
20/06/05 14:40:01 INFO Executor: Finished task 0.0 in stage 17.0 (TID 20). 765 bytes result sent to driver
20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 21
20/06/05 14:40:01 INFO Executor: Running task 1.0 in stage 17.0 (TID 21)
20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00011:0+2000
20/06/05 14:40:01 INFO Executor: Finished task 1.0 in stage 17.0 (TID 21). 2146 bytes result sent to driver
20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 22
20/06/05 14:40:01 INFO Executor: Running task 2.0 in stage 17.0 (TID 22)
20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00011:2000+831
20/06/05 14:40:01 INFO Executor: Finished task 2.0 in stage 17.0 (TID 22). 808 bytes result sent to driver
20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 23
20/06/05 14:40:01 INFO Executor: Running task 3.0 in stage 17.0 (TID 23)
20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00009:0+95
20/06/05 14:40:01 INFO Executor: Finished task 3.0 in stage 17.0 (TID 23). 765 bytes result sent to driver
I0605 14:40:04.482619  4374 exec.cpp:445] Executor asked to shutdown
I0605 14:40:04.482844  4374 executor.cpp:184] Received SHUTDOWN event
I0605 14:40:04.482877  4374 executor.cpp:800] Shutting down
I0605 14:40:04.482920  4374 executor.cpp:913] Sending SIGTERM to process tree at pid 4382
20/06/05 14:40:04 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver mesos-slave:41651 disassociated! Shutting down.
I0605 14:40:04.489429  4374 executor.cpp:926] Sent SIGTERM to the following process trees:
[ 
-+- 4382 sh -c LD_LIBRARY_PATH="/sparkscala_2.11-0.1.jar/java-client-2.7.6.jar/core-io-1.7.6.jar/spark-connector_2.11-2.3.0.jar/opentracing-api-0.31.0.jar/rxjava-1.3.8.jar/rxscala_2.11-0.26.5.jar/core-1.1.2.jar/spark-streaming-kafka-0-10_2.11-2.4.5.jar/spark-sql-kafka-0-10_2.11-2.4.5.jar/kafka-clients-2.4.0.jar/kafka_2.11-2.4.1.jar/spark-nlp-assembly-2.5.0.jar/spark-nlp_2.11-2.5.0.jar:$LD_LIBRARY_PATH" "/opt/spark/spark-2.4.5-bin-hadoop2.7/./bin/spark-class" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@mesos-slave:41651 --executor-id 0 --cores 1 --app-id fb88a3ad-d32c-41ae-be67-36517a272bcb-0005 --hostname mesos-slave 
 \--- 4383 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark/spark-2.4.5-bin-hadoop2.7/conf/:/opt/spark/spark-2.4.5-bin-hadoop2.7/jars/* -Xmx1024m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@mesos-slave:41651 --executor-id 0 --cores 1 --app-id fb88a3ad-d32c-41ae-be67-36517a272bcb-0005 --hostname mesos-slave 
]
I0605 14:40:04.489470  4374 executor.cpp:930] Scheduling escalation to SIGKILL in 88secs from now
20/06/05 14:40:04 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
20/06/05 14:40:04 INFO DiskBlockManager: Shutdown hook called
20/06/05 14:40:04 INFO CouchbaseConnection: Performing Couchbase SDK Shutdown
20/06/05 14:40:04 INFO ShutdownHookManager: Shutdown hook called
20/06/05 14:40:04 INFO ShutdownHookManager: Deleting directory /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0005/executors/0/runs/50383a32-eafb-45cd-ab6b-3be4f5d790a4/spark-e87c68df-00c0-4d18-acc5-684a42cab22b
20/06/05 14:40:04 INFO ConfigurationProvider: Closed bucket feeds
20/06/05 14:40:04 INFO Node: Disconnected from Node remote_ip/datanode1
I0605 14:40:04.540186  4379 executor.cpp:998] Command terminated with signal Terminated (pid: 4382)
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown IoPool: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown kvIoPool: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown viewIoPool: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown queryIoPool: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown searchIoPool: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Core Scheduler: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Runtime Metrics Collector: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Latency Metrics Collector: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown analyticsIoPool: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Netty: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Tracer: success 
20/06/05 14:40:04 INFO CoreEnvironment: Shutdown OrphanReporter: success 
I0605 14:40:05.542169  4381 process.cpp:927] Stopped the socket accept loop

Your Environment

Docker environment:

1 Mesos Master Container
1 Mesos Worker Container
1 Chronos Container

Versions:

Spark NLP version: 2.4.5
Apache NLP version: 2.5.0
Java version (java -version): **JRE version: OpenJDK Runtime Environment (8.0_252-b09) (build 1.8.0_252-8u252-b09-1~16.04-b09) **Java VM: OpenJDK 64-Bit Server VM (25.252-b09 mixed mode linux-amd64 compressed oops)
Docker Container's Operating System and version:

NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial

wont-fix

opened by FedericoF93 25

'JavaPackage' object is not callable when 'PretrainedPipeline('explain_document_ml', 'en')'

TypeError Traceback (most recent call last) in () ----> 1 pipline = PretrainedPipeline('explain_document_ml', 'en')

/home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/pretrained.py in init(self, name, lang, remote_loc) 89 90 def init(self, name, lang='en', remote_loc=None): ---> 91 self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc) 92 self.light_model = LightPipeline(self.model) 93

/home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadPipeline(name, language, remote_loc) 50 def downloadPipeline(name, language, remote_loc=None): 51 print(name + " download started this may take some time.") ---> 52 file_size = _internal._GetResourceSize(name, language, remote_loc).apply() 53 if file_size == "-1": 54 print("Can not find the model to download please check the name!")

/home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc) 68 super(_ClearCache, self).init("com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.clearCache", name, language, remote_loc) 69 ---> 70 71 class _GetResourceSize(ExtendedJavaWrapper): 72 def init(self, name, language, remote_loc):

/home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args) 9 super(ExtendedJavaWrapper, self).init(java_obj) 10 self.sc = SparkContext._active_spark_context ---> 11 self._java_obj = self.new_java_obj(java_obj, *args) 12 self.java_obj = self._java_obj 13

/home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args) 19 20 def new_java_obj(self, java_class, *args): ---> 21 return self._new_java_obj(java_class, *args) 22 23 def new_java_array(self, pylist, java_class):

/opt/spark-2.4.3-bin-hadoop2.7/python/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args) 65 java_obj = getattr(java_obj, name) 66 java_args = [_py2java(sc, arg) for arg in args] ---> 67 return java_obj(*java_args) 68 69 @staticmethod

TypeError: 'JavaPackage' object is not callable
invalid

opened by vasudhajain0 25
Why do you use hadoop-aws 3.2 ? Spark 2.4 doesn't come with hadoop 3.2 which makes it very difficult to work with as we already use hadoop-aws 2.7.4
Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Using with hadoop-aws 2.7.3 already installed, hadoop 3.2 is a conflict along with aws sdk

Context

Your Environment

Spark NLP version:

Apache NLP version:

Java version (java -version):

Setup and installation (Pypi, Conda, Maven, etc.):

Operating System and version:

Link to your project (if any):

question
opened by appunni-dishq 23

Problem with spark-nlp

Hi! I'm using this example to create my own sentiment classifier but when I want to execute the below code, I got an error.

use = BertEmbeddings.load('/home/mahdi/workTable/dataset/bert/') \
                    .setInputCols(["document"])\
                    .setOutputCol("sentence_embeddings")\
                    .setPoolingLayer(-2)

I tested it with UniversalSentenceEncoder but got the same error.

The error:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00007fac59e78da9, pid=1736, tid=0x00007fad517fb700
#
# JRE version: OpenJDK Runtime Environment (8.0_252-b09) (build 1.8.0_252-8u252-b09-1~18.04-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.252-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99
#
# Core dump written. Default location: /home/mahdi/workTable/core or core.1736

I used standalone cluster mode with one master and 3 slaves with 4G memory and 4 core for each one at first. Then I used one master and one slave with 10G memory and 6 core for each one. But still got the same error.

My spark initialization:

findspark.init()
conf=SparkConf()
conf.set("spark.driver.memory", "19g")
conf.set("spark.cores.max", "16")
conf.set("spark.executor.memory", "9700m")
conf.set("spark.executor.cores", "8")
conf.set("spark.executor.instances", "8")
conf.set("spark.rpc.message.maxSize","1024")
conf.set("spark.driver.extraJavaOptions","-Djava.io.tmpdir=/home/mahdi/workTable/temp/")
conf.set("spark.executor.extraJavaOptions","-Djava.io.tmpdir=/home/mahdi/workTable/temp/")


spark = SparkSession.builder.master("spark://172.18.16.74:7077").appName("Sentiment Analysis").config(conf=conf)\
                            .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.4")\
                            .getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

print("Spark version : " ,spark.version)
print("Spark-NLP version : " ,sparknlp.version())
# Spark version :  2.4.5
# Spark-NLP version :  2.5.4

How can I fix it?

Thanks for your help :)

Requires more input Stale

opened by m-developer96 23

Could not initialize class com.johnsnowlabs.util.ConfigHelper$
Receiving an error when trying to load pretrained model from hdfs.

Description

In HDFS, loaded offline pre trained model file(s). Apply or use it in code e.g. bert = BertEmbeddings.load() throws an error "Could not initialize class com.JohnSnowLabs.util. ConfigHelper"

Expected Behavior

It should load pre trained model from the uncompressed file in HDFS.

Current Behavior

Receiving an error message: Py4JJavaError: An error occurred while calling None.com.johnsnowlabs.nlp.embeddings.BertEmbeddings. : java.lang.NoClassDefFoundError: Could not initialize class com.johnsnowlabs.util.ConfigHelper$

Possible Solution

Reference to the offline model might be wrong OR something needs to be updated in Config.

Steps to Reproduce

Import all spark NLP libs from sparknlp.base import * from sparknlp.annotator import *
from sparknlp.common import * import sparknlp

Sparknlp.start() spark = sparknlp.start()

document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

Load the pretrained model from hdfs path. bert = BertEmbeddings.load("/user/xxx/bert_base_cased_en_2.4.0_2.4_1580579557778")
.setInputCols(["document"])
.setOutputCol("bert")
.setCaseSensitive(False)
.setPoolingLayer(0)

Context

Trying to apply ClassifierDL - word embedding and sentence Embeddings (USE). classiferDL is new for me, fixing this issue will enable it's use for many different applications.

Your Environment

Spark NLP version sparknlp.version(): 2.4.5

Apache NLP version spark.version: 2.3.2.3.1.0.0-78

Java version java -version: openjdk version "1.8.0_282", OpenJDK Runtime Environment (build 1.8.0_282-b08), OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)

Setup and installation (Pypi, Conda, Maven, etc.): Pyspark

Operating System and version: Hadoop Cluster

Link to your project (if any):

Thank you for the help.
Requires more input Stale
opened by beginneruser2021 22

Encountering java.lang.NullPointerException when dislpaying Bert transformations

Hello,

My set up is a single laptop computer running Kubuntu 20.10 (Linux kernel version 5.8.0-55-generic) on Intel Core i5-7200U CPU (4 cores) with 5.7 GB of RAM available.

On this modest machine, I am trying to learn how to set up a standalone spark cluster and submit a job with PySpark that uses SparkNLP.

I have base my work off of https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/blogposts/3.NER_with_BERT.ipynb

I have installed Spark 3.0.2 in this machine in the home directory and have set SPARK_HOME in my environment variables as necessary. Once done, I ran the start-master.sh script from spark's sbin directory and it launched master successfully. Then, I launched a worker on the same machine and it registered with the master with 4 cores and 4.7 GB of RAM. On this setup, I was able to successfully run the PI approximation example from Spark's website.

Now, in this machine, I created another directory and setup a virtualenvironment. PIP packages installed in this venv: numpy==1.20.3 py4j==0.10.9 pyspark==3.0.2 spark-nlp==3.1.0 sparknlp==1.0.0

I launched Python 3.8.6 from this virtualenvironment and ran the following script:

from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .master("spark://rajan-X556URK:7077")\
    .appName("nerexample")\
    .config("spark.driver.memory", "4G")\
    .config("spark.executor.memory", "4G")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.0")\
    .getOrCreate() 

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *

from urllib.request import urlretrieve

urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.train',
           'eng.train')

urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.testa',
           'eng.testa') 

bert_annotator = BertEmbeddings.pretrained('small_bert_L2_128', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setBatchSize(8)

from sparknlp.training import CoNLL

test_data = CoNLL().readDataset(spark, '/home/w/Assignments/ner/eng.testa')

test_data = bert_annotator.transform(test_data)



test_data.show(3)

Right when I execute the test_data.show() line, I get a NullPointerException.

Following is the log from the stderr file of this worker:

Spark Executor Command: "/usr/lib/jvm/java-11-openjdk-amd64/bin/java" "-cp" "/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/conf/:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/jars/*" "-Xmx4096M" "-Dspark.driver.port=34205" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:34205" "--executor-id" "0" "--hostname" "192.168.2.103" "--cores" "4" "--app-id" "app-20210611204208-0009" "--worker-url" "spark://[email protected]:44535"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/06/11 20:42:09 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 15317@rajan-X556URK
21/06/11 20:42:09 INFO SignalUtils: Registered signal handler for TERM
21/06/11 20:42:09 INFO SignalUtils: Registered signal handler for HUP
21/06/11 20:42:09 INFO SignalUtils: Registered signal handler for INT
21/06/11 20:42:09 WARN Utils: Your hostname, rajan-X556URK resolves to a loopback address: 127.0.1.1; using 192.168.2.103 instead (on interface wlp3s0)
21/06/11 20:42:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/jars/spark-unsafe_2.12-3.0.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/06/11 20:42:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/06/11 20:42:10 INFO SecurityManager: Changing view acls to: w
21/06/11 20:42:10 INFO SecurityManager: Changing modify acls to: w
21/06/11 20:42:10 INFO SecurityManager: Changing view acls groups to: 
21/06/11 20:42:10 INFO SecurityManager: Changing modify acls groups to: 
21/06/11 20:42:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(w); groups with view permissions: Set(); users  with modify permissions: Set(w); groups with modify permissions: Set()
21/06/11 20:42:10 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:34205 after 95 ms (0 ms spent in bootstraps)
21/06/11 20:42:10 INFO SecurityManager: Changing view acls to: w
21/06/11 20:42:10 INFO SecurityManager: Changing modify acls to: w
21/06/11 20:42:10 INFO SecurityManager: Changing view acls groups to: 
21/06/11 20:42:10 INFO SecurityManager: Changing modify acls groups to: 
21/06/11 20:42:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(w); groups with view permissions: Set(); users  with modify permissions: Set(w); groups with modify permissions: Set()
21/06/11 20:42:10 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:34205 after 3 ms (0 ms spent in bootstraps)
21/06/11 20:42:10 INFO DiskBlockManager: Created local directory at /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/blockmgr-d8c52b91-a0ef-49fb-8712-7116a0410c3b
21/06/11 20:42:11 INFO MemoryStore: MemoryStore started with capacity 2.2 GiB
21/06/11 20:42:11 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:34205
21/06/11 20:42:11 INFO WorkerWatcher: Connecting to worker spark://[email protected]:44535
21/06/11 20:42:11 INFO ResourceUtils: ==============================================================
21/06/11 20:42:11 INFO ResourceUtils: Resources for spark.executor:

21/06/11 20:42:11 INFO ResourceUtils: ==============================================================
21/06/11 20:42:11 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:44535 after 31 ms (0 ms spent in bootstraps)
21/06/11 20:42:11 INFO WorkerWatcher: Successfully connected to spark://[email protected]:44535
21/06/11 20:42:11 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
21/06/11 20:42:11 INFO Executor: Starting executor ID 0 on host 192.168.2.103
21/06/11 20:42:11 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35511.
21/06/11 20:42:11 INFO NettyBlockTransferService: Server created on 192.168.2.103:35511
21/06/11 20:42:11 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/06/11 20:42:11 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 192.168.2.103, 35511, None)
21/06/11 20:42:11 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 192.168.2.103, 35511, None)
21/06/11 20:42:11 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 192.168.2.103, 35511, None)
21/06/11 20:42:11 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar with timestamp 1623424325166
21/06/11 20:42:11 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:34205 after 3 ms (0 ms spent in bootstraps)
21/06/11 20:42:12 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16731401106283909142.tmp
21/06/11 20:42:12 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-19307619201623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar
21/06/11 20:42:12 INFO Executor: Fetching spark://192.168.2.103:34205/files/net.jcip_jcip-annotations-1.0.jar with timestamp 1623424325166
21/06/11 20:42:12 INFO Utils: Fetching spark://192.168.2.103:34205/files/net.jcip_jcip-annotations-1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp4135900916397329853.tmp
21/06/11 20:42:12 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/1155917211623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.jcip_jcip-annotations-1.0.jar
21/06/11 20:42:12 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.code.findbugs_annotations-3.0.1.jar with timestamp 1623424325166
21/06/11 20:42:12 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.code.findbugs_annotations-3.0.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp5390838511657707315.tmp
21/06/11 20:42:12 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/10453638051623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_annotations-3.0.1.jar
21/06/11 20:42:12 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar with timestamp 1623424325166
21/06/11 20:42:12 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp12163369454458897115.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-6753754811623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.projectlombok_lombok-1.16.8.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.projectlombok_lombok-1.16.8.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp546613331331570155.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/15471060871623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.projectlombok_lombok-1.16.8.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.typesafe_config-1.3.0.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.typesafe_config-1.3.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp501578203029232760.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-6243396901623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.typesafe_config-1.3.0.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/net.sf.trove4j_trove4j-3.0.3.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/net.sf.trove4j_trove4j-3.0.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp4294334457124108819.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-9179969801623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.sf.trove4j_trove4j-3.0.3.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.json4s_json4s-ext_2.12-3.5.3.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.json4s_json4s-ext_2.12-3.5.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp5724117738489913536.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-1785968311623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.json4s_json4s-ext_2.12-3.5.3.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.code.findbugs_jsr305-3.0.1.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.code.findbugs_jsr305-3.0.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp6507586328711846510.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-19147812741623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_jsr305-3.0.1.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.joda_joda-convert-1.8.1.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.joda_joda-convert-1.8.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp11192836114627213928.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-18183925021623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.joda_joda-convert-1.8.1.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/dk.brics.automaton_automaton-1.11-8.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/dk.brics.automaton_automaton-1.11-8.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp17414383452524692686.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/18002895341623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./dk.brics.automaton_automaton-1.11-8.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.navigamez_greex-1.0.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.navigamez_greex-1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp1310093016529474953.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/444129991623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.navigamez_greex-1.0.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.code.gson_gson-2.3.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.code.gson_gson-2.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16952031653904177164.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-20852710581623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.gson_gson-2.3.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/it.unimi.dsi_fastutil-7.0.12.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/it.unimi.dsi_fastutil-7.0.12.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp5122682618647664079.tmp
21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-5370007131623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./it.unimi.dsi_fastutil-7.0.12.jar
21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar with timestamp 1623424325166
21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp4638237247886531412.tmp
21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-3144268511623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar
21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.github.universal-automata_liblevenshtein-3.0.0.jar with timestamp 1623424325166
21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.github.universal-automata_liblevenshtein-3.0.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp18408146982236201037.tmp
21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/19900329611623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.github.universal-automata_liblevenshtein-3.0.0.jar
21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.slf4j_slf4j-api-1.7.21.jar with timestamp 1623424325166
21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.slf4j_slf4j-api-1.7.21.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp18433314345265653010.tmp
21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/13339163381623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.slf4j_slf4j-api-1.7.21.jar
21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.rocksdb_rocksdbjni-6.5.3.jar with timestamp 1623424325166
21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.rocksdb_rocksdbjni-6.5.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp15154651623340219296.tmp
21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/19889744071623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.rocksdb_rocksdbjni-6.5.3.jar
21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/joda-time_joda-time-2.9.5.jar with timestamp 1623424325166
21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/joda-time_joda-time-2.9.5.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp6878914735123238495.tmp
21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-7077374021623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./joda-time_joda-time-2.9.5.jar
21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.amazonaws_aws-java-sdk-bundle-1.11.603.jar with timestamp 1623424325166
21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.amazonaws_aws-java-sdk-bundle-1.11.603.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp11428706676980878857.tmp
21/06/11 20:42:17 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/11445123081623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.amazonaws_aws-java-sdk-bundle-1.11.603.jar
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp2249757634022456047.tmp
21/06/11 20:42:17 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-12346780511623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-3.0.0-beta-3.jar
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.json4s_json4s-ext_2.12-3.5.3.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.json4s_json4s-ext_2.12-3.5.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp4902258414204843486.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/6839329141623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.json4s_json4s-ext_2.12-3.5.3.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.json4s_json4s-ext_2.12-3.5.3.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/dk.brics.automaton_automaton-1.11-8.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/dk.brics.automaton_automaton-1.11-8.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp7723995488492432875.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/6345908611623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./dk.brics.automaton_automaton-1.11-8.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./dk.brics.automaton_automaton-1.11-8.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/net.jcip_jcip-annotations-1.0.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/net.jcip_jcip-annotations-1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp332592907644172826.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/14461652401623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.jcip_jcip-annotations-1.0.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.jcip_jcip-annotations-1.0.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.projectlombok_lombok-1.16.8.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.projectlombok_lombok-1.16.8.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp1709548010051135733.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/3280036381623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.projectlombok_lombok-1.16.8.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.projectlombok_lombok-1.16.8.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/net.sf.trove4j_trove4j-3.0.3.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/net.sf.trove4j_trove4j-3.0.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp12992547080912692118.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/18958713891623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.sf.trove4j_trove4j-3.0.3.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.sf.trove4j_trove4j-3.0.3.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.joda_joda-convert-1.8.1.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.joda_joda-convert-1.8.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16024886356109174200.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-21432645511623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.joda_joda-convert-1.8.1.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.joda_joda-convert-1.8.1.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.github.universal-automata_liblevenshtein-3.0.0.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.github.universal-automata_liblevenshtein-3.0.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp11719577668617794252.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-19939684301623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.github.universal-automata_liblevenshtein-3.0.0.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.github.universal-automata_liblevenshtein-3.0.0.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.navigamez_greex-1.0.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.navigamez_greex-1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp807262545140534729.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/18948526941623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.navigamez_greex-1.0.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.navigamez_greex-1.0.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/joda-time_joda-time-2.9.5.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/joda-time_joda-time-2.9.5.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp12279758982734310357.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-5516510511623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./joda-time_joda-time-2.9.5.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./joda-time_joda-time-2.9.5.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp12934127082379220752.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-4898677941623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-3.0.0-beta-3.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-3.0.0-beta-3.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp7551843316349076899.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/17717935161623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.code.gson_gson-2.3.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.code.gson_gson-2.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp8987546978536014081.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-7546975391623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.gson_gson-2.3.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.gson_gson-2.3.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/it.unimi.dsi_fastutil-7.0.12.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/it.unimi.dsi_fastutil-7.0.12.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp14662829152554853125.tmp
21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-20180996401623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./it.unimi.dsi_fastutil-7.0.12.jar
21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./it.unimi.dsi_fastutil-7.0.12.jar to class loader
21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.rocksdb_rocksdbjni-6.5.3.jar with timestamp 1623424325166
21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.rocksdb_rocksdbjni-6.5.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp9949668037197273689.tmp
21/06/11 20:42:18 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/5078754801623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.rocksdb_rocksdbjni-6.5.3.jar
21/06/11 20:42:18 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.rocksdb_rocksdbjni-6.5.3.jar to class loader
21/06/11 20:42:18 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar with timestamp 1623424325166
21/06/11 20:42:18 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp9212643548963030178.tmp
21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/694347761623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar
21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar to class loader
21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.code.findbugs_jsr305-3.0.1.jar with timestamp 1623424325166
21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.code.findbugs_jsr305-3.0.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp6313052555132831521.tmp
21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-11647417711623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_jsr305-3.0.1.jar
21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_jsr305-3.0.1.jar to class loader
21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.slf4j_slf4j-api-1.7.21.jar with timestamp 1623424325166
21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.slf4j_slf4j-api-1.7.21.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16533770977053225215.tmp
21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/18776259231623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.slf4j_slf4j-api-1.7.21.jar
21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.slf4j_slf4j-api-1.7.21.jar to class loader
21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.amazonaws_aws-java-sdk-bundle-1.11.603.jar with timestamp 1623424325166
21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.amazonaws_aws-java-sdk-bundle-1.11.603.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp17738135895720612151.tmp
21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/13928342451623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.amazonaws_aws-java-sdk-bundle-1.11.603.jar
21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.amazonaws_aws-java-sdk-bundle-1.11.603.jar to class loader
21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.typesafe_config-1.3.0.jar with timestamp 1623424325166
21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.typesafe_config-1.3.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp15020909164608666214.tmp
21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-4682533391623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.typesafe_config-1.3.0.jar
21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.typesafe_config-1.3.0.jar to class loader
21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar with timestamp 1623424325166
21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16503352824305074337.tmp
21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-8807534571623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar
21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar to class loader
21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.code.findbugs_annotations-3.0.1.jar with timestamp 1623424325166
21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.code.findbugs_annotations-3.0.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp1442695003069020584.tmp
21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/12936857421623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_annotations-3.0.1.jar
21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_annotations-3.0.1.jar to class loader
21/06/11 20:42:36 INFO CoarseGrainedExecutorBackend: Got assigned task 0
21/06/11 20:42:36 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/06/11 20:42:36 INFO TorrentBroadcast: Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB)
21/06/11 20:42:36 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:32947 after 4 ms (0 ms spent in bootstraps)
21/06/11 20:42:36 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KiB, free 2.2 GiB)
21/06/11 20:42:36 INFO TorrentBroadcast: Reading broadcast variable 1 took 128 ms
21/06/11 20:42:36 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KiB, free 2.2 GiB)
21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/metadata/part-00000:0+443
21/06/11 20:42:37 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
21/06/11 20:42:37 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.6 KiB, free 2.2 GiB)
21/06/11 20:42:37 INFO TorrentBroadcast: Reading broadcast variable 0 took 18 ms
21/06/11 20:42:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 198.4 KiB, free 2.2 GiB)
21/06/11 20:42:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1414 bytes result sent to driver
21/06/11 20:42:37 INFO CoarseGrainedExecutorBackend: Got assigned task 1
21/06/11 20:42:37 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
21/06/11 20:42:37 INFO CoarseGrainedExecutorBackend: Got assigned task 2
21/06/11 20:42:37 INFO CoarseGrainedExecutorBackend: Got assigned task 3
21/06/11 20:42:37 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)
21/06/11 20:42:37 INFO CoarseGrainedExecutorBackend: Got assigned task 4
21/06/11 20:42:37 INFO Executor: Running task 2.0 in stage 1.0 (TID 3)
21/06/11 20:42:37 INFO Executor: Running task 3.0 in stage 1.0 (TID 4)
21/06/11 20:42:37 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
21/06/11 20:42:37 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.4 KiB, free 2.2 GiB)
21/06/11 20:42:37 INFO TorrentBroadcast: Reading broadcast variable 3 took 15 ms
21/06/11 20:42:37 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 4.1 KiB, free 2.2 GiB)
21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00005:0+111532
21/06/11 20:42:37 INFO TorrentBroadcast: Started reading broadcast variable 2 with 1 pieces (estimated total size 4.0 MiB)
21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00004:0+111799
21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00009:0+111710
21/06/11 20:42:37 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.6 KiB, free 2.2 GiB)
21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00003:0+111815
21/06/11 20:42:37 INFO TorrentBroadcast: Reading broadcast variable 2 took 20 ms
21/06/11 20:42:37 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 198.4 KiB, free 2.2 GiB)
21/06/11 20:42:38 INFO Executor: Finished task 3.0 in stage 1.0 (TID 4). 66763 bytes result sent to driver
21/06/11 20:42:38 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 66496 bytes result sent to driver
21/06/11 20:42:38 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 66779 bytes result sent to driver
21/06/11 20:42:38 INFO Executor: Finished task 2.0 in stage 1.0 (TID 3). 66674 bytes result sent to driver
21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 5
21/06/11 20:42:38 INFO Executor: Running task 4.0 in stage 1.0 (TID 5)
21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00006:0+111573
21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 6
21/06/11 20:42:38 INFO Executor: Running task 5.0 in stage 1.0 (TID 6)
21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 7
21/06/11 20:42:38 INFO Executor: Running task 6.0 in stage 1.0 (TID 7)
21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00007:0+111394
21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 8
21/06/11 20:42:38 INFO Executor: Running task 7.0 in stage 1.0 (TID 8)
21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00001:0+111321
21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00008:0+111429
21/06/11 20:42:38 INFO Executor: Finished task 7.0 in stage 1.0 (TID 8). 66350 bytes result sent to driver
21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 9
21/06/11 20:42:38 INFO Executor: Running task 8.0 in stage 1.0 (TID 9)
21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00011:0+111491
21/06/11 20:42:38 INFO Executor: Finished task 6.0 in stage 1.0 (TID 7). 66242 bytes result sent to driver
21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 10
21/06/11 20:42:38 INFO Executor: Finished task 4.0 in stage 1.0 (TID 5). 66494 bytes result sent to driver
21/06/11 20:42:38 INFO Executor: Finished task 5.0 in stage 1.0 (TID 6). 66315 bytes result sent to driver
21/06/11 20:42:38 INFO Executor: Running task 9.0 in stage 1.0 (TID 10)
21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 11
21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00010:0+111524
21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 12
21/06/11 20:42:38 INFO Executor: Running task 10.0 in stage 1.0 (TID 11)
21/06/11 20:42:38 INFO Executor: Running task 11.0 in stage 1.0 (TID 12)
21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00000:0+111679
21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00002:0+111457
21/06/11 20:42:38 INFO Executor: Finished task 8.0 in stage 1.0 (TID 9). 66412 bytes result sent to driver
21/06/11 20:42:38 INFO Executor: Finished task 11.0 in stage 1.0 (TID 12). 66600 bytes result sent to driver
21/06/11 20:42:38 INFO Executor: Finished task 9.0 in stage 1.0 (TID 10). 66445 bytes result sent to driver
21/06/11 20:42:38 INFO Executor: Finished task 10.0 in stage 1.0 (TID 11). 66378 bytes result sent to driver
21/06/11 20:42:55 INFO CoarseGrainedExecutorBackend: Got assigned task 13
21/06/11 20:42:55 INFO Executor: Running task 0.0 in stage 2.0 (TID 13)
21/06/11 20:42:55 INFO TorrentBroadcast: Started reading broadcast variable 6 with 1 pieces (estimated total size 4.0 MiB)
21/06/11 20:42:55 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 77.1 KiB, free 2.2 GiB)
21/06/11 20:42:55 INFO TorrentBroadcast: Reading broadcast variable 6 took 14 ms
21/06/11 20:42:55 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 376.3 KiB, free 2.2 GiB)
21/06/11 20:42:58 INFO CodeGenerator: Code generated in 392.898976 ms
21/06/11 20:42:58 INFO CodeGenerator: Code generated in 50.294749 ms
21/06/11 20:42:58 INFO CodeGenerator: Code generated in 85.842712 ms
21/06/11 20:42:58 INFO CodeGenerator: Generated method too long to be JIT compiled: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.serializefromobject_doConsume_0$ is 20081 bytes
21/06/11 20:42:58 INFO CodeGenerator: Code generated in 257.430603 ms
21/06/11 20:42:59 INFO CodeGenerator: Code generated in 166.091418 ms
21/06/11 20:42:59 INFO TorrentBroadcast: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB)
21/06/11 20:42:59 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 333.3 KiB, free 2.2 GiB)
21/06/11 20:42:59 INFO TorrentBroadcast: Reading broadcast variable 4 took 8 ms
21/06/11 20:42:59 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 3.4 MiB, free 2.2 GiB)
21/06/11 20:42:59 INFO TorrentBroadcast: Started reading broadcast variable 5 with 5 pieces (estimated total size 20.0 MiB)
21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece3 stored as bytes in memory (estimated size 4.0 MiB, free 2.2 GiB)
21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece2 stored as bytes in memory (estimated size 4.0 MiB, free 2.2 GiB)
21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece4 stored as bytes in memory (estimated size 1039.2 KiB, free 2.2 GiB)
21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.0 MiB, free 2.2 GiB)
21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece1 stored as bytes in memory (estimated size 4.0 MiB, free 2.2 GiB)
21/06/11 20:42:59 INFO TorrentBroadcast: Reading broadcast variable 5 took 126 ms
21/06/11 20:43:00 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 17.5 MiB, free 2.2 GiB)
21/06/11 20:43:00 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 13)
java.lang.NullPointerException
	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.getTFHubSession(TensorflowWrapper.scala:109)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.tag(TensorflowBert.scala:90)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.$anonfun$calculateEmbeddings$1(TensorflowBert.scala:223)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator.toStream(Iterator.scala:1415)
	at scala.collection.Iterator.toStream$(Iterator.scala:1414)
	at scala.collection.AbstractIterator.toStream(Iterator.scala:1429)
	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:303)
	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:303)
	at scala.collection.AbstractIterator.toSeq(Iterator.scala:1429)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.calculateEmbeddings(TensorflowBert.scala:221)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.$anonfun$batchAnnotate$2(BertEmbeddings.scala:237)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.batchAnnotate(BertEmbeddings.scala:229)
	at com.johnsnowlabs.nlp.HasBatchedAnnotate.$anonfun$batchProcess$1(HasBatchedAnnotate.scala:41)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
21/06/11 20:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 14
21/06/11 20:43:00 INFO Executor: Running task 0.1 in stage 2.0 (TID 14)
21/06/11 20:43:00 ERROR Executor: Exception in task 0.1 in stage 2.0 (TID 14)
java.lang.NullPointerException
	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.getTFHubSession(TensorflowWrapper.scala:109)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.tag(TensorflowBert.scala:90)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.$anonfun$calculateEmbeddings$1(TensorflowBert.scala:223)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator.toStream(Iterator.scala:1415)
	at scala.collection.Iterator.toStream$(Iterator.scala:1414)
	at scala.collection.AbstractIterator.toStream(Iterator.scala:1429)
	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:303)
	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:303)
	at scala.collection.AbstractIterator.toSeq(Iterator.scala:1429)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.calculateEmbeddings(TensorflowBert.scala:221)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.$anonfun$batchAnnotate$2(BertEmbeddings.scala:237)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.batchAnnotate(BertEmbeddings.scala:229)
	at com.johnsnowlabs.nlp.HasBatchedAnnotate.$anonfun$batchProcess$1(HasBatchedAnnotate.scala:41)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
21/06/11 20:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 15
21/06/11 20:43:00 INFO Executor: Running task 0.2 in stage 2.0 (TID 15)
21/06/11 20:43:00 ERROR Executor: Exception in task 0.2 in stage 2.0 (TID 15)
java.lang.NullPointerException
	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.getTFHubSession(TensorflowWrapper.scala:109)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.tag(TensorflowBert.scala:90)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.$anonfun$calculateEmbeddings$1(TensorflowBert.scala:223)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator.toStream(Iterator.scala:1415)
	at scala.collection.Iterator.toStream$(Iterator.scala:1414)
	at scala.collection.AbstractIterator.toStream(Iterator.scala:1429)
	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:303)
	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:303)
	at scala.collection.AbstractIterator.toSeq(Iterator.scala:1429)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.calculateEmbeddings(TensorflowBert.scala:221)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.$anonfun$batchAnnotate$2(BertEmbeddings.scala:237)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.batchAnnotate(BertEmbeddings.scala:229)
	at com.johnsnowlabs.nlp.HasBatchedAnnotate.$anonfun$batchProcess$1(HasBatchedAnnotate.scala:41)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
21/06/11 20:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 16
21/06/11 20:43:00 INFO Executor: Running task 0.3 in stage 2.0 (TID 16)
21/06/11 20:43:01 ERROR Executor: Exception in task 0.3 in stage 2.0 (TID 16)
java.lang.NullPointerException
	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.getTFHubSession(TensorflowWrapper.scala:109)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.tag(TensorflowBert.scala:90)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.$anonfun$calculateEmbeddings$1(TensorflowBert.scala:223)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator.toStream(Iterator.scala:1415)
	at scala.collection.Iterator.toStream$(Iterator.scala:1414)
	at scala.collection.AbstractIterator.toStream(Iterator.scala:1429)
	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:303)
	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:303)
	at scala.collection.AbstractIterator.toSeq(Iterator.scala:1429)
	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.calculateEmbeddings(TensorflowBert.scala:221)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.$anonfun$batchAnnotate$2(BertEmbeddings.scala:237)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.batchAnnotate(BertEmbeddings.scala:229)
	at com.johnsnowlabs.nlp.HasBatchedAnnotate.$anonfun$batchProcess$1(HasBatchedAnnotate.scala:41)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
21/06/11 20:43:04 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
21/06/11 20:43:04 INFO MemoryStore: MemoryStore cleared
21/06/11 20:43:04 ERROR CoarseGrainedExecutorBackend: RE

I am a novice and this is probably a trivial issue, but raising it nonetheless since I couldnt' find a solution anywhere.

Thanks!

bug-fix fixed-next-release

opened by havellay 22

Using Fat Jars behind company's firewall not viable.
Description

I have started this conversation:

https://spark-nlp.slack.com/archives/CA118BWRM/p1617225602087300

and based on the response, I have tried fat jars on my work laptop. Using the Fat Jars, it did move pass the starting session step, but it failed short in sentence detection, and there are big differences between spark-nlp 2.7.x and 3.0.x, as detailed below:

1.1. On Spark NLP version 2.7.5: got a timeout when company's VPN is enabled (on my work MACOS laptop):

spark = SparkSession.builder\     .appName("Spark NLP")\     .master("local[4]")\     .config("spark.driver.memory","16G")\     .config("spark.driver.maxResultSize", "0")\     .config("spark.kryoserializer.buffer.max", "2000M")\     .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-2.7.5.jar")\     .getOrCreate() spark

Apache Spark version: 2.4.4 Spark NLP version 2.7.5 sentence_detector_dl download started this may take some time.

Py4JJavaError                             Traceback (most recent call last) in       1 sentencerDL = SentenceDetectorDLModel
----> 2     .pretrained("sentence_detector_dl", "en")
      3     .setInputCols(["document"])
      4     .setOutputCol("sentences")       5 ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)    3095     def pretrained(name="sentence_detector_dl", lang="en", remote_loc=None):    3096         from sparknlp.pretrained import ResourceDownloader -> 3097         return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)    3098    3099 ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)      30     def downloadModel(reader, name, language, remote_loc=None, j_dwn='PythonResourceDownloader'):      31         print(name + " download started this may take some time.") ---> 32         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()      33         if file_size == "-1":      34             print("Can not find the model to download please check the name!") ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc)     190     def init(self, name, language, remote_loc):     191         super(_GetResourceSize, self).init( --> 192             "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)     193     194 ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args)     127         super(ExtendedJavaWrapper, self).init(java_obj)     128         self.sc = SparkContext._active_spark_context --> 129         self._java_obj = self.new_java_obj(java_obj, *args)     130         self.java_obj = self._java_obj     131 ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)     137     138     def new_java_obj(self, java_class, *args): --> 139         return self._new_java_obj(java_class, *args)     140     141     def new_java_array(self, pylist, java_class): ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)      65             java_obj = getattr(java_obj, name)      66         java_args = [_py2java(sc, arg) for arg in args] ---> 67         return java_obj(*java_args)      68      69     @staticmethod ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)    1255         answer = self.gateway_client.send_command(command)    1256         return_value = get_return_value( -> 1257             answer, self.gateway_client, self.target_id, self.name)    1258    1259         for temp_arg in temp_args: ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)      61     def deco(*a, **kw):      62         try: ---> 63             return f(*a, **kw)      64         except py4j.protocol.Py4JJavaError as e:      65             s = e.java_exception.toString() ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     326                 raise Py4JJavaError(     327                     "An error occurred while calling {0}{1}{2}.\n". --> 328                     format(target_id, ".", name), value)     329             else:     330                 raise Py4JError( Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : com.amazonawsShadedAmazonClientException: Unable to execute HTTP request: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out         at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:454)         at com.amazonawsShadedhttp.AmazonHttpClient.execute(AmazonHttpClient.java:232)         at com.amazonawsShadedservices.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)         at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)         at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:984)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)         at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:498)         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)         at py4j.Gateway.invoke(Gateway.java:282)         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)         at py4j.commands.CallCommand.execute(CallCommand.java:79)         at py4j.GatewayConnection.run(GatewayConnection.java:238)         at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.httpShadedconn.ConnectTimeoutException: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out         at org.apache.httpShadedconn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:551)         at org.apache.httpShadedimpl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)         at org.apache.httpShadedimpl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)         at org.apache.httpShadedimpl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:641)         at org.apache.httpShadedimpl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480)         at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)         at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)         at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)         ... 21 more 1.2. However, once I disable the company's VPN, the above call to SentenceDetectorDLModel works!

2.1. Using Spark NLP version 3.0.1 I get a NullPointerException back:

spark = SparkSession.builder\     .appName("Spark NLP")\     .master("local[4]")\     .config("spark.driver.memory","16G")\     .config("spark.driver.maxResultSize", "0")\     .config("spark.kryoserializer.buffer.max", "2000M")\     .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")\     .getOrCreate() spark

Apache Spark version: 3.1.1 Spark NLP version 3.0.1

sentence_detector_dl download started this may take some time.

Py4JJavaError                             Traceback (most recent call last) in       1 sentencerDL = SentenceDetectorDLModel
----> 2     .pretrained("sentence_detector_dl", "en")
      3     .setInputCols(["document"])
      4     .setOutputCol("sentences")       5 ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)    3107     def pretrained(name="sentence_detector_dl", lang="en", remote_loc=None):    3108         from sparknlp.pretrained import ResourceDownloader -> 3109         return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)    3110    3111 ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)      30     def downloadModel(reader, name, language, remote_loc=None, j_dwn='PythonResourceDownloader'):      31         print(name + " download started this may take some time.") ---> 32         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()      33         if file_size == "-1":      34             print("Can not find the model to download please check the name!") ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc)     190     def init(self, name, language, remote_loc):     191         super(_GetResourceSize, self).init( --> 192             "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)     193     194 ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args)     127         super(ExtendedJavaWrapper, self).init(java_obj)     128         self.sc = SparkContext._active_spark_context --> 129         self._java_obj = self.new_java_obj(java_obj, *args)     130         self.java_obj = self._java_obj     131 ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)     137     138     def new_java_obj(self, java_class, *args): --> 139         return self._new_java_obj(java_class, *args)     140     141     def new_java_array(self, pylist, java_class): ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)      64             java_obj = getattr(java_obj, name)      65         java_args = [_py2java(sc, arg) for arg in args] ---> 66         return java_obj(*java_args)      67      68     @staticmethod ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)    1303         answer = self.gateway_client.send_command(command)    1304         return_value = get_return_value( -> 1305             answer, self.gateway_client, self.target_id, self.name)    1306    1307         for temp_arg in temp_args: ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)     109     def deco(*a, **kw):     110         try: --> 111             return f(*a, **kw)     112         except py4j.protocol.Py4JJavaError as e:     113            converted = convert_exception(e.java_exception) ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     326                 raise Py4JJavaError(     327                     "An error occurred while calling {0}{1}{2}.\n". --> 328                     format(target_id, ".", name), value)     329             else:     330                 raise Py4JError( Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.NullPointerException         at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsernameEnvironment(ClientConfiguration.java:874)         at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsername(ClientConfiguration.java:902)         at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.getProxyUsername(HttpClientSettings.java:90)         at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.isAuthenticatedProxy(HttpClientSettings.java:182)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.addProxyConfig(ApacheHttpClientFactory.java:96)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:75)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)         at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.(AmazonHttpClient.java:324)         at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.(AmazonHttpClient.java:308)         at com.amazonaws.ShadedByJSLAmazonWebServiceClient.(AmazonWebServiceClient.java:229)         at com.amazonaws.ShadedByJSLAmazonWebServiceClient.(AmazonWebServiceClient.java:181)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:617)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:597)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:575)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:542)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client$lzycompute(S3ResourceDownloader.scala:45)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client(S3ResourceDownloader.scala:36)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)         at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:498)         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)         at py4j.Gateway.invoke(Gateway.java:282)         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)         at py4j.commands.CallCommand.execute(CallCommand.java:79)         at py4j.GatewayConnection.run(GatewayConnection.java:238)         at java.lang.Thread.run(Thread.java:748)

2.2. If I disaable company's VPN, I get the same NullPointerException as above - 2.1.

Expected Behavior

I would like to use your code behind company's firewall, and more importantly from AWS SageMaker. I do test it first on my work laptop, so I like to have it working there as well.

Current Behavior

Not working, got a healthcare temp license, which expires in a couple of days, and so far I was not able to run any of your code behind company's firewall. So, setting the spark-nlp session using the Fat Jars: when using a pretrain model such as: sentencerDL = SentenceDetectorDLModel
.pretrained("sentence_detector_dl", "en")
.setInputCols(["document"])
.setOutputCol("sentences") it fails.

Possible Solution

Like the idea of using Fat Jars, but need them functional.

Steps to Reproduce

tested on my work macos catalina latest version using the installation instructions: https://nlp.johnsnowlabs.com/docs/en/install#python for both: $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp $ pip install spark-nlp==3.0.1 pyspark==3.1.1 $ pip install jupyter $ jupyter notebook

and

$ java -version $ conda create -n spark-nlp python=3.7 -y $ conda activate spark-nlp $ pip install spark-nlp==2.7.5 pyspark==2.4.4 $ pip install jupyter $ jupyter notebook

Pretty much follow the code from: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb#scrollTo=KvNuyGXpD7Nt

but using the Fat Jars instead:

spark = SparkSession.builder
.appName("Spark NLP")
.master("local[4]")
.config("spark.driver.memory","16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")
.getOrCreate()

and the moment I hit this code:

sentencerDL = SentenceDetectorDLModel
.pretrained("sentence_detector_dl", "en")
.setInputCols(["document"])
.setOutputCol("sentences")

I get the above errors (NullPointerException for spark-nlp 3.0.x and timing out for spark-nlp 2.7.x)

Context

Your Environment

Spark NLP version sparknlp.version(): Spark NLP version 3.0.1

Apache NLP version spark.version: Apache Spark version: 3.1.1

Java version java -version: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00) OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode)

Conda latest release.

Operating System and version: MacOS catalina, latest release.
opened by Octavian-act 22

Tensorflow lib core dumped

When I try to use pretrained model I get core dumped. Error is below.

2020-08-05 14:35:59 INFO  HadoopRDD:54 - Input split: hdfs://namenode:9000/models/recognize_entities_dl/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00011:0+2831
2020-08-05 14:35:59 INFO  Executor:54 - Finished task 4.0 in stage 16.0 (TID 32). 765 bytes result sent to driver
2020-08-05 14:35:59 INFO  TaskSetManager:54 - Finished task 4.0 in stage 16.0 (TID 32) in 38 ms on localhost (executor driver) (5/7)
2020-08-05 14:35:59 INFO  Executor:54 - Finished task 5.0 in stage 16.0 (TID 33). 765 bytes result sent to driver
2020-08-05 14:35:59 INFO  TaskSetManager:54 - Finished task 5.0 in stage 16.0 (TID 33) in 47 ms on localhost (executor driver) (6/7)
2020-08-05 14:35:59 INFO  Executor:54 - Finished task 6.0 in stage 16.0 (TID 34). 2146 bytes result sent to driver
2020-08-05 14:35:59 INFO  TaskSetManager:54 - Finished task 6.0 in stage 16.0 (TID 34) in 55 ms on localhost (executor driver) (7/7)
2020-08-05 14:35:59 INFO  TaskSchedulerImpl:54 - Removed TaskSet 16.0, whose tasks have all completed, from pool 
2020-08-05 14:35:59 INFO  DAGScheduler:54 - ResultStage 16 (first at Feature.scala:120) finished in 0.110 s
2020-08-05 14:35:59 INFO  DAGScheduler:54 - Job 16 finished: first at Feature.scala:120, took 0.119676 s
2020-08-05 14:35:59 INFO  MemoryStore:54 - Block broadcast_31 stored as values in memory (estimated size 8.4 KB, free 361.2 MB)
2020-08-05 14:35:59 INFO  MemoryStore:54 - Block broadcast_31_piece0 stored as bytes in memory (estimated size 440.0 B, free 361.2 MB)
2020-08-05 14:35:59 INFO  BlockManagerInfo:54 - Added broadcast_31_piece0 in memory on 82a79ae5305b:45455 (size: 440.0 B, free: 365.8 MB)
2020-08-05 14:35:59 INFO  SparkContext:54 - Created broadcast 31 from broadcast at Feature.scala:87
\#
\# A fatal error has been detected by the Java Runtime Environment:
\#
\#  SIGILL (0x4) at pc=0x00007f2dae59ada9, pid=846, tid=0x00007f2e5dad5700
\#
\# JRE version: OpenJDK Runtime Environment (8.0_171-b11) (build 1.8.0_171-8u171-b11-1~bpo8+1-b11)
\# Java VM: OpenJDK 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)
\# Problematic frame:
\# C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99
\#
\# Core dump written. Default location: //core or core.846
\#
\# An error report file with more information is saved as:
\# //hs_err_pid846.log
\#
\# If you would like to submit a bug report, please visit:
\#   http://bugreport.java.com/bugreport/crash.jsp
\# The crash happened outside the Java Virtual Machine in native code.
\# See problematic frame for where to report the bug.

Steps to Reproduce

Clone the repo https://github.com/miloradtrninic/entity/
Run docker compose from cloned directory
Download recognize_entities_dl for offline usage (recognize_entities_dl_en_2.4.3_2.4_1584626752821)
Unzip on local computer model
docker exec namenode mkdir models
docker cp recognize_entities_dl/ namenode:/models/recognize_entities_dl
docker exec namenode hdfs dfs -mkdir /models
docker exec namenode hdfs dfs -put /models/recognize_entities_dl/ /models/
Run " ./submit.sh b=1 d=1 e=1 a=/ " it will give 1GB to driver and executors and build the project with sbt assembly.

Context

I am getting core dumped on simple execution of the spark nlp framework.

It seams a lot like #923 but I think I provided reproducible environment.

This issue along with #985 is blocking me completely from using the library and proceeding with my masters thesis. Until it is fixed can you provide some docker images you know it is working on? I have this same issue when I use offline models for the spark-nlp starter project in this environment.

Your Environment

Spark version: 2.4.0
Apache NLP version: 2.4.5
Java version (java -version): 1.8
Setup and installation (Pypi, Conda, Maven, etc.): SBT
Operating System and version: Linux
Link to your project (if any): https://github.com/miloradtrninic/entity/

wont-fix

opened by miloradtrninic 18

SPARKNLP-713 Modifies Default Values GraphExtraction
Description

Modifies default values of explodeEntities and mergeEntities parameters

Motivation and Context

Defining these parameters by default to true, makes this annotator to have and output, avoiding users to think it does not work.

How Has This Been Tested?

Screenshots (if appropriate):

Local Tests

Google Colab notebook

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)

[x] Code improvements with no or little impact

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

[x] My code follows the code style of this project.

[ ] My change requires a change to the documentation.

[ ] I have updated the documentation accordingly.

[ ] I have read the CONTRIBUTING page.

[ ] I have added tests to cover my changes.

[x] All new and existing tests passed.
opened by danilojsl 0
SPARKNLP-607: Implement HubertForCTC
Description

This PR adds an Annotator to load HubertForCTC models.

Motivation and Context

With more speech-to-text models coming out, we want to support a wider range of models.

How Has This Been Tested?

Added new tests for the annotator on python and scala side.

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] Code improvements with no or little impact

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to change)
opened by DevinTDHa 0
Need guidance to finetune BertSentenceEmbedding using domain specific pair of sentences

This is not a proper feature request; rather I need the guidance to build our customized model using BertSentenceEmbedding which would be built on top of pretrained model for ex: small_bert_L2_128; I will use some domain specific dataset to finetune the mentioned model. Request to share the approach in spark-nlp perspective.
Feature request

opened by srimantacse 0
Relocating public examples back to the main repository
We are relocating all examples related to the public Spark NLP back to the example directory. The reasons resulting for this decision:

It is reasonable to have some examples under the example directory like many other libraries

The public examples are abandoned in the spark-nlp-workshop and not maintain by any specific team

The spark-nlp-workshop has become extremely hard to navigate. It's not easy for a new user to know where to start and I don't see any sign it will get any better

Having all our examples in main repository will allow us to have them all compatible in each release (versioning them as well via tag)

This also encourages us to have more examples for different languages as the people maintaining workshop mostly know Python

documentation new-feature DON'T MERGE
opened by maziyarpanahi 0
Spark NLP 427 release candidate
https://github.com/JohnSnowLabs/spark-nlp/pull/13280

https://github.com/JohnSnowLabs/spark-nlp/pull/13282

https://github.com/JohnSnowLabs/spark-nlp/pull/13283

https://github.com/JohnSnowLabs/spark-nlp/pull/13284

enhancement documentation bug-fix models_hub DON'T MERGE
opened by maziyarpanahi 0

Releases(4.2.6)

4.2.6(Dec 21, 2022)
:star: Improvements

Updating Spark & PySpark dependencies from 3.2.1 to 3.2.3 in provided scripts and in all the documentation

:bug: Bug Fixes

Fix the broken TypedDependencyParserApproach and TypedDependencyParserModel annotators used in Python (this bug was introduced in 4.2.5 release)

Fix the broken Python API documentation

:book: Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.2.6

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.6 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.6

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.6 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.6

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.6 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.6

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.6 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.6

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.2.6</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.2.6</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.2.6</version> </dependency>

spark-nlp-aarch64:

 <dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-aarch64_2.12</artifactId> <version>4.2.6</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.6.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.6.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.6.jar

AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.6.jar

What's Changed

Contributors

'@gadde5300 @diatrambitas @Cabir40 @josejuanmartinez @danilojsl @jsl-builder @DevinTDHa @maziyarpanahi @dcecchini @agsfer '

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.5...4.2.6
Source code(tar.gz)
Source code(zip)
4.2.5(Dec 16, 2022)
:loudspeaker: Overview

Spark NLP 4.2.5 🚀 comes with a new CamemBERT for sequence classification annotator (multi-class & multi-label), new pipeline validation for LightPipeline in Python, 26 updated noteooks to use the latest TensorFlow and Transformers libraries, support for new Databricks 11.3 runtime, support for new EMR versions of 6.8 and 6.9 (only EMR versions with Spark 3.3), over 400+ state-of-the-art multi-lingual pretrained models, and bug fixes.

Do not forget to visit Models Hub with over 11700+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉

:star: New Features & improvements

NEW: Introducing CamemBertForSequenceClassification annotator in Spark NLP 🚀. CamemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using CamembertForSequenceClassification for PyTorch or TFCamembertForSequenceClassification for TensorFlow in HuggingFace 🤗

NEW: Add AnnotatorType validation in Spark NLP LightPipeline. Currently, a misconfiguration of inputCols in an annotator in a pipeline raises an exception when using transform method, but in LightPipeline it only outputs empty values. This behavior can confuse users, this change introduces a validation that will raise an exception now in LightPipeline too.

Add outputAnnotatorType for all annotators in Python

Add inputAnnotatorTypes and outputAnnotatorType requirement validation for all subclasses derived from AnnotatorApproach and AnnotatorModel

Adding AnnotatorType validation in LightPipeline

NEW: Migrate 26 notenooks to import external Transformer models into Spark NLP. These notebooks now come with latest TensorFlow 2.11.0 and HuggingFace 4.25.1 releases. The notebooks also have TF signatures with data input types explicitly set to guarantee model sanity once imported into Spark NLP

Add validation for the number and type of columns set in TFNerDLGraphBuilder annotator. In efforts to avoid wrong definition of columns when using Spark NLP annotators in Python

Add more details to Alphabet error message in EntityRuler annotator to better guide users

Add instructions on how to resolve RocksDB incompatibilities when using Spark NLP with an M1 machine

Welcoming new Databricks runtimes support

11.3

11.3 ML

11.3 GPU

Welcoming new EMR versions support

6.8.0

6.9.0

Refactor and implement a better error handling in ResourceDownloader. This change removes getObjectFromS3 allowing AWS SDK to rise the correspondent error. In addition, this change also refactors ResourceDownloader to reflect the intention of each credential type on the downloader

Implement full build and test of all unit tests base on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x major releases

UpdateUpgrade sbt-assembly to 1.2.0 that comes with lots of performance improvements. This benefits those who are trying to package Spark NLP as a Fat JAR

Update sbt to 1.8.0 with improvements and bug fixes, but mostly for CVEs fixes:

Updates to Coursier 2.1.0-RC1 to address https://github.com/advisories/GHSA-wv7w-rj2x-556x

Updates to Ivy 2.3.0-sbt-a8f9eb5bf09d0539ea3658a2c2d4e09755b5133e to address https://github.com/advisories/GHSA-wv7w-rj2x-556x

Use the new withIncludeScala in assemblyOption instead of value

:bug: Bug Fixes

Fix an issue with the BigTextMatcher Annotator, where it would not match entities with overlapping definitions. For Example, if both lung and lung cancer are defined, lung would not be matched in a given text. This was due to an abstraction error of one of the subclasses of the BigTextMatcher during construction of the underlying data structure

Fix indexing issue for RegexTokenizer annotator. If the document was split into sentences, the index of the sentence inside the document was not taken into consideration for the indexes of the tokens. This would lead to further issues down the pipeline, where tokens would be filtered while unpacking them for other Annotators

Refactor the Resolvers object in Spark NLP's dependency to avoid the conflict with the Resolvers inside the new sbt

🛑 Known Issues

TypedDependencyParserModel annotator fails in Python in this release (will be fixed in 4.2.6 release next week)

Models

Spark NLP 4.2.5 comes with 400+ state-of-the-art pre-trained transformer models in many languages.

Featured Models

| Model | Name | Lang |
|:---------------------|:-------------------|:---| | RoBertaForSequenceClassification | roberta_classifier_autotrain_neurips_chanllenge_1287149282 | en | RoBertaForSequenceClassification | roberta_classifier_autonlp_imdb_rating_625417974 | en | RoBertaForSequenceClassification | RoBertaForSequenceClassification | bn | RoBertaForSequenceClassification | roberta_classifier_autotrain_citizen_nlu_hindi_1370952776 | hi | RoBertaForSequenceClassification | roberta_classifier_detect_acoso_twitter | es | RoBertaForQuestionAnswering | roberta_qa_deepset_base_squad2 | en | RoBertaForQuestionAnswering | roberta_qa_icebert | is | RoBertaForQuestionAnswering | roberta_qa_mrm8488_base_bne_finetuned_s_c | es | RoBertaForQuestionAnswering | roberta_qa_base_bne_squad2 | es | BertEmbeddings | bert_embeddings_rbt3 | zh | BertEmbeddings | bert_embeddings_base_it_cased | it | BertEmbeddings | bert_embeddings_base_indonesian_522m | id | BertEmbeddings | bert_embeddings_base_german_uncased | de | BertEmbeddings | [bert_embeddings_base_japanese_char](https://nlp.johnsnowlabs.com/2022/12/02/bert_embeddings_base_japanese_char_ja.html) |ja| BertEmbeddings | [bert_embeddings_bangla_base](https://nlp.johnsnowlabs.com/2022/12/02/bert_embeddings_bangla_base_bn.html) |bn| BertEmbeddings | [bert_embeddings_base_arabertv01](https://nlp.johnsnowlabs.com/2022/12/02/bert_embeddings_base_arabertv01_ar.html) |ar`

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 11700+ models & pipelines in 230+ languages is available on Models Hub

:notebook: New Notebooks

Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| CamemBertForTokenClassification|HuggingFace in Spark NLP - CamemBertForSequenceClassification |

:notebook: Updated Notebooks

The following notebooks have been updated to use the last release of TensorFLow 2.11 and Hugging Face 4.25 libraries

Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| BertEmbeddings | HuggingFace in Spark NLP - BERT | BertSentenceEmbeddings | HuggingFace in Spark NLP - BERT Sentence | DistilBertEmbeddings| HuggingFace in Spark NLP - DistilBERT | CamemBertEmbeddings| HuggingFace in Spark NLP - CamemBERT | RoBertaEmbeddings | HuggingFace in Spark NLP - RoBERTa | DeBertaEmbeddings | HuggingFace in Spark NLP - DeBERTa | XlmRoBertaEmbeddings | HuggingFace in Spark NLP - XLM-RoBERTa | AlbertEmbeddings | HuggingFace in Spark NLP - ALBERT | BertForTokenClassification|HuggingFace in Spark NLP - BertForTokenClassification | DistilBertForTokenClassification|HuggingFace in Spark NLP - DistilBertForTokenClassification | AlbertForTokenClassification|HuggingFace in Spark NLP - AlbertForTokenClassification | RoBertaForTokenClassification|HuggingFace in Spark NLP - RoBertaForTokenClassification | XlmRoBertaForTokenClassification|HuggingFace in Spark NLP - XlmRoBertaForTokenClassification | CamemBertForTokenClassification|HuggingFace in Spark NLP - CamemBertForTokenClassification | CamemBertForTokenClassification|HuggingFace in Spark NLP - CamemBertForSequenceClassification | BertForSequenceClassification |HuggingFace in Spark NLP - BertForSequenceClassification | DistilBertForSequenceClassification |HuggingFace in Spark NLP - DistilBertForSequenceClassification | AlbertForSequenceClassification |HuggingFace in Spark NLP - AlbertForSequenceClassification | RoBertaForSequenceClassification |HuggingFace in Spark NLP - RoBertaForSequenceClassification | XlmRoBertaForSequenceClassification |HuggingFace in Spark NLP - XlmRoBertaForSequenceClassification | AlbertForQuestionAnswering |HuggingFace in Spark NLP - AlbertForQuestionAnswering | BertForQuestionAnswering|HuggingFace in Spark NLP - BertForQuestionAnswering | DeBertaForQuestionAnswering|HuggingFace in Spark NLP - DeBertaForQuestionAnswering | DistilBertForQuestionAnswering|HuggingFace in Spark NLP - DistilBertForQuestionAnswering | RoBertaForQuestionAnswering|HuggingFace in Spark NLP - RoBertaForQuestionAnswering | XlmRobertaForQuestionAnswering|HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering |

You can visit Import Transformers in Spark NLP

You can visit Spark NLP Workshop for 100+ examples

:book: Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.2.5

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.2.5</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.2.5</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.2.5</version> </dependency>

spark-nlp-aarch64:

 <dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-aarch64_2.12</artifactId> <version>4.2.5</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.5.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.5.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.5.jar

AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.5.jar

What's Changed

Contributors

@Damla-Gurbaz @Cabir40 @josejuanmartinez @danilojsl @mhnavid @DevinTDHa @jsl-builder @KshitizGIT @suvrat-joshi @maziyarpanahi @agsfer

New Contributors

@mhnavid made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/12977

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.4...4.2.5
Source code(tar.gz)
Source code(zip)
4.2.4(Nov 28, 2022)
:loudspeaker: Overview

Spark NLP 4.2.4 🚀 comes with new support for GCP storage to automatically download and load models & pipelines via setting the cache_pretrained path, update to TensorFlow 2.7.4 with security patch fixes, lots of improvements in our documentation, improvements, and bug fixes.

Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉

:star: New Features & improvements

Introducing support for GCP storage to automatically download and load pre-trained models/pipelines from cache_pretrained directory

Update to TensorFlow 2.7.4 with bug and CVEs fixes. Details about bugs and CVEs fixes: https://github.com/JohnSnowLabs/spark-nlp/commit/417e2a1ff2b0bca2d2046c4d4740f52ce770689f

Improve error handling while importing external TensorFlow models into Spark NLP

Improve error messages when importing external models from remote storages like DBFS, S3, and HDFS

Update documentation on how to use testDataset param in NerDLApproach, ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach

Update installation instructions for the Apple M1 chip

Add support for future decoder-encoder models with 2 separated models

🐛 Bug Fixes

Add missing setPreservePosition in NerConverter

Add missing inputAnnotatorTypes to BigTextMatcher, ViveknSentimentModel, and NerConverter annotators

Fix all wrong example codes provided for LemmatizerModel in Models Hub

Fix the t5_grammar_error_corrector model to be compatible with Spark NLP 4.0+

Fix provided notebook to import Longformer models from Hugging Face into Spark NLP

:notebook: New Notebooks

Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| Spark NLP Conf |Dowbload and Load Model from GCP Storage | | LongformerEmbeddings|HuggingFace in Spark NLP - Longformer |

You can visit Import Transformers in Spark NLP

You can visit Spark NLP Workshop for 100+ examples

:book: Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.2.4

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.4

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.4

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.4

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.2.4</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.2.4</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.2.4</version> </dependency>

spark-nlp-aarch64:

 <dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-aarch64_2.12</artifactId> <version>4.2.4</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.4.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.4.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.4.jar

AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.4.jar

What's Changed

release note v4.2.2 by @Cabir40 in https://github.com/JohnSnowLabs/spark-nlp/pull/13091

added languages by @ahmedlone127 in https://github.com/JohnSnowLabs/spark-nlp/pull/13097

[skip ci] Create PR 4.2.2-healthcare-docs-8fde8ce2327dce2fb89db1742eec8ca121eee0de-3 by @jsl-builder in https://github.com/JohnSnowLabs/spark-nlp/pull/13084

FEATURE NMH-139: Add annotator to existing model [skip-test] by @KshitizGIT in https://github.com/JohnSnowLabs/spark-nlp/pull/13096

Add Visual NLP 4.2 to compatible versions in models.json by @pabla in https://github.com/JohnSnowLabs/spark-nlp/pull/13099

Add new demos 25 by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13100

Docs/alab 4.3.0 by @diatrambitas in https://github.com/JohnSnowLabs/spark-nlp/pull/13104

Added content for installation in OpenShift by @suvrat-joshi in https://github.com/JohnSnowLabs/spark-nlp/pull/13105

Update subtabs by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13110

Release Notes Updated by @Cabir40 in https://github.com/JohnSnowLabs/spark-nlp/pull/13111

Updated old hc snippets by @ArshaanNazir in https://github.com/JohnSnowLabs/spark-nlp/pull/13092

Added content for healthcare nlp integration by @suvrat-joshi in https://github.com/JohnSnowLabs/spark-nlp/pull/13115

Added some content for troubleshooting section by @suvrat-joshi in https://github.com/JohnSnowLabs/spark-nlp/pull/13116

Docs/alab 2479 add content for model testing page by @rpranab in https://github.com/JohnSnowLabs/spark-nlp/pull/13114

Update oncology.md by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13146

SPARKNLP-656 & SPARKNLP-657: Updated Documentation by @DevinTDHa in https://github.com/JohnSnowLabs/spark-nlp/pull/13108

SPARKNLP-658 Update EngineError message by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13109

SPARKNLP-661: Add missing setPreservePosition in NerConverter by @DevinTDHa in https://github.com/JohnSnowLabs/spark-nlp/pull/13112

fixed Wrong Example code provided for LemmatizerModel #13125 by @ahmedlone127 in https://github.com/JohnSnowLabs/spark-nlp/pull/13126

SPARKNLP-620 Provide GCP Support for Cache Folder by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/13141

SPARKNLP-669 Adding missing inputAnnotatorTypes by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/13144

SPARKNLP-665 Updating to TensorFlow 2.7.4 by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13152

SPARKNLP-671 incorporate the exception into the error message by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13153

Models hub by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13160

Release/424 release candidate by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13163

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.3...4.2.4
Source code(tar.gz)
Source code(zip)
4.2.1(Nov 28, 2022)
:loudspeaker: Overview

Spark NLP 4.2.1 🚀 comes with a new multi-lingual support for Word Segmentation mostly used for (but not limited to) Chinese, Japanese, Korean, and so on, adding Automatic Speech Recognition (ASR) pipelines to LightPipeline arsenal for faster computation of smaller datasets without Apache Spark (e.g. RESTful API use case), adding support for processed audio files in type of Double in addition to Float for Wav2Vec2, over 230+ state-of-the-art Transformer Vision (ViT) pretrained pipelines for 1-line Image Classification, and bug fixes.

Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉

:star: New Features & improvements

NEW: Support for multi-lingual WordSegmenter. Add enableRegexTokenizer feature in WordSegmenter to support word segmentation within mixed and multi-lingual content https://github.com/JohnSnowLabs/spark-nlp/pull/12854

NEW: Add support for Audio/ASR (Wav2Vec2) support to LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12895

NEW: Add support for Double type in addition to Float type to AudioAssembler annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12904

Improve error handling in fullAnnotateImage for LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12868

Add SpanBertCoref annotator to all docs https://github.com/JohnSnowLabs/spark-nlp/pull/12889

Bug Fixes

Fix feeding fullAnnotate in Lightpipeline with a list that started to fail in 4.2.0 release

Fix exception in ContextSpellCheckerModel when updateVocabClass is used with append set to true https://github.com/JohnSnowLabs/spark-nlp/pull/12875

Fix exception in Chunker annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12901

:notebook: New Notebooks

Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| SpanBertCorefModel |Coreference Resolution with SpanBertCorefModel | | WordSegmenter |Train and inference multi-lingual Word Segmenter | |

You can visit Import Transformers in Spark NLP

You can visit Spark NLP Workshop for 100+ examples

Models

Spark NLP 4.2.1 comes with 230+ state-of-the-art pre-trained Transformer Vision (ViT) pipeline:

Featured Pipelines

| Pipeline | Name | Lang |
|:---------------------|:-------------------|:---| | PretrainedPipeline | pipeline_image_classifier_vit_base_patch16_224_finetuned_eurosat | en | PretrainedPipeline | pipeline_image_classifier_vit_base_beans_demo_v5 | en | PretrainedPipeline | pipeline_image_classifier_vit_animal_classifier_huggingface | en | PretrainedPipeline | pipeline_image_classifier_vit_Infrastructures | en | PretrainedPipeline | pipeline_image_classifier_vit_blocks | en | PretrainedPipeline | pipeline_image_classifier_vit_beer_whisky_wine_detection | en | PretrainedPipeline | pipeline_image_classifier_vit_base_xray_pneumonia | en | PretrainedPipeline | pipeline_image_classifier_vit_baseball_stadium_foods | en | PretrainedPipeline | pipeline_image_classifier_vit_dog_vs_chicken | en

Check 460+ Transformer Vision (ViT) models & pipelines for Models Hub - Image Classification

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub

:book: Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.2.1

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.1

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.1

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.2.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.2.1</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.2.1</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.1.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.1.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.1.jar

AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.1.jar

What's Changed

Contributors

@Meryem1425 @muhammetsnts @jsl-models @josejuanmartinez @DevinTDHa @ArshaanNazir @C-K-Loan @KshitizGIT @agsfer @diatrambitas @danilojsl @Damla-Gurbaz @maziyarpanahi @jsl-builder

New Contributors

@ArshaanNazir made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/12881

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.0...4.2.1
Source code(tar.gz)
Source code(zip)
4.2.3(Nov 10, 2022)
:loudspeaker: Overview

Spark NLP 4.2.3 🚀 comes with new improvements to the CoNLLGenerator annotator, a new way to pass rules to the RegexMatcher annotator, unifying control over a number of columns in setInputCols between the Scala and Python, new documentation for our new IAnnotation feature for those who are using Spark NLP in Scala, and bug fixes.

Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉

:star: New Features & improvements

Adding metadata sentence key parameter in order to select which metadata field to use as a sentence for the CoNLLGenerator annotator

Include escaping in the CoNLLGenerator annotator when writing to CSV and preserve special char token

Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file

regexMatcher = RegexMatcher() \ .setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \ .setDelimiter(",") \ .setInputCols(["sentence"]) \ .setOutputCol("regex") \ .setStrategy("MATCH_ALL")

Implement a new control over a number of accepted columns in Python. This will sync the behavior between Scala and Python where the user sets more columns than allowed inside setInputCols while using Spark NLP in Python

Add documentation for the new IAnnotation feature for Scala users

Bug Fixes

Fix NotSerializableException when the WordEmbeddings annotator is used over the K8s cluster while setEnableInMemoryStorage is set to true

Fix a bug in the RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space

Fix training module failing on EMR due to a bad Apache Spark version detection. The use of the following classes was fixed on EMR: CoNLL(), CoNLLU(), POS(), and PubTator()

Fix a bug in the CoNLLGenerator annotator where the token has non-int metadata

Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models

Fix NaNs result in some ViTForImageClassification models/pipelines

:notebook: New Notebooks

You can visit Import Transformers in Spark NLP

You can visit Spark NLP Workshop for 100+ examples

:book: Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.2.3

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.2.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.2.3</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.2.3</version> </dependency>

spark-nlp-aarch64:

 <dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-aarch64_2.12</artifactId> <version>4.2.3</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.3.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.3.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.3.jar

AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.3.jar

What's Changed

Models hub legal by @josejuanmartinez in https://github.com/JohnSnowLabs/spark-nlp/pull/12999

Models hub finance by @josejuanmartinez in https://github.com/JohnSnowLabs/spark-nlp/pull/13000

Embed React and ReactDOM instead of packages from unpkg [skip test] by @pabla in https://github.com/JohnSnowLabs/spark-nlp/pull/13002

updated OCR release notes by @albertoandreottiATgmail in https://github.com/JohnSnowLabs/spark-nlp/pull/13010

Compat tables by @albertoandreottiATgmail in https://github.com/JohnSnowLabs/spark-nlp/pull/13012

Updating s3 link for dependency_conllu model by @luca-martial in https://github.com/JohnSnowLabs/spark-nlp/pull/13016

Add new demos by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13020

Add new demos 24 by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13022

Updated legre_contract_doc_parties_en and finre_work_experience_en mo… by @bunyamin-polat in https://github.com/JohnSnowLabs/spark-nlp/pull/13023

Docs/alab update documentation 410 by @diatrambitas in https://github.com/JohnSnowLabs/spark-nlp/pull/13024

Doc fix scala and open source by @ArshaanNazir in https://github.com/JohnSnowLabs/spark-nlp/pull/13008

Update 2022-10-22-finclf_bert_sentiment_analysis_lt.md by @gadde5300 in https://github.com/JohnSnowLabs/spark-nlp/pull/13026

add alab image by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13030

Docs/alab update documentation 410 by @diatrambitas in https://github.com/JohnSnowLabs/spark-nlp/pull/13034

SPARKNLP 643 detecting spark version in a safer way by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13035

Docs/alab update documentation 410 by @diatrambitas in https://github.com/JohnSnowLabs/spark-nlp/pull/13041

Added content for exporting visual NER project ad updated few other sections by @suvrat-joshi in https://github.com/JohnSnowLabs/spark-nlp/pull/13042

Bump model card Spark NLP HC version to 4.2.1 by @luca-martial in https://github.com/JohnSnowLabs/spark-nlp/pull/13027

SPARKNLP-642: Fix indexing issue for regex splits without space by @DevinTDHa in https://github.com/JohnSnowLabs/spark-nlp/pull/13032

Update ALAB by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13045

Serializable Issue K8s Word Embeddings by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/13001

FEATURE NMH-133: Rename products in search [skip-test] by @KshitizGIT in https://github.com/JohnSnowLabs/spark-nlp/pull/12998

Fix sorting in the versions drop-down [skip test] by @pabla in https://github.com/JohnSnowLabs/spark-nlp/pull/13049

Add tooltips for Unidirectional and Bidirectional models [skip test] by @pabla in https://github.com/JohnSnowLabs/spark-nlp/pull/13064

FEATURE NMH-134: Rebranding products [skip-test] by @KshitizGIT in https://github.com/JohnSnowLabs/spark-nlp/pull/13065

Adding Control for Annotators with One Column by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/12997

Update 2022-10-18-legre_confidentiality_en.md by @gadde5300 in https://github.com/JohnSnowLabs/spark-nlp/pull/13059

Update 2022-09-28-legre_indemnifications_en.md by @gadde5300 in https://github.com/JohnSnowLabs/spark-nlp/pull/13058

Fix a bug in Vision Transformer annotator that results in NaNs for some models by @ahmedlone127 in https://github.com/JohnSnowLabs/spark-nlp/pull/13048

Bug fix and enhancements for CoNLLGenerator annotator by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13053

SPARKNLP-621: Add string support to RegexMatcher in addition to a file by @DevinTDHa in https://github.com/JohnSnowLabs/spark-nlp/pull/13060

Add ScalaDoc for IAnnotation by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/13061

doc fix in old hc md files by @ArshaanNazir in https://github.com/JohnSnowLabs/spark-nlp/pull/13025

Release/423 release candidate by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13036

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.2...4.2.3
Source code(tar.gz)
Source code(zip)
4.2.2(Oct 27, 2022)
:loudspeaker: Overview

Spark NLP 4.2.2 🚀 comes with support for DBFS, HDFS, and S3 in addition to local file systems when you are importing external models from TF Hub and Hugging Face, unifying LightPipeline APIs across Scala, Java, and Python languages for Image Classification, the new fullAnnotateImage for Scala, the new fullAnnotateImageJava for Java, the support for LightPipeline for QuestionAnswering pre-trained pipelines, and bug fixes.

Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉

:star: New Features & improvements

Add support for importing TensorFlow SavedModel from remote storages like DBFS, S3, and HDFS. From this release, you can import models saved from TF Hub and HuggingFace on a remote storage

Add support for fullAnnotate in LightPipeline for the path of images in Scala

Add fullAnnotate method in PretrainedPipeline for Scala

Add fullAnnotateJava method in PretrainedPipeline for Java

Add fullAnnotateImage to PretrainedPipeline for Scala

Add fullAnnotateImageJava to PretrainedPipeline for Java

Add support for Question Answering in fullAnnotate method in PretrainedPipeline

Add Predicted Entities to all Vision Transformers (ViT) models and pipelines

Bug Fixes

Unify the annotatorType name in Python and Scala for Spark schema in Annotation, AnnotationImage, and AnnotationAudio

Fix missing indexes in the RecursiveTokenizer annotator affecting downstream NLP tasks in the pipeline

:notebook: New Notebooks

Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| WordSegmenter |Import External SavedModel From Remote | |

You can visit Import Transformers in Spark NLP

You can visit Spark NLP Workshop for 100+ examples

:book: Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.2.2

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.2.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.2.2</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.2.2</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.2.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.2.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.2.jar

AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.2.jar

What's Changed

Contributors

@galiph @agsfer @pabla @josejuanmartinez @Cabir40 @maziyarpanahi @Meryem1425 @danilojsl @jsl-builder @jsl-models @ahmedlone127 @DevinTDHa @jdobes-cz @Damla-Gurbaz @Mary-Sci

New Contributors

@Mary-Sci made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/12978

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.1...4.2.2
Source code(tar.gz)
Source code(zip)
4.2.0(Sep 27, 2022)
:loudspeaker: Overview

For the first time ever we are delighted to announce Automatic Speech Recognition (ASR) support in Spark NLP by using state-of-the-art Wav2Vec2 models at scale 🚀. This release also comes with Table Question Answering by TAPAS, CamemBERT for Token Classification, support for an external test dataset during training of all classifiers, much faster EntityRuler, 3000+ state-of-the-art models, and other enhancements and bug fixes!

We are also celebrating crossing 11000+ free and open-source models & pipelines in our Models Hub. 🎉 As always, we would like to thank our community for their feedback, questions, and feature requests.

:star: New Features & improvements

NEW: Introducing Wav2Vec2ForCTC annotator in Spark NLP 🚀. Wav2Vec2ForCTC can load Wav2Vec2 models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by using Wav2Vec2ForCTC for PyTorch or TFWav2Vec2ForCTC for TensorFlow models in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/12767)

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

NEW: Introducing TapasForQuestionAnswering annotator in Spark NLP 🚀. TapasForQuestionAnswering can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by using TapasForQuestionAnswering for PyTorch or TFTapasForQuestionAnswering for TensorFlow models in HuggingFace 🤗

TAPAS: Weakly Supervised Table Parsing via Pre-training

NEW: Introducing CamemBertForTokenClassification annotator in Spark NLP 🚀. CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using CamembertForTokenClassification for PyTorch or TFCamembertForTokenClassification for TensorFlow in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/12752)

Implementing setTestDataset to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators: ClassifierDLApproach, SentimentDLApproach, and MultiClassifierDLApproach (https://github.com/JohnSnowLabs/spark-nlp/pull/12796)

Refactoring and improving EntityRuler annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when using EntityRuler https://github.com/JohnSnowLabs/spark-nlp/pull/12634

Add support for S3 storage in the cache_folder where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (https://github.com/JohnSnowLabs/spark-nlp/pull/12707)

Implementing lookaround functionalities in DocumentNormalizer annotator. Currently, DocumentNormalizer has both lookahead and lookbehind functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing the lookaround feature (https://github.com/JohnSnowLabs/spark-nlp/pull/12735)

Implementing setReplaceEntities param to NerOverwriter annotator to replace all the NER labels (entities) with the given new labels (entities) (https://github.com/JohnSnowLabs/spark-nlp/pull/12745)

Bug Fixes

Fix a bug in generating the NerDL graph by using TF v2. The previous graph generated by the TFGraphBuilder annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created by TFGraphBuilder won't have this issue anymore (https://github.com/JohnSnowLabs/spark-nlp/pull/12636)

Fix a bug introduced in the 4.0.0 release between Transformer-based Word Embeddings annotators. In the 4.0.0 release, the following annotators were migrated to BatchAnnotate to improve their performance, especially on GPU. However, a bug was introduced in sentence indices which when it is combined with SentenceEmbeddings for Text Classifications tasks (ClassifierDLApproach, SentimentDLApproach, and ClassifierDLApproach) resulted in low accuracy: AlbertEmbeddings, CamemBertEmbeddings, DeBertaEmbeddings, DistilBertEmbeddings, LongformerEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, and XlnetEmbeddings (https://github.com/JohnSnowLabs/spark-nlp/pull/12641)

Add support for a list of questions and context in LightPipline. Previously, only one context and question at a time were supported in LightPipeline for Question Answering annotators. We have added support to fullAnnotate and annotate to receive two lists of questions and contexts (https://github.com/JohnSnowLabs/spark-nlp/pull/12653)

Fix division by zero exception in the GPT2Transformer annotator when the setDoSample param was set to true (https://github.com/JohnSnowLabs/spark-nlp/pull/12661)

Fix AttributeError when PretrainedPipeline is used in Python with ImageAssembler as one of the stages (https://github.com/JohnSnowLabs/spark-nlp/pull/12813)

:notebook: New Notebooks

Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| Wav2Vec2ForCTC|Automatic Speech Recognition in Spark NLP | ViTForImageClassification|HuggingFace in Spark NLP - ViTForImageClassification | CamemBertForTokenClassification|HuggingFace in Spark NLP - CamemBertForTokenClassification | ClassifierDLApproach|ClassifierDL Train and Evaluate | | MultiClassifierDLApproach|MultiClassifierDL Train and Evaluate | | SentimentDLApproach|SentimentDL Train and Evaluate | | Pretrained/cache_folder|Download & Load Models From S3 | | EntityRuler|EntityRuler | | EntityRuler|EntityRuler Alphabet | | EntityRuler|EntityRuler LightPipeline | | EntityRuler|EntityRuler Without Storage | | DocumentNormalizer|Apply Lookaround Patterns | |

You can visit Import Transformers in Spark NLP

You can visit Spark NLP Workshop for 100+ examples

Models

Spark NLP 4.2.0 comes with 3000+ state-of-the-art pre-trained transformer models in many languages.

Featured Models

| Model | Name | Lang |
|:---------------------|:-------------------|:---| | Wav2Vec2ForCTC | asr_wav2vec2_base_100h_by_facebook | en | Wav2Vec2ForCTC | asr_wav2vec2_base_960h_by_facebook | en | Wav2Vec2ForCTC | asr_wav2vec2_large_960h | en | Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_53_german_by_facebook | de | Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_53_french_by_facebook | fr | Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_53_polish_by_facebook | nl | Wav2Vec2ForCTC | asr_wav2vec2_base_10k_voxpopuli | hu | Wav2Vec2ForCTC | asr_wav2vec2_base_10k_voxpopuli | fi | Wav2Vec2ForCTC | asr_wav2vec2_base_10k_voxpopuli | it | Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_japanese_hiragana | ja

Check 2000+ Wav2Vec2 models & pipelines for Models Hub - Automatic Speech Recognition (ASR)

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub

:book: Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.2.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.2.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.2.0</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.2.0</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.0.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.0.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar

AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar

What's Changed

Contributors

@maziyarpanahi @suvrat-joshi @danilojsl @josejuanmartinez @ahmedlone127 @Damla-Gurbaz @vankov @xusliebana @DevinTDHa @jsl-builder @Cabir40 @muhammetsnts @wolliq @Meryem1425 @pabla @C-K-Loan @rpranab @agsfer

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.1.0...4.2.0

This discussion was created from the release John Snow Labs Spark-NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more!. Source code(tar.gz)
Source code(zip)
4.1.0(Aug 24, 2022)
Overview

An Image is Worth 16x16 Words!

For the first time ever we are delighted to announce support for Image Classification in Spark NLP by using state-of-the-art Vision Transformer (ViT) models at scale. This release comes with official support for AWS Graviton and ARM64 processors, new Databricks and EMR support, and 1000+ state-of-the-art models.

Spark NLP 4.1 also celebrates crossing 8000+ free and open-source models & pipelines available on Models Hub. 🎉 As always, we would like to thank our community for their feedback, questions, and feature requests.

:star: New Features & improvements

NEW: Introducing ViTForImageClassification annotator in Spark NLP 🚀. ViTForImageClassification can load Vision Transformer ViT Models with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet. This annotator is compatible with all the models trained/fine-tuned by using ViTForImageClassification for PyTorch or TFViTForImageClassification for TensorFlow models in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/11536)

An overview of the ViT model structure as introduced in Google Research’s original 2021 paper

data_df = spark.read.format("image") \ .load(path="images/") image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") image_classifier = ViTForImageClassification \ .pretrained() \ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, image_classifier, ]) model = pipeline.fit(data_df)

NEW: Support for AWS Graviton/Graviton2 With up to 3x Better Price-Performance. For the first time, Spark NLP supports Graviton and ARM64 (ARMv8 above) processors. (https://github.com/JohnSnowLabs/spark-nlp/pull/10939)

NEW: Introducing TFNerDLGraphBuilder annotator. TFNerDLGraphBuilder can be used to automatically detect the parameters of a needed NerDL graph and generate the graph within a pipeline when the default NER graphs are not suitable for your training datasets. TFNerDLGraphBuilder supports local, DBFS, and S3 file systems. (https://github.com/JohnSnowLabs/spark-nlp/pull/10564)

Allow passing confidence scores from all XXXForTokenClassification annotators to NerConverter. It is now possible to access the confidence scores coming from the following annotators in NerConverter metadata (similar to NerDLModel): AlbertForTokenClassification, BertForTokenClassification, DeBertaForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification, and DeBertaForTokenClassification

Introducing PushToHub Python class to easily push public models & pipelines to Models Hub

Introducing fullAnnotateImage to existing LightPipeline to support ImageAssembler and ViTForImageClassification annotators in a Spark NLP pipeline. The fullAnnotateImage supports a path to images hosted locally, on DBFS, and S3.

light_pipeline = LightPipeline(model) annotations_result = light_pipeline.fullAnnotateImage("images/hippopotamus.JPEG")

Welcoming a new EMR 6.x series to our Spark NLP family:

EMR 6.7.0 (now supports Apache Spark 3.2.1, Apache Hive 3.1.3, HUDI 0.11, PrestoDB 0.272, and Trino 0.378.)

Welcoming 3 new Databricks runtimes to our Spark NLP family:

Databricks 11.2 LTS

Databricks 11.2 LTS ML

Databricks 11.2 LTS ML GPU

Welcoming new AWS Graviton-enabled for Databricks runtime:

General Purpose: m6g, m6gd

Compute Optimized: c6g, c6gd

Memory Optimized: r6g, r6gd

Models

Spark NLP 4.1.0 comes with 1000+ state-of-the-art pre-trained transformer models for Image Classifications, Token Classification, and Sequence Classification in many languages.

Featured Models

| Model | Name | Lang |
|:---------------------|:-------------------|:---| | ViTForImageClassification | image_classifier_vit_base_patch16_224 | en | ViTForImageClassification | image_classifier_vit_base_patch16_384 | en | ViTForImageClassification | image_classifier_vit_base_patch32_384 | en | ViTForImageClassification | image_classifier_vit_base_xray_pneumonia | en | ViTForImageClassification | image_classifier_vit_finetuned_chest_xray_pneumonia | en | ViTForImageClassification | image_classifier_vit_food | en | ViTForImageClassification | image_classifier_vit_base_food101 | en | ViTForImageClassification | image_classifier_vit_autotrain_dog_vs_food | en | ViTForImageClassification | image_classifier_vit_baseball_stadium_foods | en | ViTForImageClassification | image_classifier_vit_south_indian_foods | en | ViTForImageClassification | image_classifier_vit_denver_nyc_paris | en | ViTForImageClassification | image_classifier_vit_CarViT | en

Check out 240 (ViT) models on Models Hub - Image Classification

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 8000+ models & pipelines in 230+ languages is available on Models Hub

New Notebooks

| Notebook | ------------ | |Graph Builder| |Graph ViTForImageClassification|

You can visit Spark NLP Workshop for 100+ examples

You can visit Import Transformers in Spark NLP

:book: Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.1.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.1.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.1.0

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.1.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.1.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.1.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.1.0</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.1.0</version> </dependency>

spark-nlp-aarch64:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-aarch64_2.12</artifactId> <version>4.1.0</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.1.0.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.1.0.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.1.0.jar

AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.1.0.jar

What's Changed

New Contributors

@paulk-asert made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/11128

@cayorodriguez made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/10376

Contributors

@josejuanmartinez @jsl-models @maziyarpanahi @DevinTDHa @agsfer @rpranab @vankov @cayorodriguez @paulk-asert @Ahmetemintek @muhammetsnts @jsl-builder @Cabir40 @diatrambitas @galiph @ahmedlone127 @pabla @Damla-Gurbaz

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.0.2...4.1.0
Source code(tar.gz)
Source code(zip)
4.0.2(Jul 19, 2022)
Overview

We are pleased to release Spark NLP 🚀 4.0.2! This release comes with full compatibility with the newly-released Apache Spark 3.3.0 and official support for Databrick's new runtimes 11.1 Beta (includes Apache Spark 3.3.0, Scala 2.12).

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:

Databricks 11.1 Beta

Databricks 11.1 ML Berta

Databricks 11.1 ML Berta GPU

SentenceDetector now comes with a new parameter customBoundsStrategy for returning custom bounds https://github.com/JohnSnowLabs/spark-nlp/pull/10567

Example

with setCustomBounds([r"\.", ";"])

This is a sentence. This one uses custom bounds; As is this one;

Without the flags will result in

["This is a sentence", "This one uses custom bounds", "As is this one"]

With the new flag:

.setCustomBounds([r"\.", ";"]) .setCustomBoundsStrategy("append")

the result will be

["This is a sentence.", "This one uses custom bounds;", "As is this one;"]

Similarly with prepend:

1. This is a list 1.1 This is a subpoint 2. Second thing 2.2 Second subthing

.setCustomBounds([r"\n[\d\.]+"]) .setCustomBoundsStrategy("prepend")

the result will be

[ "1. This is a list", "1.1 This is a subpoint", "2. Second thing", "2.2 Second subthing" ]

Bug Fixes

Fix bug that attempts to create spark session on executors when using GraphExtraction in Spark/PySpark 3.3 https://github.com/JohnSnowLabs/spark-nlp/pull/9905

Models and Pipelines

Spark NLP 4.0.2 comes with 620+ state-of-the-art pre-trained transformer models in 21 languages including multi-lingual models.

Featured Models

| Model | Name | Lang |
|:---------------------|:-------------------|:---| | BertForQuestionAnswering | electra_qa_BioM_Base_SQuAD2_BioASQ8B | en | BertForQuestionAnswering | bert_qa_multilingual_base_cased_chines | zh | BertForQuestionAnswering | bert_qa_deep_pavlov_full | ru | BertForQuestionAnswering | bert_qa_firmanindolanguagemodel | id | BertForQuestionAnswering | bert_qa_kcbert_base_finetuned_squad | ko | BertForQuestionAnswering | bert_qa_mbert_finetuned_mlqa_de_hi_dev | xx | BertForQuestionAnswering | bert_qa_modelontquad | tr | BertForQuestionAnswering | bert_qa_newsqa_el_4 | el | BertForQuestionAnswering | bert_qa_testpersianqa | fa | BertForQuestionAnswering | bert_qa_arabert_finetuned_arcd | ar | BertForTokenClassification | bert_ner_NER_legal_de_Sahajtomar | de | BertForTokenClassification | bert_ner_NER_en_vi_it_es_tinparadox | xx | BertForTokenClassification | bert_ner_NER_CAMELBERT | ar | BertForTokenClassification | bert_ner_Swedish_NER | sv | BertForTokenClassification | bert_ner_bert_base_chinese_ner | zh | BertForTokenClassification | bert_ner_bert_base_hu_cased_ner | hu | BertForTokenClassification | bert_ner_bert_base_indonesian_NER | id | BertForTokenClassification | bert_ner_bert_base_irish_cased_v1_finetuned_ner | ga | BertForTokenClassification | bert_ner_bert_base_pt_archive | pt | BertForTokenClassification | bert_ner_bert_base_spanish_wwm_uncased_finetuned_NER_medical | es

The complete list of all 6900+ models & pipelines in 230+ languages is available on Models Hub

📖 Documentation & Articles

Spark NLP: Hardware Acceleration

Serving Spark NLP via API in Java

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.0.2

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.0.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.0.2</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.0.2</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.2.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.2.jar

What's Changed

Contributors

@gadde5300 @danilojsl @hsaglamlar @Cabir40 @ahmedlone127 @muhammetsnts @KshitizGIT @maziyarpanahi @albertoandreottiATgmail @DevinTDHa @luca-martial @Damla-Gurbaz @jsl-models @Meryem1425

New Contributors

@hsaglamlar made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/10544

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.0.1...4.0.2
Source code(tar.gz)
Source code(zip)
4.0.1(Jul 1, 2022)
Overview

We are pleased to release Spark NLP 🚀 4.0.1! This release comes with supporting the newly-released Apache Spark 3.3.0 with improved join query performance via Bloom filters, increases the Pandas API coverage, and many other improvements. In addition, Spark NLP comes with official support for Databricks runtimes 11, other enhancements, and bug fixes.

As always, we would like to thank our community for their feedback, questions, and feature requests.

Features & Enhancements

Full support for Apache Spark & PySpark 3.3.0

Add Apache Spark 3.3.0 to Google Colab and Kaggle setup scripts

New -g option for Google Colab and Kaggle setup on GPU device to upgrade libcudnn8 to 8.1.0 to solve the issue on GPU

Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:

Databricks 11.0 LTS

Databricks 11.0 LTS ML

Databricks 11.0 LTS ML GPU

Bug Fixes

Fix the error caused by PySpark 3.3.0 in CoNLL, CoNLLU, POS, and PubTator annotators as training helpers

Fix and re-upload Dependency and Type Dependency parser pre-trained models

Update pre-trained pipelines with issues on PySpark 3.2 and 3.3

Documentation

Serving Spark NLP via API in Java

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.0.1

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.0.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.0.1</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.0.1</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.1.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.1.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.1.jar

What's Changed

Contributors

@muhammetsnts @jsl-models @Meryem1425 @Damla-Gurbaz @jsl-builder @rpranab @danilojsl @josejuanmartinez @Cabir40 @DevinTDHa @agsfer @suvrat-joshi @ahmedlone127 @albertoandreottiATgmail @KshitizGIT @mahmoodbayeshi @maziyarpanahi

New Contributors

@ahmedlone127 made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/9887

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.0.0...4.0.1
Source code(tar.gz)
Source code(zip)
4.0.0(Jun 15, 2022)
Overview

We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! 🎉

This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!

We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Support for The oneAPI Deep Neural Network Library (oneDNN) optimizations to improve TensorFlow on CPU. Enabling onDNN can improve some transformer-based models up to 97%. By default, the oneDNN optimizations will be turned off. To enable them, you can set the environment variable TF_ENABLE_ONEDNN_OPTS. On Linux systems, for instance: export TF_ENABLE_ONEDNN_OPTS=1

NEW: Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations can result in performance improvements up to +700% (more details in the Benchmarks section)

NEW: Official support for Apple silicon M1 on macOS devices. You can use the spark-nlp-m1 package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0

NEW: Introducing AlbertForQuestionAnswering annotator in Spark NLP 🚀. AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using AlbertForQuestionAnswering for PyTorch or TFAlbertForQuestionAnswering for TensorFlow models in HuggingFace 🤗

NEW: Introducing BertForQuestionAnswering annotator in Spark NLP 🚀. BertForQuestionAnswering can load BERT & ELECTRA Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using BertForQuestionAnswering and ElectraForQuestionAnswering for PyTorch or TFBertForQuestionAnswering and TFElectraForQuestionAnswering for TensorFlow models in HuggingFace 🤗

NEW: Introducing DeBertaForQuestionAnswering annotator in Spark NLP 🚀. DeBertaForQuestionAnswering can load DeBERTa v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForQuestionAnswering for PyTorch or TFDebertaV2ForQuestionAnswering for TensorFlow models in HuggingFace 🤗

NEW: Introducing DistilBertForQuestionAnswering annotator in Spark NLP 🚀. DistilBertForQuestionAnswering can load DistilBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DistilBertForQuestionAnswering for PyTorch or TFDistilBertForQuestionAnswering for TensorFlow models in HuggingFace 🤗

NEW: Introducing LongformerForQuestionAnswering annotator in Spark NLP 🚀. LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using LongformerForQuestionAnswering for PyTorch or TFLongformerForQuestionAnswering for TensorFlow models in HuggingFace 🤗

NEW: Introducing RoBertaForQuestionAnswering annotator in Spark NLP 🚀. RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using RobertaForQuestionAnswering for PyTorch or TFRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗

NEW: Introducing XlmRoBertaForQuestionAnswering annotator in Spark NLP 🚀. XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForQuestionAnswering for PyTorch or TFXLMRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗

NEW: Introducing MultiDocumentAssembler annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators

NEW: Introducing SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on BERT for Coreference Resolution: Baselines and Analysis paper. An implementation of a SpanBert-based coreference resolution model.

NEW: Introducing enableInMemoryStorage parameter in WordEmbeddingsModel annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory.

Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions

Unifying all supported Apache Spark packages on Maven into spark-nlp for CPU, spark-nlp-gpu for GPU, and spark-nlp-m1 for new Apple silicon M1 on macOS. The need for Apache Spark specific packages like spark-nlp-spark32 has been removed.

Adding a new param to sparknlp.start() function in Python and Scala for Apple silicon M1 on macOS (m1=True)

Upgrade TensorFlow to 2.7.1 and start supporting Apple silicon M1

Upgrade RocksDB with new enhancements and support for Apple silicon M1

Upgrade SentencePiece tokenizer TF ops to 2.7.1

Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support

Upgrade to Scala 2.12.15

Update Colab, Kaggle, and SageMaker scripts

Refactor the entire Python module in Spark NLP to make the development and maintenance easier

Refactor unit tests in Python and migrate to pytest

Welcoming 6x new Databricks runtimes to our Spark NLP family:

Databricks 10.4 LTS

Databricks 10.4 LTS ML

Databricks 10.4 LTS ML GPU

Databricks 10.5

Databricks 10.5 ML

Databricks 10.5 ML GPU

Welcoming a new EMR 6.x series to our Spark NLP family:

EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)

Migrate T5Transformer to TensorFlow v2 architecture by re-uploading all the existing models

Support for 2 inputs in LightPipeline with MultiDocumentAssembler

Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)

Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines

Allow change of case sensitivity. Currently, the user cannot set the setCaseSensitive param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.

Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0

Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)

Performance Improvements (Benchmarks)

We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.

The following benchmarks have been done by using a single Dell Server with the following specs:

GPU: Tesla P100 PCIe 12GB - CUDA Version: 11.3 - Driver Version: 465.19.01

CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz - 40 Cores

Memory: 80G

GPU

We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:

| Model on GPU | Spark NLP 3.4.3 vs. 4.0.0 | | ----------------- |:-------------------------:| | RoBERTa base | +560%(6.6x) | | RoBERTa Large | +332%(4.3x) | | Albert Base | +587%(6.9x) | | Albert Large | +332%(4.3x) | | DistilBERT | +659%(7.6x) | | XLM-RoBERTa Base | +638%(7.4x) | | XLM-RoBERTa Large | +365%(4.7x) | | XLNet Base | +449%(5.5x) | | XLNet Large | +267%(3.7x) | | DeBERTa Base | +713%(8.1x) | | DeBERTa Large | +477%(5.8x) | | Longformer Base | +52%(1.5x) |

CPU

The oneAPI Deep Neural Network Library (oneDNN) optimizations are now available in Spark NLP 4.0.0 that uses TensorFlow 2.7.1. You can enable those CPU optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1.

Intel has been collaborating with Google to optimize its performance on Intel Xeon processor-based platforms using Intel oneAPI Deep Neural Network (oneDNN), an open-source, cross-platform performance library for DL applications. TensorFlow optimizations are enabled via oneDNN to accelerate key performance-intensive operations such as convolution, matrix multiplication, and batch normalization.

Comparing the last release of Spark NLP 3.4.3 on CPU vs. Spark NLP 4.0.0 on CPU with oneDNN enabled.

| Model on CPU | 3.4.x vs. 4.0.0 with oneDNN | | ----------------- |:------------------------:| | BERT Base | +47% | | BERT Large | +42% | | RoBERTa Base | +51% | | RoBERTa Large | +61% | | Albert Base | +83% | | Albert Large | +58% | | DistilBERT | +80% | | XLM-RoBERTa Base | +82% | | XLM-RoBERTa Large | +72% | | XLNet Base | +50% | | XLNet Large | +27% | | DeBERTa Base | +59% | | DeBERTa Large | +56% | | CamemBERT Base | +97% | | CamemBERT Large | +65% | | Longformer Base | +63% |

Bug Fixes

Fix the default pre-trained model for DeBertaForTokenClassification in Scala and Python

Remove a requirement in DocumentNormalizer that consecutive stage processing can produce empty text annotations without breaking the pipeline

Fix WordSegmenterModel outputting the wrong order of tokens. The regex that groups the tagging format was refactored to preserve the order of segmented outputs (tokens)

Fix encoding sentences not respecting the max sequence length given by a user in XlmRobertaSentenceEmbeddings

Fix encoding sentences by using SentencePiece to calculate the correct tokens indexing

Fix SentencePiece serialization issue when XlmRoBertaEmbeddings and XlmRoBertaSentenceEmbeddings annotators are used from a Fat JAR on GPU

Remove non-existing parameters from DocumentAssembler in Python

Updated Requirements

Java 8 (still supported) or 11

Apache Spark 3.x (3.0, 3.1, and 3.2)

NVIDIA® GPU drivers version 450.80.02 or higher

CUDA® Toolkit 11.2

cuDNN SDK 8.1.0

Scala 2.12.15

Backward Compatibility

Deprecate support for Spark/PySpark 2.3, Spark/PySpark 2.4, and Scala 2.11 https://github.com/JohnSnowLabs/spark-nlp/pull/8319

The start() functions in Python and Scala will no longer have spark23, spark24, and spark32 parameters. The default sparknlp.start() works on PySpark 3.0.x, 3.1.x, and 3.2.x without the need for any Spark-related flags

Some models/pipelines which were trained or saved by using Spark and PySpark 2.3/2.4 will no longer work on Spark NLP 4.0.0

Remove json4s-ext dependency to allow the support for all Apache Spark major releases in one build

Models and Pipelines

Spark NLP 4.0.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.

New NER Models

nerdl_conll_deberta_large NER model breaks the previously highest F1 on CoNLL03 dev by 1%

| Model | Name | Lang | Dev F1 |:---------------------|:-------------------|:---|:----| | NerDLModel | nerdl_conll_deberta_large | en | 96% | | NerDLModel | nerdl_conll_elmo | en | 95.6% | | NerDLModel | nerdl_conll_deberta_base | en | 94% |

Featured Models

| Model | Name | Lang |
|:---------------------|:-------------------|:---| | AlbertForQuestionAnswering | albert_base_qa_squad2 | en | DebertaForQuestionAnswering | deberta_v3_xsmall_qa_squad2 | en | DistilBertForQuestionAnswering | distilbert_base_cased_qa_squad2 | en | LongformerForQuestionAnswering | longformer_base_base_qa_squad2 | en | RoBertaForQuestionAnswering | roberta_base_qa_squad2 | en | XlmRoBertaForQuestionAnswering | xlm_roberta_base_qa_squad2 | en | DistilBertForQuestionAnswering | distilbert_qa_multi_finedtuned_squad | pt | BertForQuestionAnswering | bert_qa_bert_large_cased_squad_v1.1_portuguese | pt | BertForQuestionAnswering | bert_qa_chinese_pert_base_mrc | zh | BertForQuestionAnswering | bert_qa_arap_qa_bert | ar | BertForQuestionAnswering | bert_qa_ainize_klue_bert_base_mrc | ko | BertForQuestionAnswering | bert_qa_Part_1_mBERT_Model_E1 | xx | BertForQuestionAnswering | bert_qa_qacombination_bert_el_Danastos | el

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 6000+ models & pipelines in 230+ languages is available on Models Hub

New Notebooks

Import hundreds of models in different languages to Spark NLP

Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| AlbertForQuestionAnswering |HuggingFace in Spark NLP - AlbertForQuestionAnswering | BertForQuestionAnswering|HuggingFace in Spark NLP - BertForQuestionAnswering | DeBertaForQuestionAnswering|HuggingFace in Spark NLP - DeBertaForQuestionAnswering | DistilBertForQuestionAnswering|HuggingFace in Spark NLP - DistilBertForQuestionAnswering | LongformerForQuestionAnswering|HuggingFace in Spark NLP - LongformerForQuestionAnswering | RoBertaForQuestionAnswering|HuggingFace in Spark NLP - RoBertaForQuestionAnswering | XlmRobertaForQuestionAnswering|HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering |

You can visit Import Transformers in Spark NLP for more info

Documentation

Serving Spark NLP via API in Java

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==4.0.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>4.0.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>4.0.0</version> </dependency>

spark-nlp-m1:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-m1_2.12</artifactId> <version>4.0.0</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.0.jar

GPU on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.0.jar

M1 on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.0.jar

What's Changed

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.4...4.0.0

@vankov @mahmoodbayeshi @Ahmetemintek @DevinTDHa @albertoandreottiATgmail @KshitizGIT @jsl-models @gokhanturer @josejuanmartinez @murat-gunay @rpranab @wolliq @bunyamin-polat @pabla @danilojsl @agsfer @Meryem1425 @gadde5300 @muhammetsnts @Damla-Gurbaz @maziyarpanahi @jsl-builder @Cabir40 @suvrat-joshi
Source code(tar.gz)
Source code(zip)
3.4.4(May 6, 2022)
Overview

We are very excited to release Spark NLP 🚀 3.4.4! This release comes with a new DeBERTa for Token Classification annotator compatible with existing or fine-tuned models on HuggingFace 🤗, a new annotator for CamemBERT embeddings models, up to 18x times improvements of UniversalSentenceEncoder on GPU devices, up to 400% speed improvements in Tokenizer with a list of exceptions, new state-of-the-art NER, French embeddings, DistilBERT embeddings, and ALBERT embeddings models!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

NEW: Introducing DeBertaForTokenClassification annotator in Spark NLP 🚀. DeBertaForTokenClassification can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForTokenClassification for PyTorch or TFDebertaV2ForTokenClassification for TensorFlow models in HuggingFace https://github.com/JohnSnowLabs/spark-nlp/pull/8082

NEW: Introducing CamemBertEmbeddings annotator in Spark NLP 🚀. https://github.com/JohnSnowLabs/spark-nlp/pull/8237 CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. For further information or requests, please go to Camembert Website

Add support for batching rows to improve UniversalSentenceEncoder on GPU devices. This new feature will increase GPU speed between 2x to 18x times depending on the distribution of sentences https://github.com/JohnSnowLabs/spark-nlp/pull/8234

Bug Fixes & Enhancements

Optimizing Tokenizer performance up to 400% when there is an exceptions list. We have improved the exceptions list to be scalable to a large number of exceptions without impacting the overall performance https://github.com/JohnSnowLabs/spark-nlp/pull/7881

Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts https://github.com/JohnSnowLabs/spark-nlp/pull/8028

Fix bug that caused get input/output/LazyAnnotator to return None https://github.com/JohnSnowLabs/spark-nlp/pull/8043

Fix DeBertaForSequenceClassification in Python failing to load pretrained models https://github.com/JohnSnowLabs/spark-nlp/pull/8060

Fix missing Lemma and POS models from 3.4.3 release

Dependencies

Removing outdated trove4j dependency in favour of native Java modules https://github.com/JohnSnowLabs/spark-nlp/pull/8236

Upgrade the base Apache Spark to 2.4.8, 3.0.3, and 3.2.1

Upgrade type typesafe config to 1.4.2

Upgrade sbt to 1.6.2

Models

Spark NLP 3.4.4 comes with over 160+ state-of-the-art multi-lingual pretrained models. Some of the featured models:

New DeBERTa Token Classification Models

New fine-tuned DeBERTa v3 models for token classifications over CoNLL03 and OntoNotes datasets that reach state-of-the-art metrics.

| Model | Name | Lang | F1 Dev |:----------------|:-----------|:-----|:-----| | DeBertaForTokenClassification | deberta_v3_large_token_classifier_conll03 | en| 0.97 | DeBertaForTokenClassification | deberta_v3_base_token_classifier_conll03 | en| 0.96 | DeBertaForTokenClassification | deberta_v3_small_token_classifier_conll03 | en| 0.95 | DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_conll03 | en| 0.93 | DeBertaForTokenClassification | deberta_v3_large_token_classifier_ontonotes | en| 0.89 | DeBertaForTokenClassification | deberta_v3_base_token_classifier_ontonotes | en| 0.88 | DeBertaForTokenClassification | deberta_v3_small_token_classifier_ontonotes | en| 0.87 | DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_ontonotes | en| 0.86

New CamemBERT Models

| Model | Name | Lang | |:----------------|:-----------|:-----| | CamemBertEmbeddings | camembert_large | fr| | CamemBertEmbeddings | camembert_base | fr| | CamemBertEmbeddings | camembert_base_ccnet_4gb | fr| | CamemBertEmbeddings | camembert_base_ccnet | fr| | CamemBertEmbeddings | camembert_base_oscar_4gb | fr| | CamemBertEmbeddings | camembert_base_wikipedia_4gb | fr|

New DistilBERT Embeddings Models

| Model | Name | Lang | |:----------------|:-----------|:-----| | DistilBertEmbeddings | distilbert_embeddings_distilbert_base_fr_cased | fr| | DistilBertEmbeddings | distilbert_embeddings_marathi_distilbert | mr| | DistilBertEmbeddings | distilbert_embeddings_distilbert_base_indonesian | id| | DistilBertEmbeddings | distilbert_embeddings_javanese_distilbert_small | jv| | DistilBertEmbeddings | distilbert_embeddings_malaysian_distilbert_small | ms| | DistilBertEmbeddings | distilbert_embeddings_distilbert_base_ar_cased | ar|

New ALBERT Embeddings Models

| Model | Name | Lang | |:----------------|:-----------|:-----| | AlbertEmbeddings | albert_embeddings_fralbert_base | fr| | AlbertEmbeddings | albert_embeddings_albert_base_arabic | ar| | AlbertEmbeddings | albert_embeddings_marathi_albert_v2 | mr| | AlbertEmbeddings | albert_embeddings_albert_fa_base_v2 | fa| | AlbertEmbeddings | albert_embeddings_albert_large_bahasa_cased | ms| | AlbertEmbeddings | albert_embeddings_marathi_albert | mr|

The complete list of all 5000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Import CamemBERT models to Spark NLP 🚀

Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| CamemBertEmbeddings| HuggingFace in Spark NLP - CamemBERT |

You can visit Import Transformers in Spark NLP for more info

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.4.4

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.4

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.4.4</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.4.4</version> </dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark32_2.12</artifactId> <version>3.4.4</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark32_2.12</artifactId> <version>3.4.4</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.4.4</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.4.4</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.4.4</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.4.4</version> </dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.4.jar

GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.4.jar

CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.4.jar

GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.4.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.4.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.4.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.4.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.4.jar

What's Changed

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.3...3.4.4

New Contributors

@aymanechilah made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6956

@xusliebana @Ahmetemintek @jsl-models @Meryem1425 @mahmoodbayeshi @aymanechilah @DevinTDHa @agsfer @rpranab @C-K-Loan @maziyarpanahi @Damla-Gurbaz @danilojsl @luca-martial @muhammetsnts @josejuanmartinez @bunyamin-polat @galiph @jsl-builder @albertoandreottiATgmail
Source code(tar.gz)
Source code(zip)
3.4.3(Apr 12, 2022)
Overview

We are very excited to release Spark NLP 🚀 3.4.3! This release comes with a new DeBERTa for Sequence Classification annotator compatible with existing or fine-tuned models on HuggingFace 🤗, a new sigmoid activation function in addition to softmax to support multi-label models in all ForSequenceClassification annotators, new features added to SentenceDetectorDL, new features added to CoNLLU and Lemmatizer, and more than 600 new multi-lingual models for DeBERTa, BERT, DistilBERT, fastText, Lemmatizer and Part of Speech, and other improvements!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

NEW: Introducing DeBertaForSequenceClassification annotator in Spark NLP 🚀. DeBertaForSequenceClassification can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaForSequenceClassification for PyTorch or TFDebertaForSequenceClassification for TensorFlow models in HuggingFace https://github.com/JohnSnowLabs/spark-nlp/pull/7713

New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification https://github.com/JohnSnowLabs/spark-nlp/pull/7479

New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL https://github.com/JohnSnowLabs/spark-nlp/pull/7214

New impossiblePenultimates in SentenceDetectorDLModel https://github.com/JohnSnowLabs/spark-nlp/pull/7685

New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol https://github.com/JohnSnowLabs/spark-nlp/pull/7344

New formCol and lemmaCol parameters in Lemmatizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/7344

Add new functionality to download and extract models from S3 via direct link https://github.com/JohnSnowLabs/spark-nlp/pull/7682

Enhancements

Fix and train new English spell checker models for Spark NLP 3.4.1 on Spark 3.x and 2.x

Update SentenceDetector Python and Scala documentation

Add a missing notebook to demonstrate training a WordSegmenterApproach annotator for word segmentation https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb

Models

New DeBERTa Classification Models

New fine-tuned DeBERTa v3 models for text classifications over IMDB reviews in English and Urdu, AG News categories in English, and Allocine French reviews.

| Model | Name | Lang | |:----------------|:-----------|:-----| | DeBertaForSequenceClassification | mdeberta_v3_base_sequence_classifier_imdb | ur| | DeBertaForSequenceClassification | mdeberta_v3_base_sequence_classifier_allocine | fr| | DeBertaForSequenceClassification | deberta_v3_xsmall_sequence_classifier_imdb | en| | DeBertaForSequenceClassification | deberta_v3_small_sequence_classifier_imdb | en| | DeBertaForSequenceClassification | deberta_v3_base_sequence_classifier_imdb | en| | DeBertaForSequenceClassification | deberta_v3_large_sequence_classifier_imdb | en| | DeBertaForSequenceClassification | deberta_v3_xsmall_sequence_classifier_ag_news | en| | DeBertaForSequenceClassification | deberta_v3_small_sequence_classifier_ag_news | en|

New BERT Models

Spark NLP now has up to 250 state-of-the-art BERT models in 27 languages including Arabic, Bengali, Chinese, Dutch, English, Finnish, French, German, Greek, Hindi, Italian, Japanese, Javanese, Korean, Marathi, Panjabi, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Telugu, Turkish, Urdu, Vietnamese, and Multi-lingual.

| Model | Name | Lang | |:----------------|:-----------|:-----| | BertEmbeddings | bert_embeddings_ARBERT | ar| | BertEmbeddings | bert_embeddings_German_MedBERT | de| | BertEmbeddings | bert_embeddings_bangla_bert_base | bn| | BertEmbeddings | bert_embeddings_bert_base_5lang_cased | zh| | BertEmbeddings | bert_embeddings_bert_base_5lang_cased | fr| | BertEmbeddings | bert_embeddings_bert_base_hi_cased | hi| | BertEmbeddings | bert_embeddings_bert_base_it_cased | it| | BertEmbeddings | bert_embeddings_bert_base | ko| | BertEmbeddings | bert_embeddings_bert_base_tr_cased | tr| | BertEmbeddings | bert_embeddings_bert_base_ur_cased | ur| | BertEmbeddings | bert_embeddings_bert_base_vi_cased | vi|

New fastText Models

Over 128 new Word2Vec models in 128 languages made by fastText word embeddings.

| Model | Name | Lang | |:----------------|:-----------|:-----| | WordEmbeddingsModel | w2v_cc_300d | hi| | WordEmbeddingsModel | w2v_cc_300d | azb| | WordEmbeddingsModel | w2v_cc_300d | bo| | WordEmbeddingsModel | w2v_cc_300d | diq| | WordEmbeddingsModel | w2v_cc_300d | cy| | WordEmbeddingsModel | w2v_cc_300d | ckb| | WordEmbeddingsModel | w2v_cc_300d | el| | WordEmbeddingsModel | w2v_cc_300d | es|

New Lemmatizer and Part of Speech Models

234 new Lemmatizer and Part of Speech models in 62 languages based on the new Universal Dependency treebank 2.9 release.

| Model | Name | Lang | |:----------------|:-----------|:-----| | LemmatizerModel | lemma_afribooms | af| | LemmatizerModel | lemma_alksnis | lt| | LemmatizerModel | lemma_alpino | nl| | LemmatizerModel | lemma_arcosg | gd| | LemmatizerModel | lemma_ancora | es| | LemmatizerModel | lemma_ancora | ca| | PerceptronModel | pos_mtg | te| | PerceptronModel | pos_ttb | ta| | PerceptronModel | pos_vtb | vi| | PerceptronModel | pos_cac | cs| | PerceptronModel | pos_btb | bg| | PerceptronModel | pos_afribooms | af|

The complete list of all 4800+ models & pipelines in 200+ languages is available on Models Hub.

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.4.3

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.3

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.3

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.3

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.3

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.4.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.4.3</version> </dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark32_2.12</artifactId> <version>3.4.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark32_2.12</artifactId> <version>3.4.3</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.4.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.4.3</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.4.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.4.3</version> </dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.3.jar

GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.3.jar

CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.3.jar

GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.3.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.3.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.3.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.3.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.3.jar

What's Changed

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.2...3.4.3

New Contributors

@snosrap made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/7484

@gokhanturer made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/7654

@suvrat-joshi made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/7671

@vankov @gokhanturer @egenc @Cabir40 @xusliebana @suvrat-joshi @murat-gunay @snosrap @gadde5300 @jsl-models @Meryem1425 @DevinTDHa @agsfer @rpranab @diatrambitas @maziyarpanahi @Damla-Gurbaz @luca-martial @muhammetsnts @josejuanmartinez @bunyamin-polat @jsl-builder @albertoandreottiATgmail
Source code(tar.gz)
Source code(zip)
3.4.2(Mar 10, 2022)
Overview

We are pleased to release Spark NLP 🚀 3.4.2! This release comes with a new DeBERTa transformer for word embeddings, new caching to speed up training Word2Vec and Doc2Vec, new English and multi-lingual state-of-the-art models, and bug fixes!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2Model for PyTorch or TFDebertaV2Model for TensorFlow models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace

Introducing a new param enableCaching in Doc2VecApproach to speed up the training

Introducing a new param enableCaching in Word2VecApproach to speed up the training

Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU

Support EMR emr-5.34.0 and emr-6.5.0

Bug Fixes

Fix bestModelMetric param when the set value was ignored https://github.com/JohnSnowLabs/spark-nlp/pull/6978

New Notebooks

Import DeBERTa models to Spark NLP 🚀

Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| DeBertaEmbeddings | HuggingFace in Spark NLP - DeBERTa |

You can visit Import Transformers in Spark NLP for more info

Models

New state-of-the-art DeBERTa models:

| Model | Name | Lang | |:----------------|:-----------|:-----| | DeBertaEmbeddings | deberta_v3_xsmall | en| | DeBertaEmbeddings | deberta_v3_small | en| | DeBertaEmbeddings | deberta_v3_base | en| | DeBertaEmbeddings | deberta_v3_large | en| | DeBertaEmbeddings | mdeberta_v3_base | xx|

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.4.2

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.2

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.4.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.4.2</version> </dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark32_2.12</artifactId> <version>3.4.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark32_2.12</artifactId> <version>3.4.2</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.4.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.4.2</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.4.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.4.2</version> </dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.2.jar

GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.2.jar

CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.2.jar

GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.2.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.2.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.2.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.2.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.2.jar

What's Changed

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.1...3.4.2

New Contributors

@mahmoodbayeshi made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6835

@bunyamin-polat made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6969

@agsfer @KshitizGIT @gadde5300 @kolia1985 @jsl-models @rpranab @josejuanmartinez @bunyamin-polat @maziyarpanahi @jsl-builder @Damla-Gurbaz @xusliebana @mahmoodbayeshi @luca-martial @dependabot @muhammetsnts @albertoandreottiATgmai
Source code(tar.gz)
Source code(zip)
3.4.1(Feb 8, 2022)
Overview

We are pleased to release Spark NLP 🚀 3.4.1! This release comes with a TF session warmup in 3 annotators where the first inference was slower than the rest, adding a new param to choose which F1 to track to save the best model when training a NerDL model, new T5 models such as text to SQL or grammar correction, new multi-lingual state-of-the-art models, and other bug fixes!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features & Enhancements

Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same https://github.com/JohnSnowLabs/spark-nlp/pull/6773

Add bestModelMetric param to choose between Micro-average or Macro-average for best model https://github.com/JohnSnowLabs/spark-nlp/pull/6749

Add trimWhitespace and preservePosition params to RegexTokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6806

Add a new setSentenceMatch param to EntityRuler to match entities across documents/sentences and not just tokens https://github.com/JohnSnowLabs/spark-nlp/pull/6841

Add support spark32 and real_time_output flags in sparknlp.start() function at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6822

Allow users to set tasks in the T5Transformer annotator

Bug Fixes

Fix random NullPointerException when using TensorFlow models without Kyro serialization https://github.com/JohnSnowLabs/spark-nlp/pull/6741

Fix RecursiveTokenizerModel not being readable in a saved Pipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6748

Fix ContextSpellCheckerApproach not being trained on Databricks https://github.com/JohnSnowLabs/spark-nlp/pull/6750

Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors https://github.com/JohnSnowLabs/spark-nlp/pull/6799

Fix GraphExtraction when fullAnnotate and document are used at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6845

Fix Word2VecModel being cast to Doc2VecModel by mistake https://github.com/JohnSnowLabs/spark-nlp/pull/6849

Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification https://github.com/JohnSnowLabs/spark-nlp/pull/6867

Fix missing setExceotionsPath param in Tokenizer when it's used in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6868

Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1 (the option to choose which metric to be tracked is now available as well)

Update broken slow unit tests https://github.com/JohnSnowLabs/spark-nlp/pull/6767

Models

New state-of-the-art models in English, French, Vietnamese, Dutch, and Indian (Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)

Featured Pretrained Models

| Model | Name | Lang | |:----------------|:-----------|:-----| | T5Transformer | t5_informal_to_formal_styletransfer | en| | T5Transformer | t5_formal_to_informal_styletransfer | en| | T5Transformer | t5_passive_to_active_styletransfer | en| | T5Transformer | t5_active_to_passive_styletransfer | en| | T5Transformer | t5_grammar_error_corrector | en| | T5Transformer | t5_small_wikiSQL | en| | LongformerEmbeddings | clinical_longformer | en| | AlbertEmbeddings | albert_indic | xx| | DistilBertEmbeddings | distilbert_base_cased | vi| | BertForSequenceClassification | bert_sequence_classifier_news_sentiment | de| | BertForSequenceClassification | bert_sequence_classifier_emotion | en| | DistilBertForTokenClassification | distilbert_token_classifier_typo_detector | en| | DistilBertForTokenClassification | distilbert_base_token_classifier_masakhaner | xx| | WordEmbeddingsModel | word2vec_wiki_1000 | fr| | WordEmbeddingsModel | word2vec_wac_200 | fr| | WordEmbeddingsModel | w2v_cc_300d | fr|

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.4.1

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.1

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.4.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.4.1</version> </dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark32_2.12</artifactId> <version>3.4.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark32_2.12</artifactId> <version>3.4.1</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.4.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.4.1</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.4.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.4.1</version> </dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.1.jar

GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.1.jar

CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.1.jar

GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.1.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.1.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.1.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.1.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.1.jar

What's Changed

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.0...3.4.1

New Contributors

@Cabir40 made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6685

@rpranab made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6830

@Meryem1425 made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6828

@Damla-Gurbaz made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6847

@diatrambitas @egenc @xyutech @Cabir40 @xusliebana @murat-gunay @KshitizGIT @jsl-models @Meryem1425 @HashamUlHaq @DevinTDHa @agsfer @rpranab @C-K-Loan @maziyarpanahi @Damla-Gurbaz @luca-martial @danilojsl @wolliq @muhammetsnts @pabla @josejuanmartinez @jsl-builder @albertoandreottiATgmail
Source code(tar.gz)
Source code(zip)
3.4.0(Jan 5, 2022)
Overview

We are very excited to release Spark NLP 3.4.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community at the dawn of 2022! 🎉

Spark NLP 3.4.0 extends the support for Apache Spark 3.2.x major releases on Scala 2.12. We now support all 5 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, 3.1.x, and 3.2.x at once helping our community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP end of life support. We also extend support for new Databricks and EMR instances on Spark 3.2.x clusters.

This release also comes with a brand new GPT2Transformer using OpenAI GPT-2 models for prediction at scale, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer annotators to use existing or fine-tuned models for Sequence Classification, new distributed and trainable Word2Vec annotators, new state-of-the-art transformer models in many languages, a new param to useBestModel in NerDL during training, bug fixes, and lots more!

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Introducing GPT2Transformer annotator in Spark NLP 🚀 for Text Generation purposes. GPT2Transformer uses OpenAI GPT-2 models from HuggingFace 🤗 for prediction at scale in Spark NLP 🚀 . GPT-2 is a transformer model trained on a very large corpus of English data in a self-supervised fashion. This means it was trained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences

NEW: Introducing RoBertaForSequenceClassification annotator in Spark NLP 🚀. RoBertaForSequenceClassification can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using RobertaForSequenceClassification for PyTorch or TFRobertaForSequenceClassification for TensorFlow models in HuggingFace 🤗

NEW: Introducing XlmRoBertaForSequenceClassification annotator in Spark NLP 🚀. XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForSequenceClassification for PyTorch or TFXLMRobertaForSequenceClassification for TensorFlow models in HuggingFace 🤗

NEW: Introducing LongformerForSequenceClassification annotator in Spark NLP 🚀. LongformerForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using LongformerForSequenceClassification for PyTorch or TFLongformerForSequenceClassification for TensorFlow models in HuggingFace 🤗

NEW: Introducing AlbertForSequenceClassification annotator in Spark NLP 🚀. AlbertForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using AlbertForSequenceClassification for PyTorch or TFAlbertForSequenceClassification for TensorFlow models in HuggingFace 🤗

NEW: Introducing XlnetForSequenceClassification annotator in Spark NLP 🚀. XlnetForSequenceClassification can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLNetForSequenceClassification for PyTorch or TFXLNetForSequenceClassification for TensorFlow models in HuggingFace 🤗

NEW: Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML. You can train Word2Vec in a cluster on multiple machines to handle large-scale datasets and use the trained model for token-level classifications such as NerDL

Introducing useBestModel param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training

Support Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.0.x/3.1.x, but now you have spark-nlp-spark32 and spark-nlp-gpu-spark32 packages

Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (spark32=True)

Update Colab and Kaggle scripts for faster setup. We no longer need to remove Java 11 in order to install Java 8 since Spark NLP works on Java 11. This makes the installation of Spark NLP on Colab and Kaggle as fast as pip install spark-nlp pyspark==3.1.2

Add new scripts/notebook to generate custom TensroFlow graphs for ContextSpellCheckerApproach annotator

Add a new graphFolder param to ContextSpellCheckerApproach annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph

Support DBFS file system in graphFolder param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks

Add a new feature to all classifiers (ForTokenClassification and ForSequenceClassification) to retrieve classes from the pretrained models

sequenceClassifier = XlmRoBertaForSequenceClassification \ .pretrained('xlm_roberta_base_sequence_classifier_ag_news', 'en') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') print(sequenceClassifier.getClasses()) #Sports, Business, World, Sci/Tech

Add inputFormats param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.

date_matcher = DateMatcher() \ .setInputCols(['document']) \ .setOutputCol("date") \ .setInputFormats(["yyyy", "yyyy/dd/MM", "MM/yyyy"]) \ .setOutputFormat("yyyyMM") \ #previously called `.setDateFormat` .setSourceLanguage("en")

Enable batch processing in T5Transformer and MarianTransformer annotators

Add Schema to readDataset in CoNLL() class

Welcoming 6x new Databricks runtimes to our Spark NLP family:

Databricks 10.0

Databricks 10.0 ML GPU

Databricks 10.1

Databricks 10.1 ML GPU

Databricks 10.2

Databricks 10.2 ML GPU

Welcoming 3x new EMR 6.x series to our Spark NLP family:

EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)

EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)

EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)

Bug Fixes

Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times at once results in higher disk usage and IO may become a bottleneck for larger models especially on a machine with slower disks. Thanks to @jerrychenhf for finding this issue and offering a solution https://github.com/JohnSnowLabs/spark-nlp/pull/6575

Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes up to 2x slower). Please update to 3.4.0 if you are using any of these two annotators https://github.com/JohnSnowLabs/spark-nlp/pull/6605

Fix a bug in model resolution by not filtering based on the timestamp

Fix configProtoBytes param type in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6549

Fix missing DefaultParamsReadable in RegexTokenizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/6653

Fix missing models lemma_antbnc, sentiment_vivekn, and spellcheck_norvig for Spark 3.x

Fix missing pipelines clean_slang, check_spelling, match_chunks, and match_datetime for Spark 3.x

Fix saveModel in TrainingHelper

Fix Keyword/Yake module naming in Scala https://github.com/JohnSnowLabs/spark-nlp/pull/6562

Models Hub

Models Hub now comes with new features to easily filter and find your desired models & pipelines by:

NLP Task

Natural Language

Spark NLP version

In addition, you can also filter models & pipelines by:

Models or Pipelines (finally! 😃 )

Tags used inside Model's card

Or even by predicted entities (which labels/classes a model can predict)

As always, you can host your own pre-trained models & pipelines easily accessible to you for free & forever! 🚀

Models and Pipelines

Spark NLP 3.4.0 comes with state-of-the-art pre-trained transformer models. Models Hub supports over 15 NLP tasks: Named Entity Recognition, Text Classification, Sentiment Analysis, Translation, Question Answering, Summarization, Sentence Detection, Embeddings, Language Detection, Stop Words Removal, Word Segmentation, Part of Speech Tagging, Lemmatization, Spell Check, Dependency Parser, and Text Generation

Featured Models

| Model | Name | Lang |
|:---------------------|:-------------------|:---| | GPT2Transformer| gpt2_distilled | en | GPT2Transformer| gpt2 | en | GPT2Transformer| gpt2_medium | en | GPT2Transformer| gpt2_large | en | XlmRoBertaForSequenceClassification| xlm_roberta_base_sequence_classifier_imdb | en | XlmRoBertaForSequenceClassification| xlm_roberta_base_sequence_classifier_allocine | fr | XlmRoBertaForSequenceClassification| xlm_roberta_base_sequence_classifier_ag_news | en | RoBertaForSequenceClassification| roberta_base_sequence_classifier_imdb | en | RoBertaForSequenceClassification| roberta_base_sequence_classifier_ag_news | en | AlbertForSequenceClassification| albert_base_sequence_classifier_ag_news | en | AlbertForSequenceClassification| albert_base_sequence_classifier_imdb | en | LongformerForSequenceClassification| longformer_base_sequence_classifier_ag_news | en | LongformerForSequenceClassification| longformer_base_sequence_classifier_imdb | en | BertForSequenceClassification| bert_sequence_classifier_sentiment | it | BertForSequenceClassification| bert_sequence_classifier_finbert_tone | en | BertForSequenceClassification| bert_sequence_classifier_toxicity | ru | XlnetForSequenceClassification| xlnet_base_sequence_classifier_imdb | en | XlnetForSequenceClassification| xlnet_base_sequence_classifier_ag_news | en | RoBertaForTokenClassification| roberta_token_classifier_bne_capitel_ner | es | RoBertaForTokenClassification| roberta_token_classifier_icelandic_ner | is | RoBertaForTokenClassification| roberta_token_classifier_ticker | en | RoBertaForTokenClassification| roberta_token_classifier_pos_tagger | id | RoBertaForTokenClassification| roberta_token_classifier_timex_semeval | en | XlmRoBertaForTokenClassification| xlm_roberta_large_token_classifier_masakhaner | xx | XlmRoBertaForTokenClassification| xlm_roberta_base_token_classifier_ner | tr | XlmRoBertaForTokenClassification| xlm_roberta_large_token_classifier_ner | id | XlmRoBertaForTokenClassification| xlm_roberta_large_token_classifier_conll03 | de | XlmRoBertaForTokenClassification| xlm_roberta_large_token_classifier_hrl | xx | BertForTokenClassification| bert_hi_en_ner | hi | BertForTokenClassification| bert_token_classifier_scandi_ner | xx | BertForTokenClassification| bert_token_classifier_hi_en_ner | hi | BertForTokenClassification| bert_token_classifier_dutch_udlassy_ner | nl | BertForTokenClassification| bert_token_classifier_chinese_ner | zh | DistilBertEmbeddings| distilbert_uncased | te | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_swahili | sw | BertEmbeddings| bert_base_finnish_uncased | fr | BertEmbeddings| bert_base_finnish_cased | fi | BertEmbeddings| electra_medal_acronym | en | ClassifierDLModel| classifierdl_urduvec_fakenews | ur | ClassifierDLModel| classifierdl_bert_news | ur | NerDLModel| nerdl_restaurant_100d | en | Word2VecModel| word2vec_gigaword_wiki_300 | en | Word2VecModel| word2vec_gigaword_300 | en

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 4100+ models & pipelines in 230+ languages is available on Models Hub

Backward Compatibility

The parameter dateFormat in DateMatcher and MultiDateMatcher annotators has been renamed to outputFormat:

# previously .setDateFormat("yyyy/MM/dd") # after 3.4.0 release .setOutputFormat("yyyy/MM/dd")

Deprecating xling TF Hub models for UniversalSentenceEncoder annotator (there are CMLM models available which outperform xling models with support for more languages)

Deprecating Finnish old BERT models (there are newer models available now)

New Notebooks

Import hundreds of models in different languages to Spark NLP

Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| AlbertForSequenceClassification |HuggingFace in Spark NLP - AlbertForSequenceClassification | RoBertaForSequenceClassification |HuggingFace in Spark NLP - RoBertaForSequenceClassification | XlmRoBertaForSequenceClassification |HuggingFace in Spark NLP - XlmRoBertaForSequenceClassification | XlnetForSequenceClassification |HuggingFace in Spark NLP - XlnetForSequenceClassification |

You can visit Import Transformers in Spark NLP for more info

New Word2Vec notebook

Spark NLP | Jupyter Notebook :------------ | :-------------| Word2VecApproach | Train Word2Vec and NER models

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.4.0

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.0

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.0

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.0

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.0

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.4.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.4.0</version> </dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark32_2.12</artifactId> <version>3.4.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark32_2.12</artifactId> <version>3.4.0</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.4.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.4.0</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.4.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.4.0</version> </dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.0.jar

GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.0.jar

CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.0.jar

GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.0.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.0.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.0.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.0.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.0.jar

What's Changed

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.3.4...3.4.0

New Contributors

@galiph made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6528

@Ahmetemintek made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6531

@xyutech made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6547

@KshitizGIT made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6550

@luca-martial made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6642

@Cabir40 made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6685

@vankov @xyutech @Cabir40 @murat-gunay @Ahmetemintek @KshitizGIT @gadde5300 @jsl-models @DevinTDHa @agsfer @diatrambitas @maziyarpanahi @luca-martial @danilojsl @wolliq @muhammetsnts @pabla @josejuanmartinez @jsl-builder @galiph @albertoandreottiATgmail
Source code(tar.gz)
Source code(zip)
3.3.4(Nov 25, 2021)
Patch release

Fix ClassCastException error in pretrained function for DistilBertForSequenceClassification in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6513

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP publications

Spark NLP in Action

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.3.4

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.4 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.4

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.3.4</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.3.4</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.3.4</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.3.4</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.3.4</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.3.4</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.4.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.4.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.4.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.4.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.4.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.4.jar

What's Changed

Update documentation of ChunkKeyPhraseExtraction by @vankov in https://github.com/JohnSnowLabs/spark-nlp/pull/6508

Fixes new instantiation in scala section by @josejuanmartinez in https://github.com/JohnSnowLabs/spark-nlp/pull/6469

Fix the wrong name for DistilBertForSequenceClassification in Python by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/6513

Release/334 release candidate by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/6514

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.3.3...3.3.4
Source code(tar.gz)
Source code(zip)
3.3.3(Nov 22, 2021)
Overview

(knock, knock, knock) Penny? Yes, this is a very special release if you are obsessed with the number 3 as much as we are! So we are pleased to announce Spark NLP 🚀 3.3.3 release! 🎉 🎊 🎈

This release comes with a new DistilBertForSequenceClassification annotator for existing or fine-tuned DistilBERT models for Text Classification on HuggingFace, new distributed and trainable Doc2Vec annotator based on Word2Vec implementation in Spark ML, improving BertEmbeddings and BertSentenceEmbeddings on a single machine on a GPU device where the DataFrame has 1 sentence per row or input column is set to document, new state-of-the-art fine-tuned DistilBERT models for Sequence Classification, enhancements, bug fixes, and more!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features and Enhancements

NEW: Introducing DistilBertForSequenceClassification annotator in Spark NLP 🚀. DistilBertForSequenceClassification DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using DistilBertForSequenceClassification or TFDistilBertForSequenceClassification in HuggingFace 🤗

NEW: Introducing trainable and distributed Doc2Vec annotators based on Word2Vec in Spark ML

Improving BertEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device

Improving BertSentenceEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device

Add a new feature to the CoNLL() class, allowing it to read multiple CoNLL files at the same time into a single DataFrame

Add support for Long type in label column for ClassifierDLApproach and SentimentDLApproach

Add script to setup AWS SageMaker thanks to @xegulon

Add instructions to setup Amazon Linux 2

Bug Fixes

Improve models and pipelines resolutions in Spark NLP when wrong models/pipelines are downloaded regardless of their Apache Spark version

Fix MarianTransformer bug on empty sequences

Fix TFInvalidArgumentException in MarianTransformer for sequences longer than 512

Fix MarianTransformer multi-lingual models and pipelines such as opus_mt_mul_en and opus_mt_mul_en

Fix a bug in DateMatcher and MultiDateMatcher when detecting month from subwords by mistake

Add the missing lemma_antbnc model to Models Hub

Add the missing sentiment_vivekn model to Models Hub

Add the missing spellcheck_norvig model to Models Hub

Models

New state-of-the-art fine-tuned DistilBERT models for Sequence Classification:

Featured Pretrained Models

| Model | Name | Build | Lang | |:---------------------|:-------------------|:-----------------|:-----| | DistilBertForSequenceClassification | distilbert_sequence_classifier_sst2| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_sequence_classifier_policy| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_sequence_classifier_industry| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_sequence_classifier_emotion| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_sequence_classifier_banking77| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_multilingual_sequence_classifier_allocine| fr | 3.3.3| | DistilBertForSequenceClassification | distilbert_base_sequence_classifier_imdb| ur | 3.3.3| | DistilBertForSequenceClassification | distilbert_base_sequence_classifier_imdb| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_base_sequence_classifier_amazon_polarity| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_base_sequence_classifier_ag_news| en | 3.3.3| | Doc2VecModel | doc2vec_gigaword_300| en | 3.3.3| | Doc2VecModel | doc2vec_gigaword_wiki_300| en | 3.3.3|

The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| DistilBertForSequenceClassification |HuggingFace in Spark NLP - DistilBertForSequenceClassification | Doc2Vec |Train Doc2Vec for Text Classification | |

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.3.3

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.3

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.3.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.3.3</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.3.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.3.3</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.3.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.3.3</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.3.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.3.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.3.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.3.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.3.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.3.jar

What's Changed

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.3.2...3.3.3

New Contributors

@xegulon made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6449

@DevinTDHa @diatrambitas @xegulon @egenc @gadde5300 @jsl-models @murat-gunay @josejuanmartinez @maziyarpanahi @jsl-builder @wolliq @xusliebana @agsfer @danilojsl @vankov @muhammetsnts @albertoandreottiATgmail
Source code(tar.gz)
Source code(zip)
3.3.2(Nov 3, 2021)
Overview

We are pleased to release Spark NLP 🚀 3.3.2! This release comes with a new BertForSequenceClassification annotator for existing or fine-tuned models on HuggingFace, new logging feature during training with Comet.ml, New state-of-the-art fine-tuned BERT models for Sequence Classification, and bug fixes!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Introducing BertForSequenceClassification annotator. BertForSequenceClassification can load BERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using BertForSequenceClassification (PyTorch) or TFBertForSequenceClassification (TensorFlow) in HuggingFace 🤗

New support for Comet.ml in Spark NLP to build better models faster.

Comet enables data scientists and teams to track, compare, explain and optimize experiments and models across the model’s entire lifecycle. From training to production. With just two lines of code, you can start building better models today.

Comet SparkNLP Integration Notebook

Bug Fixes and Enhancements

Fix a missing batchSize param in NerDLModel that degraded GPU performance by not allowing users to change the default batchSize

Fix NerDLApproach logs format on Databricks

Fix EntityRulerApproach name from import

Fix missing EntityRulerModel in ResourceDownloader

Faster Colab setup script for pyspark 3.0.x and 3.1.x on Java 11

Models

New state-of-the-art fine-tuned BERT models for Sequence Classification in English, French, German, Spanish, Japanese, Turkish, Russian, and multilingual languages.

Featured Pretrained Models

| Model | Name | Build | Lang | |:---------------------|:-------------------|:-----------------|:-----| | BertForSequenceClassification | bert_multilingual_sequence_classifier_allocine| 3.3.2 | fr| | BertForSequenceClassification | bert_large_sequence_classifier_imdb| 3.3.2 | en| | BertForSequenceClassification | bert_base_sequence_classifier_imdb| 3.3.2 | en| | BertForSequenceClassification | bert_base_sequence_classifier_ag_news| 3.3.2 | en| | BertForSequenceClassification | bert_base_sequence_classifier_dbpedia_14| 3.3.2 | en| | BertForSequenceClassification | bert_sequence_classifier_turkish_sentiment| 3.3.2 | tr| | BertForSequenceClassification | bert_sequence_classifier_sentiment| 3.3.2 | de| | BertForSequenceClassification | bert_sequence_classifier_rubert_sentiment| 3.3.2 | ru| | BertForSequenceClassification | bert_sequence_classifier_multilingual_sentiment| 3.3.2 | xx| | BertForSequenceClassification | bert_sequence_classifier_japanese_sentiment| 3.3.2 | ja| | BertForSequenceClassification | bert_sequence_classifier_finbert| 3.3.2 | en| | BertForSequenceClassification | bert_sequence_classifier_dehatebert_mono| 3.3.2 | en| | BertForSequenceClassification | bert_sequence_classifier_beto_sentiment_analysis| 3.3.2 | es| | BertForSequenceClassification | bert_sequence_classifier_beto_emotion_analysis| 3.3.2 | es|

The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| BertForSequenceClassification |HuggingFace in Spark NLP - BertForSequenceClassification | | Comet.ml | Comet SparkNLP Integration Notebook|

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.3.2

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.2

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.2

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.2

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.3.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.3.2</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.3.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.3.2</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.3.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.3.2</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.2.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.2.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.2.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.2.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.2.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.2.jar

Source code(tar.gz)
Source code(zip)
3.3.1(Oct 18, 2021)
Overview

We are pleased to release Spark NLP 🚀 3.3.1! This release comes with a new EntityRuler annotator, better compatibility between TokenClassification annotators and other annotators in Spark NLP pipeline, new state-of-the-art XLM-RoBERTa models in African Languages, and bug fixes!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Introducing EntityRuler annotators to receive either a JSON or CSV ontology file that maps entities to patterns. You can implement a purely rule-based entity recognition system by using EntityRuler, it can be saved as a Model and reused in other pipelines to annotate your document against your knowledge base.

Access EntityRuler Documentation

Bug Fixes

Fix compatibility issue between NerOverwriter and AlbertForTokenClassification, BertForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification annotators

Fix a bug in ContextSpellCheckerApproach annotator failing to find an appropriate TF graph

Fix a bug in ContextSpellCheckerModel not being able to load a trained model

Fix token alignment with token pieces in BertEmbeddings resulting in missing vectors with Unicode characters

Add the missing pretrained NER models for the XlmRoBertaForTokenClassification annotator

Add the missing pretrained NER models for the LongformerForTokenClassification annotator

Backward compatibility

Renaming YakeModel to YakeKeywordExtraction to represent the actual purpose of this annotator more clearly.

Models and Pipelines

New state-of-the-art XLM-RoBERTa models in Luganda, Naija, Yoruba, Hausa, Kinyarwanda, Wolof, Igbo, Amharic, Swahili, and Luo.

New Transformer Models

| Model | Name | Build | Lang | |:---------------------|:-------------------|:-----------------|:-----| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_yoruba| 3.3.1 | yo| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_wolof| 3.3.1 | wo| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_naija| 3.3.1 | pcm| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_swahili| 3.3.1 | sw| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_luganda| 3.3.1 | lg| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_kinyarwanda| 3.3.1 | rw| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_hausa| 3.3.1 | ha| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_igbo| 3.3.1 | ig| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_amharic| 3.3.1 | am| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_yoruba| 3.3.1 | yo| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_wolof| 3.3.1 | wo| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_swahili| 3.3.1 | sw| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_naija| 3.3.1 | pcm| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_luo| 3.3.1 | lou|

The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Spark NLP | Jupyter Notebooks | |:------------ | :-------------| | EntityRuler| EntityRuler| | EntityRuler| EntityRuler_LightPipeline| | EntityRuler| EntityRuler_Whitout_Storage|

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.3.1

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.1

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.1

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.1

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.3.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.3.1</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.3.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.3.1</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.3.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.3.1</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.1.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.1.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.1.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.1.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.1.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.1.jar

Source code(tar.gz)
Source code(zip)
3.3.0(Sep 29, 2021)
Overview

We are very excited to release Spark NLP 🚀 3.3.0! This release comes with new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer existing or fine-tuned models for Token Classification on HuggingFace 🤗 , up to 50x times faster saving Spark NLP models & pipelines, no more 2G limitation for the size of imported TensorFlow models, lots of new functions to filter and display pretrained models & pipelines inside Spark NLP, bug fixes, and more!

We are proud to say Spark NLP 3.3.0 is still compatible across all major releases of Apache Spark used locally, by all Cloud providers such as EMR, and all managed services such as Databricks. The major releases of Apache Spark include Apache Spark 3.0.x/3.1.x (spark-nlp), Apache Spark 2.4.x (spark-nlp-spark24), and Apache Spark 2.3.x (spark-nlp-spark23).

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Starting Spark NLP 3.3.0 release there will be no limitation of size when you import TensorFlow models! You can now import TF Hub & HuggingFace models larger than 2 Gigabytes of size.

NEW: Up to 50x faster saving Spark NLP models and pipelines! We have improved the way we package TensorFlow SavedModel while saving Spark NLP models & pipelines. For instance, it used to take up to 10 minutes to save the xlm_roberta_base model before Spark NLP 3.3.0, and now it only takes up to 15 seconds!

NEW: Introducing AlbertForTokenClassification annotator in Spark NLP 🚀. AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using AlbertForTokenClassification or TFAlbertForTokenClassification in HuggingFace 🤗

NEW: Introducing XlnetForTokenClassification annotator in Spark NLP 🚀. XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using XLNetForTokenClassificationet or TFXLNetForTokenClassificationet in HuggingFace 🤗

NEW: Introducing RoBertaForTokenClassification annotator in Spark NLP 🚀. RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using RobertaForTokenClassification or TFRobertaForTokenClassification in HuggingFace 🤗

NEW: Introducing XlmRoBertaForTokenClassification annotator in Spark NLP 🚀. XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForTokenClassification or TFXLMRobertaForTokenClassification in HuggingFace 🤗

NEW: Introducing LongformerForTokenClassification annotator in Spark NLP 🚀. LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using LongformerForTokenClassification or TFLongformerForTokenClassification in HuggingFace 🤗

NEW: Introducing new ResourceDownloader functions to easily look for pretrained models & pipelines inside Spark NLP (Python and Scala). You can filter models or pipelines via language, version, or the name of the annotator

from sparknlp.pretrained import * # display and filter all available pretrained pipelines ResourceDownloader.showPublicPipelines() ResourceDownloader.showPublicPipelines(lang="en") ResourceDownloader.showPublicPipelines(lang="en", version="3.2.0") # display and filter all available pretrained pipelines ResourceDownloader.showPublicModels() ResourceDownloader.showPublicModels("NerDLModel", "3.2.0") ResourceDownloader.showPublicModels("NerDLModel", "en") ResourceDownloader.showPublicModels("XlmRoBertaEmbeddings", "xx") +--------------------------+------+---------+ | Model | lang | version | +--------------------------+------+---------+ | xlm_roberta_base | xx | 3.1.0 | | twitter_xlm_roberta_base | xx | 3.1.0 | | xlm_roberta_xtreme_base | xx | 3.1.3 | | xlm_roberta_large | xx | 3.3.0 | +--------------------------+------+---------+ # remove all the downloaded models & pipelines to free up storage ResourceDownloader.clearCache() # display all available annotators that can be saved as a Model ResourceDownloader.showAvailableAnnotators()

Welcoming Databricks Runtime 9.1 LTS, 9.1 ML, and 9.1 ML with GPU

Bug Fixes

Fix a bug in RoBertaEmbeddings when all special tokens were identical

Fix a bug in RoBertaEmbeddings when a special token contained valid regex

Fix a bug that leads to memory leak inside NorvigSweeting spell checker. This issue caused issues with pretrained pipelines such as explain_document_ml and explain_document_dl due to some inputs

Fix the wrong types being assigned to minCount and classCount in Python for ContextSpellCheckerApproach annotator

Fix explain_document_ml pretrained pipeline for Spark NLP 3.x on Apache Spark 2.x

Fix WordSegmenterModel wordseg_best model for Thai language

Fix WordSegmenterModel wordseg_large model for Chinese language

Models and Pipelines

Spark NLP 3.3.0 comes with:

New ALBERT, RoBERTa, XLNet, and XLM-RoBERTa for Token Classification models

New XLM-RoBERTa models in Luganda, Kinyarwanda, Igbo, Hausa, and Amharic languages

New Transformer Models

| Model | Name | Build | Lang | |:---------------------|:-------------------|:-----------------|:-----| |RoBertaForTokenClassification| roberta_large_token_classifier_ontonotes | 3.3.0 | en |RoBertaForTokenClassification| roberta_large_token_classifier_conll03 | 3.3.0 | en |RoBertaForTokenClassification| roberta_base_token_classifier_ontonotes | 3.3.0 | en |RoBertaForTokenClassification| roberta_base_token_classifier_conll03 | 3.3.0 | en |RoBertaForTokenClassification| distilroberta_base_token_classifier_ontonotes | 3.3.0 | en |RoBertaForTokenClassification| roberta_token_classifier_zwnj_base_ner | 3.3.0 | fa |XlmRoBertaForTokenClassification| xlm_roberta_token_classifier_ner_40_lang | 3.3.0 | xx |AlbertForTokenClassification| albert_xlarge_token_classifier_conll03 | 3.3.0 | en |AlbertForTokenClassification| albert_large_token_classifier_conll03 | 3.3.0 | en |AlbertForTokenClassification| albert_base_token_classifier_conll03 | 3.3.0 | en |XlnetForTokenClassification| xlnet_large_token_classifier_conll03 | 3.3.0 | en |XlnetForTokenClassification| xlnet_base_token_classifier_conll03 | 3.3.0 | en |XlmRoBertaEmbeddings| xlm_roberta_large | 3.3.0 | xx |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_luganda | 3.3.0 | lg |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_kinyarwanda | 3.3.0 | rw |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_igbo | 3.3.0 | ig |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_hausa | 3.3.0 | ha |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_amharic | 3.3.0 | am

The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Import hundreds of models in different languages to Spark NLP

Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| AlbertForTokenClassification|HuggingFace in Spark NLP - AlbertForTokenClassification | RoBertaForTokenClassification|HuggingFace in Spark NLP - RoBertaForTokenClassification | XlmRoBertaForTokenClassification|HuggingFace in Spark NLP - XlmRoBertaForTokenClassification |

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP in Action

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.3.0

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.0

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.0

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.0

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.3.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.3.0</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.3.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.3.0</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.3.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.3.0</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.0.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.0.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.0.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.0.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.0.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.0.jar

Source code(tar.gz)
Source code(zip)
3.2.3(Sep 15, 2021)
Overview

We are pleased to release Spark NLP 🚀 3.2.3! This release comes with new and completed documentation for all Transformers and Trainable annotators in Spark NLP, new Japanese NER and Embeddings models, new multilingual Transformer models, code enhancements, and bug fixes.

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Add delimiter feature to CoNLL() class to support other delimiters in CoNLL files https://github.com/JohnSnowLabs/spark-nlp/pull/5934

Add support for IOB in addition to IOB2 format in GraphExtraction annotator https://github.com/JohnSnowLabs/spark-nlp/pull/6101

Change YakeModel output type from KEYWORD to CHUNK to have more available features after the YakeModel annotator such as Chunk2Doc or ChunkEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/6065

Welcoming Databricks Runtime 9.0, 9.0 ML, and 9.0 ML with GPU

A new and completed Transformer page

description

default model's name

link to Models Hub

link to notebook on Spark NLP Workshop

link to Python APIs

link to Scala APIs

link to source code and unit test

Examples in Python and Scala for

Prediction

Training

Raw Embeddings

A new and completed Training page

Training Datasets

Text Processing

Spell Checkers

Token Classification

Text Classification

External Trainable Models

Bug Fixes & Enhancements

Fix the default language for XlmRoBertaSentenceEmbeddings pretrained model in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6057

Fix SentenceEmbeddings issue concatenating sentences instead of each correspondent sentence https://github.com/JohnSnowLabs/spark-nlp/pull/6060

Fix GraphExctraction usage in LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6101

Fix compatibility issue in explain_document_ml pipeline

Better import process for corrupted merges file in Longformer tokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6083

Models and Pipelines

Spanish, Greek, Swedish, Dutch, German, French, Romanian, and Japanese

BERT Embeddings (Word and Sentence)

| Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | BertEmbeddings | bert_base_uncased_legal | 3.2.2 | en | BertEmbeddings | bert_base_uncased | 3.2.2 | es | BertEmbeddings | bert_base_cased | 3.2.2 | es | BertEmbeddings | bert_base_uncased | 3.2.2 | el | BertEmbeddings | bert_base_cased | 3.2.2 | sv | BertEmbeddings | bert_base_cased | 3.2.2 | nl | BertSentenceEmbeddings | sent_bert_base_uncased_legal | 3.2.2 | en | BertSentenceEmbeddings | sent_bert_base_uncased | 3.2.2 | es | BertSentenceEmbeddings | sent_bert_base_cased | 3.2.2 | es | BertSentenceEmbeddings | sent_bert_base_uncased | 3.2.2 | el | BertSentenceEmbeddings | sent_bert_base_cased | 3.2.2 | sv | BertSentenceEmbeddings | sent_bert_base_cased | 3.2.2 | nl | BertSentenceEmbeddings | sent_bert_base_cased | 3.2.2 | de

Other multilingual models

| Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | WordEmbeddingsModel | japanese_cc_300d | 3.2.2 | ja | NerDLModel | ner_ud_gsd_cc_300d | 3.2.2 | ja | NerDLModel | ner_ud_gsd_xlm_roberta_base | 3.2.2 | ja | BertForTokenClassification | bert_token_classifier_ner_ud_gsd | 3.2.2 | ja | BertForTokenClassification | bert_token_classifier_ner_btc | 3.2.2 | en | ClassifierDLModel | classifierdl_bert_sentiment | 3.2.2 | de | ClassifierDLModel | classifierdl_bert_sentiment | 3.2.2 | fr

The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

Models Hub for the community by community

Serve Your Spark NLP Models for Free! You can host and share your Spark NLP models & pipelines publicly with everyone to reuse them with one line of code!

Models Hub is open to everyone to upload their models and pipelines, showcase their work, and share them with others.

Please visit the following page for more information: https://modelshub.johnsnowlabs.com/

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP in Action

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.2.3

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.3

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.3

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.2.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.2.3

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.2.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.2.3</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.2.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.2.3</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.2.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.2.3</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.2.3.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.2.3.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.2.3.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.2.3.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.2.3.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.2.3.jar

Source code(tar.gz)
Source code(zip)
3.2.2(Sep 1, 2021)
Overview

We are pleased to release Spark NLP 🚀 3.2.2! This release comes with accessible Models Hub to our community to host their models and pipelines for free, new RoBERTa and XLM-RoBERTa Sentence Embeddings, over 40 new models and pipelines in 20+ languages, bug fixes, and more

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

A new RoBertaSentenceEmbeddings annotator for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators

A new XlmRoBertaSentenceEmbeddings annotator for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators

Add support for AWS MFA via Spark NLP configuration

Add new AWS configs to Spark NLP configuration when using a private S3 bucket to store logs for training models or access TF graphs needed in NerDLApproach

spark.jsl.settings.aws.credentials.access_key_id

spark.jsl.settings.aws.credentials.secret_access_key

spark.jsl.settings.aws.credentials.session_token

spark.jsl.settings.aws.s3_bucket

spark.jsl.settings.aws.region

Models Hub for the community, by the community

Serve Your Spark NLP Models for Free! You can host and share your Spark NLP models & pipelines publicly with everyone to reuse them with one line of code!

We are opening Models Hub to everyone to upload their models and pipelines, showcase their work, and share them with others.

Please visit the following page for more information: https://modelshub.johnsnowlabs.com/

Bug Fixes & Enhancements

Improve loading merges file for RoBERTa tokenizer

Remove batchSize param from broadcast in XlmRoBertaEmbeddings to be set after it is created

Preserve previously generated metadata in BertSentenceEmbeddings annotator

Set elmo as a default poolingLayer in ElmoEmbeddings

Fix special tokens ids in XlmRoBertaEmbeddings annotator

Fix distilbert_base_token_classifier_ontonotes model

Fix distilbert_base_token_classifier_conll03 model

Fix distilbert_base_token_classifier_few_nerd model

Fix distilbert_token_classifier_persian_ner model

Fix ner_conll_longformer_base_4096 model

Models and Pipelines

Spark NLP 3.2.2 comes with new Turkish text classifier pipelines, Expert BERT Word and Sentence embeddings such as wiki books and PubMed, new BERT model for 17 Indian languages, and Sentence Detection models for 15 new languages.

Pipelines

| Name | Build | Lang | |:-------------------|:-----------------|:------| | classifierdl_berturk_cyberbullying_pipeline | 3.1.3 | tr | classifierdl_bert_news_pipeline | 3.1.3 | de | classifierdl_electra_questionpair_pipeline | 3.2.0 | en | classifierdl_bert_news_pipeline | 3.2.0 | tr

Named Entity Recognition

| Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | NerDLModel | ner_conll_elmo | 3.2.2 | en | NerDLModel | ner_conll_albert_base_uncased | 3.2.2 | en | NerDLModel | ner_conll_albert_large_uncased | 3.2.2 | en | NerDLModel | ner_conll_xlnet_base_cased | 3.2.2 | en

BERT Embeddings

| Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | BertEmbeddings | bert_muril | 3.2.0 | xx | BertEmbeddings | bert_wiki_books_sst2 | 3.2.0 | en | BertEmbeddings | bert_wiki_books_squad2 | 3.2.0 | en | BertEmbeddings | bert_wiki_books_qqp | 3.2.0 | en | BertEmbeddings | bert_wiki_books_qnli | 3.2.0 | en | BertEmbeddings | bert_wiki_books_mnli | 3.2.0 | en | BertEmbeddings | bert_wiki_books | 3.2.0 | en | BertEmbeddings | bert_pubmed_squad2 | 3.2.0 | en | BertEmbeddings | bert_pubmed | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_sst2 | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_squad2 | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_qqp | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_qnli | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_mnli | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_pubmed_squad2 | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_pubmed | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_muril | 3.2.0 | xx

Sentence Detection

Yiddish, Ukrainian, Telugu, Tamil, Somali, Sindhi, Russian, Punjabi, Nepali, Marathi, Malayalam, Kannada, Indonesian, Gujrati, Bosnian

| Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | yi | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | uk | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | te | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | ta | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | so | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | sd | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | ru | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | pa | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | ne | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | mr | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | ml | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | kn | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | id | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | gu | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | bs

The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP in Action

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP publications

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.2.2

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.2

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.2

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.2.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.2.2

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.2.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.2.2</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.2.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.2.2</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.2.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.2.2</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.2.2.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.2.2.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.2.2.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.2.2.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.2.2.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.2.2.jar

Source code(tar.gz)
Source code(zip)

3.2.1(Aug 11, 2021)

Patch release

Fix unsupported model error in pretrained function for LongformerEmbeddings, BertForTokenClassification, and DistilBertForTokenClassification https://github.com/JohnSnowLabs/spark-nlp/issues/5947

Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP publications
Spark NLP in Action
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==3.2.1

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.1

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.1

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.2.1

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.2.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.2.1</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.2.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.2.1</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.2.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.2.1</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.2.1.jar
GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.2.1.jar
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.2.1.jar
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.2.1.jar
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.2.1.jar
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.2.1.jar

Source code(tar.gz)
Source code(zip)

3.2.0(Aug 10, 2021)
Overview

We are very excited to release Spark NLP 🚀 3.2.0! This is a big release with new Longformer models for long documents, BertForTokenClassification & DistilBertForTokenClassification for existing or fine-tuned models on HuggingFace, GraphExctraction & GraphFinisher to find relevant relationships between words, support for multilingual Date Matching, new Pydoc for Python APIs, and so many more!

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Introducing LongformerEmbeddings annotator. Longformer is a transformer model for long documents. Longformer is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.

We have trained two NER models based on Longformer Base and Large embeddings:

| Model | Accuracy | F1 Test | F1 Dev | |:------|:----------|:------|:--------| |ner_conll_longformer_base_4096 | 94.75% | 90.09 | 94.22 |ner_conll_longformer_large_4096 | 95.79% | 91.25 | 94.82

NEW: Introducing BertForTokenClassification annotator. BertForTokenClassification can load BERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using BertForTokenClassification or TFBertForTokenClassification in HuggingFace 🤗

NEW: Introducing DistilBertForTokenClassification annotator. DistilBertForTokenClassification can load BERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using DistilBertForTokenClassification or TFDistilBertForTokenClassification in HuggingFace 🤗

NEW: Introducing GraphExctraction and GraphFinisher annotators to extract a dependency graph between entities. The GraphExtraction class takes e.g. extracted entities from a NerDLModel and creates a dependency tree that describes how the entities relate to each other. For that, a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words

NEW: Introducing support for multilingual DateMatcher and MultiDateMatcher annotators. These two annotators will support English, French, Italian, Spanish, German, and Portuguese languages

NEW: Introducing new Python APIs and fully documented Pydoc

NEW: Introducing new Spark NLP configurations via spark.conf() by deprecating application.conf usage. You can easily change Spark NLP configurations in SparkSession. For more examples please vistit Spark NLP Configuration

Add support for Amazon S3 to log_folder Spark NLP config and outputLogsPath param in NerDLApproach, ClassifierDlApproach, MultiClassifierDlApproach, and SentimentDlApproach annotators

Added cache_folder, log_folder, and cluster_tmp_dir to sparknlp.start() function to set Spark NLP configurations

Added examples to all Spark NLP Scaladoc

Added examples to all Spark NLP Pydoc

Welcoming new Databricks runtimes to our Spark NLP family:

Databricks 8.4 ML & GPU

Fix printing a wrong version return in sparknlp.version()

Models and Pipelines

Spark NLP 3.2.0 comes with new LongformerEmbeddings, BertForTokenClassification, and DistilBertForTokenClassification annotators.

New Longformer Models

| Model | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | LongformerEmbeddings | longformer_base_4096 | 3.2.0 | en | LongformerEmbeddings | longformer_large_4096 | 3.2.0 | en

Featured NerDL Models

New NER models for CoNLL (4 entities) and OntoNotes (18 entities) trained by using BERT, RoBERTa, DistilBERT, XLM-RoBERTa, and Longformer Embeddings:

| Model | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | NerDLModel | ner_ontonotes_roberta_base | 3.2.0 | en | NerDLModel | ner_ontonotes_roberta_large | 3.2.0 | en | NerDLModel | ner_ontonotes_distilbert_base_cased | 3.2.0 | en | NerDLModel | ner_conll_bert_base_cased | 3.2.0 | en | NerDLModel | ner_conll_distilbert_base_cased | 3.2.0 | en | NerDLModel | ner_conll_roberta_base | 3.2.0 | en | NerDLModel | ner_conll_roberta_large | 3.2.0 | en | NerDLModel | ner_conll_xlm_roberta_base | 3.2.0 | en | NerDLModel | ner_conll_longformer_base_4096 | 3.2.0 | en | NerDLModel | ner_conll_longformer_large_4096 | 3.2.0 | en

BERT and DistilBERT for Token Classification

New BERT and DistilBERT fine-tuned for the Named Entity Recognition (NER) in English, Persian, Spanish, Swedish, and Turkish:

| Model | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | BertForTokenClassification | bert_base_token_classifier_conll03 | 3.2.0 | en | BertForTokenClassification | bert_large_token_classifier_conll03 | 3.2.0 | en | BertForTokenClassification | bert_base_token_classifier_ontonote | 3.2.0 | en | BertForTokenClassification | bert_large_token_classifier_ontonote | 3.2.0 | en | BertForTokenClassification | bert_token_classifier_parsbert_armanner | 3.2.0 | fa | BertForTokenClassification | bert_token_classifier_parsbert_ner | 3.2.0 | fa | BertForTokenClassification | bert_token_classifier_parsbert_peymaner | 3.2.0 | fa | BertForTokenClassification | bert_token_classifier_turkish_ner | 3.2.0 | tr | BertForTokenClassification | bert_token_classifier_spanish_ner | 3.2.0 | es | BertForTokenClassification | bert_token_classifier_swedish_ner | 3.2.0 | sv | BertForTokenClassification | bert_base_token_classifier_few_nerd | 3.2.0 | en | DistilBertForTokenClassification | distilbert_base_token_classifier_few_nerd | 3.2.0 | en | DistilBertForTokenClassification | distilbert_base_token_classifier_conll03 | 3.2.0 | en | DistilBertForTokenClassification | distilbert_base_token_classifier_ontonotes | 3.2.0 | en | DistilBertForTokenClassification | distilbert_token_classifier_persian_ner | 3.2.0 | fa

The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Import hundreds of models in different languages to Spark NLP

Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| LongformerEmbeddings|HuggingFace in Spark NLP - Longformer | BertForTokenClassification|HuggingFace in Spark NLP - BertForTokenClassification | DistilBertForTokenClassification|HuggingFace in Spark NLP - DistilBertForTokenClassification |

You can visit Import Transformers in Spark NLP for more info

New Multilingual DateMatcher and MultiDateMatcher

Spark NLP | Jupyter Notebooks :------------ | :-------------| MultiDateMatcher | Date Matcher in English MultiDateMatcher | Date Matcher in French MultiDateMatcher | Date Matcher in German MultiDateMatcher | Date Matcher in Italian MultiDateMatcher | Date Matcher in Portuguese MultiDateMatcher | Date Matcher in Spanish GraphExtraction | Graph Extraction Intro GraphExtraction | Graph Extraction GraphExtraction | Graph Extraction Explode Entities

Deprecation

The use of application.conf has been deprecated in Spark NLP 3.2.0 release. You can set those configurations via Spark Conf during SparkSession creation. For the full list and examples please visit the Spark NLP Configuration.

Documentation

TF Hub & HuggingFace to Spark NLP

Models Hub with new models

Spark NLP publications

Spark NLP in Action

Spark NLP documentation

Spark NLP Scala APIs

Spark NLP Python APIs

Spark NLP Workshop notebooks

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.2.0

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.0

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.0

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.2.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.2.0

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.2.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.2.0</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.2.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.2.0</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.2.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.2.0</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.2.0.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.2.0.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.2.0.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.2.0.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.2.0.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.2.0.jar

Source code(tar.gz)
Source code(zip)
3.1.3(Jul 20, 2021)
Overview

We are pleased to release Spark NLP 🚀 3.1.3! In this release, we bring notebooks to easily import models for BERT and ALBERT models from TF Hub into Spark NLP, new multilingual NER models for 40 languages with a fine-tuned XLM-RoBERTa model, and new state-of-the-art document/sentence embeddings models for English and 100+ languages!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Support BERT models from TF Hub to Spark NLP

Support BERT for sentence embeddings from TF Hub to Spark NLP

Support ALBERT models from TF Hub to Spark NLP

Welcoming new Databricks 8.4 / 8.4 ML/GPU runtimes to Spark NLP platforms

New Models

We have trained multilingual NER models by using the entire XTREME (40 languages) and WIKINER (8 languages).

Multilingual Named Entity Recognition:

| Model | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | NerDLModel | ner_xtreme_xlm_roberta_xtreme_base | 3.1.3 | xx | NerDLModel | ner_xtreme_glove_840B_300 | 3.1.3 | xx | NerDLModel | ner_wikiner_xlm_roberta_base | 3.1.3 | xx | NerDLModel | ner_wikiner_glove_840B_300 | 3.1.3 | xx | NerDLModel | ner_mit_movie_simple_distilbert_base_cased | 3.1.3 | en | NerDLModel | ner_mit_movie_complex_distilbert_base_cased | 3.1.3 | en | NerDLModel | ner_mit_movie_complex_bert_base_cased | 3.1.3 | en

Fine-tuned XLM-RoBERTa base model by randomly masking 15% of XTREME dataset:

| Model | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | XlmRoBertaEmbeddings | xlm_roberta_xtreme_base | 3.1.3 | xx

New Universal Sentence Encoder trained with CMLM (English & 100+ languages):

The models extend the BERT transformer architecture and that is why we use them with BertSentenceEmbeddings.

| Model | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | BertSentenceEmbeddings | sent_bert_use_cmlm_en_base | 3.1.3 | en | BertSentenceEmbeddings | sent_bert_use_cmlm_en_large | 3.1.3 | en | BertSentenceEmbeddings | sent_bert_use_cmlm_multi_base | 3.1.3 | xx | BertSentenceEmbeddings | sent_bert_use_cmlm_multi_base_br | 3.1.3 | xx

Benchmark

We used BERT base, large, and the new Universal Sentence Encoder trained with CMLM extending the BERT transformer architecture to train ClassifierDL with News dataset:

(120k training examples - 10 Epochs - 512 max sequence - Nvidia Tesla P100)

| Model | Accuracy | F1 | Duration |:-----------------------------|:-------------------|:-----------------|:------| |tfhub_use | 0.90 | 0.89 | 10 min |tfhub_use_lg | 0.91 | 0.90 | 24 min |sent_bert_base_cased | 0.92 | 0.90 | 35 min |sent_bert_large_cased | 0.93 | 0.91 | 75 min |sent_bert_use_cmlm_en_base | 0.934 | 0.91 | 36 min |sent_bert_use_cmlm_en_large | 0.945 | 0.92| 72 min

The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

Bug Fixes

Fix serialization issue in NorvigSweetingModel

Fix the issue with BertSentenceEmbeddings model in TF v2

Update ArrayType structure to fix Finisher failing to clean up some annotators

New Notebooks

Spark NLP | TF Hub Notebooks :------------ | :-------------| BertEmbeddings | TF Hub in Spark NLP - BERT BertSentenceEmbeddings | TF Hub in Spark NLP - BERT Sentence AlbertEmbeddings | TF Hub in Spark NLP - ALBERT

Documentation

HuggingFace & TF Hub to Spark NLP

Models Hub with new models

Spark NLP publications

Spark NLP in Action

Spark NLP documentation

Spark NLP Workshop notebooks

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.1.3

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.3

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.3

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.1.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.1.3

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.1.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.1.3</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.1.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.1.3</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.1.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.1.3</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.1.3.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.1.3.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.1.3.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.1.3.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.1.3.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.1.3.jar

Source code(tar.gz)
Source code(zip)
3.1.2(Jul 7, 2021)
Overview

We are pleased to release Spark NLP 🚀 3.1.2! We have a new and much-improved XLNet annotator with support for HuggingFace 🤗 models in Spark NLP. We managed to make XlnetEmbeddings almost 5x times faster on GPU compare to prior releases!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Migrate XlnetEmbeddings to TensorFlow v2. This allows the importing of HuggingFace XLNet models to Spark NLP

Migrate XlnetEmbeddings to BatchAnnotate to allow better performance on accelerated hardware such as GPU

Dynamically extract special tokens from SentencePiece model in XlmRoBertaEmbeddings

Add setIncludeAllConfidenceScores param in NerDLModel to merge confidence scores per label to only predicted label

Fully updated Annotators page with full examples in Python and Scala

Fully update Transformers page for all the transformers in Spark NLP

Bug Fixes & Enhancements

Fix issue with SymmetricDeleteModel

Fix issue with encoding unknown bytes in RoBertaEmbeddings

Fix issue with multi-lingual UniversalSentenceEncoder models

Sync params between Python and Scala for ContextSpellChecker

change setWordMaxDist to setWordMaxDistance in Scala

change setLMClasses to setLanguageModelClasses in Scala

change setWordMaxDist to setWordMaxDistance in Scala

change setBlackListMinFreq to setCompoundCount in Scala

change setClassThreshold to setClassCount in Scala

change setWeights to setWeightedDistPath in Scala

change setInitialBatchSize to setBatchSize in Python

Sync params between Python and Scala for ViveknSentimentApproach

change setCorpusPrune to setPruneCorpus in Scala

Sync params between Python and Scala for RegexMatcher

change setRules to setExternalRules in Scala

Sync params between Python and Scala for WordSegmenterApproach

change setPosCol to setPosColumn

change setIterations to setNIterations

Sync params between Python and Scala for ViveknSentimentApproach

change setCorpusPrune to setPruneCorpus

Sync params between Python and Scala for PerceptronApproach

change setPosCol to setPosColumn

Fix typos in docs: https://github.com/JohnSnowLabs/spark-nlp/pull/5766 and https://github.com/JohnSnowLabs/spark-nlp/pull/5775 thanks to @brollb

Performance Improvements

Introducing a new batch annotation technique implemented in Spark NLP 3.1.2 for XlnetEmbeddings annotator to radically improve prediction/inferencing performance. From now on the batchSize for these annotators means the number of rows that can be fed into the models for prediction instead of sentences per row. You can control the throughput when you are on accelerated hardware such as GPU to fully utilize it.

Backward compatibility

We have migrated XlnetEmbeddings to TensorFlow v2, the earlier models prior to 3.1.2 won't work after this release. We have already updated the models and uploaded them on Models Hub. You can use pretrained() that takes care of it automatically or please make sure you download the new models manually.

Documentation

HuggingFace to Spark NLP

Models Hub with new models

Spark NLP publications

Spark NLP in Action

Spark NLP documentation

Spark NLP Workshop notebooks

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.1.2

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.2

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.2

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.1.2 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.1.2

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.1.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.1.2</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.1.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.1.2</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.1.2</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.1.2</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.1.2.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.1.2.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.1.2.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.1.2.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.1.2.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.1.2.jar

Source code(tar.gz)
Source code(zip)
3.1.1(Jun 23, 2021)
Overview

We are pleased to release Spark NLP 🚀 3.1.1! We have a new and much-improved ALBERT annotator with support for HuggingFace 🤗 models in Spark NLP. We managed to make AlbertEmbeddings almost 7x times faster on GPU compare to prior releases!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Migrate AlbertEmbeddings to TensorFlow v2. This allows the importing of HuggingFace ALBERT models to Spark NLP

Migrate AlbertEmbeddings to BatchAnnotate to allow better performance on accelerated hardware such as GPU

Enable stdout/stderr in real-time for child processes via sparknlp.start(). Thanks to PySpark 3.x, this is now possible with sparknlp.start(real_time_output=True) to have the outputs of Spark NLP (such as metrics during training) right in your Jupyter, Colab, and Kaggle notebooks.

Complete examples for all annotators in Scaladoc APIs https://github.com/JohnSnowLabs/spark-nlp/pull/5668

Bug Fixes & Enhancements

Fix YakeModel issue with empty token https://github.com/JohnSnowLabs/spark-nlp/pull/5683 thanks to @shaddoxac

Fix getAnchorDateMonth method in DateMatcher and MultiDateMatcher https://github.com/JohnSnowLabs/spark-nlp/pull/5693

Fix the broken PubTutor class in Python https://github.com/JohnSnowLabs/spark-nlp/pull/5702

Fix relative dates in DateMatcher and MultiDateMatcher such as day after tomorrow or day before yesterday https://github.com/JohnSnowLabs/spark-nlp/pull/5706

Add isPaddedToken param to PubTutor https://github.com/JohnSnowLabs/spark-nlp/pull/5702

Fix issue with logger inside session on some setup https://github.com/JohnSnowLabs/spark-nlp/pull/5715

Add signatures to TF session to handle inputs/outputs more dynamically in BertEmbeddings, DistilBertEmbeddings, RoBertaEmbeddings, and XlmRoBertaEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/5715

Fix XlmRoBertaEmbeddings issue with init_all_tables https://github.com/JohnSnowLabs/spark-nlp/pull/5715

Add missing YakeModel from annotators

Add missing random seed param to ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach https://github.com/JohnSnowLabs/spark-nlp/pull/5697

Make the Java Exceptions appear before Py4J exceptions for ease of debugging in Python https://github.com/JohnSnowLabs/spark-nlp/pull/5709

Make sure batchSize set in NerDLModel is the same internally to feed TensorFlow https://github.com/JohnSnowLabs/spark-nlp/pull/5716

Fix a typo in documentation https://github.com/JohnSnowLabs/spark-nlp/pull/5664 thanks to @roger-yu-ds

Performance Improvements

Introducing a new batch annotation technique implemented in Spark NLP 3.1.1 for AlbertEmbeddings annotator to radically improve prediction/inferencing performance. From now on the batchSize for these annotators means the number of rows that can be fed into the models for prediction instead of sentences per row. You can control the throughput when you are on accelerated hardware such as GPU to fully utilize it.

Performance achievements by using Spark NLP 2.x/3.0.x vs. Spark NLP 3.1.1

(Performed on a Databricks cluster)

| Spark NLP 2.x/3.0.x vs. 3.1.1 | CPU | GPU | |------------------|-------------------------|------------------------ |ALBERT Base | 22% | 340% |
|Albert Large | 20% | 770% |

We will update this benchmark table in future pre-releases.

Backward compatibility

We have migrated AlbertEmbeddings to TensorFlow v2, the earlier models prior to 3.1.1 won't work after this release. We have already updated the models and uploaded them on Models Hub. You can use pretrained() that takes care of it automatically or please make sure you download the new models manually.

Documentation

HuggingFace to Spark NLP

Models Hub with new models

Spark NLP publications

Spark NLP in Action

Spark NLP documentation

Spark NLP Workshop notebooks

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.1.1

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.1

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.1

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.1.1 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.1.1

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.1.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.1.1</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.1.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.1.1</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.1.1</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.1.1</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.1.1.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.1.1.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.1.1.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.1.1.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.1.1.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.1.1.jar

Source code(tar.gz)
Source code(zip)
3.1.0(Jun 7, 2021)
Overview

We are very excited to release Spark NLP 🚀 3.1.0! This is one of our biggest releases with lots of models, pipelines, and groundworks for future features that we are so proud to share it with our community.

Spark NLP 3.1.0 comes with over 2600+ new pretrained models and pipelines in over 200+ languages, new DistilBERT, RoBERTa, and XLM-RoBERTa annotators, support for HuggingFace 🤗 (Autoencoding) models in Spark NLP, and extends support for new Databricks and EMR instances.

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Introducing DistiBertEmbeddings annotator. DistilBERT is a small, fast, cheap, and light Transformer model trained by distilling BERT base. It has 40% fewer parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances

NEW: Introducing RoBERTaEmbeddings annotator. RoBERTa (Robustly Optimized BERT-Pretraining Approach) models deliver state-of-the-art performance on NLP/NLU tasks and a sizable performance improvement on the GLUE benchmark. With a score of 88.5, RoBERTa reached the top position on the GLUE leaderboard

NEW: Introducing XlmRoBERTaEmbeddings annotator. XLM-RoBERTa (Unsupervised Cross-lingual Representation Learning at Scale) is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data with 100 different languages. It also outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model

NEW: Introducing support for HuggingFace exported models in equivalent Spark NLP annotators. Starting this release, you can easily use the saved_model feature in HuggingFace within a few lines of codes and import any BERT, DistilBERT, RoBERTa, and XLM-RoBERTa models to Spark NLP. We will work on the remaining annotators and extend this support to the rest with each release - For more information please visit this discussion

NEW: Migrate MarianTransformer to BatchAnnotate to control the throughput when you are on accelerated hardware such as GPU to fully utilize it

Upgrade to TensorFlow v2.4.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x

Update to CUDA11 and cuDNN 8.0.2 for GPU support

Implement ModelSignatureManager to automatically detect inputs, outputs, save and restore tensors from SavedModel in TF v2. This allows Spark NLP 3.1.x to extend support for external Encoders such as HuggingFace and TF Hub (coming soon!)

Implement a new BPE tokenizer for RoBERTa and XLM models. This tokenizer will use the custom tokens from Tokenizer or RegexTokenizer and generates token pieces, encodes, and decodes the results

Welcoming new Databricks runtimes to our Spark NLP family:

Databricks 8.1 ML & GPU

Databricks 8.2 ML & GPU

Databricks 8.3 ML & GPU

Welcoming a new EMR 6.x series to our Spark NLP family:

EMR 6.3.0 (Apache Spark 3.1.1 / Hadoop 3.2.1)

Added examples to Spark NLP Scaladoc

Models and Pipelines

Spark NLP 3.1.0 comes with over 2600+ new pretrained models and pipelines in over 200 languages available for Windows, Linux, and macOS users.

Featured Transformers

| Model | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | BertEmbeddings | bert_base_dutch_cased | 3.1.0 | nl | BertEmbeddings | bert_base_german_cased | 3.1.0 | de | BertEmbeddings | bert_base_german_uncased | 3.1.0 | de | BertEmbeddings | bert_base_italian_cased | 3.1.0 | it | BertEmbeddings | bert_base_italian_uncased | 3.1.0 | it | BertEmbeddings | bert_base_turkish_cased | 3.1.0 | tr | BertEmbeddings | bert_base_turkish_uncased | 3.1.0 | tr | BertEmbeddings | chinese_bert_wwm | 3.1.0 | zh | BertEmbeddings | bert_base_chinese | 3.1.0 | zh | DistilBertEmbeddings | distilbert_base_cased | 3.1.0 | en | DistilBertEmbeddings | distilbert_base_uncased | 3.1.0 | en | DistilBertEmbeddings | distilbert_base_multilingual_cased | 3.1.0 | xx | RoBertaEmbeddings | roberta_base | 3.1.0 | en | RoBertaEmbeddings | roberta_large | 3.1.0 | en | RoBertaEmbeddings | distilroberta_base | 3.1.0 | en | XlmRoBertaEmbeddings | xlm_roberta_base | 3.1.0 | xx | XlmRoBertaEmbeddings | twitter_xlm_roberta_base | 3.1.0 | xx

Featured Translation Models

| Model | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | MarianTransformer | Chinese to Vietnamese | 3.1.0 | xx | MarianTransformer | Chinese to Ukrainian | 3.1.0 | xx | MarianTransformer | Chinese to Dutch | 3.1.0 | xx | MarianTransformer | Chinese to English | 3.1.0 | xx | MarianTransformer | Chinese to Finnish | 3.1.0 | xx | MarianTransformer | Chinese to Italian | 3.1.0 | xx | MarianTransformer | Yoruba to English | 3.1.0 | xx | MarianTransformer | Yapese to French | 3.1.0 | xx | MarianTransformer | Waray to Spanish | 3.1.0 | xx | MarianTransformer | Ukrainian to English | 3.1.0 | xx | MarianTransformer | Hindi to Urdu | 3.1.0 | xx | MarianTransformer | Italian to Ukrainian | 3.1.0 | xx | MarianTransformer | Italian to Icelandic | 3.1.0 | xx

Transformers in Spark NLP

Import hundreds of models in different languages to Spark NLP

Spark NLP | HuggingFace Notebooks :------------ | :-------------| BertEmbeddings | HuggingFace in Spark NLP - BERT BertSentenceEmbeddings | HuggingFace in Spark NLP - BERT Sentence DistilBertEmbeddings| HuggingFace in Spark NLP - DistilBERT
RoBertaEmbeddings | HuggingFace in Spark NLP - RoBERTa
XlmRoBertaEmbeddings | HuggingFace in Spark NLP - XLM-RoBERTa

The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

Backward compatibility

We have updated our MarianTransformer annotator to be compatible with TF v2 models. This change is not compatible with previous models/pipelines. However, we have updated and uploaded all the models and pipelines for 3.1.x release. You can either use MarianTransformer.pretrained(MODEL_NAME) and it will automatically download the compatible model or you can visit Models Hub to download the compatible models for offline use via MarianTransformer.load(PATH)

Documentation

HuggingFace to Spark NLP

Models Hub with new models

Spark NLP publications

Spark NLP in Action

Spark NLP documentation

Spark NLP Workshop notebooks

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.1.0

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.0

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.0

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.1.0 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.1.0

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.1.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.1.0</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.1.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.1.0</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.1.0</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.1.0</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.1.0.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.1.0.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.1.0.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.1.0.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.1.0.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.1.0.jar

Source code(tar.gz)
Source code(zip)
3.0.3(May 6, 2021)
Overview

We are glad to release Spark NLP 3.0.3! We have added some new features to our T5 Transformer annotator to help with longer and more accurate text generation, trained some new multi-lingual models and pipelines in Farsi, Hebrew, Korean, and Turkish, and fixed some bugs in this release.

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Add 6 new features to T5Transformer for longer and better text generation

doSample: Whether or not to use sampling; use greedy decoding otherwise

temperature: The value used to module the next token probabilities

topK: The number of highest probability vocabulary tokens to keep for top-k-filtering

topP: If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation

repetitionPenalty: The parameter for repetition penalty. 1.0 means no penalty. See CTRL: A Conditional Transformer Language Model for Controllable Generation paper for more details

noRepeatNgramSize: If set to int > 0, all ngrams of that size can only occur once

Spark NLP 3.0.3 is compatible with the new Databricks 8.2 (ML) runtime

Spark NLP 3.0.3 is compatible with the new EMR 5.33.0 (with Zeppelin 0.9.0) release

Bug Fixes

Fix ChunkEmbeddings Array out of bounds exception https://github.com/JohnSnowLabs/spark-nlp/pull/2796

Fix pretrained tfhub_use_multi and tfhub_use_multi_lg models in UniversalSentenceEncoder https://github.com/JohnSnowLabs/spark-nlp/pull/2827

Fix anchorDateMonth in Python that resulted in 1 additional month and case sensitivity to some relative dates like next friday or next Friday https://github.com/JohnSnowLabs/spark-nlp/pull/2848

Models and Pipelines

New multilingual models and pipelines for Farsi, Hebrew, Korean, and Turkish

| Model | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | ClassifierDLModel | classifierdl_bert_news | 3.0.2 | tr | UniversalSentenceEncoder | tfhub_use_multi | 3.0.0 | xx | UniversalSentenceEncoder | tfhub_use_multi_lg | 3.0.0 | xx

| Pipeline | Name | Build | Lang |
|:-----------------------------|:-------------------|:-----------------|:------| | PretrainedPipeline | recognize_entities_dl | 3.0.0 | fa | PretrainedPipeline | explain_document_lg | 3.0.2 | he | PretrainedPipeline | explain_document_lg | 3.0.2 | ko

The complete list of all 1100+ models & pipelines in 192+ languages is available on Models Hub.

Documentation and Notebooks

Add a new Offline section to docs

Installing Spark NLP and Spark OCR in air-gapped networks (offline mode)

Models Hub with new models

Spark NLP publications

Spark NLP in Action

Spark NLP documentation

Spark NLP Workshop notebooks

Spark NLP training certification notebooks for Google Colab and Databricks

Spark NLP Display for visualization of different types of annotations

Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI pip install spark-nlp==3.0.3

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.0.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.0.3

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.0.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.0.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.0.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.0.3

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.0.3 pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.0.3

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp_2.12</artifactId> <version>3.0.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu_2.12</artifactId> <version>3.0.3</version> </dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark24_2.11</artifactId> <version>3.0.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark24_2.11</artifactId> <version>3.0.3</version> </dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-spark23_2.11</artifactId> <version>3.0.3</version> </dependency>

spark-nlp-gpu:

<dependency> <groupId>com.johnsnowlabs.nlp</groupId> <artifactId>spark-nlp-gpu-spark23_2.11</artifactId> <version>3.0.3</version> </dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.0.3.jar

GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.0.3.jar

CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.0.3.jar

GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.0.3.jar

CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.0.3.jar

GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.0.3.jar

Source code(tar.gz)
Source code(zip)

State of the Art Natural Language Processing

Related tags

Overview

Spark NLP: State of the Art Natural Language Processing

Project's website

Community support

Table of contents

Features

Requirements

Quick Start

Apache Spark Support

Databricks Support

EMR Support

Usage

Spark Packages

Command line (requires internet connection)

Scala

Maven

SBT

Python

Python without explicit Pyspark installation

Pip/Conda

Compiled JARs

Build from source

spark-nlp

Using the jar manually

Apache Zeppelin

Python in Zeppelin

Jupyter Notebook (Python)

Google Colab Notebook

Databricks Cluster

S3 Cluster

With no Hadoop configuration

Pipelines and Models

Pipelines

Please check out our Models Hub for the full list of pre-trained pipelines with examples, demos, benchmarks, and more

Models

Please check out our Models Hub for the full list of pre-trained models with examples, demo, benchmark, and more

Examples

All examples: spark-nlp-workshop

FAQ

Citation

Contributing

Contact

John Snow Labs

Comments

Description

Expected Behavior

Current Behavior

Possible Solution

Context

Your Environment

Description

=== from documentassembler ==============================================

=== from lemmatizer ====================================================

=== from normalizer ====================================================

Description

Expected Behavior

Current Behavior

Your Environment

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment

Description

Apache Spark version: 2.4.4 Spark NLP version 2.7.5 sentence_detector_dl download started this may take some time.

sentence_detector_dl download started this may take some time.

Expected Behavior

Current Behavior

Possible Solution