SynapseML - an open source library to simplify the creation of scalable machine learning pipelines

Overview

SynapseML

Synapse Machine Learning

Build Status codecov Gitter

Release Notes Scala Docs PySpark Docs Academic Paper

Version Snapshot Version

SynapseML (previously MMLSpark) is an open source library to simplify the creation of scalable machine learning pipelines. SynapseML builds on Apache Spark and SparkML to enable new kinds of machine learning, analytics, and model deployment workflows. SynapseML adds many deep learning and data science tools to the Spark ecosystem, including seamless integration of Spark Machine Learning pipelines with the Open Neural Network Exchange (ONNX), LightGBM, The Cognitive Services, Vowpal Wabbit, and OpenCV. These tools enable powerful and highly-scalable predictive and analytical models for a variety of datasources.

SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. For production grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

SynapseML requires Scala 2.12, Spark 3.2+, and Python 3.6+. See the API documentation for Scala and for PySpark.

Table of Contents

Features

Vowpal Wabbit on Spark The Cognitive Services for Big Data LightGBM on Spark Spark Serving
Fast, Sparse, and Effective Text Analytics Leverage the Microsoft Cognitive Services at Unprecedented Scales in your existing SparkML pipelines Train Gradient Boosted Machines with LightGBM Serve any Spark Computation as a Web Service with Sub-Millisecond Latency
HTTP on Spark ONNX on Spark Responsible AI Spark Binding Autogeneration
An Integration Between Spark and the HTTP Protocol, enabling Distributed Microservice Orchestration Distributed and Hardware Accelerated Model Inference on Spark Understand Opaque-box Models and Measure Dataset Biases Automatically Generate Spark bindings for PySpark and SparklyR
Isolation Forest on Spark CyberML Conditional KNN
Distributed Nonlinear Outlier Detection Machine Learning Tools for Cyber Security Scalable KNN Models with Conditional Queries

Documentation and Examples

For quickstarts, documentation, demos, and examples please see our website.

Setup and installation

Python

To try out SynapseML on a Python (or Conda) installation you can get Spark installed via pip with pip install pyspark. You can then use pyspark as in the above example, or from python:

import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
            .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.5") \
            .getOrCreate()
import synapse.ml

SBT

If you are building a Spark application in Scala, add the following lines to your build.sbt:

libraryDependencies += "com.microsoft.azure" % "synapseml_2.12" % "0.9.5"

Spark package

SynapseML can be conveniently installed on existing Spark clusters via the --packages option, examples:

spark-shell --packages com.microsoft.azure:synapseml_2.12:0.9.5
pyspark --packages com.microsoft.azure:synapseml_2.12:0.9.5
spark-submit --packages com.microsoft.azure:synapseml_2.12:0.9.5 MyApp.jar

This can be used in other Spark contexts too. For example, you can use SynapseML in AZTK by adding it to the .aztk/spark-defaults.conf file.

Databricks

To install SynapseML on the Databricks cloud, create a new library from Maven coordinates in your workspace.

For the coordinates use: com.microsoft.azure:synapseml_2.12:0.9.5 with the resolver: https://mmlspark.azureedge.net/maven. Ensure this library is attached to your target cluster(s).

Finally, ensure that your Spark cluster has at least Spark 3.2 and Scala 2.12. If you encounter Netty dependency issues please use DBR 10.1.

You can use SynapseML in both your Scala and PySpark notebooks. To get started with our example notebooks import the following databricks archive:

https://mmlspark.blob.core.windows.net/dbcs/SynapseMLExamplesv0.9.5.dbc

Apache Livy and HDInsight

To install SynapseML from within a Jupyter notebook served by Apache Livy the following configure magic can be used. You will need to start a new session after this configure cell is executed.

Excluding certain packages from the library may be necessary due to current issues with Livy 0.5

%%configure -f
{
    "name": "synapseml",
    "conf": {
        "spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.9.5",
        "spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12"
    }
}

In Azure Synapse, "spark.yarn.user.classpath.first" should be set to "true" to override the existing SynapseML packages

%%configure -f
{
    "name": "synapseml",
    "conf": {
        "spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.9.5",
        "spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.12,org.scalactic:scalactic_2.12,org.scalatest:scalatest_2.12",
        "spark.yarn.user.classpath.first": "true"
    }
}

Docker

The easiest way to evaluate SynapseML is via our pre-built Docker container. To do so, run the following command:

docker run -it -p 8888:8888 -e ACCEPT_EULA=yes mcr.microsoft.com/mmlspark/release

Navigate to http://localhost:8888/ in your web browser to run the sample notebooks. See the documentation for more on Docker use.

To read the EULA for using the docker image, run \ docker run -it -p 8888:8888 mcr.microsoft.com/mmlspark/release eula

GPU VM Setup

SynapseML can be used to train deep learning models on GPU nodes from a Spark application. See the instructions for setting up an Azure GPU VM.

Building from source

SynapseML has recently transitioned to a new build infrastructure. For detailed developer docs please see the Developer Readme

If you are an existing synapsemldeveloper, you will need to reconfigure your development setup. We now support platform independent development and better integrate with intellij and SBT. If you encounter issues please reach out to our support email!

R (Beta)

To try out SynapseML using the R autogenerated wrappers see our instructions. Note: This feature is still under development and some necessary custom wrappers may be missing.

Papers

Learn More

Contributing & feedback

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

See CONTRIBUTING.md for contribution guidelines.

To give feedback and/or report an issue, open a GitHub Issue.

Other relevant projects

Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Comments
  • feat: R test gen

    feat: R test gen

    Related Issues/PRs

    What changes are proposed in this pull request?

    This PR introduces test generation for the R interface to SynapseML, similar to that for Python and Dotnet.

    Briefly describe the changes included in this Pull Request. We generate tests for models in the R language that call the generated R functions that in turn call Scala code.

    How is this patch tested?

    We generate tests and run them.

    Does this PR change any dependencies?

    • [ ] No. You can skip this section.
    • [x] Yes. Make sure the dependencies are resolved correctly, and list changes here. We have new dependences on R packages jsonlite and mlflow. Dependences on sparklyr and testthat were already present.

    Does this PR add a new feature? If so, have you added samples on website?

    • [x] No. You can skip this section.
    • [ ] Yes. Make sure you have added samples following below steps.

    AB#1898553

    opened by niehaus59 170
  • feat: Remove CNTK functionality and replace with ONNX

    feat: Remove CNTK functionality and replace with ONNX

    What changes are proposed in this pull request?

    Currently the ImageFeaturizer uses CNTK models. This PR replaces this underlying dependency with ONNX models, effectively removing usage of CNTK from the library.

    Associated changes: Create ONNXHub for referring to predefined SynapseML models in the cloud Directly use ONNX-ML protobuf classes for splitting models at internal nodes

    How is this patch tested?

    TODO

    • [ ] I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

    Does this PR change any dependencies?

    TODO

    • [ ] No. You can skip this section.
    • [ ] Yes. Make sure the dependencies are resolved correctly, and list changes here.

    Does this PR add a new feature? If so, have you added samples on website?

    TODO

    • [ ] No. You can skip this section.
    • [ ] Yes. Make sure you have added samples following below steps.

    TODO

    1. Find the corresponding markdown file for your new feature in website/docs/documentation folder. Make sure you choose the correct class estimators/transformers and namespace.
    2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
    3. Make sure the DocTable points to correct API link.
    4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
    5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
    6. Make sure the WebsiteSamplesTests job pass in the pipeline.

    AB#1910261

    opened by svotaw 127
  • java.net.ConnectException: Connection refused (Connection refused) with LightGBMClassifier in Databricks

    java.net.ConnectException: Connection refused (Connection refused) with LightGBMClassifier in Databricks

    I am trying to run this example with my own dataset on databricks. https://github.com/microsoft/recommenders/blob/master/notebooks/02_model/mmlspark_lightgbm_criteo.ipynb My cluster configuration is from 2 until 10 worker nodes. Worker Type is 28.GB Memory, 8 cores. In the beginning of my notebook I set the following properties spark.conf.set("spark.executor.memory", "80g") spark.conf.set("spark.driver.maxResultSize", "6g") but it seems that it doesn't effect the notebook environment.

    I am using for the LightGBMClassifier , the library Azure:mmlspark:0.16. My dataset has 1.502.306 rows and 9 columns. It is a spark dataframe, result of 3 joins between 3 SQL Tables (transformed to spark dataframes with the command spark.sql()) I apply feature_processor step to encode the categorical columns. Then after setting the LightGBMClassifier parameter, I train the model. My LightGBMClassifier parameters are : `NUM_LEAVES = 8 NUM_ITERATIONS = 20 LEARNING_RATE = 0.1 FEATURE_FRACTION = 0.8 EARLY_STOPPING_ROUND = 5

    Model name

    MODEL_NAME = 'lightgbm_criteo.mml'

    lgbm = LightGBMClassifier( labelCol="kategorie1", featuresCol="features", objective="multiclass", isUnbalance=True, boostingType="gbdt", boostFromAverage=True, baggingSeed=3, #früher 42 numLeaves=NUM_LEAVES, numIterations=NUM_ITERATIONS, learningRate=LEARNING_RATE, featureFraction=FEATURE_FRACTION, earlyStoppingRound=EARLY_STOPPING_ROUND, timeout=1200.0 #parallelism='data_parallel' )I applied the repartition trick as well before training the modeltrain = train.repartition(50) train.rdd.getNumPartitions()Then when I runmodel = lgbm.fit(train)then I get the following errorPy4JJavaError: An error occurred while calling o1125.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 36.0 failed 4 times, most recent failure: Lost task 10.3 in stage 36.0 (TID 3493, 10.139.64.10, executor 7): java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.(Socket.java:434) at java.net.Socket.(Socket.java:211) at com.microsoft.ml.spark.TrainUtils$.getNodes(TrainUtils.scala:178) at com.microsoft.ml.spark.TrainUtils$$anonfun$5.apply(TrainUtils.scala:211) at com.microsoft.ml.spark.TrainUtils$$anonfun$5.apply(TrainUtils.scala:205) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:204) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:200) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:197) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:852) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:852) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at org.apache.spark.scheduler.Task.run(Task.scala:112) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1481) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

    Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2355) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2343) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2342) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2342) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1096) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1096) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1096) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2574) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2510) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:893) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2240) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2338) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1051) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:379) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1033) at org.apache.spark.sql.Dataset$$anonfun$reduce$1.apply(Dataset.scala:1650) at org.apache.spark.sql.Dataset$$anonfun$withNewRDDExecutionId$1.apply(Dataset.scala:3409) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:99) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:228) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:85) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:158) at org.apache.spark.sql.Dataset.withNewRDDExecutionId(Dataset.scala:3405) at org.apache.spark.sql.Dataset.reduce(Dataset.scala:1649) at com.microsoft.ml.spark.LightGBMClassifier.train(LightGBMClassifier.scala:85) at com.microsoft.ml.spark.LightGBMClassifier.train(LightGBMClassifier.scala:27) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.(Socket.java:434) at java.net.Socket.(Socket.java:211) at com.microsoft.ml.spark.TrainUtils$.getNodes(TrainUtils.scala:178) at com.microsoft.ml.spark.TrainUtils$$anonfun$5.apply(TrainUtils.scala:211) at com.microsoft.ml.spark.TrainUtils$$anonfun$5.apply(TrainUtils.scala:205) at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29) at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:204) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83) at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:200) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:197) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:852) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:852) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at org.apache.spark.scheduler.Task.run(Task.scala:112) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1481) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ` I really want to understand the reason behind this error and try the suggestions that you offer. I have been stuck on this problem since 2 weeks. I have read many similar errors, implemented some suggestions like increasing the cluster memory, configuring spark.executor.memory, repartitioning the data but still I cannot train the LightGBMClassifier with my input data.

    bug high priority area/lightgbm 
    opened by emnajaoua 86
  • feat: add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage

    feat: add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage

    Adding single (or "singleton") dataset mode to lightgbm learners. User can enable this new mode by setting the parameter useSingleDatasetMode=True (it is false by default). In this mode, each executor creates a single LightGBMDataset. By default, currently each task within an executor creates a dataset:

    image

    In this PR, a new mode is added to only create one dataset per executor:

    image

    This means that there is lower network communication overhead since fewer nodes are initialized and more parallelization is done within the machine in the native code with default number of threads. This also seems to reduce memory usage significantly for some datasets.

    Note in most cluster configurations there is usually only one executor per machine anyway.

    In performance tests, we've found this mode sometimes outperforms the default in certain scenarios, both in terms of memory and execution time.

    On a sparse dataset with 9 GB of data and large parameter values (num_leaves=768, num_trees=1000, min_data_in_leaf=15000, max_bin=512) and 5 machines with 8 cores and 28 GB of RAM, runtime was 17.54 minutes with this new mode. When specifying tasks=5 it took 106 minutes and in default mode it failed with OOM.

    However in other scenarios the default mode is much faster. On dense Higgs dataset (4GB) with default parameters and 8 workers with 14 GB memory, 4 cores each the default run took 54 seconds but new single dataset mode took 1.1 minutes (used to be 2 minutes with recent optimization on dataset conversion code to native this was speeded up a lot), which was a bit slower.

    For this reason we will keep this mode as non-default for now as we continue to do more benchmarking/experimentation.

    opened by imatiach-msft 82
  • feat: new LIME and KernelSHAP explainers

    feat: new LIME and KernelSHAP explainers

    In this PR, we rewrote the LIME explainers and added KernelSHAP explainers in the com.microsoft.ml.spark.explainers package.

    New features:

    • KernelSHAP explainer for tabular, vector, image and text models.
    • LIME explainer now supports kernel width and sample weights.
    • Both explainer support categorical variable (in tabular explainer).
    • Both explainers report r-squared metric from the underlying regression model.
    • Both explainers support explaining multiple classes output in one run.
    • For tabular and vector models, both explainers support passing in a background dataframe. ~~If one is not given, the dataframe used for local interpretation will be used as background data.~~

    Sample notebooks will be included in the next PR.

    opened by memoryz 74
  • feat: Support deep vision

    feat: Support deep vision

    Summary

    Support common DNN models for deep vision classification.

    Tests

    Added unit test.

    Dependency changes

    Added requirements.txt file to explain dependencies for python package.

    opened by serena-ruan 67
  • test: Add more E2E test to pipeline

    test: Add more E2E test to pipeline

    Related Issues/PRs

    What changes are proposed in this pull request?

    Adding tests for synapse extension.

    How is this patch tested?

    • [x] I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

    Does this PR change any dependencies?

    • [x] No. You can skip this section.
    • [ ] Yes. Make sure the dependencies are resolved correctly, and list changes here.

    Does this PR add a new feature? If so, have you added samples on website?

    • [x] No. You can skip this section.
    • [ ] Yes. Make sure you have added samples following below steps.
    1. Find the corresponding markdown file for your new feature in website/docs/documentation folder. Make sure you choose the correct class estimators/transformers and namespace.
    2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
    3. Make sure the DocTable points to correct API link.
    4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
    5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
    6. Make sure the WebsiteSamplesTests job pass in the pipeline.
    opened by k-rush 62
  • feat: Adding a new param for explicitly setting slot names.

    feat: Adding a new param for explicitly setting slot names.

    Feat: Adding a new param for explicitly setting slot names.

    It is related to #740

    Though it is possible to assign feature names to dataset by using spark dataframe's meta in mmlspark, sometimes, I want to set names more simply without handling spark dataframe because of several reasons.

    If a developer set the slotNames param, feature names are set to them. If not, it is referenced to spark dataframe's meta.

    opened by ocworld 52
  • feat: Add LightGBM streaming execution mode

    feat: Add LightGBM streaming execution mode

    Summary

    Add the streaming execution mode to LightGBM wrapper. This mode uses almost no memory on top of what LightGBM needs to execute.

    Tests

    Tests will be modified to run in both bulk and streaming mode before checkin. Before this PR was pushed, LightGBMClassifier tests were all passing for streaming mode (the bulk of the tests). There are also some new tests just for streaming components and instrumentation in Common.

    Dependency changes

    This PR cannot be checked in without corresponding changes in native LightGBM library, or pointing to custom upload. https://github.com/microsoft/LightGBM/pull/5299

    AB#1891953

    opened by svotaw 50
  • feat: Add support for ContextualBandit in the VW module

    feat: Add support for ContextualBandit in the VW module

    • Adds VowpalWabbitContextualBandit, VowpalWabbitContextualBanditModel, ColumnVectorSequencer classes
    • Update com.github.vowpalwabbit dependency version for CB support
    • Add tests in Scala and Python for the new functionality
    • Other featurizer improvements from @eisber
    opened by jackgerrits 44
  • feat: Causal DoubleMLEstimator (#8)

    feat: Causal DoubleMLEstimator (#8)

    What changes are proposed in this pull request?

    Add package 'com.microsoft.azure.synapse.ml.causal' and implementation LinearDMLEstimator

    How is this patch tested?

    • [x] I have written tests

    Does this PR change any dependencies?

    • [x] No.

    Does this PR add a new feature? If so, have you added samples on website?

    • [x] Yes.
    opened by dylanw-oss 42
  • fix: fix annamespace import for Experimental (#1780)

    fix: fix annamespace import for Experimental (#1780)

    Related Issues/PRs

    I did not find any related PRs

    #xxx

    What changes are proposed in this pull request?

    In the isolation forest example notebook, udf is used to convert a vector to array vec2array = udf(lambda vec: vec.toArray().tolist(), ArrayType(FloatType())) . However, udf is not imported into resulting into an error when the notebook is run. This can be resolved with below change: vec2array = F.udf(lambda vec: vec.toArray().tolist(), ArrayType(FloatType()))

    How is this patch tested?

    By running the notebook.

    • [ ] I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

    Does this PR change any dependencies?

    • [x] No. You can skip this section.
    • [ ] Yes. Make sure the dependencies are resolved correctly, and list changes here.

    Does this PR add a new feature? If so, have you added samples on website?

    • [x] No. You can skip this section.
    • [ ] Yes. Make sure you have added samples following below steps.
    1. Find the corresponding markdown file for your new feature in website/docs/documentation folder. Make sure you choose the correct class estimators/transformers and namespace.
    2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
    3. Make sure the DocTable points to correct API link.
    4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
    5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
    6. Make sure the WebsiteSamplesTests job pass in the pipeline.
    opened by pawarbi 1
  • [BUG]java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed

    [BUG]java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed

    SynapseML version

    0.10.2

    System information

    • Language version: scala 2.12
    • Spark Version: 3.1.1
    • Spark Platform : yarn cluster.

    Describe the problem

    Sometimes it occurs network exceptions:

    23/01/05 14:07:26 WARN TaskSetManager: Lost task 60.0 in stage 28.0 (TID 17418) (node74-53-153-bdxs.qiyi.hadoop executor 12): java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:18)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:192)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:126)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    

    Code to reproduce issue

    val lgbClassifier = new LightGBMClassifier()
          .setFeaturesCol("features")
          .setLabelCol("label")
          .setObjective("binary")
          .setIsUnbalance(config.getBoolean("LGB.IS_UNBALANCE"))
          .setBoostingType(config.getString("LGB.BOOSTING_TYPE"))
          .setOtherRate(config.getDouble("LGB.GOSS_RATE"))
          .setZeroAsMissing(config.getBoolean("LGB.ZERO_AS_MISSING"))
          .setLearningRate(config.getDouble("LGB.ETA"))
          .setFeatureFraction(config.getDouble("LGB.FEATURE_FRACTION"))
          .setNegBaggingFraction(config.getDouble("LGB.NEG_BAGGING_FRACTION"))
          .setBaggingFreq(config.getInt("LGB.BAGGING_FREQ"))
          .setBaggingSeed(777)
          .setPassThroughArgs(config.getString("LGB.OTHER_ARGS"))
          .setDropRate(0.1)
          .setLambdaL1(config.getDouble("LGB.REG_L1"))
          .setLambdaL2(config.getDouble("LGB.REG_L2"))
          .setMaxDepth(config.getInt("LGB.MAX_DEPTH"))
          .setMaxBin(config.getInt("LGB.MAX_BIN"))
          .setMinDataInLeaf(config.getInt("LGB.MIN_DATA_IN_LEAF"))
          .setMinGainToSplit(1e-5)
          .setNumTasks(config.getInt("LGB.NUM_TASKS"))
          .setNumThreads(config.getInt("LGB.NUM_THREADS"))
          .setNumIterations(config.getInt("LGB.NUM_ROUND"))
          .setEarlyStoppingRound(config.getInt("LGB.EARLY_STOP_ROUND"))
          .setVerbosity(100)
          .setMetric("auc")
          .setValidationIndicatorCol("is_valid")
          .setIsProvideTrainingMetric(true)
          .setIsEnableSparse(true)
          .setUseSingleDatasetMode(true)
          .setParallelism("data_parallel")
          .setUseBarrierExecutionMode(true)
    
        val lgbModel = lgbClassifier.fit(trainData.repartition(config.getInt("LGB.NUM_TASKS")))
    

    Other info / logs

    23/01/05 14:07:26 WARN TaskSetManager: Lost task 60.0 in stage 28.0 (TID 17418) (node74-53-153-bdxs.qiyi.hadoop executor 12): java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:18)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:192)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:126)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    23/01/05 14:07:26 INFO DAGScheduler: Marking ResultStage 28 (collect at LightGBMBase.scala:595) as failed due to a barrier task failed.
    23/01/05 14:07:26 INFO YarnClusterScheduler: Killing all running tasks in stage 28: Task ResultTask(28, 60) from barrier stage ResultStage 28 (collect at LightGBMBase.scala:595) failed.
    23/01/05 14:07:26 INFO DAGScheduler: ResultStage 28 (collect at LightGBMBase.scala:595) failed in 61.860 s due to Stage failed because barrier task ResultTask(28, 60) finished unsuccessfully.
    java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:18)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:192)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:126)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    23/01/05 14:07:26 INFO DAGScheduler: Job 9 failed: collect at LightGBMBase.scala:595, took 282.907703 s
    23/01/05 14:07:26 INFO DAGScheduler: Resubmitting ResultStage 28 (collect at LightGBMBase.scala:595) due to barrier stage failure.
    23/01/05 14:07:26 ERROR LightGBMClassifier: {"buildVersion":"0.10.2","className":"class com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier","method":"train","uid":"LightGBMClassifier_229249a7b970"}
    org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(28, 60) finished unsuccessfully.
    java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:18)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:192)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:126)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
    	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1968)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2443)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
    	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
    	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:595)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks$(LightGBMBase.scala:583)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.executePartitionTasks(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:573)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining$(LightGBMBase.scala:545)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.executeTraining(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:435)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch$(LightGBMBase.scala:392)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.trainOneDataBatch(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:61)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb(BasicLogging.scala:62)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb$(BasicLogging.scala:59)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain(BasicLogging.scala:48)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain$(BasicLogging.scala:47)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:42)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:35)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    	at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
    	at com.iqiyi.ads.algo.dmp.model.BaseModel.buildLgbModel(BaseModel.scala:207)
    	at com.iqiyi.ads.algo.dmp.model.BaseModel.trainAndPredict(BaseModel.scala:285)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.$anonfun$run$1(IndustryTagModel.scala:62)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.$anonfun$run$1$adapted(IndustryTagModel.scala:59)
    	at scala.collection.Iterator.foreach(Iterator.scala:943)
    	at scala.collection.Iterator.foreach$(Iterator.scala:943)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.run(IndustryTagModel.scala:59)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel$.main(IndustryTagModel.scala:79)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.main(IndustryTagModel.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
    23/01/05 14:07:26 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(28, 60) finished unsuccessfully.
    java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:18)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:192)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:126)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(28, 60) finished unsuccessfully.
    java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:18)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:192)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:126)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
    	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1968)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2443)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
    	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
    	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:595)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks$(LightGBMBase.scala:583)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.executePartitionTasks(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:573)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining$(LightGBMBase.scala:545)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.executeTraining(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:435)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch$(LightGBMBase.scala:392)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.trainOneDataBatch(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:61)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb(BasicLogging.scala:62)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb$(BasicLogging.scala:59)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain(BasicLogging.scala:48)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain$(BasicLogging.scala:47)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:42)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:35)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    	at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
    	at com.iqiyi.ads.algo.dmp.model.BaseModel.buildLgbModel(BaseModel.scala:207)
    	at com.iqiyi.ads.algo.dmp.model.BaseModel.trainAndPredict(BaseModel.scala:285)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.$anonfun$run$1(IndustryTagModel.scala:62)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.$anonfun$run$1$adapted(IndustryTagModel.scala:59)
    	at scala.collection.Iterator.foreach(Iterator.scala:943)
    	at scala.collection.Iterator.foreach$(Iterator.scala:943)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.run(IndustryTagModel.scala:59)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel$.main(IndustryTagModel.scala:79)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.main(IndustryTagModel.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
    23/01/05 14:07:26 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(28, 60) finished unsuccessfully.
    java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:18)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:192)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:126)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
    	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1968)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2443)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
    	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
    	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:595)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks$(LightGBMBase.scala:583)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.executePartitionTasks(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:573)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining$(LightGBMBase.scala:545)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.executeTraining(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:435)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch$(LightGBMBase.scala:392)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.trainOneDataBatch(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:61)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb(BasicLogging.scala:62)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb$(BasicLogging.scala:59)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain(BasicLogging.scala:48)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain$(BasicLogging.scala:47)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:42)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:35)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    	at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
    	at com.iqiyi.ads.algo.dmp.model.BaseModel.buildLgbModel(BaseModel.scala:207)
    	at com.iqiyi.ads.algo.dmp.model.BaseModel.trainAndPredict(BaseModel.scala:285)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.$anonfun$run$1(IndustryTagModel.scala:62)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.$anonfun$run$1$adapted(IndustryTagModel.scala:59)
    	at scala.collection.Iterator.foreach(Iterator.scala:943)
    	at scala.collection.Iterator.foreach$(Iterator.scala:943)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.run(IndustryTagModel.scala:59)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel$.main(IndustryTagModel.scala:79)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.main(IndustryTagModel.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
    )
    23/01/05 14:07:26 INFO PrometheusSink: metricsNamespace=None, sparkAppName=Some(online_user_biz_interest_tag_model_game_3224), sparkAppId=Some(application_1672821343673_128705), executorId=Some(driver), sparkMetricsAppId=None
    23/01/05 14:07:26 INFO PrometheusSink: role=driver, job=application_1672821343673_128705
    23/01/05 14:07:27 INFO DAGScheduler: Resubmitting failed stages
    23/01/05 14:07:27 INFO SparkContext: Invoking stop() from shutdown hook
    23/01/05 14:07:27 INFO SparkUI: Stopped Spark web UI at http://node74-46-97-bdxs.qiyi.hadoop:38380
    23/01/05 14:07:27 INFO PrometheusSink: metricsNamespace=None, sparkAppName=Some(online_user_biz_interest_tag_model_game_3224), sparkAppId=Some(application_1672821343673_128705), executorId=Some(driver), sparkMetricsAppId=None
    23/01/05 14:07:27 INFO PrometheusSink: role=driver, job=application_1672821343673_128705
    23/01/05 14:07:27 INFO YarnClusterSchedulerBackend: Shutting down all executors
    23/01/05 14:07:27 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
    23/01/05 14:07:27 INFO ShuffleWriteClientImpl: Successfully send heartbeat to Coordinator grpc client ref to journalnode02-bdxs-g1.qiyi.hadoop:21000
    23/01/05 14:07:27 INFO ShuffleWriteClientImpl: Successfully send heartbeat to Coordinator grpc client ref to journalnode01-bdxs-g1.qiyi.hadoop:21000
    23/01/05 14:07:27 INFO RssShuffleManager: Finish send heartbeat to coordinator and servers
    23/01/05 14:07:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    23/01/05 14:07:27 WARN NioEventLoop: Selector.select() returned prematurely 512 times in a row; rebuilding Selector io.netty.channel.nio.SelectedSelectionKeySetSelector@7740d5ee.
    23/01/05 14:07:27 INFO NioEventLoop: Migrated 5 channel(s) to the new Selector.
    23/01/05 14:07:27 WARN NioEventLoop: Selector.select() returned prematurely 512 times in a row; rebuilding Selector io.netty.channel.nio.SelectedSelectionKeySetSelector@6385cf2e.
    23/01/05 14:07:27 INFO NioEventLoop: Migrated 3 channel(s) to the new Selector.
    23/01/05 14:07:27 INFO MemoryStore: MemoryStore cleared
    23/01/05 14:07:27 INFO BlockManager: BlockManager stopped
    23/01/05 14:07:27 INFO BlockManagerMaster: BlockManagerMaster stopped
    23/01/05 14:07:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    23/01/05 14:07:27 INFO SparkContext: Successfully stopped SparkContext
    23/01/05 14:07:27 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(28, 60) finished unsuccessfully.
    java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12412 failed
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.validate(LightGBMUtils.scala:18)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:192)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.NetworkManager$.initLightGBMNetwork(NetworkManager.scala:203)
    	at com.microsoft.azure.synapse.ml.lightgbm.BasePartitionTask.mapPartitionTask(BasePartitionTask.scala:126)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$executePartitionTasks$1(LightGBMBase.scala:589)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:131)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    
    	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2259)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2208)
    	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2207)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2207)
    	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1968)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2443)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2388)
    	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2377)
    	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
    	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
    	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:595)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks$(LightGBMBase.scala:583)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.executePartitionTasks(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:573)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining$(LightGBMBase.scala:545)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.executeTraining(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:435)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch$(LightGBMBase.scala:392)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.trainOneDataBatch(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:61)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb(BasicLogging.scala:62)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logVerb$(BasicLogging.scala:59)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain(BasicLogging.scala:48)
    	at com.microsoft.azure.synapse.ml.logging.BasicLogging.logTrain$(BasicLogging.scala:47)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logTrain(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:42)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:35)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    	at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
    	at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
    	at com.iqiyi.ads.algo.dmp.model.BaseModel.buildLgbModel(BaseModel.scala:207)
    	at com.iqiyi.ads.algo.dmp.model.BaseModel.trainAndPredict(BaseModel.scala:285)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.$anonfun$run$1(IndustryTagModel.scala:62)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.$anonfun$run$1$adapted(IndustryTagModel.scala:59)
    	at scala.collection.Iterator.foreach(Iterator.scala:943)
    	at scala.collection.Iterator.foreach$(Iterator.scala:943)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.run(IndustryTagModel.scala:59)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel$.main(IndustryTagModel.scala:79)
    	at com.iqiyi.ads.algo.dmp.model.IndustryTagModel.main(IndustryTagModel.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)
    )
    23/01/05 14:07:27 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
    

    What component(s) does this bug affect?

    • [ ] area/cognitive: Cognitive project
    • [ ] area/core: Core project
    • [ ] area/deep-learning: DeepLearning project
    • [X] area/lightgbm: Lightgbm project
    • [ ] area/opencv: Opencv project
    • [ ] area/vw: VW project
    • [ ] area/website: Website
    • [ ] area/build: Project build system
    • [ ] area/notebooks: Samples under notebooks folder
    • [ ] area/docker: Docker usage
    • [ ] area/models: models related issue

    What language(s) does this bug affect?

    • [X] language/scala: Scala source code
    • [ ] language/python: Pyspark APIs
    • [ ] language/r: R APIs
    • [ ] language/csharp: .NET APIs
    • [ ] language/new: Proposals for new client languages

    What integration(s) does this bug affect?

    • [ ] integrations/synapse: Azure Synapse integrations
    • [ ] integrations/azureml: Azure ML integrations
    • [ ] integrations/databricks: Databricks integrations
    bug triage 
    opened by shexuan 2
  • [BUG] publishBlob doesn't work with reading secret from environment variable

    [BUG] publishBlob doesn't work with reading secret from environment variable

    SynapseML version

    0.10.2

    System information

    • Language version (e.g. python 3.8, scala 2.12):
    • Spark Version (e.g. 3.2.3):
    • Spark Platform (e.g. Synapse, Databricks):

    Describe the problem

    1. Before executing command "publishBlob", set storage secret to environment variable ~ export STORAGE-KEY= export: not valid in this context: STORAGE-KEY

    the problem is that hyphen is not allowed in environment variable name. all env names are defined in https://github.com/microsoft/SynapseML/blob/master/project/Secrets.scala it's better to switch them to use underscore in name.

    for example:

    val StorageKeyEnvVarName: String = "STORAGE-KEY" ==> val StorageKeyEnvVarName: String = "STORAGE_KEY"

    Code to reproduce issue

    N/A

    Other info / logs

    No response

    What component(s) does this bug affect?

    • [ ] area/cognitive: Cognitive project
    • [ ] area/core: Core project
    • [ ] area/deep-learning: DeepLearning project
    • [ ] area/lightgbm: Lightgbm project
    • [ ] area/opencv: Opencv project
    • [ ] area/vw: VW project
    • [ ] area/website: Website
    • [ ] area/build: Project build system
    • [ ] area/notebooks: Samples under notebooks folder
    • [ ] area/docker: Docker usage
    • [ ] area/models: models related issue

    What language(s) does this bug affect?

    • [ ] language/scala: Scala source code
    • [ ] language/python: Pyspark APIs
    • [ ] language/r: R APIs
    • [ ] language/csharp: .NET APIs
    • [ ] language/new: Proposals for new client languages

    What integration(s) does this bug affect?

    • [ ] integrations/synapse: Azure Synapse integrations
    • [ ] integrations/azureml: Azure ML integrations
    • [ ] integrations/databricks: Databricks integrations
    by design 
    opened by dylanw-oss 2
  • docs: Added more up-to-date ONNX docs

    docs: Added more up-to-date ONNX docs

    Related Issues/PRs

    #xxx

    What changes are proposed in this pull request?

    Added docs for new ONNX content.

    How is this patch tested?

    • [ ] I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

    Does this PR change any dependencies?

    • [ ] No. You can skip this section.
    • [ ] Yes. Make sure the dependencies are resolved correctly, and list changes here.

    Does this PR add a new feature? If so, have you added samples on website?

    • [ ] No. You can skip this section.
    • [ ] Yes. Make sure you have added samples following below steps.
    1. Find the corresponding markdown file for your new feature in website/docs/documentation folder. Make sure you choose the correct class estimators/transformers and namespace.
    2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
    3. Make sure the DocTable points to correct API link.
    4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
    5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
    6. Make sure the WebsiteSamplesTests job pass in the pipeline.
    opened by svotaw 9
  • docs: Add docs for LightGBM execution mode

    docs: Add docs for LightGBM execution mode

    Related Issues/PRs

    #xxx

    What changes are proposed in this pull request?

    Briefly describe the changes included in this Pull Request.

    How is this patch tested?

    • [ ] I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

    Does this PR change any dependencies?

    • [ ] No. You can skip this section.
    • [ ] Yes. Make sure the dependencies are resolved correctly, and list changes here.

    Does this PR add a new feature? If so, have you added samples on website?

    • [ ] No. You can skip this section.
    • [ ] Yes. Make sure you have added samples following below steps.
    1. Find the corresponding markdown file for your new feature in website/docs/documentation folder. Make sure you choose the correct class estimators/transformers and namespace.
    2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
    3. Make sure the DocTable points to correct API link.
    4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
    5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
    6. Make sure the WebsiteSamplesTests job pass in the pipeline.
    opened by svotaw 12
  • feat: add aad authentication support for cognitive services

    feat: add aad authentication support for cognitive services

    Related Issues/PRs

    #xxx

    What changes are proposed in this pull request?

    Briefly describe the changes included in this Pull Request.

    How is this patch tested?

    • [ ] I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

    Does this PR change any dependencies?

    • [ ] No. You can skip this section.
    • [ ] Yes. Make sure the dependencies are resolved correctly, and list changes here.

    Does this PR add a new feature? If so, have you added samples on website?

    • [ ] No. You can skip this section.
    • [ ] Yes. Make sure you have added samples following below steps.
    1. Find the corresponding markdown file for your new feature in website/docs/documentation folder. Make sure you choose the correct class estimators/transformers and namespace.
    2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
    3. Make sure the DocTable points to correct API link.
    4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
    5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
    6. Make sure the WebsiteSamplesTests job pass in the pipeline.
    opened by serena-ruan 9
Releases(v0.10.2)
  • v0.10.2(Nov 22, 2022)

    v0.10.2

    Bug Fixes 🐞

    • remove Vowpal Wabbit exclusion, add Interpretability exclusion (#1708)
    • remove synapse E2E testing exclusion - cyber ml (#1699)
    • update isolation forest notebook (#1696)
    • don't throw on invalid columns in DropColumns (#1695)
    • fix pyarrow failure in deeplearning test (#1689)
    • fix linked service on cog service base (#1685)
    • fix Uplift Modelling style
    • KernelSHAP throws error when the key type in the ZipMap output is LongType (#1656)
    • fix flaky translate tests (#1643)
    • update ubuntu to 20.04 in pipeline (#1624)

    Build 🏭

    • bump actions/checkout from 2 to 3 (#1737)
    • bump loader-utils from 2.0.2 to 2.0.3 in /website (#1709)
    • bump amannn/action-semantic-pull-request from 5.0.1 to 5.0.2 (#1688)
    • bump amannn/action-semantic-pull-request from 4 to 5.0.1 (#1680)

    Documentation 📘

    • update developer readme instruction on python env creation (#1693)
    • fix multiple typos and update error hintings in ai-samples-timeseries notebook (#1663)
    • improve error msg to make it clearer for users and fix typos (#1662)
    • simplify data downloading and add mlflow to uplift modelling (#1659)
    • move magic command forward since it restarts interpreter
    • remove unused docs and fix links
    • improve example notebooks
    • add aisample uplift modelling (#1640)
    • fix command to launch jupyter notebook (#1649)
    • add mlflow in ai samples time series forecasting (#1645)
    • add mlflow logging and loading (#1641)
    • update spark version in Readme
    • improve readme overview
    • add aisample on text classification (#1617)

    Features 🌈

    • add simple deep learning text classifier (#1591)
    • Add SpeakerEmotionInference transformer for generating SSML t… (#1691)
    • Deprecate CNTK objects (#1712)
    • Remove CNTK functionality and replace with ONNX (#1593)
    • R test generation (#1586)

    Maintenance 🔧

    • bump version to 0.10.2 (#1738)
    • fix style (#1736)
    • automate clean-acr with github action workflow (#1735)
    • autodelete old models (#1729)
    • Making secrets optional and cached (#1726)
    • add secret scanning infrastructure (#1724)
    • Move new ImageFeaturizer to onnx namespace (#1711)
    • ScalaStyle fixes (#1716)
    • update scalatest and scalactic (#1706)
    • remove synapse test exclusions (#1698)
    • pin az and python versions (#1705)
    • fix ado integration (#1704)
    • remove notebooks (#1703)
    • fix reopen comment action
    • fix reopen on comment workflow
    • fix typo in issue reopen yaml
    • re open github issues after a comment (#1676)
    • clean up github workflows and add issue label remover (#1674)
    • turn off failing synapse tests temporarily (#1658)
    • added synapse-internal to platform detector function (#1651)
    • publish test jars
    • improve test coverage (#1631)
    • Remove MVAD's dependence on hardwired credentials and azure SDKs (#1629)
    • clean up TextAnalytics cog service APIs (#1622)

    Testing 💚

    • Additional E2E testing infrastructure (#1727)
    • Improve ONNXtests reliability (#1713)

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n

    Changes:

    • cd1d2ea65ffcd0f89bf1fee231c430560508bcce chore: bump version to 0.10.2 (#1738)
    • fd78889112b8d48927ddac1b660f62400bb1ba12 build: bump actions/checkout from 2 to 3 (#1737)
    • c806ba79d17afa6edf53c9ea55e563f6842a6825 chore: fix style (#1736)
    • e6b5a90352b7456333df92dc9f7755b6cb8f300b feat: add simple deep learning text classifier (#1591)
    • 1de2d558996fed8ff12a312a52d53c39c3322fcb chore: automate clean-acr with github action workflow (#1735)
    • 952d1bd3e0a4b7755d8aa3d069ff90ff626c7b17 clarify date comparisons when deleting old models/groups (#1733)
    • 6ea02bd81ef09647119f4c36583f3a523dd7c795 chore: autodelete old models (#1729)
    • 8b02e1d31751eef8b01b87ee590f1b3169e6371a chore: Making secrets optional and cached (#1726)
    • c62c6ad441c18d98354099dbd39448d4b0734c58 test: Additional E2E testing infrastructure (#1727)
    • aeb2ff7ecde6180fc546e2596d8121c71be49505 feat: Add SpeakerEmotionInference transformer for generating SSML t… (#1691)
    See More
    • 0b96cc5bd5b26b677f2bd1ff7371c631c3984ee4 chore: add secret scanning infrastructure (#1724)
    • 2a7a67ba373d3dd69c7ac4cd2f4118d1980eac06 feat: Deprecate CNTK objects (#1712)
    • e38e3ad30c6bb10cbafa4351974aea8c2b8ebf37 chore: Move new ImageFeaturizer to onnx namespace (#1711)
    • 0ff6802377328cb8875f7c60305da77074fa1771 test: Improve ONNXtests reliability (#1713)
    • fe4c5d27a8d35aa90c5a70383b028de652247d48 chore: ScalaStyle fixes (#1716)
    • 050b541e8b74c09d63abfdb2ad05d7582bd06f29 build: bump loader-utils from 2.0.2 to 2.0.3 in /website (#1709)
    • f2e88fdea7c1010118913eecf0457b5daf881d25 feat: Remove CNTK functionality and replace with ONNX (#1593)
    • abdfe19e79ca0533c168366aca96b8082a44a8db fix: remove Vowpal Wabbit exclusion, add Interpretability exclusion (#1708)
    • 6a1f994812234ba861bcae0c96fb11f185eec261 chore: update scalatest and scalactic (#1706)
    • 144674fdb9537e6dbad2817dee95d8c26eaf3fa5 chore: remove synapse test exclusions (#1698)
    • 32c654b83781c6e028fdd23be47e083d1decb8e7 chore: pin az and python versions (#1705)
    • c8fba2831d34338b5c82079a77b77963ceab32b6 chore: fix ado integration (#1704)
    • 92d409574376a1570ab49cc82a60c7efbc9fd1de chore: remove notebooks (#1703)
    • a9537809d877be4c2eec83b9a55c46a942026d41 fix: remove synapse E2E testing exclusion - cyber ml (#1699)
    • b257c70562b312e03ca8fb566d71582601725800 fix: update isolation forest notebook (#1696)
    • 9120b056920b54647b75c2c640a8e9e87c919969 using predictionCol for isolation forest (#1686) [ #1060 ]
    • 448f6b7ca81d0e806e06410a5035bd5edff2ad6e Remove trident.mlflow APIs. (#1687)
    • f4af33f719844d419083e6204bb58ee14b6de133 fix: don't throw on invalid columns in DropColumns (#1695)
    • c531bbbfc93ccee3a3cc167060411941d3635e1b docs: update developer readme instruction on python env creation (#1693)
    • 467e651dd814213bbfe4c13e5bc1b6dac7fd86ee build: bump amannn/action-semantic-pull-request from 5.0.1 to 5.0.2 (#1688)
    • 302831ffd8cec84f0e24de6eef4c193d3ed0966a fix: fix pyarrow failure in deeplearning test (#1689)
    • e857511e21e471829048650955823ecfe8e1e89d fix: fix linked service on cog service base (#1685)
    • f29318a274610dda543ee1422bdbd74cdb6a752a build: bump amannn/action-semantic-pull-request from 4 to 5.0.1 (#1680)
    • 50ac0c8aa7149637396700b8ccf16a422eb732ed Update reopen-issue-on-comment.yml
    • c9278b5c1c2225c6b0f48bcd21996a1a584cd1f9 chore: fix reopen comment action
    • b3a9ba9ca84e7af257664f81858301c9af03fc61 chore: fix reopen on comment workflow
    • 9fe273b8665d9c70b5f0375a9b8e603d65143874 chore: fix typo in issue reopen yaml
    • a7c50de2e905b55d64be9fdc413f012fdaecfb27 chore: re open github issues after a comment (#1676)
    • 8914750ac804fe9072b327e4cecfc6c382cd52f2 chore: clean up github workflows and add issue label remover (#1674)
    • 965231a98c7bc32151dc93f6c1a399c4b6ba4c76 docs: fix multiple typos and update error hintings in ai-samples-timeseries notebook (#1663)
    • 4fa7249966386fdc86edb99ea1ea665ef1643c94 docs: improve error msg to make it clearer for users and fix typos (#1662)
    • dd9e5d24a570f5735c2e048b35d0fb06aa887000 fix: fix Uplift Modelling style
    • 5a52aef842eaeb36796fc4fc825e9198c371229f docs: simplify data downloading and add mlflow to uplift modelling (#1659)
    • 95f451ab3f1b13d635c69b797f153610c2902ea4 chore: turn off failing synapse tests temporarily (#1658)
    • 76d73826de969c36a3e71f884bf0dd7258beb7f0 fix: KernelSHAP throws error when the key type in the ZipMap output is LongType (#1656)
    • e703ad4605e387e711de4e0ee3d9919d57e46674 chore: added synapse-internal to platform detector function (#1651)
    • ca358e369a20fdbbfc43339cb0fb09d481cfe16a docs: move magic command forward since it restarts interpreter
    • 3a160b395dae8d7d3af528b4d6180dc4d7737dd6 docs: remove unused docs and fix links
    • d5a499720a4cd7a536ac755fbe1af2f355a27bc1 docs: improve example notebooks
    • a7d097a7057d4363a5de4d1b9173867df60522b2 chore: publish test jars
    • b7c8cf10b7b70fd90f45fe0e81a4bf7d8b58b4eb docs: add aisample uplift modelling (#1640)
    • c8750ce83a6884ddb7503c361c53c6d67fab86e8 docs: fix command to launch jupyter notebook (#1649)
    • 8d552746951ab4190e82dda1d4f04699fc46c69f docs: add mlflow in ai samples time series forecasting (#1645)
    • d751a52b7e61d460bac15b01d5f2680274cbde77 fix: fix flaky translate tests (#1643)
    • 59a922b4c7aa73f4a1f540b9f17ecc8f46c55a86 docs: add mlflow logging and loading (#1641)
    • 4115d4f0f2ea5210b9eafd777ff7dc6f4567a7fb Create .acrolinx-config.edn
    • 64fecca3f51ec6df6753edde0fba23ab87127a3e docs: update spark version in Readme
    • 32037ecf357917d4270d7a2e7deec7074be91c4b docs: improve readme overview
    • 289bd974275a7df6c0dfc79fb9a20156a14c3c7e remove extra packages installation in pythontests (#1633)
    • 4878686cf58696bc276f810912f7a8667a2bcff0 feat: R test generation (#1586)
    • 1381db524e9a48ba2d0463d7c1bfb8b057c9fc61 chore: improve test coverage (#1631)
    • e700fd146a3c19aefac442b5f91f2b27600da938 chore: Remove MVAD's dependence on hardwired credentials and azure SDKs (#1629)
    • d5ee8e747aeb0edc42a9c1e6b448503717bb1b1c fix: update ubuntu to 20.04 in pipeline (#1624)
    • dbbe6814c6f82793f549cf798bce16a06b4abcc6 chore: clean up TextAnalytics cog service APIs (#1622)
    • d98ac02c989492307e924f286cd5a7f3be767241 docs: add aisample on text classification (#1617)

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • v0.10.1(Aug 23, 2022)

    SynapseML v0.10.1

    Bug Fixes 🐞

    • fix speechToTextSuite serializationFuzzing failure (#1626)
    • fix translator endpoint and update all endpoints for gov regions (#1623)
    • binder runtime issues (#1598)
    • clean up cluster if databricks tests pass (#1599)
    • fix deep-learning test flakiness (#1600)
    • update dotnetTestBase assembly version (#1601)
    • fix flaky forms test (#1584)

    Build 🏭

    • bump EnricoMi/publish-unit-test-result-action from 1 to 2 (#1609)
    • bump actions/setup-node from 2 to 3 (#1610)
    • bump actions/setup-python from 2.3.2 to 4.2.0 (#1611)
    • bump actions/setup-java from 2 to 3 (#1612)
    • simplify e2e test pipeline with test matrix

    Documentation 📘

    • add aisample notebooks into community folder (#1606)
    • add aisample time series forecasting (#1614)
    • fix .NET logo on website (#1604)
    • improve OpenAI notebook (#1596)
    • pin mybinder to v0.10.0 to avoid thrashing
    • add demo into videos on website (#1581)
    • update installation guidance of v0.10.0 (#1578)
    • add more .net samples (#1570)
    • add dotnet installation & example doc (#1567)
    • Update issue template

    Features 🌈

    • add stale bot for issues (#1602)
    • Support grayscale images in toNDArray (#1592)
    • Add the descriptionExcludes parameter to AnalyzeImage (#1590)
    • Added the DeepVisionClassifier a simple API for deep transfer learning and fine-tuning of a variety of vision backbones (#1518)

    Maintenance 🔧

    • bump to v0.10.1 (#1628)
    • deprecate old Text analytics APIs to prepare for refactoring (#1627)
    • remove deprecated lime APIs (#1620)
    • update openai service to the official deployment, and disable test due to outage (#1619)
    • Auto update GitHub actions with dependabot (#1608)
    • hotfix binder badge
    • pin binder version for users (#1607)
    • Bump spark to 3.2.2
    • bump spark version
    • Format welcome message with emojis (#1583)
    • Add welcome message to new PRs/Issues (#1573)
    • Add GH workflow to label new/reopened issues (#1571)
    • update website (#1566)

    Testing 💚

    • stabilize unit tests (#1576)

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n

    Changes:

    • 0f54bc65e720ac89f1d4c04502bc2cb5c6310db7 chore: bump to v0.10.1 (#1628)
    • 3d0f3f466afde7f69d91cdf27a97e629ac6dad91 chore: deprecate old Text analytics APIs to prepare for refactor (#1627)
    • 2052e13b82f6ccc94848359f8070f24d5f06de6c chore: remove deprecated lime APIs (#1620)
    • 09213b010e658833148026d242268e2eb0482b17 fix: fix speechToTextSuite serializationFuzzing failure (#1626)
    • 9f78bf0074ebf3668fb7e1d5d18a681ec236f988 fix: fix translator endpoint and update all endpoints for gov regions (#1623)
    • 7e90d190bd3f96869fe176c4eefe2bc417fe34fe docs: add aisample notebooks into community folder (#1606)
    • ac40e5af5d2a7bd1f6c0a25b12b2e07e6ff92c2e chore: update openai service to official, and disable test due to outage (#1619)
    • f54f7f68a4cb6596af5975d4caa20ea0ead798b2 docs: add aisample time series forecasting (#1614)
    • 7b4b0e1c066e1ec7c3bff719e12b991d8193a25e build: bump EnricoMi/publish-unit-test-result-action from 1 to 2 (#1609)
    • 43b0d1714954b1d160d1e0a11c64a52226d6825a build: bump actions/setup-node from 2 to 3 (#1610)
    See More
    • c48a07a97493269799571c0df8a06654c414c247 build: bump actions/setup-python from 2.3.2 to 4.2.0 (#1611)
    • b1a331c3c61fd89c147e2b0be65b9b23056eba9b build: bump actions/setup-java from 2 to 3 (#1612)
    • 78e40cb37cab655d6d94b4df8690d1302154d019 chore: Auto update github actions with dependabot (#1608)
    • 69d2d202439187f862bee59cac93b99e19ce0a4d chore: hotfix binder badge
    • 93d7ccf7a782d89ac157d6e1c87ea3f55d11b886 chore: pin binder version for users (#1607)
    • c7a61ecd57f9962590be3a075f586578a7fa3e13 fix: binder runtime issues (#1598)
    • c960c06b8534e6b0013f4b2107d262fd4be62472 docs: fix .NET logo on website (#1604)
    • 28a35b43ea7685e2e70ffc84c1bd39c6f7866176 fix: clean up cluster if databricks tests pass (#1599)
    • 5a28740881a7298d783ed922b7746b7fc3d7c77b fix: fix deep-learning test flakiness (#1600)
    • adf1a61d19a4493a39b063dc2abddc16a8b1bbe6 fix: update dotnetTestBase assembly version (#1601)
    • c659b330342cdde38340b2c488d4b9bc8b2df58b feat: add stale bot for issues (#1602)
    • 05a420257c25167c300a9a7c6e13f5674e4fba9a docs: improve OpenAI notebook (#1596)
    • e019756ae7534cc1cdf81a8b24f8224b92855bdc feat: Support gray scale images in toNDArray (#1592)
    • 51beaa0e462d5f7edc5f32242e0a7b8cc91b3ab5 feat: Add the descriptionExcludes parameter to AnalyzeImage (#1590)
    • b9ac22a544a1398cd77e00dd10f0796f729eaf4c docs: pin mybinder to v0.10.0 to avoid thrashing
    • 1808a0f452ffab9ee24063b1e6c16ef5ed06f95d chore: Bump spark to 3.2.2
    • 8e7d4533e7f2c54da35438eb6561e47ee9269197 build: simplify e2e test pipeline with test matrix
    • 8e34c7ba56687c2d92f601c5bbf1475cbad68584 chore: bump spark version
    • 44c8ed5239dd7b7f43295865ff5b0caa87d40ab6 feat: Added the DeepVisionClassifier a simple API for deep transfer learning and fine-tuning of a variety of vision backbones (#1518)
    • e4f0883740b970430cbbb5781431206b788caa49 fix: fix flaky forms test (#1584)
    • 7da5f49d3161c2a2809f3dce5d117d9ec7903eb5 chore: Format welcome message with emojis (#1583)
    • 0e6bb3557aff7314fd791bd40d8dccaaed7c5093 Serena/update issue template (#1582)
    • a6a271860889dcc0b81bb8c5915bc35f31b866f3 docs: add demo into videos on website (#1581)
    • 7c34fc4332443bff1d4e0f8c7a696f1c94d71977 test: stabilize unit tests (#1576)
    • 49f3a58f9853421f832b6c50bf459d92af075459 chore: Add welcome message to new PRs/Issues (#1573)
    • 4868e8bfed15da4d40cad1a910a272bc43bafc92 Add back LightGBM library initialization in booster (#1575)
    • d427b8842a56a88dbbe1d533a4df083c41adb07f docs: update installation guidance of v0.10.0 (#1578)
    • 55a60c9c017278881de70aa92bf516cf0e5fa552 docs: add more .net samples (#1570)
    • 39fe2d8b987e0bee0823320d456d87da48b7a45d chore: Add GH workflow to label new/reopened issues (#1571)
    • 0febe3cb5df1838bb5daaa138fbc74ee904a69ff docs: add dotnet installation & example doc (#1567)
    • db95a1046584c158a3819325c2229d6501b48330 chore: update website (#1566)

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Jul 18, 2022)

    SynapseML
    Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML v0.10.0 (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, Java, .NET, C#, and F#.

    Highlights

    | | | | | |:--:|:--:|:--:|:--:| |OpenAI Language Models | .NET, C#, and F# Support | Full MLFlow Support | Live Demos in Browser | | Embed 175-billion parameter models into your databases with ease | Use or train any SynapseML model from .NET | Quick and easy MLOps, model management, and autologging | Explore the SynapseML library with zero setup | | Learn More | Getting Started Guide | Explore the Docs | Run in Browser |

    New Features

    General ✨

    Azure Cognitive Services for Big Data 🧠

    Responsible AI at Scale 😇

    • Added partial dependence plots (PDP) to allow for understanding how independent variables affect a model's prediction (#1426)
    • Updated ICE/PDP documentation with PDP-based feature importance and additional examples (#1441, #1352)
    • Added a notebook for ICE and PDP feature explainers (#1318)
    • Updated data balance documentation to better describe how it can be used to ensure model fairness (#1540)

    MLFlow 🔃

    LightGBM on Spark 🌳

    • Added the ability to pass in generic argument strings to LightGBM enabling many complex parameterizations (#1444)
    • Added seed parameters to LightGBM (#1387)
    • Added a method to get LightGBM native model string directly (#1515)
    • Fixed issue with validation data creation during useSingleDataset mode (#1527)
    • Fixed multiclass training with initial scores (#1526)
    • Fixed saving LightGBM model iterations with early stopping (#1497)
    • Fixed issue where chunk size parameter was incorrectly specified during data copy (#1490)
    • Fixed issue where when empty partition is chosen as the main worker in singleDatasetMode (#1458)
    • Fixed bug with data repartitioning in LightGBMRanker(#1368)
    • Fixed outdated docs for useSingleDatasetMode (#1562)
    • Refactored LightGBM class structure to improve logging and debugging (#1557)

    Vowpal Wabbit 🐇

    • Fixed issues with the saveNativeModel for the VWRegressionModel #1364 (#1366)
    • Fixed issues with building quadratic interaction terms (#1460)

    Isolation Forests 🌲

    Additional Updates

    Maintenance 🔧

    • Removed unused debugging code (#1546)
    • Remove Synapse test exclusion for Explanation Dashboard notebook (#1531)
    • Made python style checks verbose (#1532)
    • Fixed library checking while installing library on Databricks cluster (#1488)
    • Upgraded and fix Dockerfiles (#1472)
    • Added Developer Docker Image build to pipeline (#1480)
    • Fixed ADO area path in Issue Linker (#1464)
    • Fix master version badge display
    • Improved Databricks error reporting
    • Updated azure cli to stop build errors
    • Fixed SSL handshake flakiness
    • Added itsdangerous as a dependency to ADB tests (#1412)
    • Turned on debug for pr to work item workflow
    • Pointed pr linker to official implementation
    • Changed GitHub action trigger from pull_request_target to pull_request (#1413)
    • Fixed issue where Unit Tests were not executing (#1409)
    • Added Azure DevOps PR linker (#1394)
    • Updated GH PAT name (#1389)
    • Re-enable Synapse E2E Tests (#1517)
    • Updated SynapseE2E Tests to Spark 3.2 (#1362)
    • Fix ADO issue/pr linking (#1463)
    • Cleaned up extra MVAD models and improved network resiliency (#1457)
    • Updated azure blob client version (#1563)
    • Fixed docker security vulnerability (#1561)
    • Streamlined scalastyle hook (#1530)
    • Updated CODEOWNERS (#1523)
    • Updated OpenAI resource info (#1525)
    • Fixed semantic PR checking (#1503)
    • Updated docker images to remain compliant (#1500)
    • Added component governance explicitly to build so timeout variable works (#1489)
    • Fixed path for notebook test files in gitignore (#1485)
    • Increased component governance timeout (#1482)
    • Added conda caching to build
    • Stopped build from failing after 1 hour
    • Fixed flaking MVAD test
    • Refactored build pipeline definitions
    • Split Synapse tests into multiple test (#1377)
    • Moved from ADO Pipelines to GitHub Workflows (#1406)

    Website Improvements 💻

    • Fixed MathJax expressions rendering (#1343)
    • Fixed google analytics gtags (#1434)
    • Corrected placement of BingSiteAuth.xml config (#1445, #1439)
    • Fixed website security and upgrade docusaurus (#1545)
    • Moveed Geospatial Services to its own folder (#1345)
    • Bumped minimist from 1.2.5 to 1.2.6 in /website (#1455)
    • Bumped node-forge from 1.2.1 to 1.3.0 in /website (#1451)
    • Bumped prismjs from 1.25.0 to 1.27.0 in /website (#1430)
    • Bumped follow-redirects from 1.14.7 to 1.14.8 in /website (#1402)
    • Bumped nanoid from 3.1.23 to 3.2.0 in /website (#1355)
    • Bumped shelljs from 0.8.4 to 0.8.5 in /website (#1347)
    • Bumped follow-redirects from 1.14.1 to 1.14.7 in /website (#1348)
    • Bumped cross-fetch from 3.1.4 to 3.1.5 in /website (#1496)
    • Bumped async from 2.6.3 to 2.6.4 in /website (#1481)
    • Pinned onnxmltools to a specific version (#1524)

    Bug Fixes 🐞

    • Fixed twitter sentiment detection notebook (#1544)
    • Fixed issue with DataConversion serialization (#1505)
    • Fixed typos in TestBase (#1501)
    • Fixed issue in GridSpace python API (#1470)
    • Fixed reflective class loading in IntelliJ (#1456)
    • Removed verbose ComputeModelStatistics output and convert scoredLabelsCol to DoubleType (#1361)
    • Fixed flaking in geospatial notebooks

    Code Style 🎶

    • Improved style checks using pre-commit (#1538, #1528, #1535)
    • Formatted code and notebooks with Black style checker (#1522, #1520)

    Documentation 📘

    • Tabularized badges for readability (#1486)
    • Added a PR template (#1418)
    • Improved installation readme (#1369, #1422)
    • Added a Security readme (#1511)
    • Updated the Azure Synapse readme (#1372)
    • Remove reference to custom maven resolver
    • Added pointer to docs on synapse pool configuration
    • Fixed typos in readme (#1516)

    Contributor Spotlight

    We are excited to highlight the contributions of the following SynapseML contributors:

    | | | | |:--:|:--:|:--:| | Serena Ruan | Ric Serradas | Puneet Pruthi | | Serena is a Software Engineer II on the Synapse team in Beijing and a force of nature. In this release, Serena has continued her prolific contribution steak by adding language support for .NET, C#, and F# and integrating SynapseML with MLFlow. Additionally, Serena has contributed several features to the MLFlow and Spark.NET open-source communities so that these systems can work better for every user. These contributions are just some of the many amazing things Serena has accomplished during this release, and her devotion and craft are pivotal to the ecosystem. | Ric is a Senior Engineering Manager on the OneNote team with a shining personality and drive to collaborate. In just a few weeks Ric hit the ground running by setting up an automated link between GitHub and Azure DevOps, building the first working version of SynapseE2E tests, and re-writing our entire build in GH Actions. Furthermore, Ric worked tirelessly through nights and weekends to land his contributions. | Puneet is a Senior Engineer on the SynapseML team with a knack for engineering systems and dockerization. Puneet's contributions to the library include architecting the new binder integration, driving our Synapse E2E tests to completion, and improving SynapseML’ s infrastructure around community engagement. Puneet is constantly thinking of ways to improve the community and we value his effort. | | | | | | Mark Niehaus | Keerthi Yanda | Yagna Oruganti | | Mark is a Senior Software Engineer on the SynapseML team with a deep knowledge of the .NET ecosystem and infrastructure development. In this release, Mark architected SynapseML’ s .NET binding blob publishing strategy, drove the OpenAI GPT-3 bindings to completion, and wrote a detailed GPT-3 walkthrough. Mark completed these projects while supporting the Time Series Insights service, speaking to his ability to keep multiple plates spinning at a time. | Keerthi is a Software Engineer II on the SynapseML team. Despite joining Microsoft just a few months ago, Keerthi has quickly learned the SynapseML ropes to take command of our integration with the Azure Synapse platform. Huge kudos to her for braving long build times, and daunting error messages to make sure SynapseML works out of the box on Synapse Analytics clusters. | Yagna is a Senior Data and Applied Scientist on the Industry AI team with a talent for building solutions that integrate many community tools to solve customer challenges. Yagna's first contribution to SynapseML was a masterpiece of a demo showing how to use Isolation Forests, MLFlow, Tabular SHAP, and the interpret-ml explanation dashboard in a single anomaly detection example. |

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external, who helped create this version of SynapseML

    Serena Ruan @serena-ruan, Eric Dettinger, Scott Votaw @svotaw, Puneet Pruthi @ppruthi, Ric Serradas @riserrad, Mark Niehaus @niehaus59, Kyle Rush @k-rush, Keerthi Yanda @KeerthiYandaOS, Yagna Oruganti @YagnaDeepika, Jason Wang @memoryz, Ilya Matiach @imatiach-msft, Yazeed Alaudah @yalaudah, Elena Zherdeva @ezherdeva, Kashyap Patel @ms-kashyap, Martha Laguna @martthalch @marthalc, Alex Li @liyzcj, Maria Guirguis @maguir, Alexandra Savelieva @alsavelv, @netang, Sudhindra Kovalam @SudhindraKovalam, Markus Cozowicz @eisber, Tom Finley, Markus Weimer, Jeff Zheng, James Verbus @jverbus, Chris Hoder, Misha Desai, Nellie Gustafsson, Eren Orbey, Beverly Kodhek, Louise Han @jr-MS, Justyna Lucznik, Kim Manis, Mitrabhanu Mohanty, Bogdan Crivat, Anand Raman, William T. Freeman, James Montemagno, Luis Quintanilla, Dennis Kennedy, Ryan Hurey, Jarno Ensio, Brian Mouncer, Steve Suh @suhsteve, Akshaya Annavajhala (AK), Guolin Ke, Tara Grumm, Niharika Dutta @Niharikadutta, Andrew Fogarty, Juanyong Duan, Weichen Xu @WeichenXu123, Spark.NET Team, ONNX Team, Azure Global, Vowpal Wabbit Team, LightGBM Team, MSFT Garage Team, MSR Outreach Team, Speech SDK Team, MLflow Team

    Learn More

    | | | | |:--:|:--:|:--:| | Visit our website for the latest docs, demos, and examples | Read more about SynapseML's GA release in the Microsoft Research Blog | Learn more about our .NET bindings and code generation system. | | | | | | Watch a demonstration of SynapseML to create a multilingual search engine. | Read our Paper from IEEE Big Data '21 | Explore our integration with the Azure OpenAI Service|

    Source code(tar.gz)
    Source code(zip)
  • v0.9.5(Jan 12, 2022)

    SynapseML
    Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML v0.9.5 (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, and Java.

    Highlights

    | | | | | | |:--:|:--:|:--:|:--:|:--:| | Geospatial Intelligence |Multivariate Anomaly Detection | Responsible AI at Scale | Text To Speech | Healthcare Analytics | | Large-scale map and geocoding operations | Build custom time series anomaly detection systems | Distributed Conditional Expectation and Partial Dependence Analysis | East-to-use Neural Text to Speech for large datasets | Quickly understand entities and relationships in corpora of medical text. |

    New Features

    Geospatial Intelligence 🗺️

    • Added support for distributed geospatial queries backed by the Azure Maps API
    • Added the geospatial usage overview (#1339)
    • Explore how to use the geospatial intelligence services to analyze flood risks. (#1339)
    • Added the AddressGeocoder transformer to map informal addresses to standardized adresses with latitude and longitude (#1294)
    • Added the ReverseGeocoder transformer to map latitude and longitude measurements to standardized addresses. (#1339)
    • Added the CheckPointInPolygon, to detect if latitude and longitude queries lie inside regions of interest (#1339)

    Azure Cognitive Services for Big Data 🧠

    • Added the Healthcare Analytics Transformer for extracting medical information, entities, and relationships for text. [Example Usage] (#1329)
    • Added the FitMultivariateAnomaly estimator for training custom anomaly detection models on DataFrames of multivariate time series data (#1272)
    • Added example notebook for Multivariate Anomaly Detector
    • See how to train a custom Multivariate Anomaly detector in the Estimators reference docs (#1323)
    • Added simplified Text Analytics transformers that support auto-batching (#1329)
    • Added the TextToSpeech Transformer for transforming Dataframes of text to audio files with neural voice synthesis (#1320)
    • Added the TextAnalyze transformer to support executing multiple text analytics workloads within a single API call (#1267, #1312)

    Responsible AI at Scale 😇

    • Added Individual Conditional Expectation explanations and Partial Dependence Plots with the ICETransformer. This tool gives detailed explanations of how features in opaque-box models affect the model prediction. (#1284)
    • Learn about how to use the ICETransformer through an example with the Adult Census dataset

    MLFlow 🔃

    • Add MLFlow support for saving and loading SynapseML models (#1277)

    LightGBM on Spark 🌳

    • Improved LightGBM training performance 4x-10x by setting num_threads to be cores-1 (#1282)
    • Added the predict_disable_shape_check in LightGBM (#1273)
    • Reduced temporary file bloat by creating the LightGBM native temp directory lazily (#1326)
    • Added logging for number of columns and rows when creating datasets, set useSingleDatasetMode=True by default (#1222)

    Infrastructure 🏭

    • SynapseML now installable from Maven Central!
    • SynapseML now supports spark v3.2.x

    Additional Updates

    Bug Fixes 🐞

    • Allowed FlattenBatch to propagate non-array values (#1286)
    • Fixed flaky tests (#1342)
    • Fixed website bugs and migrated docSearch (#1331)
    • Fixed issue where IsolationForestModel does not properly exchange params with the inner model (#1330)
    • Corrected the objective param when using fobj (#1292)
    • Fixed issue where broadcasted sum in breeze 1.0 breaks in Spark 3.2.0 (#1299)
    • Hotfixes for R test runners (#1283)
    • fix installation instruction (#1268)
    • Removing broadcast hint (#1255)
    • fix install instructions (#1259)

    Build 🏭

    • bump algoliasearch-helper from 3.6.1 to 3.6.2 in /website (#1270)
    • remove some deps that cause sec issues (#1264)

    Documentation 📘

    • Fixed broken link to CyberML notebook (#1322)
    • Added website announcement bar (#1263)
    • Updated and improve readme (#1262)
    • Removed references to runme in contributing.md
    • Supported Math expressions in website markdown (#1278)
    • Corrected Synapse typo in website (#1335)

    Maintenance 🔧

    • Stopped lightGBM tests from timing out (#1315)
    • Fixed r test flakiness (#1314)
    • Updated VerifyLightGBMClassifier.scala (#1313)
    • Update speech SDK test results
    • Add in missing tests in build (#1300)
    • Fix flaky build steps (#1298)
    • Fix website telemetry (#1261)
    • Add website telemetry (#1260)
    • Added missing test classes to pipeline

    Contributor Spotlight

    We are excited to highlight the contributions of the following SynapseML contributors:

    | | | | |:--:|:--:|:--:| | Serena Ruan | Ilya Matiach | Sudhindra Kovalam | | Serena is an engineer on the Azure Synapse team in Beijing. In this release, Serena has continued her unbelievable speed of contributions with support for Multivariate Anomaly Detection, MLFlow, and installation from Maven Central. These contributions are just a few of the many projects Serena has contributed since she joined just a few months ago! | Ilya is a prolific engineer on the Azure Machine Learning Boston team working on responsible AI. Ilya contributed LightGBM on Spark and worked tirelessly to improve and support this feature. Ilya has been an active contributor to the SynapseML project for 5 years and has built many of the tools in the library. | Sudhindra is an engineer on the Microsoft Maps team and has contributed intelligent geospatial APIs to SynapseML v0.9.5. Sudhindra developed new ways to automate generation of Spark code from swagger files allowing him to contribute a large suite of features rapidly. | | | | | | Elena Zherdeva | The Text Analytics Explorer Interns | Stuart Leeks | | Elena is an engineer on the CSX Data team working on building scalable responsible AI tools. In Elena's first contribution to SynapseML she added Individual Conditional Expectation plots at scale. She also contributed a detailed sample notebook that does a fantastic job of explaining key concepts in Responsible AI. | Samantha Konigsberg (top left), Preeti Pidatala (top right), and Victoria Johnston (bottom) were summer explorer interns on the text analytics team. They collaborated together to build new simplified API's for the text analytics service using the Java SDK layer. One of these contributions was the new Healthcare Analytics API in Spark. This was intern's first Scala project, making this contribution all the more impressive!| Stuart is Engineer on the Commercial Software Engineering. Stuart not only uses SynapseML to power customer engagements, but also directly contributes features needed to make his customers succeed. Stuart contributed support for the new Analyze Text API which allows users to perform multiple intelligent text tasks with a single API call. Stuart also added features to SynapseML’s Mini-batchers to improve their generality. |

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML

    Jason Wang @memoryz , Serena Ruan @serena-ruan, Ilya Matiach @imatiach-msft , Stuart Leeks @stuartleeks, Sudhindra Kovalam @SudhindraKovalam, Elena Zherdeva @ezherdeva, Preeti Pidatala @preetipidatala, Samantha Konigsberg @skonigs, Victoria Johnston @victoriajmicrosoft, Markus Cozowicz @eisber, Yazeed Alaudah @yalaudah, Suhas Mehta @suhas92, Kashyap Patel @ms-kashyap, Wenqing Xu @xuwq1993, Markus Weimer, Jeff Zheng, James Verbus @jverbus, Misha Desai, Nellie Gustafsson, Ruixin Xu, Eric Dettinger, Martha Laguna, Louise Han @jr-MS, Rashid Monin, Ali Emami, Clemens Schotte, Edward Un, Johannes Kebeck, Han Li, Assaf Israel @assafi, Tom Finley, Tomas Talius, Mitrabhanu Mohanty, Anand Raman, William T. Freeman, Ryan Hurey, Jarno Ensio, Brian Mouncer, Sharath Chandra, Beverly Kodhek, Nisheet Jain, Akshaya Annavajhala (AK), Euan Garden, Lev Novik, Guolin Ke, Tara Grumm, Ismaël Mejía, Keunhyun Oh, @martin0258, @sinnfashen, Dung Nguyen @nhymxu, @elswork, ONNX Team, Azure Global, Vowpal Wabbit Team, Light GBM Team, MSFT Garage Team, MSR Outreach Team, Speech SDK Team

    Learn More

    | | | | |:--:|:--:|:--:| | Visit our new website for the latest docs, demos, and examples | Read more about SynapseML's GA release in the Microsoft Research Blog | SynapseML is now generally available on Azure Synapse! Get started here. | | | | | | Learn more about Multivariate Anomaly Detection in SynapseML | Read our Paper from IEEE Big Data '21 | Sign up for the Private Preview of Explainable Boosting Machines in SynapseML |

    Source code(tar.gz)
    Source code(zip)
  • v0.9.4(Nov 16, 2021)

    SynapseML
    Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, and Java.

    Highlights

    | | | | | | |:--:|:--:|:--:|:--:|:--:| | General Availability on Synapse |ONNX on Spark | Responsible AI | Form Recognition and Translation | Reinforcement Learning | | We are ready to help you productionalize on Azure Synapse Analytics | Distributed and hardware accelerated model inference on Spark | Understand opaque-box models, measure dataset biases, Explainable Boosting Machines | Parse PDFs and translate dataframes between over 100 languages | Contextual Bandit Reinforcement Learning with Vowpal Wabbit |

    New Features

    General ✨

    • Renamed and rebranded! Microsoft ML for Apache Spark is now SynapseML
    • New modular library sub-packages for standalone install of each major set of features
    • Support Spark 3.1.2 and Scala 2.12
    • Support pip install synapseml for python bindings

    ONNX on Spark 🕸

    Cognitive Services for Big Data🧠

    • Added Multilingual Translation APIs (#1108) (Tutorial)
    • Added FormRecognition APIs (Invoice, IDs, BusinessCards, Layouts, Custom Models) (#1099) (Tutorial)
    • Added the FormOntologyLearner to extract meaningful "ontologies" of objects from collections of forms
    • Add notebook to Create a Multilingual Search Engine from Forms
    • Updated Text Analytics API to V3.1 (#1193)
    • Add redactedText to PIIV3 (#1247)
    • Added Personally Identifying Information (PII) identification
    • Added Read API
    • Added Conversation Transcription API
    • Cognitive service now support data exfiltration protected (DEP) VNET allowing for individualized security solutions on Synapse Analytics (Learn More)
    • Added support for the m4a codec in Speech to Text models
    • Added predictive maintenance notebook
    • Added Cognitive Service overview notebook
    • Added support for linked service authentication in Synapse Analytics
    • Simple no-code support in in Synapse Analytics

    Responsible AI at Scale 😇

    • Added Additive Shapley Explanations (SHAP) for understanding the predictions of opaque-box models (#1077)
    • New API for Locally Interpretable Model-Agnostic Explanations (LIME), now supports background distributions text models, and has the same API as SHAP (#1077)
    • Added Measure transformers for Data Balance Analysis (#1218)
    • Add more notebook samples for documentation (#1043)
    • Documentation and notebooks for Interpretability on Spark
    • Introduce Responsible AI section on website (Interpretability + DataBalanceAnalysis) (#1241)
    • Adding document and notebook for Data Balance Analysis (#1226)
    • Explainable Boosting Machines for performant and interpretable ML (Private preview on Synapse Analytics only)

    Vowpal Wabbit 🐇

    • Added ContextualBandit reinforcement learning (#896)
    • Added Vowpal Wabbit Overview Notebook

    LightGBM 🌳

    • Added matrix type parameter and improve logic to automatically infer dataset sparsity (#1052)
    • Added several parameters related to dart boosting type (#1045)
    • Added chunk size parameter for copying java data to native (#1041)
    • Added number of threads parameter (#1055)
    • Added custom objective function to LightGBM learners (#1054)
    • Added singleton dataset mode for faster performance and reduced memory usage (#1066)
    • Add num iteration and start iteration parameters to LightGBM model (#1024)
    • Added the average precision metric (#1034)
    • Added overview notebook for LightGBM
    • Moved to new streaming API for dense data to reduce memory usage
    • Tuned chinking code for faster performance

    Build and Infrastructure Improvements 🏭

    • New Docusaurus website generation system
    • E2E Tests on Synapse Analytics (#1014)
    • Split library into separately installable subprojects (#1073)
    • Added a unified logging and telemetry system (#1019)
    • Modernized R wrapper generation
    • New Automated Python test generation (#998)
    • New extensible code generation system
    • New two-tiered security for build secrets
    • Update ubuntu version to 18.04
    • Automated back-up ACR images

    Additional Updates

    Bug Fixes 🐞

    • Enable backwards compatibility for mmlspark python namespace imports (#1244)
    • Fix publishing to maven and pypi (#1242)
    • Fix broken link to notebook in Data Balance Analysis doc (#1240)
    • min_data_in_leaf missing from dataset parameters in lightgbm (#1239)
    • Fix performance issue in interpretability notebooks (#1238)
    • Fixed cognitive service errors (#1176)
    • Fixed flaky tests
    • Rename NERPii to PII
    • Fixed cog service test flakes
    • Fixed setLinkedService issues in Synapse (#1177)
    • Improved LGBM error message for invalid slot names (#1160)
    • Fixed generated python code (#1121)
    • Updated notebookUtils class path (#1118)
    • Fixed LIME NaN weight output (#1117, #1112)
    • Fixed Guava version issue in Azure Synapse and Databricks (#1103)
    • Fixed flakiness in spark session stopping
    • Fixed result parsing for forms
    • Fixed explainers returning wrong results when targetClassesCol is specified
    • Fixed CNTKModel issue due to catalyst bug on databricks (#1076)
    • Fixed null handling in bing image response (#1067)
    • Avoided strange issue with databricks json parser
    • Fixed dependency exclusions and build secret querying
    • Fixed issue in tabular lime sampler (#1058)
    • Updated Bing search URLs (#1048)
    • Refactored python wrappers to use common class (#758)
    • Updated java params patch (#1027)
    • Added missing returns in new python lightGBM model methods
    • Stop R binding generation from failing silently
    • Fixed conversation transcription participant column functionality
    • Reduce verbosity to prevent RPC disassociated errors
    • Fixed performance slip in Featurize
    • Added timeout logic for speech to text
    • Added ffmpeg time limit enforcing for flaky streams (#1001)
    • Fixed upload python whl file to blob(#1000)
    • Cleaned up python tests (#994)
    • Fixed read schemas (#988)
    • Made HTTP default concurrent timeout infinite
    • Made HTTP rate limiting retry indefinitely
    • Recommender Patch for Spark 3 Update (#982)
    • Fix typo in text sentiment schema
    • Changed ints to longs for offset and duration in STT
    • Fixed processing sparse vector size
    • Fixed Double User agent setting bug
    • Fixed build warnings (#1080)
    • Fixed build for new intellij
    • Fixed livy dependency resolution
    • Fixed pom for sbt dependencies (#1202)
    • Fixed bug in testGen parallelism
    • Auto-update packages in docker
    • remove unused code
    • Fix codecov logging of wrapper generation (#1098)
    • Fix badge publishing
    • Remove issue in scalastyle file for new IJ

    Documentation 📘

    • Add explicit pointer to HDI install
    • fix typo (#990)
    • Bump python install to top to make it clearer
    • Add example CyberML notebook (#958)
    • Add CyberML link to README.md (#989)

    New Contributor Spotlight

    We are excited to welcome several new developers to the SynapseML project.

    | | | | |:--:|:--:|:--:| | Serena Ruan | Jason Wang | Wenqing Xu | | Serena is an Engineer on the Azure Synapse team in Beijing. Within her first months working on SynapseML, Serena contributed Forms and Translator cognitive services, a unified logging and telemetry system, notebooks and documentation for every transformer and estimator, and a new docusaurus-based website. | Jason is a Principal Engineer on Microsoft's DSP team and is focused on large-scale responsible AI. Jason started his contribution streak with a new API for model explainability that unifies both SHAP and LIME. Jason has also contributed ONNX on Spark which dramatically broadens the scope of models that can be used in SynapseML. | Wenqing is a software engineer on the Azure Synapse team in Beijing. Wenqing has been instrumental in preparing SynapseML for General Availability. In particular, Wenqing added support for linked service authentication of cognitive services, extended E2E testing to Synapse Analytics, and added the PII identification service. | | | | | | Kashyap Patel | Rohit Agrawal | Jack Gerrits | | Kashyap is an Engineer on Microsoft's DSP team working on improving the fairness of machine learning models. Kashyap contributed tools for assessing dataset bias without requiring a labelled dataset or model. | Rohit is a Senior Engineer on Microsoft's Cognitive Service team working on large-scale orchestration of intelligent services. Rohit modernized our Text Analytics Stack by updating to v3.0 and laid the groundwork for E2E testing on Synapse Analytics.| Jack is a Senior Engineer on the decision service and reinforcement learning team at Microsoft Research NYC. Jack contributed support for contextual bandit reinforcement learning with Vowpal Wabbit. |

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML

    Jason Wang, Serena Ruan, Ilya Matiach, Jack Gerrits, Kashyap Patel, Wenqing Xu, Markus Weimer, Jeff Zheng, Nellie Gustafsson, Ruixin Xu, Martha Laguna, Markus Cozowicz, Rohit Agrawal, Daniel Ciborowski, Jako Tinkus, Tom Finley, Tomas Talius, Mitrabhanu Mohanty, Roy Levin, Anand Raman, William T. Freeman, Ryan Hurey, Sharath Chandra, Beverly Kodhek, Assaf Israel, Nisheet Jain, Ryan Hurey, Miguel Fierro, Dotan Patrich, Akshaya Annavajhala (AK), Euan Garden, Lev Novik, Guolin Ke, Tara Grumm, Keunhyun Oh, Vanunts Arsenii, Alexandr Severinov, David Lacalle Castillo, Ryosuke Horiuchi, Ashish Solanki, Matthieu Maitre, ONNX Team, Azure Global, Vowpal Wabbit Team, Light GBM Team, MSFT Garage Team, MSR Outreach Team, Speech SDK Team

    Learn More

    | | | | |:--:|:--:|:--:| | Visit our new website for the latest docs, demos, and examples | Read more about SynapseML in the Microsoft Research Blog | Get started with SynapseML on Azure Synapse Analytics | | | | | | Read the Synapse Analytics Ignite Announcements | Read our Paper from IEEE Big Data '21 | Watch our ODSC Webinar on working with AI services at scale |

    Source code(tar.gz)
    Source code(zip)
  • v0.9.2(Nov 3, 2021)

    v0.9.2

    Bug Fixes 🐞

    • fix publish to central maven (#1233)
    • fix website (#1234)
    • fix typo in sbt install
    • lightgbm default params should not be specified if optional (#1232)
    • fix website broken links (#1230)
    • improve azure search writer error message in Array[Array[]] case
    • update baseUrl and fix static images (#1217)
    • Fixing flaky unit tests (#1215)
    • Docker image should install openjdk-8-jre as opposed to default-… (#1211)
    • Fixing flaky test

    Documentation 📘

    • add explanation dashboard integration example notebook (#1236)
    • fix links to developer readme and R setup (#1229)

    Feat

    • Build our new website (#1190)

    Features 🌈

    • support direct pip install (#1223)
    • Measure transformers for Data Balance Analysis (#1218)
    • Add the FormOntologyLearner

    Maintenance 🔧

    • release synapseml 0.9.2 (#1237)

    Performance Improvements 🚀

    • website enhancement (#1221)

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n

    Changes:

    • 81f5f80bc68918840c51023a0ba8a3cbae55a814 chore: release synapseml 0.9.2 (#1237)
    • 127c70a9f806c6f412e56c2d766b4b65d53d342e docs: add explanation dashboard integration example notebook (#1236)
    • 9b9c2fbb2341949f9a3c85837a7f6b1acb7b9b13 fix: fix publish to central maven (#1233)
    • 7059573dd873494851d8e1db9c5ea9ad44a945a1 fix: fix website (#1234)
    • d47f014159d99c999c14153c1fc7b51622c21999 fix: fix typo in sbt install
    • 336eff5606a965358ef1bbff7f7f970697479e4e fix: lightgbm default params should not be specified if optional (#1232)
    • 3d92dd730e52d8194470347eb7fb43aca3f09343 feat: support direct pip install (#1223)
    • 2771853c4d956c3c5f349bc3156f4d2f7f12b0f8 docs: fix links to developer readme and R setup (#1229)
    • ea91189db473b7a82e66eaf3e42122b9223bcfb0 fix: fix website broken links (#1230)
    • bbd874407161367c6927636c1bdb6dd791bbb36e perf: website enhancement (#1221)
    See More
    • c5e174214f4ada3cb9bb534140f6c4d759bd4150 feat: Measure transformers for Data Balance Analysis (#1218)
    • 73c6a657a1cebc580fa6fa8da56dc34eb85dc36e fix: improve azure search writer error message in Array[Array[]] case
    • d8344c5b4efa6b33fbdbbba06f715d4b7f8af2a1 feat: Add the FormOntologyLearner
    • 2d81b5056dce57f9191ac2beb279c554f960259c fix: update baseUrl and fix static images (#1217)
    • e23041f47f3bad97435eb5564e0ca451fc70aee2 fix: Fixing flaky unit tests (#1215)
    • 5d31e3e1054a7bcd571225f3f24e7c4990e95c78 fix: Docker image should install openjdk-8-jre as opposed to default-… (#1211)
    • 9623b3ea1530f32f15610459e657dcf98c0f4d49 Feat: Build our new website (#1190)
    • 3f74133b8a5d00220eaad6e3e8e0361e7faf8856 fix: Fixing flaky test

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Oct 15, 2021)

    v0.9.1

    Bug Fixes 🐞

    • fix readme badge

    Maintenance 🔧

    • Bump version to 0.9.1

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n

    Changes:

    • 6b814261af82ea1cdcc34c13d78d086107b72385 chore: Bump version to 0.9.1
    • 274b110913dcffc2f89742c14aebfc45989533fc fix:fix doc publishing
    • 600bc6e84026291a80785923e53f681b67fb1eb3 fix: fix readme badge

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Oct 15, 2021)

    v0.9.0

    Bug Fixes 🐞

    • don't crash on fallback storage location (#1183)

    Chore

    • rename mmlspark to synapseml (#1204)

    Features 🌈

    • updata versions in README.md (#1205)

    Maintenance 🔧

    • release synapseml 0.9.0 (#1206)

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML.\n

    Changes:

    • a6c7fea6dc6a9bbffcdaeef3e587e5efdb1ada50 chore: release synapseml 0.9.0 (#1206)
    • 383cb951811908fe29b85253edfd8dffb9b2241c Chore: rename mmlspark to synapseml (#1204)
    • ecc6868e2280b5f0e2344b7e3cc9c11e19670b1f fix: don't crash on fallback storage location (#1183)
    • 661e3e5a443d37f24dca68a6f52d4aaae03368a1 feat: updata versions in README.md (#1205)

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v1.0.0-rc4(Jul 18, 2022)

    v1.0.0-rc4

    Bug Fixes 🐞

    • fix setLinkedService in Synapse
    • fix cognitive service errors (#1176)
    • fix anomaly detector test cases
    • rename NERPii to PII
    • fix scala style error
    • fix cog service test flakes
    • fix setLinkedService issues in Synapse (#1177)
    • improve LGBM error message for invalid slot names (#1160)
    • flaky lime test
    • fix flaky conversation transcription test
    • fix SpeechToTextSDK setLinedService (#1138)
    • fix generated python code (#1121)
    • update notebookUtils class path (#1118)
    • LIME returns NaN weight if a feature contains a single value or when the sampler cannot obtain a different state for a feature due to data skew. It returns zero weights for all other features. (#1117)
    • fix Guava version issue in Azure Synapse and Databricks (#1103)
    • fix flakiness in spark session stopping
    • Fix result parsing for forms
    • LIME sometimes return nan weights (#1112)
    • reformat code
    • explainers return wrong results when targetClassesCol is specified
    • Unit test OOM error (#1093)
    • Update codeowners (#1092)
    • BingImageSearch fails randomly in E2E test (#1082)
    • [Workaround] CNTKModel does not output correct result (#1076)
    • small issue with null in bing image response (#1067)
    • fix flaky conversation transcription test
    • avoid strange issue with databricks json parser
    • fix dependency exclusions and build secret querying
    • Fix issue in tabular lime sampler (#1058)
    • Bing search URL update (#1048)
    • early stopping test and average precision metric (#1034)
    • refactor python wrappers to use common class (#758)
    • java params patch (#1027)
    • missing returns in new python lightgbm model methods
    • fix issue with r bindings silently failing
    • fix conversation transcription participant column functionality
    • reduce verbosity to prevent RPC disassociated errors
    • Fix performance slip in Featurize
    • add timeout for stt
    • update subscription in build secrets
    • Add ffmpeg time limit enforcing for flaky streams (#1001)
    • fix upload python whl file to blob(#1000)
    • adding more recommendation code owners (#996)
    • cleanup python tests (#994)
    • Fix read schemas (#988)
    • fix issue with NER suite test
    • make concurrent timeout infinite
    • Make rate limiting retry indefinitely
    • Recommender Patch for Spark 3 Update (#982)
    • fix typo in text sentimant schema
    • change ints to longs for offset and duration in STT
    • fix python tests in build
    • fix processing sparse vector size
    • Fix Double User agent setting bug

    Build 🏭

    • add two teired security for build secrets
    • Fixing build warnings (#1080)
    • update ubuntu version to 18.04
    • fix build for new intellij
    • fix livy dependency resolution

    Doc

    • add predictive maintenence notebook
    • Add CyberML link to README.md (#989)
    • Add example cyberML notebook (#958)

    Documentation 📘

    • Adding document and notebooks for ONNXModel (#1164)
    • Documentation and notebooks for Interpretability on Spark
    • Add explicit pointer to HDI install
    • fix typo (#990)
    • Bump python install to top to make it clearer

    Features 🌈

    • Update Text Analytics API to V3.1 (#1193)
    • add NERPii
    • Add Infrastructure to Run Tests on Synapse (#1014)
    • rename Read to ReadImage (#1163)
    • ONNX model inference on Spark (#1152)
    • update DocumentTranslator to support setLinkedService in Synapse (#1151)
    • add setLinkedService (#1136)
    • add translator (#1108)
    • add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage (#1066)
    • add form recognizer support (#1099)
    • split library into subprojects (#1073)
    • new LIME and KernelSHAP explainers (#1077)
    • refactor to have separate dataset utils and partition processor (#1089)
    • refactoring of lightgbm code in preparation for single dataset mode (#1088)
    • move partition consolidator and add LocalAggregator API (#1071)
    • add number of threads parameter (#1055)
    • add custom objective function to lightgbm learners (#1054)
    • Add more notebook samples for documentation (#1043)
    • add matrix type parameter and improve auto logic (#1052)
    • add several parameters related to dart boosting type (#1045)
    • added chunk size parameter for copying java data to native (#1041)
    • Add MMLSpark logging infrastructure (#1019)
    • Add R wrapper gen
    • add num iteration and start iteration to lightgbm model (#1024)
    • Refactor code generation system
    • add automated python test generation infrastructure (#998)
    • add TextLIME
    • Add ReadAPI
    • add conversation transcription
    • add m4a codec

    Maintenance 🔧

    • bump version numbers (#1203)
    • Fix pom for sbt dependencies (#1202)
    • Add script to clean and back up ACR
    • fix bug in testgen parallelism
    • testing new build
    • disable failing synapse e2e tests
    • fix flaky serialization fuzzing test
    • disable failing doc translator test
    • fix flakiness in python tests (#1144)
    • auto-update packages in docker
    • fix flaky notebook
    • remove ununsed code
    • fix codecov logging of wrapper generation (#1098)
    • update to lightgbm 3.2.110
    • fix badge publishing
    • upgrade lightgbm to 3.2.100
    • update build to new subscription (#991)
    • fix Detect face suite (#968)
    • remove issue in scalastle file for new IJ
    • lower threshold for STT tests

    Performance Improvements 🚀

    • tune chunking code, fix memory leak
    • moving to new streaming API for dense data to reduce memory usage

    Update

    • reformat code
    • update setLocation
    • remove parens
    • use HasSetLinkedService trait
    • add more cognitive service
    • add more cognitive service
    • add more cognitive service
    • add more cognitive service
    • remove test code
    • add test code
    • remove testing code
    • add sample code for test
    • add sample code for test
    • add sample code for test
    • add sample code for test
    • add sample code for test
    • add sample code for test
    • add reflection
    • remove example in test files
    • add class path
    • add reflection
    • notebook
    • update spark version to 3.1.2 (#1086)

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.\n

    Changes:

    • 5fc65abbe43f520529970d2173f671e39004e510 chore: bump version numbers (#1203)
    • 993da81a0ab947a65cabea89fb9cc0a52d4498bb chore: Fix pom for sbt dependencies (#1202)
    • 327be83c6c711d3cba3be84cda85b997dd087c44 feat: Update Text Analytics API to V3.1 (#1193)
    • 661057752d7baea4592842ba5af05fbdc6f3bd9c fix: fix setLinkedService in Synapse
    • e08a8e2918fbf62ec2e83ddfa709023006edb0ba chore: Add script to clean and back up ACR
    • d85aae8dbe489b20299892406be32c32a73c362f fix: fix cognitive service errors (#1176)
    • c6925dbb87b6e7c65a8b9c9c9a4b2d0161a770aa fix: fix anomaly detector test cases
    • b52c36101f9eecc9f306b16ebef1b03700ad421c fix: rename NERPii to PII
    • 2ce1ba6be91e2f39b2ad97550685efd474e979b6 fix: fix scala style error
    • 1000fdb38ddbfbd2f4b4b52870d22b260e1e25df feat: add NERPii
    See More
    • 4682199012edc35b1ccefad7167b7aee3c844106 fix: fix cog service test flakes
    • 0c4d32d4b25cbd6c32d65c7fce0f0bca95a0ff2e doc: add predictive maintenence notebook
    • 80889120ff06f242310e1778130cac0ed47f30fd fix: fix setLinkedService issues in Synapse (#1177)
    • 2d65668b194f4cbcf070302765227352379844a0 update notebook link
    • 586e6761bb242fa7124e13845c030b24648ebf42 chore: fix bug in testgen parallelism
    • 5ed9a8cfab0a20b18eed982dcfcc02beae69032c chore: testing new build
    • f00272ec2dc402ce5521ae5f721195c168e82323 chore: disable failing synapse e2e tests
    • fdf756292c6e3679be602ef30faa8993fad65c50 chore: fix flaky serialization fuzzing test
    • f5b9c5ee67b67f9913d72eafaaa13f3175967d38 chore: disable failing doc translator test
    • 3ae67abdfee5f0bedd89a086b82101e7153b3b9c feat: Add Infrastructure to Run Tests on Synapse (#1014)
    • de4b47b8b6643575eb8dec470dec0dadfd1d836b Security upgrade required for openjdk from 8-alpine to 17-ea-22-jdk-oracle (#1165)
    • 21d5ec86c6fa5c4be7d627d77c56567f233c9013 docs: Adding document and notebooks for ONNXModel (#1164)
    • 1f9135f40b76f894b8bcea5983ba8ca37249e123 feat: rename Read to ReadImage (#1163)
    • 8ec07e72d85f4fcc03b51d263856823eda7f7874 fix: improve LGBM error message for invalid slot names (#1160)
    • 448f893684e1f503b6c5cf0d3e3543aa80b61163 feat: ONNX model inference on Spark (#1152)
    • a5135b2ed9bba9f785764f115df6bbeeba7c3797 feat: update DocumentTranslator to support setLinkedService in Synapse (#1151)
    • d5470ffecf1778a6f9ba2df32b0f07049b582e7c chore: fix flakiness in python tests (#1144)
    • 204799258ca23539a275bdc9ee155a6090460f93 update Cognitive Services - Overview notebook (#1126)
    • 6ef2d28a9a3d57d63e40202e3d50ba15ae9ee3d0 fix: flaky lime test
    • 5a6f8946ec24d9f3aa957b19c6c3d8b10160a7db fix: fix flaky conversation transcription test
    • cf1281d0014bb6e88c0d9f0411e5b6d6a23b4d4e build: add two teired security for build secrets
    • 8eda1df878256eb68e5921eef9f0c8b6bfef5bb6 feat: add setLinkedService (#1136)
    • 4167921e646619186bc5ae90f2544ddffb0068ed fix: fix SpeechToTextSDK setLinedService (#1138)
    • 87ec5f7442e2fca4003c952d191d0ea5f7d61eac fix: fix generated python code (#1121)
    • 84d8d246a2c853e00743db1ea2341c47fcef67dd feat: add translator (#1108)
    • d287be6185ca2e2a9a7fe9940a592eda362e727d fix: update notebookUtils class path (#1118)
    • 0f69cf5ac9e12db78ccee67c8fc768ef3b864cb8 feat: add singleton dataset mode for faster performance and use old sparse dataset create method to reduce memory usage (#1066)
    • 41bfd055175f6c8f3aee437b89ca1083f394d20c fix: LIME returns NaN weight if a feature contains a single value or when the sampler cannot obtain a different state for a feature due to data skew. It returns zero weights for all other features. (#1117)
    • fe70f31766818d39ae059ef2e4473735014f8168 fix: fix Guava version issue in Azure Synapse and Databricks (#1103)
    • 115f9214562b1f9a5ac3827f9f674c86bb66eee8 fix: fix flakiness in spark session stopping
    • a825a7430ee49a1c56533b7f844e9094c1e0f898 chore: auto-update packages in docker
    • 9314f82c7713a140311496faaeb229727886ad51 fix: Fix result parsing for forms
    • 0c6490d2394e88ed09121e3a75dde638568464a1 chore: fix flaky notebook
    • 94f04a8b78460826e55eabfcd64caacfa76ec44d fix: LIME sometimes return nan weights (#1112)
    • 85f089d0ae7aaaaefe6afa83c8aa96268bf6db14 feat: add form recognizer support (#1099)
    • 931cb42b25e0d637ef251b18a524d7027bbea127 update: reformat code
    • 8c69739c8ff9d714613f46528504ef4fcc67d5a5 update: update setLocation
    • 124b9c651211a3a580ff4d9fa254c627dc6ae866 update: remove parens
    • c2e31923b68862f8ae6890491ac1d80a44eba44f fix: reformat code
    • 20a795b9bf13ac70f658c379cbd7c4998ae25496 update: use HasSetLinkedService trait
    • f075a97f6f0d1bd3446caa8d8389255ec18bd0a2 update: add more cognitive service
    • 13a7126bbee60a287c5cf175060a44cf9a355dae update: add more cognitive service
    • 8114ccce08a88f11ce7df9353d56e18d43dbe503 update: add more cognitive service
    • e5b2a20d276c0c5472045d879d9fd4e64f77e803 update: add more cognitive service
    • f6e6591237c994f02ab79f53f347a79d23c02277 update: remove test code
    • d01fa1818e09d8c3c38ac6bf8c4e63348c5e7196 update: add test code
    • d85fc59960d871060fc0f7866e5d4d55120e6f95 update: remove testing code
    • 873ed329d8324b2814c1517e62e4c18feb52087a update: add sample code for test
    • d842f6205ec4bbb8562a3f60c79de96eb8ba4a53 update: add sample code for test
    • 2318af64c0f08fb2605621c28c2dc5565da6f86d update: add sample code for test
    • 3034b59a570af404bdc5b2f395759e6badc3f5fd update: add sample code for test
    • 74215972bb6ca3d02b8d1c94c20aa54aba7f376a update: add sample code for test
    • 5b7e574ebe5a01a810ebed9137b258a457b63596 update: add sample code for test
    • e633635611cbd79610c835a4aed543b005b7badf update: add reflection
    • df9098d5aba54940278df5e47d8ad53a5123d478 update: remove example in test files
    • 2deca5ee1a6a6b32befcffbe3473ad1a9c1bbee2 update: add class path
    • 80b7a08ac4d3b8ff451cffc8bae2de796df240a5 update: add reflection
    • f480aff79d2e2a2c04efe0fc83564ed239af22b4 Docs update
    • 40f7fbf50d1f7fef6c86d04c00117bcd89c1c2f1 Reformat notebooks with jupyter lab
    • 774af7297b5f61c03b59b350923677172537898b update notebooks
    • bafc8d470fcf0ef1b309831113faabf93e7e7974 Update docs, reformat notebooks
    • 171ed8958126eb274d6138605540c3024dfdd80a update: notebook
    • c255e6617cca64f777a49a887977fc27bfb5cffd Deprecate old lime code and update readme
    • a9b55425f129aa2d251c3cfd3acb76fd2778a64c docs: Documentation and notebooks for Interpretability on Spark
    • 26b9b077431b9ad76689e189225e2ecbb779461f explainer notebooks
    • 84f96e9a46e756396fafd243159aa7225644bbee chore: remove ununsed code
    • 541f76f7dc1c31a07adb4f7f8c903199b303a4ff fix: explainers return wrong results when targetClassesCol is specified
    • e54406a32ba9a5b56e65d1a12195c824bbbc6f4b chore: fix codecov logging of wrapper generation (#1098)
    • a5b265e41d387ddb32fecf74e6b25f35f6034d9b feat: split library into subprojects (#1073)
    • c84ab47020e358fe875a29160037c4971c0a77a7 fix: Unit test OOM error (#1093)
    • 725a92dce673b05798a410d24658a751ffa89b2e fix: Update codeowners (#1092)
    • 7dd6bb1cf082bdba6298cc0a85b0b6ba95ed1f0e feat: new LIME and KernelSHAP explainers (#1077)
    • 00bac62b94284ab5ac94c30ff1f174571622e836 update: update spark version to 3.1.2 (#1086)
    • 21d6c0444e1e2747b759f65f1c63f13cca12c7f8 feat: refactor to have separate dataset utils and partition processor (#1089)
    • e8a97ed9ecf3b6c11a164543482ada6576f8abd2 feat: refactoring of lightgbm code in preparation for single dataset mode (#1088)
    • e7d4ecafc3f524906ae4548b0879c37bc8633a2d build: Fixing build warnings (#1080)
    • ebee5dc3ac7c0ae69b120dc2b0d50da8c6e0be53 fix: BingImageSearch fails randomly in E2E test (#1082)
    • 0632f1bf61ab6dc793095f1a639cbf3b0754a0d7 fix: [Workaround] CNTKModel does not output correct result (#1076)
    • 36ee274e93e1f7a07fc863061ad726e5ca5b49ee feat: move partition consolidator and add LocalAggregator API (#1071)
    • 2a716c100fc99a66d01c849256b75ced383eb23a feat: add number of threads parameter (#1055)
    • 63ce4ef62a916982002b0b6f8a55e3f7d12b830e fix: small issue with null in bing image response (#1067)
    • 6aecdf1c0c212950344f210f11aea2dfb8760009 Add sparse vector support to KNN. (#1063)
    • ab15ca4237225caab9c8ea6e937bbed3d911b660 fix: fix flaky conversation transcription test
    • 45379694813458c5e113d84186c09b3a5c455cdc fix: avoid strange issue with databricks json parser
    • 4baaf4964fc1c91a532d690a58468c13e32526ad fix: fix dependency exclusions and build secret querying
    • d6b1726d9078f9fd0560c986e3913b47101fe5f7 docs: Add explicit pointer to HDI install
    • ae8004afc2924304ce554c1b67e1ad4c316c7100 feat: add custom objective function to lightgbm learners (#1054)
    • d8bb51f8d4c8b5a9cd2e9a046fb0355dabc356f2 fix: Fix issue in tabular lime sampler (#1058)
    • 663d9650d3884ece260a457d9b016088380c2cb9 feat: Add more notebook samples for documentation (#1043)
    • 12cea2df9e479077813b611c1b098ca39b1a3133 feat: add matrix type parameter and improve auto logic (#1052)
    • 03b8b7d141332b2913fdb9b9b1ee3671fdd12ab7 fix: Bing search URL update (#1048)
    • b704515f2180ea839e67ac37753c8796f759ef1a Update Classification - Adult Census.ipynb
    • bd63cc8d5ab4de1e0ae73779bda6f094d28bc720 feat: add several parameters related to dart boosting type (#1045)
    • b7f29e8300b85e82798c8bfee96cb95207e5b727 feat: added chunk size parameter for copying java data to native (#1041)
    • 1c4691f1b77b93b9fe756e726f053ea77abe77c9 Update pr.yml
    • aad223e045512f5c59249e838cfff2fd5d279e2d fix: early stopping test and average precision metric (#1034)
    • 04a9876fd30f0162f4b17c81059753c0290a5564 fix: refactor python wrappers to use common class (#758)
    • f5479ddfcf9fa9e776a5e83fefe4371db0d6abcc fix: java params patch (#1027)
    • d7b86d34502507dc6aef01a47c186d9b6ab1cfbd Create pr.yml
    • c20aee805bafa17652e014e343fbe18d1981f98f Update ado-integration.yml
    • e3cffa5751c369c44186dd44adb54f91bc0626a9 Delete ado-pr-integration.yml
    • 11f8dbbe6d884f55bdbcaeadcc0b741ff8baf93d Update ado-integration.yml
    • 369bb8326602c55a3695d6848d32e2abedc6d12f Update ado-pr-integration.yml
    • a53003f3f249bf7c1c3de87b702be418afabe405 Update ado-pr-integration.yml
    • 05cb62622b214927021437e0d97426559b639d74 Rename ado-pr-integration to ado-pr-integration.yml
    • 03f6f29d572d3b634375da4865c26b2def437811 Create ado-pr-integration
    • 19b305f0a1170458027ea1ed35cde50ad8e870e0 Update ado-integration.yml
    • a7dbeb83a78caaae7c1520c26e17d9a7aafd077e Update ado-integration.yml
    • 3b8e046cfc514ace79f5bae9554d415c40438978 Update ado-integration.yml
    • acbb268f93db61a863e7921ad0550d9039127d6f Create ado-integration.yml (#1039)
    • 1e2f33b3fa5a3ab0a58093c9dc8df6f58034d024 feat: Add MMLSpark logging infrastructure (#1019)
    • 99b580f5ee7c671fb662908623dddff632bedc9d feat: Add R wrapper gen
    • bf337941f4fed2b4675d307aa446e0e3b54ef251 fix: missing returns in new python lightgbm model methods
    • 99047351f1ec4a3d547ec622c6027506c328da68 chore: update to lightgbm 3.2.110
    • 61d2bf18991b78402a405085f914366c8792afe6 feat: add num iteration and start iteration to lightgbm model (#1024)
    • 2c223f664c506acba4fd1ef4f53b4541df3fcc25 fix: fix issue with r bindings silently failing
    • c33451fb22b7c140749ac443d5a68c98a44c1c0a fix: fix conversation transcription participant column functionality
    • bc9e81ef2cf3fe5b0a1a1a586ace925fa1270d1f perf: tune chunking code, fix memory leak
    • 8942198727fd652d8cae5dbf75ca7404da4e07ee fix: reduce verbosity to prevent RPC disassociated errors
    • 0c44344a6354f2aae4754ec825fbbc97275eacad perf: moving to new streaming API for dense data to reduce memory usage
    • 1b46782818b53c0bb6cce9cb95a6eb98bf49d177 chore: fix badge publishing
    • 1e3a4a44c68fd0d5257b8708c1c5e3885330c760 fix: Fix performance slip in Featurize
    • 8d4c405daec9adbe4482ba20849de6596e217bef feat: Refactor code generation system
    • cd79ecda47bacec8acfa6babf6e585240e617ad0 chore: upgrade lightgbm to 3.2.100
    • ffe2507ed8c1b9c20ea7efe6d3d7407c4bc88506 fix: add timeout for stt
    • 3b91af32cdc1bcd24d59db28240eb23b118cb502 build: update ubuntu version to 18.04
    • 4446afa5d8c6748560c650deae877374e4f7793c fix: update subscription in build secrets
    • 01a8cb4f2bcce7e953d7305f80b439646fc590d8 Update developer-readme.md
    • 54379bf7cdfd7fb2f27f3a0bb5f055c95e560c36 chore:remove flaky LGBMtest
    • 4e915d4312ea1ad11a8dc5fba499f6507c2f8825 feat: add automated python test generation infrastructure (#998)
    • 9b7518316cfcc2f5debce549bbffa3566c2cb865 fix: Add ffmpeg time limit enforcing for flaky streams (#1001)
    • ec7cb7856381cfa1169a3f6fb119a67062510cbc fix: fix upload python whl file to blob(#1000)
    • 96f66447ce69e1cd24ca6ec3b69c4b980255842a fix: adding more recommendation code owners (#996)
    • d496aa7d437e0c7edd3237a85951e43951eee1c5 fix: cleanup python tests (#994)
    • 0717ac4c603ab69f5f8fcc4c87dc2bfebc90e2bc fix: Fix read schemas (#988)
    • 9cff1e6495a4509bcaae832a44205592ecaaa05b chore: update build to new subscription (#991)
    • 7a1f28b0c163979baf48ff23863752c9280a2009 Update pipeline.yaml for Azure Pipelines
    • 657e6b1d969932cd68f29033001abedec6760952 Update pipeline.yaml for Azure Pipelines
    • 3661a443a38111a7971f236f009fa32fd7533f74 Update pipeline.yaml for Azure Pipelines
    • 7ce0c5ff8cc1e0bf470d66354c324a128da35c93 Update pipeline.yaml for Azure Pipelines
    • 19672c485798d65e82bb76846d0d912ed64990e7 Update pipeline.yaml for Azure Pipelines
    • f913bdd94d8cf5230e1e2274c95ee768b21680df docs: fix typo (#990)
    • 59b684178ed12c82c292e24d0bd1ded4effeadd4 Update README.md
    • 062a470e1eb714cf4443c939e97c974f98d99d17 doc: Add CyberML link to README.md (#989)
    • b1c1400802a55b2899f3fa21656e187a3b6fd808 feat: add TextLIME
    • d4fa5771142e3a0a02953da4792622bf1362832a fix: fix issue with NER suite test
    • 86beddec070a4ccdf45d41b4dfd57183a94d5269 fix: make concurrent timeout infinite
    • 89fa081b82f93d6f1240b3229c7918b166571f89 fix: Make rate limiting retry indefinitely
    • f14623e21b70f6ed44ba7828f7886436e21bf496 fix: Recommender Patch for Spark 3 Update (#982)
    • 13ce0c974963d3ccda028658886b4cf323898071 Update developer-readme.md
    • 6218a5b4fdb19a1329c8b91d6ec9148bb12f3d87 Spark 3 (#970)
    • 5a5147addc42036282d1b45088fb91333d45b2d3 fix: fix typo in text sentimant schema
    • 4fe354826d79feffcd852bd166d91402eb1384a1 feat: Add ReadAPI
    • 4dab861e080248b7b938a4b2468d5633ef4be17b feat: add conversation transcription
    • 218913a131a55b9de62cd200cafe9de940cadd38 fix: change ints to longs for offset and duration in STT
    • 1daca68096e595e8774938bbf5d7abb98c000e80 feat: add m4a codec
    • 8e0c9b0f024c0917ae2086245c8bb52d502c0d58 chore: fix Detect face suite (#968)
    • 0571ae25f9c25f7e1491809756687e56e5c2e84e doc: Add example cyberML notebook (#958)
    • b04d6d655e37c22d043cb8de4359ec5b8ba5745a fix: fix python tests in build
    • 15eb55bdf8704c2375ea6e3fdd01b6fe2620c08e chore: remove issue in scalastle file for new IJ
    • 66ffeca190390115a5cd0c3c1b1c819d57ee8ece chore: lower threshold for STT tests
    • 55a3c1043813ec78a00755068d6028724b91aa41 build: fix build for new intellij
    • 7b1830e53fc88f6cb9efc8fc6e6bd885cd08bcef fix: fix processing sparse vector size
    • 0596de944e7681d8811b2aef4390527df9dfa37e Update developer-readme.md
    • 05359cfa6bf69bc67ca02e07f77e2bd91dd871e6 Update developer-readme.md
    • 0a30d1ae5583bcde95a20264af0a41b0d7175149 fix: Fix Double User agent setting bug
    • 1f077baa295f6c1426d5a28ba45d958e2a058edb Update pipeline.yaml for Azure Pipelines
    • 52463b1750db48adbcdbc073d00574345d996363 Update pipeline.yaml for Azure Pipelines
    • 78083a7ac03b5ac57e031a02d6cfe36d653470da build: fix livy dependency resolution
    • c2a3921739263914d605b5f8847ec01e0000d8d2 fix:remove preview api from NERv2
    • 98a827194b7f17f926a055ae5ab94aca54ba669e docs: Bump python install to top to make it clearer

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v1.0.0-rc3(Jul 18, 2022)

    v1.0.0-rc3

    Bug Fixes 🐞

    • fix broken test link
    • Fix incorrect indexing for determining eval prob in CB (#922)
    • Update DBC path

    Features 🌈

    • Add Env variable parametrized UserAgent header
    • Add support for ContextualBandit in the VW module (#896)
    • Update text analytics api to v3 (#916)

    Maintenance 🔧

    • bump version to 1.0.0-rc3

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

    @jackgerrits @rohit21agrawal

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v1.0.0-rc2(Jul 18, 2022)

    Microsoft ML for Apache Spark v1.0.0-rc2

    Highlights

    | | | | | | |:--:|:--:|:--:|:--:|:--:| | Isolation Forest on Spark | CyberML | Speech To Text | Conditional KNN | LightGBM + SHAP | | Distributed Nonlinear Outlier Detection | Machine Learning Tools for Cyber Security | Custom Speech to Text with Streaming Support | Scalable KNN Models with Conditional Queries | Interpret LightGBM Models using Additive Shapley Explanations |

    New Features

    Isolation Forest on Spark ⛺️

    • Added LinkedIn's Isolation Forest outlier detection algorithm
    • Read the original work for more info

    CyberML 🧙‍♂️

    • CyberML aims to provide open source tools for distributed cybersecurity workflows. This first release includes an algorithm that learns user-resource access patterns to detect anomalous access patterns. For more information see the docs

    Cognitive Services for Big Data🧠

    • Added SpechToTextSDK transformer. This new transformer transcribes raw audio files and live audio streams into text. Transcription supports realtime audio streaming, automatic splitting into utterances, and profanity detection. Supports several languages and Custom Speech Models.
    • added TextSentimentV3 transformer to leverage new Cognitive Services v3 API
    • add save and load methods to AccessAnomalyModel (#905)
    • stream robustness, output audio stream to file, and custom speech
    • Add m3u8 streaming for SpeechToTextSDK
    • enable mp3 file streaming in stt sdk (#822)

    Conditional K-Nearest Neighbors 🏡🏡

    • Added ConditionalKNN estimator and model for efficient search of high dimensional KNNs with conditional predicates.
    • Added Conditional KNN demo here
    • Find hidden artistic connections with the Mosaic application.

    HTTP on Spark 🌐

    • Added integration with python Requests to accelerate Python Requests with HTTP on Spark!
    • Optimized HTTP on Spark asynchronous performance

    Vowpal Wabbit on Spark 🐇

    • add barrier mode support for VW (#832)
    • add support for VW readable model, invert hash and re-using a previously trained VW Spark model (#821)
    • support generic numeric types for weights and labels (#817)

    LightGBM on Spark 🌳

    • add featuresShapCol to LightGBMClassifierModel (#863)
    • Expose parameter bin_construct_sample_cnt in spark for LightGBM (#780)
    • add interface function for updating learning_rate per each iteration in LightGBMDelegate (#849)
    • add delegate to monitor training (#847)
    • Add the option to get Feature Contributions in LightGBMBooster used by LightGBMRanker (#791)
    • Add option to add tolerance to improvement in metric evolution (#786)
    • added pred leaf index for LightGBMClassifier
    • Adding a new param for explicitly setting slot names. (#752)
    • added the top_k param for voting parallel (#762)
    • Adding a feature for positive and negative bagging fraction params. (#754)

    Learn More

    | | | | |:--:|:--:|:--:| | MosAIc Finds Hidden Connections in World Art (Article, Demo, Webinar) | Watch the Spark Summit Europe Keynote on MMLSpark | Learn about AI for Good and MMLSpark on the MSR Podcast |

    | | | | |:--:|:--:|:--:| | New Docs for the Cognitive Services for Big Data | Read our New Paper on Conditional KNN Trees | Read our New Paper on Microservices in Databases |

    Bug Fixes 🐞

    • Updating regular Docker Images for helm chart. (#885)
    • improve error message for invalid slot names (#897)
    • categorical parameter regression on dense dataset caused by missing whitespace (#909)
    • fix cyberml test imports
    • add "s" to failing publicwasb download
    • spark.executor.cores' default value based on master when counting workers (#855)
    • fix flakiness in BiLSTM notebook
    • make file type case insensitive
    • Add support for URI parameters and default filetypes
    • remove save_resume/preserve_performance_counters options as it breaks SGD/BFGS chaining (#828)
    • fix optional parsing for the CustomOutputParser (#835)
    • Fix flakiness in io tests
    • Improve codegen readability and added getters and setters to generated models
    • move tests to a separate package and refactor common code
    • added multiclass init score support (#805)
    • LightGBMRanker should repartition by grouping column (#778)
    • Possible multithreading issue when two scores may come in parallel they may not safely fill pointer values (#799)
    • Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance (#792)
    • Fix subtle bug in reverse index creation
    • add cap on max allowed port in network init (#759)
    • added min_data_in_leaf parameter (#760)
    • Reorder ADB Status Checks to fix flakiness
    • increase library install timeout (#763)
    • Fix an issue with the sparkContext not being instantiated at eval time
    • Fix GH release bade display
    • Codegen dataframe param fixes

    Build 🏭

    • bump version
    • Ignore existing installation when running installPipPackageTask (#895)
    • update ffmpeg on build server
    • make python test loop easier:
    • updating lightgbm to 2.3.180 (#850)
    • split cog services on spark tests
    • Split e2e and publishing (#836)
    • Add Caching to build pipeline
    • added isolation forest test to build pipeline (#800)
    • exclude scala from fat jar

    Code Style 🎶

    • Removing redundant file in the root directory: sp.txt (#796)
    • ball tree style fixes

    Documentation 📘

    • Adding section to readme for installing with apache livy (#785)
    • Add fix for maven resolver
    • Added two classification examples using Vowpal Wabbit (#733)

    Maintenance 🔧

    • add Roy to CODEOWNERS
    • fix flaky analyze image test
    • move build to new subscription (#888)
    • Update codeowners file to fix helm owwners
    • remove flaky lightGBM test and add retries to Cog service tests
    • Update CODEOWNERS (#831)
    • Add time in httpv2 tests to reduce flakiness on build VMs
    • fixes to improve test flakiness
    • updated lightgbm to 2.3.150 (#757)
    • improve efficiency of lightgbm tests
    • Add more cluster status checks
    • fix flakiness in IdentifyFacesSuite
    • bump heap size in build
    • add default UA

    Acknowledgements 🙌

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

    • Ilya Matiach @imatiach-msft
    • Markus Cosowicz @eisber
    • Lucy Zhang @zhang-lucy
    • Roy Levin @rolevin
    • Keunhyun Oh @ocworld
    • James Verbus
    • Christina Lee
    • Anand Raman
    • William T Freeman
    • Lei Zhang
    • Rohit Agrawal
    • Nisheet Jain
    • Chris Hoder
    • Chris Templeman
    • Chenhui Hu @chenhuims
    • Ryan Hurey
    • Jun Ki Min @loomlike
    • Dotan Patrich,
    • Addy Santo,
    • Anil Francis Thomas,
    • Amrit Bhattacharya,
    • Moshe Israel
    • Dalitso Banda
    • Joan Fontanals @JoanFM
    • Jack Gerrits @jackgerrits
    • Akshaya Annavajhala
    • Heiko Rahmel
    • Felix Tran @felixtran39
    • Stephanie Fu
    • Parker Levy
    • Casey Hillenburg
    • Vick Wowo
    • Brendan Walsh
    • Nick Gonsalves
    • Mindren Lu
    • Nurudín Álvarez
    • Guolin Ke
    • Chris Smith @chris-smith-zocdoc
    • David Lacalle Castillo @WaterKnight1998
    • Fokko Driesprong @Fokko
    • Diego Mazon
    • Tommy Li @tommyzli
    • Azure CAT
    • Vowpal Wabbit Team
    • Light GBM Team
    • MSFT Garage Team
    • MSR Outreach Team
    • Speech SDK Team

    Changes:

    • 81e73a27477be66788aa37c042eea27fa1c9bab6 chore: add Roy to CODEOWNERS
    • b12be504b1bf21c6894c13c795639d27e92353f0 build: bump version
    • b431a61b06f48ec1e7bf8ff2e9809025dc5f1bf6 fix: Updating regular Docker Images for helm chart. (#885)
    • 96f0b7775629d6e7b521d1ed8ca0e54655deef00 fix: improve error message for invalid slot names (#897)
    • 95c1f8a782191e3578587a49313e1d57abee5da3 fix: categorical parameter regression on dense dataset caused by missing whitespace (#909)
    • 040ad34964aaa266a6318a6974f324102a8302aa feat: add save and load methods to AccessAnomalyModel (#905)
    • 8f8c504dee24dae8bc9262a84c00d2e5d273352c fix: fix cyberml test imports
    • 9aed00480b08bd5e6378c255990ec80a0e7f9709 chore: fix flaky analyze image test
    • 826cfc22b9c4e8db37e1f520302079ef993cd321 fix: add "s" to failing publicwasb download
    • 22e19e5f52698a653dfae467f133c6552bc26e50 feat: CyberML (#890)
    See More * 54a623d445442f10e5d57d2c958b3762d4d9e331 build: Ignore existing installation when running installPipPackageTask (#895) * f1b4a946bb0d573d2e0de7705ff6694a3645f04a chore: move build to new subscription (#888) * f07e5584459e909223a470e6d2e11135b292f3ea Merge pull request #882 from ocworld/fix-rename-clusterutils-numcores * e741993efa34357f17ba5b2d1db357e8a6a68940 build: update ffmpeg on build server * 9f9ae53e8927f7c91283b611d0556e1c332f5757 feat: stream robustness, output audio stream to file, and custom speech * 0319650f275c8f4539c1ab14d4ac0660352ae32e build: make python test loop easier: * 65a13bc1c11b1799f1beb35cc83e5d5723b32526 chore: Update codeowners file to fix helm owwners * 7409ba58f1ef25be349c19cf429c880c8d7eb4dc Add num tasks override parameter for LightGBM learners (#881) * 64481e9437db43eb5f25cb33e31f097bcc59eccf fix: spark.executor.cores' default value based on master when counting workers (#855) * 4ae0fe87699d32c65dc75fa2b1787a0d70d71d75 reduce network communication overhead cost on reduce step for LightGBM learners (#869) * b4137492445060f5bcb5cab955e4bf4f91fb9543 fixed shap values shape for multiclass case and improved pyspark API (#870) * 840781a2ae6c3e9ee0a065294c893e53df576de7 unify APIs across LightGBM learner types and add SHAP feature importances to regressor (#864) * 84b392c3a46cff8d2138326da960a912ed0baf75 re-disable flaky test (#866) * d86a9370a9f0baf966f264e686751ddcdd29215c build: updating lightgbm to 2.3.180 (#850) * 6bb4a45f5bcc9f67392f934e6ec94670145bac3f feat: add featuresShapCol to LightGBMClassifierModel (#863) * 82e7a8eb59d809a4ff5a66d06bedca1ea958bbe3 Bump Apache Spark to 2.4.5 * a0db5b330b75e0211f629846cb36558b576e339f build: split cog services on spark tests * 537b611d9df7bbf2927666095d04f3c785dad66a 1) add functions for before/after batch training (#852) * ed435b82e8db55f902c15d18fe1fb52cba1631bd feat: Add m3u8 streaming for `SpeechToTextSDK` * 4d998794c114b43a5c60f5b2ed1182fb3f656c7a feat: add interface function for updating learning_rate per each iteration in LightGBMDelegate (#849) * be366c514d08e820ca0b1112072db2fb76e6f65f feat: add delegate to monitor training (#847) * c695d7a93b1a5b86941c8c9c2e4b586f0a6c421e add option for driver listen port * 99795bc38fc4decc20667ff2b6a6c34e64196209 fic: Codegen dataframe param fixes * 37e336ef7534bbcc881ec0d999ff60812057d10f feat: add barrier mode support for VW (#832) * 9c9a93b857d46a24458cc53f74aa8dfb95135a8a fix: fix flakiness in BiLSTM notebook * 5d9410a032ef3dfdae647c98fe771d547b910cd0 fix: make file type case insensitive * 55765f8e13cc8fa28a0acbacc71e833871a9cd36 chore: remove flaky lightGBM test and add retries to Cog service tests * b1e37972644ddd66b109cbeef4e1fb2c8578e20c fix: Add support for URI parameters and default filetypes * 5ae664affe7946a79fb6dbe096edc81b062d17f7 improvement: support numeric types (not just double) for weight/label (#817) * 9f15b6cd1a6d582dec9891b61430aeafad24b3b4 feat: add support for VW readable model, invert hash and re-using a previous… (#821) * 038b26b3d266a2f99d6b9f094aa97188be108fec fix: remove save_resume/preserve_performance_counters options as it breaks SGD/BFGS chaining (#828) * 7dd467092d83e162116cea5bb3084c359207cb87 build: Split e2e and publishing (#836) * ca05d1b5b99e6fa93aaf8d9916e55f4c7579d226 extended test case to validate duplicate passes parameter (#834) * 2ff6a36c64797847c0a57fb0a75ba697f7dd3e99 fix: fix optional parsing for the CustomOutputParser (#835) * f9a56e886ad02ed6b233114424b88ede71f30d7a chore: Update CODEOWNERS (#831) * c79dd12abca579d416acdb46c049132b4b41cd0d chore: Add time in httpv2 tests to reduce flakiness on build VMs * c7eed5a9f7e6c16a9c8d3270a012177fbe5ab6d5 build: Add Caching to build pipeline * c5b8b1579afd30fd7b63d1234023f97b2c2668e4 fix: Fix flakiness in io tests * 3abd9b44324ee5abb7a134d24bd96aa69c67680b chore:Split up io tests into 2 sections * 5489271aaa42736a6700d1684fc331bd5cd2354c fix:remove error prone IO from notebook tests * b4a60e5655d585c8ef7b91abf00a0d9dd205a59b fix:remove error prone IO from notebook tests * 2455cbeb5c4b8de746d2f56089445d3175bc715e chore: fixes to improve test flakiness * 6d7cfb5f17ca0b5a9ec807da21a77ec78c65b0f3 fix: Improve codegen readability and added getters and setters to generated models * 015d4ea0c27fd9bd710be0c6467c410afc58dc3a fix: move tests to a separate package and refactor common code * 6b2edc34a6116717267f65dead8582488d91cd9f feat: enable mp3 file streaming in stt sdk (#822) * 8005c1702dcf2c3d22fc821e45b14637a78c5c1f feat: Add `TextSentimentV3` Transformer (#812) * df0244c7b4f48e6c86bd8a5478d4b674a9a554ce fix: added multiclass init score support (#805) * e745784c7e60a07652bfc24e3039ed5906754541 fix: LightGBMRanker should repartition by grouping column (#778) * f7029211e737cbbc3019a2da48f5bc72d9f213e9 feat: Add the option to get Feature Contributions in LightGBMBooster used by LightGBMRanker (#791) * 875f89de89d3d7ed46a6bb4f73aca336fa276f09 build: added isolation forest test to build pipeline (#800) * 290f5cfca57606728a531076cca521d6d2bfda11 fix: Possible multithreading issue when two scores may come in parallel they may not safely fill pointer values (#799) * fb3ac9932d56c094a50dd3dceaf08ef6fdbe3ae1 docs: Adding section to readme for installing with apache livy (#785) * 7b8efa593037c3a53d2363c702e43391bb6e3304 fix: Guarantee one boosterPtr is allocated and freed per LightGBMBooster instance (#792) * 4c812d793a31bd4537e9f7d53cda6c90f08d7c44 style: Removing redundant file in the root directory: sp.txt (#796) * bd2f71e6a59c2b8ad730c8bafadc598faf189779 feat: Integration of LinkedIn's Isolation Forest (#781) * 9c61053fa126959c962c0707aa543451ef077574 feat: Add option to add tolerance to improvement in metric evolution (#786) * dbb281821542dff82386e6855e807d2e906c11bf feat: Expose parameter bin_construct_sample_cnt in spark for LightGBM (#780) * fde2d3cd4b6b72b789cbd74087d354c5164deb12 fix: Fix subtle bug in reverse index creation * 4b4af04893966d745bc1cc9cf34cda80837992d4 feat: add demo for `ConditionalKNN` * cf48d53c5fae480f9278b104fea4597cb966af6a chore: remove keys from demo * 2618422c6f1249f364ef9da4de5df3c9b648ecd9 feat: Add `SpeechToTextSDK` Transformer * 4da1ff2a1f4e88f5fe7b2a634510bd7dcbcc2993 style: ball tree style fixes * 849527d58972a67addd9934717d5db09f3f39897 feat: Add python bindings for `ConditionalBallTree` * d4d4ca82b809fd18671a2bf629b0b6201fc9a4f8 feat: Add KNN and ConditionalKNN Estimators * 134ddb5beba80165516d58740a364cc152531f65 fix bug in serialization * a00c141ce632a4259e285605853a2889ecca04cc fix review points * 9cf33cef3387d3edd12dffe8b476b60a976b5203 feat: added pred leaf index for LightGBMClassifier * 461d27d535414fe9e2547dc05187853ef1facc4e feat: added pred leaf index for LightGBMClassifier * 3a7a8130ee4f82fbfd99edc24f2562cc63cded6a feat: added pred leaf index for LightGBMClassifier * f3d624dbd3b15b64061ab8fb9b4a3f6a61f35f99 feat: Adding a new param for explicitly setting slot names. (#752) * 280cab7b020c4afdc9084db74ac33ddcb9abcd8f Expose dump model method on MMLSpark-LightGBM so that models can be saved as json. * 3da5d4f4cb68b6a6708d26b11497e48393594aa8 fix: add cap on max allowed port in network init (#759) * 91652f2e2302a4ae9d309534badff8ca8a2fd517 fix: added min_data_in_leaf parameter (#760) * 6bb042909df38ea82d4f2ec608e5235400cdc3a2 chore: updated lightgbm to 2.3.150 (#757) * 344dbbda12a2c6b309df290c268b94e4a6d83d1c feat: added the top_k param for voting parallel (#762) * ae634973d683514c32aaef95250fa80517bfeaca chore: improve efficiency of lightgbm tests * d9568dc2f8b5d1dc53c4293508a4bbb12a4d2653 chore: Add more cluster status checks * a9b05b91b59ae2a3fe213c9752c7c7343fd86bd6 chore: fix flakiness in IdentifyFacesSuite * 988403ff42630b18e453c01aa6cabf12f9b91fe0 fix: Reorder ADB Status Checks to fix flakiness * e1dc2b3df3a9d1a4b3ad5674676ec6d9838d4743 fix: increase library install timeout (#763) * a47922f7a45dbcb2c19406960b1b895f82582a8d change labelGain description * 43b4e63462a641dace34d16354c7dbef9fefd2e7 feat: Adding a feature for positive and negative bagging fraction params. (#754) * 087f290f301d7ec0ae1d9c6fb0a06cda3140fdfb docs: Add fix for maven resolver * 3da1d148c07d9de2a8ec7f46bd0de801873ca9cd docs: Added two classification examples using Vowpal Wabbit (#733) * dece5aed536658720c7daa8066ea6576bc7cf72c chore: bump heap size in build * 8bb7d861981fd519d4623cc903cbe9919762cb3f build: exclude scala from fat jar * 2465d4e3a9bf34a929d340e30aaf608997e311ac fix: Fix an issue with the sparkContext not being instantiated at eval time * d091b37c050a9c546ac4cb05186f878d93be8282 chore: add default UA * 614a4448aed1ffe1bdc58c2fcb7e84539e9fd42c perf: remove async bottlenecks from HTTP on Spark * 3caf8f0ef7eb7309f4c896d9e1c44c64e2e12e2d feat: Add wrappers for integrating with python Requests * 2fdfe3e852f7010507a1af382ad483a0b977bac6 added max_bin_by_feature, min_gain_to_split, max_delta_step parameters (#712) * 95b7ef006d5cdb77346beb826130dc31239fa1db Fix scalastyle * 56046025c0fb90816ba176d74ac54e7a411376b9 Fix default case check. Add test cases for countCardinality * 491c01cd3de5796a6f9a6abfaf18f6bc67219b37 change getTrainingCols from Option[DataType] -> Seq[DataType] * 25425a006d19d2185349f2c0f570e6333b8ab1fd Use a case class instead of anonymous tuple * c58b216f477a4d0506f0a3f1ba61e55a1356c1cd Support the group column being a string * f22aa732960abdd5c4db00a0d25b88b86b5c28fa Fix: Fix GH release bade display

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v1.0.0-rc1(Jul 18, 2022)

    v1.0.0-rc1

    Features 🌈

    • Add brands and objects to AnalyzeImage transformer
    • Add label conversion for VW binary classifier (0/1 -> -1/1) (#700)
    • Add VowpalWabbit ngram support (#696)
    • Add automatic schema inference for writing to Azure Search (#704)
    • Add metric parameter to lightgbm learners (#672)

    Bug Fixes 🐞

    • Vowpal Wabbit kwargs + improvements (#692)
    • Fix cast errors for label, weight, and init score columns
    • Fix probabilities and some win errors
    • Fix barrier execution mode with repartition for spark standalone (#651)
    • Mitigate flakiness in SpeechToText test

    Build 🏭

    • Add ability to create fat jars (#702)
    • Make Databricks tests use instance pools to remove state (#673)

    Code Refactoring 💎

    • Clean up distributed and continuous HTTP tests
    • Clean up LightGBM tests

    Documentation 📘

    • Example notebook of VW vs LightGBM (#641)
    • Update Cognitive Service docs (#659)
    • Fix typo in Spark Serving sdocs (#656)
    • Add centOS to VW on spark docs

    Maintenance 🔧

    • Improve code-quality
    • Update lightgbm to 2.2.400
    • Move build to new Azure subscription (#661)

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.\n

    Changes:

    • 8d31c026a252677654717768e942e1cf1adc9082 chore: Bump Version Number to 1.0.0-rc1
    • 2701aedc2a5115860cdeeab7b30e94515f45b828 fixed early stopping test for validation (#711)
    • 6b07829ab302a0e79c34af36fdb12082c83794fa docs: Example notebook of VW vs LightGBM (#641)
    • 163dead1c86c8c3b0506c65d32758a5bb9712f2f fix:fix num cores per executor if config not specified (#709)
    • bc0e0108316927c477b3d3211a4c1193f405d591 chore: ignore flaky test for now
    • ea7d89903163b0efdff815d0e6f3646cf913d11e feat: Add brands and objects to analyze image transformer
    • 04a2fbd31ea3adc857d7d29d6155e00df7532414 feat: added label conversion for VW binary classifier (0/1 -> -1/1) (#700)
    • da124d79f31dde9237c881e7d5d11c83433eece8 feat: Add VowpalWabbit ngram support (#696)
    • a44dafd42562821bc28ab0f9fff39c6991336d49 fix validation data and ranker preprocessing
    • 403786950ce981ac46b99eae767fe0534d379d7f feat: Add automatic schema inference for writing to Azure Search (#704)
    See More
    • 77bb67817d9361c0a8829d06948c5eebbf20d3fc update lightgbm to 2.3.100, remove generateMissingLabels, fix lightgbm getting stuck on unbalanced data
    • 2e45613e6c42949368eaa139989f2e7b18cabfe8 build: Add ability to create fat jars (#702)
    • 035fcd91787cdc1b1b07cfb1bc7c13d5d9f5fa84 cleanup duplication in unit tests (#695)
    • 932ec8667644ae991fcb71b0f527392f6f797677 adding debug for client mode issue and future investigations
    • 95061d0422f32c50f30b4adb13e674b4517eca50 fix: Vowpal Wabbit kwargs + improvements (#692)
    • 3ea5bc53cd0200ec3c9c7f9916aab48aca414961 fix: cast errors for label, weight and init score columns
    • f2bf39fb02ad648de7b5fe77a37ec35919162b5a fix categorical handling on lightgbm learners
    • 671b68892ace5967e60c7a064effd42dd5a21ec7 re-enabling windows tests for lightgbm
    • 8361eadff3ca1e5a7410825643801f49b78e5190 add eval_at parameter to lightgbm ranker
    • c0921fb0f70612fc0e1c2003e9cdb0f40148d911 Better error message when the group column is not a Int/Long
    • 05a2bef54fa88a2293020215cf4cae34f2d212c5 fix: update lightgbm to 2.2.400, fix probabilities and some win errors
    • 16ea090cbc038a466880514fae81dd111b2f099b chore: imporve code-quality
    • ef14350ef283ba4bb92724ed11db78e6227877ef build: databricks tests use instance pools to remove state (#673)
    • 8b27d888824bbca6a385b4d3b7b0364b0150b903 feat: add metric parameter to lightgbm learners (#672)
    • 9805996143d4cf174895ff2e08bb61fd2c99c4f1 fix: fix barrier execution mode with repartition for spark standalone (#651)
    • 1e186adf29ba605a2220228ccc9ffb788555bec7 chore: move to new subscription (#661)
    • 360f2f7d8116a931bf373874cd558c43d7d98973 refactor: clean up distributed HTTP tests
    • 5eedc9360411610555de2323570d223fea0af340 fix: mitigate flakiness in speechToText test
    • 029038610ca56177f3566937dd15747df2b33d67 refactor: clean up continuous http tests
    • 8ed3aeb140eb951208a77fc8a6093a6ac24f8a47 refactor: clean up LightGBM tests
    • f99c9f402c60418f3043eb6aa50aae7b8cf476c2 docs: Update Cog Service docs (#659)
    • df089cdc39512d59592fe70b09acd4b8337a63ce docs: fix typo in spark serving docs (#656)
    • b369244e20d7155029d9c44d90fa4419dee0a6aa docs: add vw to related software
    • 876553a300f245a23c5b5db3eb6cfe71e7674216 docs: add links to readme
    • 81360227321e7a6befc9cbba86721dc10969404e docs: change paper badge color
    • f974a6a30e5d85cea7dd72eb957d0a16d8b86cb2 docs: improve README
    • 8190eb5c721e45b27840c453ee958cdebeabc47f Add links to API documentation
    • 241a48640a06859d468f13178907267f3d34eb83 docs: add centOS to vw on spark docs

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.18.1(Jul 18, 2022)

    v0.18.1

    Bug Fixes 🐞

    • fix lightgbm stuck in multiclass scenario and added stratified repartition transformer (#618)
    • fix schema issue with databricks e2e tests (#653)
    • update VW dependency to 8.7.0.2 built on CentOS and optimized for portability (#652)

    Build 🏭

    • add proper secrets to publishing step (#650)

    Documentation 📘

    • Remove script action section

    Maintenance 🔧

    • bump version number

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

    Ilya Matiach, Markus Cozowicz

    Changes:

    • 62946d1adf7baa4817f54f6c166db38cea9900db chore: bump version number
    • d518b8aa3aae7ace6608742271f7873decb76b84 fix: fix lightgbm stuck in multiclass scenario and added stratified repartition transformer (#618)
    • 85fb3fc4fa60de7dbe2c20aeb05c4712f0c48d38 fix: fix schema issue with databricks e2e tests (#653)
    • 258cafbd74727b9eed1b7ae66d07e7f85b7b07a6 fix: update VW dependency to 8.7.0.2 built on CentOS and optimized for portability (#652)
    • 376cc6a86e43a2c50d9fee2adb92c34193ebd606 build: add proper secrets to publishing step (#650)
    • 0be08e91cd6c3cc20bd22e98a0f65061df88dbcf docs: Remove script action section

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.18.0(Jul 18, 2022)

    Microsoft ML for Apache Spark v0.18.0

    Highlights

    | | | | | |:--:|:--:|:--:|:--:| | Vowpal Wabbit on Spark | Quality and Build Refactor | LightGBM Ranking and More | Anomaly Detection and Speech To Text | | Fast, Sparse, and Scalable Text Analytics | New Azure Pipelines build with Code Coverage, CICD, and an organized package structure. | Barrier Execution mode, performance improvements, increased parameter coverage | New cognitive services on Spark |

    New Features

    Vowpal Wabbit on Spark: Fast and Sparse Text Analytics

    LightGBM on Spark

    • Now supports barrier execution mode
    • Added the LightGBMRanker
    • Added is_provide_training_metric to LightGBMRanker.
    • Enabled continued training with init score column
    • Added batch training support
    • Reduced memory usage
    • Fixed issues with frozen jobs
    • Fixes for multiclass classification
    • Fixed issue where multiclass classification hangs due to partitions without all classes

    HTTP on Spark

    • Added AnomalyDetector and SimpleAnomalyDetector APIs
    • Added SpeechToText transformer
    • Improved service concurrency
    • Added robustness to socket timeouts

    Miscellaneous

    • Codegen support for wrapping Ranker classes
    • Notebooks now leverage public blob for faster execution
    • Fixed summarize data column handling
    • Better compute model statistics error messages
    • Upgraded to Spark 2.4.3
    • Added Spark on Kubernetes Helm Charts
    • Added StratifiedRepartition transformer for ensuring partitions contain all classes
    • Fixed issue where ImageFeaturizer could not be executed on Databricks 2.4.3

    Build, Quality, and Infrastructure Refactor

    Azure Pipelines Integration

    • Tests parallelized on Azure Pipelines. Builds now take ~25min vs ~90min!
    • Serverless Builds: Queue as many builds as needed with no machine maintenance costs
    • Test results, error messages, and time are viewable from github PR section
    • Individual Tests can be re-queued from the GitHub PR Page
    • Builds can be queued using the pull request comment: /azp run.
      • Full details can be seen by typing /azp help
    • CI pipeline entirely specified in small .yaml file in git repo

    Local Developer Support

    • Dramatically simpler developer setup (all through SBT)
    • Local developer setup now works on any platform including windows!
    • Local setup no longer needs VM, Vagrant, or 30 min to import the library
    • All build stages are SBT tasks and can be done locally for rapid testing
      • This includes publishing maven packages to local repositories and the MMLSpark maven repo
    • All secrets now managed by centralized Azure Key Vault
    • IntelliJ will pick up on all scalastyle rules for editor-level style feedback while typing

    Code Quality Gates

    • Code Coverage now supported for every PR and reported in the comments and badge
      • Coverage is now a check-in gate to never decrease
    • Test coverage increased and dead code removed from the library
    • Custom and auto-generated Python tests now supported
    • CODEOWNERS file for better code reviews and maintenance
    • Codacy integration for automated PR reviews

    Streamlined Library Structure

    • MMLSpark now supports a true Scala/Java idiomatic package hierarchy
    • Namespace hierarchy also reflected in PySpark code
    • Note: This will require changes to existing MMLSpark Programs. For Support in migrating please contact [email protected]

    Maintainability and Community Management

    • Issue and PR templates
    • Gitter channel
    • Welcome bot to greet new contributors
    • Semantic Commits for autogenerating release notes
    • Badges to display current and master versions in the README

    Migration Support:

    • For those that already have MMLSpark developer setups please read the new developer guide to reconfigure.
    • For those that have standing PRs that need rebasing assistance please reach out to [email protected]
    • Please report any bugs or feedback!

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

    • Ilya Matiach, Markus Cozowicz, Scott Graham, Daniel Ciborowski, Christina Lee, Dalitso Banda, Shaochen Shi, Sudarshan Raghunathan, Anand Raman, Eli Barzilay, Nick Gonsalves, Tao Wu, Jeremy Reynolds, Miguel Fierro, Robert Alexander, AI CAT Team, Azure Search Team

    Contributions, Collaborations, and Feedback Welcome!

    | | | | |:--:|:--:|:--:|

    Changes:

    • 3bb48b8400e92d660355c10c9c6770f5d37f681a chore: bump version number
    • b0797b37929968063a860ff8bc16900732c624a9 docs: Improve cog services on spark docs
    • 8e966b3c098e6a6170221620638479fb7ec561c3 docs: Docs for Cognitive Services (#647)
    • eb0a421c360835b22dfefced8a841d0d39c10db8 docs: Improve VW on Spark Docs
    • 54dbcadb21a5b4bc5147f61803a975436d7126ba docs: add VowpalWabbit documentation
    • fb5b79f460dd3c57a19c6b658cb60ee64db0c949 docs: fix vw on spark description
    • c0d5786aee8d41dda3361a5e5111a88275592327 docs: update readme badges and icons
    • 071b6b0ab0ada8f3c1720949a6f3f84a16c2da87 docs: Add gitter badge
    • 5c343567003af3546e3b62183b901429889edf76 docs: Add VW on Spark to table
    • 1bdcdbfb4314d1e464c566b27806dace14a7bc20 chore: ignore .github folder for CI
    See more
    • 01d498c2f7c18bb57a3ecd2327482fc9696acd46 build: add sonatype publishing
    • 8fab72d2662ed933d5fe551b1394a711b6145797 build: make e2e cancellable
    • ddc7a4f910d391cb7b1b2d500fe37c48f3ecbc87 build: remove broken codecov flags (will reinstate when codecov fixes their service_
    • 188cbdbf5a6d74e00e2351dfe78b994708bb0270 chore: Update issue templates
    • f67b16aba8133cffeda350cd7be37577e64175a8 chore: fix welcome bot indenting
    • eeb7eba1e0b3eda3996ed7a47451d1aa24b2286f fix: Fix logistic regression error when passing "--link logistic" (#644)
    • b6a4f9320697c264bf73b19879ca15c1e59b75f3 fix: fix socket timeout error (#640)
    • 856db6d5619ad30368576b6ee55577d24e91e030 build: add mcr publishing
    • c6e44f95d96d3adc403e21985404e8527cebd6bf fix: fix issue with socket timeout in advanced handler
    • 2425b7adbb7cc5f5a0ae56b19c864ebcc7445dc4 fix: update detect anomaly suite to make anomaly more pronounced
    • 07c7fecf78af53d56f66565dd9b5033019eb71b1 style: run markdown through markdown linter
    • a0e85f5a98ce01c14a3cf3ffca856282a3029822 build: increase setup timeouts
    • 5c190f8eecd158fe32a318325ddd9f8fb94eb15d style: Fix style issues
    • 4bf6f712fa64d43af0efd759813faaae94cf37a5 build: Add build cancel timeouts
    • 915d68334eaeac2ed2fa8022bb5b4b3a3dadb039 build: add release job to Azure Pipelines
    • e48f9cbea3c446888cf2005c129f8ede9cf513db build: Add github version badges
    • 73581cbf19558df899cc909cb7e1aee3d7e5c72e build: fix flaky codecov upload
    • ce1e66d3b17ca035a71dae9148d3adce611e1c37 build: fix e2e notebook cluster check
    • 19aeb8037e3589fb6dbd25fe5840b54b2378ed98 build: Add behavior bot
    • 72ccae226876f57f71cb8ff8e388b34ce05b7031 build: Make task retry part of bash script
    • 16dd7f4eb55d7fa740c83d776599fb94598e361c Update formatting
    • 3fe4db5934552edd34cc9f025faec0c5b2526a64 adding vagrant doc and fixing indentation in vagrantfile
    • d58d6f41909ecafa057a5327374c1825331f66ce Vowpal Wabbit on Spark
    • 95dc73464714793997dffa8050451e1e50cae4dc adding vagrant file back in, updated for sbt (#622)
    • 605c98f914a51661eb868a9d83adeaac3b6e2e37 Add flaky test retry
    • 4ebbb41a08e73f731d556d97cf76a2df52a75b42 remove brittle dataset downloading from demos
    • e572a9aa584616d249652a23f8bc218e3b64ebe6 try to Fix codecov upload
    • fac542e2f6f80e51d8c62b5886b5804cc7481873 Add codecov to python tests
    • b6ba62f4c6ae6d2e9a1d0df7bd9c3bf4e1c4cc52 Add test publishing tobuild
    • 5cada6f78fee649adf2e7c413684b431edc8be23 Increase coverage and remove dead code
    • ae191a6cb777ee7dde9572ff1bdf80e366a29a70 Fix build summary
    • e18ec2e9cdf2af07c40682b5c228fb876001e8d5 leverage codecov.io's coverage capabilities
    • 8e7626332f5da8757a12d2614ffb27b87ff3746f Improve noisy neighbor problems for e2e tests
    • 6ab8916cc236dfc81c2d9b4d912f2903248083b8 add codecov file
    • 70881b2930321019c48b175e38ed9b7998bdf9d4 improve test coverage
    • 41da2b7af2bace4ce0715b50a1db050cd67207e3 improve flakiness
    • aa3c98f22f26ea6f02eebaeea2ffa5a8d8e42cfe improve coverage
    • 237d38821e9dbf23d6d187aa33b0de106066a724 Add Code Coverage badge
    • 7146b9bc2af6da655b2c3061d9cf7edfcfdc517d Add unit test timeout
    • fa87e427996ac270a9763b844d62411c610d48e6 Fix noisy neighbor search index tests
    • 0f98f7df3169e4e648c5d01ecc54173baf8d8f10 add codeowners file
    • 43218097e2b787b4b9009074b20a042e20367292 add codeowners file
    • 80aecab8321423fb20c2d5bbc23362d514180472 Add upload to codecov.io
    • 66db39fbed3e9660b9cdbf90afb065db9ce581d5 Split LGBM tests for speed
    • a6998ec6b0fe068f064ad9600fa204c349b932b0 Update README.md
    • 027e6d72f5473b8d570ca40385aad4019b39d15c Remove unused code
    • 0205b7e692b70433775617e8013f665642df791e Squash with partition fix
    • dc1554f00e0ed2829e65d0414da847ad59094e45 Add r package upload
    • 2fbd81cacfcf5eaf526ca4f9f7332446c88836fe Fix pipeline retry
    • 0fde5941b96e2993576a2453748fdca6bb6cb878 attempt to fix partition consolidator flakiness
    • 7940967acb21c6fc77a05537c6cbdeb9db55da42 Add codecov
    • 7e8225f7e34f7efa5bc44aa0e6731ab087424725 fix retry logic
    • d8c0eb49080193aaa5ca36d0b39c9e65b9a4056e Increase timeout for e2e notebook tests
    • ff059a310ef48aa408d1c01909526880376947d8 Add ability to retry pipeline
    • 8cf91cabb166796726de86e81f64f0734a23c25a Simplify build pipeline
    • 5c8c9032986138964f0d9d0acb6533ce3b8b8004 Delete runme
    • 210b522324e93824bcf6e81897c81eb31d87a9b4 Update CNTK code in README
    • da6e4977c1a1eb93495ec23ca97de18e34e6369a Update pipeline.yaml for Azure Pipelines
    • e94631885c63de61b33dda7229902469e7d6bc12 Add build status bar
    • 37d36af2acf66a46a1c44eec4ae403543061064f Enable PR builds
    • 6c56326c1a5d78460052f51150ccaf70fd3b1f4c transition to new build system
    • fb3e99e53d46ef5536dd2fa765e25b3d7ded07d8 Update dockerfile
    • 637df9d34f508cd1c83542a69e922bc342b1fe0d Update documentation for new build
    • e9ef538cdf75de1e243a21fb4a46e473d5f138a0 Improve test robustness
    • d34f9d173d6f5cb0fbaa93a078bc339c28618549 Remove unused build scripts
    • 4034a4fc9eeef54fac4f3710fdc738a904e026a7 Add doc publishing to build
    • 36d8c3bd53686e94a8a054faf3f2efd161aa85eb Fixup after rebase
    • 7c5e7b676974c21486704e71a3fa793d08f25d1c Get e2e tests working
    • 07316a8c7db982f7f7b9cf9bc6793001c8cf9dbd Fix serialization fuzzing error
    • f6df90771e93a209c4a846c462141de494c379ed Make recomendation tests faster
    • dd99937b6eb3c023d2955a91f58e7133ca4bf248 Add python tests
    • 02a8ac6c46acd0261c5b6bafa8a7ab4a05b14949 Add publish task
    • 3a526c8c6ac0720e15ca22a7e0faeb24cac08bb6 Fix Test Errors and Improve Reliability
    • 4a696c5548be2e505411b39a64af2bc669640a96 Parallelize Tests
    • 2b75b62b8bd50239564ff5d1f50a94b003881bd2 Make build windows compatible
    • 94e9b218a4bc1d6fc9134987d583924f4a83b983 Add developer-readme.md
    • 5659287842bc09710076efe5fc5af2dcc82229a4 Fix python testing
    • 987c7c49b9e10f9c3aa20f47c69fb133067387c9 Get python codegen to work
    • 90089fa36a41260f8366d7ecce0cc24c06081f47 Add scalastyle and unidoc
    • 79d41102fae2dd6e20f4aeafd77bdc9336ad1a24 Add secrets
    • 5742c0e164d54f3b87e2e9007c249d45944f61ec Refactor build
    • 77d7cb4f3c7f0c5eaf46883980754f9149d5d851 Move library into a single package
    • 29c15cb52055d2598f25bd2249a738d0f2261c3d add barrier execution mode
    • aac05361c454e4a4d383ca4f551f3a4051f1b35c fix default value for double array param in codegen
    • 2bd2faf1295c8ffae43c9f528e676ddb2f0909ba fix wrapper generator for ranker models
    • 6885ef5ea42942b6e134a341cd9f6f008e20e156 added lightgbm ranker model pyspark api
    • 08b308585eefeebffb48df5857be1579bc6c5364 fix summarize data columns
    • 044d0b5698fd99d30c874e3328a6b24cbda55acc reduce memory usage, fix frozen jobs, add more debug logging
    • 45c91f98c7ed425beefec23bcd436690e1540dd7 defer lightgbm probability calculation to native core to fix multiclass bug in some scenarios (#578)
    • 44735200184151e180a3188fa315fa15a7fd18fa squish runs together
    • 00ebf64bb34148d1cdc17f6108f31d471ec279c4 use right python version
    • 216abea6317115d4a168cd533c1212ac2063bff3 updated readme. more mini images
    • 3232d848d8de65a23a77908213ee9667f2c3a7a5 Fix flakey test
    • e9a612bb803a346e8b3d3cbfdd18cc8f36653d39 Fix Entity Detector Suite
    • ba3dbd0ea6eb654beb130bc79b9527ac62c2ef0e Improve service concurrency
    • 75819a51fe88a16126e71bcb8f3376a8d8c4837e Add simple Anamoly Detector
    • 17a765e6747dca6ab0f28cce047c7068bd3c31f2 Add is_provide_training_metric to LightGBMRanker.
    • ceb52918c125ad844cf27fb812f30e9bcb5077ac Print metrics of validation data as well.
    • b54363c9f78308505a25d0826c989326312b2c9a Implement is_provide_training_metric in Scala codes through JNI.
    • c7e31e61fb93f198128a5777a5c786cdb9d8458f fix query column to support long type
    • 6a6d57f40ecd25a23efae29b2d18671647dbdb3f Poke Build System
    • 11fe799a3e6142c0788ec5a314d83e2c4f8cb1ee Fixing Cog Service Test
    • 6eba0b6f4d612a35e4464bd955859efdf45eb803 ignore flaky test
    • 53c4b9e0fd917b91cd7fb195ebe44822cdd212ee adding LightGBMRanker
    • fa7785734a54c5e45c98c66196846be3e4682dbf add init score column for continued training
    • 32ac35348312e57599c9275fcdba800765efc638 Add anomaly detection and speech to text services
    • 06273b252d753be61c353a15a2a20455c92e3af2 improved compute model statistics error message
    • e7a309c3d9ea0462cfd055e2d794cae7dfbe5fca pass through slot names to native structure
    • b295dae1a53c7fe127a498e974554f854b316075 add batch training support in lightgbm classifier and regressor

    This list of changes was auto generated.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.17(Jul 18, 2022)

    Highlights

    • LightGBM evaluation 3-4x faster!
    • Spark Serving v2
    • LightGBM training supports early stopping and regularization
    • LIME on Spark significantly faster

    New Features

    Spark Serving v2:

    • Both Microbatch and Continuous mode have sub-millisecond latency
    • Supports fault tolerance
    • Can reply from anywhere in the pipeline
    • Fail fast modes for warning callers of bad JSON parsing
    • Fully based on DataSource API v2

    LightGBM:

    • 3-4x evaluation performance improvement
    • Add early stopping capabilities
    • Added L1 and L2 Regularization parameters
    • Made network init more robust
    • Fixed bug caused by empty partitions

    LIME on Spark:

    • LIME Parallelization significantly faster for large datasets
    • Tabular Lime now supported

    Other:

    • Added UnicodeNormalizer for working with complex text
    • Recognize Text exposes parameters for its polling handlers

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

    • Ilya Matiach, Markus Cozowicz, Scott Graham, Daniel Ciborowski, Jeremy Reynolds, Miguel Fierro, Robert Alexander, Tao Wu, Sudarshan Raghunathan, Anand Raman,Casey Hong, Karthik Rajendran, Dalitso Banda, Manon Knoertzer, Lars Ahlfors, The Microsoft AI Development Acceleration Program, Cognitive Search Team, Azure Search Team
    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.16(Jul 18, 2022)

    New Features

    New Examples

    Updates and Improvements

    General

    • MMLSpark Image Schema now unified with Spark Core
    • Bugfixes for Text Analytics services
    • PageSplitter now propagates nulls
    • HTTP on Spark now supports socket and read timeouts
    • HyperparamBuilder python wrappers now return idiomatic python objects

    LightGBM on Spark

    • Added multiclass classification
    • Added multiple types of boosting (Gradient Boosting Decision Tree, Random Forest, Dropout meet Multiple Additive Regression Trees, Gradient-based One-Side Sampling)
    • Added windows OS support/bugfix
    • LightGBM version bumped to 2.2.200
    • Added native support for categorical columns, either through Spark's StringIndexer, MMLSpark's ValueIndexer or list of indexes/slot names parameter
    • isUnbalance parameter for unbalanced datasets
    • Added boost from average parameter

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

    • Ilya Matiach, Casey Hong, Daniel Ciborowski, Karthik Rajendran, Dalitso Banda, Manon Knoertzer, Sudarshan Raghunathan, Anand Raman,Markus Cozowicz, The Microsoft AI Development Acceleration Program, Cognitive Search Team, Azure Search Team
    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.15(Jul 18, 2022)

    New Features

    • Add the TagImage and DescribeImage services
    • Add Ranking Cross Validator and Evaluator

    New Examples

    Updates and Improvements

    LightGBM

    • Fix issue with raw2probabilityInPlace
    • Add weight column
    • Add getModel API to TrainClassifier and TrainRegressor
    • Improve robustness of getting executor cores

    HTTP on Spark and Spark Serving

    • Improve robustness of Gateway creation and management
    • Imrpove Gateway documentation

    Version Bumps

    • Updated to Spark 2.4.0
    • LightGBM version update to 2.1.250

    Misc

    • Fix Flaky Tests
    • Remove autogeneration of scalastyle
    • Increase training dataset size in snow leopard example

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.

    • Ilya Matiach, Casey Hong, Karthik Rajendran, Daniel Ciborowski, Sebastien Thomas, Eli Barzilay, Sudarshan Raghunathan, @flybywind, @wentongxin, @haal
    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.14(Jul 18, 2022)

    New Features

    • The Cognitive Services on Spark: A simple and scalable integration between the Microsoft Cognitive Services and SparkML
      • Bing Image Search
      • Computer Vision: OCR, Recognize Text, Recognize Domain Specific Content, Analyze Image, Generate Thumbnails
      • Text Analytics: Language Detector, Entity Detector, Key Phrase Extractor, Sentiment Detector, Named Entity Recognition
      • Face: Detect, Find Similar, Identify, Group, Verify
    • Added distributed model interpretability with LIME on Spark
    • 100x lower latencies (<1ms) with Spark Serving
    • Expanded Spark Serving to cover the full HTTP protocol
    • Added the SuperpixelTransformer for segmenting images
    • Added a Fluent API, mlTransform and mlFit, for composing pipelines more elegantly

    New Examples

    • Chain together cognitive services to understand the feelings of your favorite celebrities with CognitiveServices - Celebrity Quote Analysis.ipynb
    • Explore how you can use Bing Image Search and Distributed Model Interpretability to get an Object Detection system without labeling any data in ModelInterpretation - Snow Leopard Detection.ipynb
    • See how to deploy any spark computation as a Web service on any Spark platform with the SparkServing - Deploying a Classifier.ipynb notebook

    Updates and Improvements

    LightGBM

    • More APIs for loading LightGBM Native Models
    • LightGBM training checkpointing and continuation
    • Added tweedie variance power to LightGBM
    • Added early stopping to lightGBM
    • Added feature importances to LightGBM
    • Added a PMML exporter for LightGBM on Spark

    HTTP on Spark

    • Added the VectorizableParam for creating column parameterizable inputs
    • Added handler parameter added to HTTP services
    • HTTP on Spark now propagates nulls robustly

    Version Bumps

    • Updated to Spark 2.3.1
    • LightGBM version update to 2.1.250

    Misc

    • Added Vagrantfile for easy windows developer setup
    • Improved Image Reader fault tolerance
    • Reorganized Examples into Topics
    • Generalized Image Featurizer and other Image based code to handle Binary Files as well as Spark Images
    • Added ModelDownloader R wrapper
    • Added getBestModel and getBestModelInfo to TuneHyperparameters
    • Expanded Binary File Reading APIs
    • Added Explode and Lambda transformers
    • Added SparkBindings trait for automating spark binding creation
    • Added retries and timeouts to ModelDownloader
    • Added ResizeImageTransformer to remove ImageFeaturizer dependence on OpenCV

    Acknowledgements

    We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark. (In alphabetical order)

    • Abhiram Eswaran, Anand Raman, Ari Green, Arvind Krishnaa Jagannathan, Ben Brodsky, Casey Hong, Courtney Cochrane, Henrik Frystyk Nielsen, Ilya Matiach, Janhavi Suresh Mahajan, Jaya Susan Mathew, Karthik Rajendran, Mario Inchiosa, Minsoo Thigpen, Soundar Srinivasan, Sudarshan Raghunathan, @terrytangyuan
    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.13(Jul 18, 2022)

    New Functionality:

    • Export trained LightGBM models for evaluation outside of Spark

    • LightGBM on Spark supports multiple cores per executor

    • CNTKModel works with multi-input multi-output models of any CNTK datatype

    • Added Minibatching and Flattening transformers for adding flexible batching logic to pipelines, deep networks, and web clients.

    • Added Benchmark test API for tracking model performance across versions

    • Added PartitionConsolidator function for aggregating streaming data onto one partition per executor (for use with connection/rate-limited HTTP services)

    Updates and Improvements:

    • Updated to Spark 2.3.0

    • Added Databricks notebook tests to build system

    • CNTKModel uses significantly less memory

    • Simplified example notebooks

    • Simplified APIs for MMLSpark Serving

    • Simplified APIs for CNTK on Spark

    • LightGBM stability improvements

    • ComputeModelStatistics stability improvements

    Acknowledgements:

    We would like to acknowledge the external contributors who helped create this version of MMLSpark (in order of commit history):

    • 严伟, @terrytangyuan, @ywskycn, @dvanasseldonk, Jilong Liao, @chappers, @ekaterina-sereda-rf
    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.11(Jul 18, 2022)

    New functionality:

    • TuneHyperparameters: parallel distributed randomized grid search for SparkML and TrainClassifier/TrainRegressor parameters. Sample notebook and python wrappers will be added in the near future.

    • Added PowerBIWriter for writing and streaming data frames to PowerBI.

    • Expanded image reading and writing capabilities, including using images with Spark Structured Streaming. Images can be read from and written to paths specified in a dataframe.

    • New functionality for convenient plotting in Python.

    • UDF transformer and additional UDFs.

    • Expanded pipeline support for arbitrary user code and libraries such as NLTK through UDFTransformer.

    • Refactored fuzzing system and added test coverage.

    • GPU training supports multiple VMs.

    Updates:

    • Updated to Conda 4.3.31, which comes with Python 3.6.3.

    • Also updated SBT and JVM.

    Improvements:

    • Additional bugfixes, stability, and notebook improvements.
    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.10(Jul 18, 2022)

    New functionality:

    • We now provide initial support for training on a GPU VM, and an ARM template to deploy an HDI Cluster with an associated GPU machine. See docs/gpu-setup.md for instructions on setting this up.

    • New auto-generated R wrappers for estimators and transformers. To import them into R, you can use devtools to import from the uploaded zip file. Tests and sample notebooks to come.

    • A new RenameColumn transformer for renaming columns within a pipeline.

    New notebooks:

    • Notebook 104: An experiment to demonstrate regression models to predict automobile prices. This notebook demonstrates the use of Pipeline stages, CleanMissingData, and ComputePerInstanceStatistics.

    • Notebook 105: Demonstrates DataConversion to make some columns Categorical.

    • There us a 401 notebook in notebooks/gpu which demonstrates CNTK training when using a GPU VM. (It is not shown with the rest of the notebooks yet.)

    Updates:

    • Updated to use CNTK 2.2. Note that this version of CNTK depends on libpng12 and libjasper1 -- which are included in our docker images. (This should get resolved in the upcoming CNTK 2.3 release.)

    Improvements:

    • Local builds will always use a "0.0" version instead of a version based on the git repository. This should simplify the build process for developers and avoid hard-to-resolve update issues.

    • The TextPreprocessor transformer can be used to find and replace all key value pairs in an input map.

    • Fixed a regression in the image reader where zip files with images no longer displayed the full path to the image inside a zip file.

    • Additional minor bug and stability fixes.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.9(Jul 18, 2022)

    New functionality:

    • Refactor ImageReader and BinaryFileReader to support streaming images, including a Python API. Also improved performance of the readers. Check the 302 notebook for usage example.

    • Add ClassBalancer estimator for improving classification performance on highly imbalanced datasets.

    • Create an infrastructure for automated fuzzing, serialization, and python wrapper tests.

    • Added a DropColumns pipeline stage.

    New notebooks:

    • 305: A Flowers sample notebook demonstrating deep transfer learning with ImageFeaturizer.

    Updates:

    • Our main build is now based on Spark 2.2.

    Improvements:

    • Enable streaming through the EnsembleByKey transformer.

    • ImageReader, HDFS issue, etc.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.8(Jul 18, 2022)

    New functionality:

    • We are now uploading MMLSpark as a Azure/mmlspark spark package. Use --packages Azure:mmlspark:0.8 with the Spark command-line tools.

    • Add a bi-directional LSTM medical entity extractor to the ModelDownloader, and new jupyter notebook for medical entity extraction using NLTK, PubMed Word embeddings, and the Bi-LSTM.

    • Add ImageSetAugmenter for easy dataset augmentation within image processing pipelines.

    Improvements:

    • Optimize the performance of CNTKModel. It now broadcasts a loaded model to workers and shares model weights between partitions on the same worker. Minibatch padding (an internal workaround of a CNTK bug) is now no longer used, eliminating excess computations when there is a mismatch between the partition size and minibatch size.

    • Bugfix: CNTKModel can work with models with unnamed outputs.

    Docker image improvements:

    • Environment variables are now part of the docker image (in addition to being set in bash).

    • New docker images:

      • microsoft/mmlspark:latest: plain image, as always,
      • microsoft/mmlspark:gpu: GPU variant based on an nvidia/cuda image.
      • microsoft/mmlspark:plus and microsoft/mmlspark:plus-gpu: these images contain additional packages for internal use; they will probably be based on an older Conda version too in future releases.

    Updates:

    • The Conda environment now includes NLTK.

    • Updated Java and SBT versions.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.7(Jul 18, 2022)

    New functionality:

    • New transforms: EnsembleByKey, Cacher Timer; see the documentation.

    Updates:

    • Miniconda version 4.3.21, including Python 3.6.

    • CNTK version 2.1, using Maven Central.

    • Use OpenCV from the OpenPnP project from Maven Central.

    Improvements:

    • Spark's binaryFiles function had a regression in version 2.1 from version 2.0 which would lead to performance issues; work around that for now. Data frame operations after a use of BinaryFileReader (eg, reading images) are significantly faster with this.

    • The Spark installation is now patched with hadoop-azure and azure-storage.

    • Includes additional bug fixes and improvements.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.6(Jul 18, 2022)

    New functionality:

    • Similar to Spark's StringIndexer, we have a ValueIndexer that can be used for indexing any type of values instead of only strings. Not only can it index these values, we also provide a reverse mapping via IndexToValue, similar to Spark's IndexToString transform.

    • A new "clean missing" data estimator, example:

      val cmd = new CleanMissingData()
        .setInputCols(Array("some-column"))
        .setOutputCols(Array("some-column"))
        .setCleaningMode(CleanMissingData.customOpt)
        .setCustomValue(someCustomValue)
      val cmdModel = cmd.fit(dataset)
      val result = cmdModel.transform(dataset)
      
    • New default featurization for date and timestamp spark types and our internal image type. For featurization of date columns, convert column to double features: year, day of week, month, day of month. For featurization of timestamp columns, same as date and in addition: hour of day, minute of hour, second of minute. For featurization of image columns, use image data converted to double with width and height info.

    • Starting the docker image without an ACCEPT_EULA variable setting would throw an error. Instead, we now start a tiny web server that shows the EULA and replaces itself with the Jupyter interface when you click the AGREE button.

    Breaking changes:

    • Renamed ImageTransform to ImageTransformer.

    Notable bug fixes and other changes:

    • Improved sample notebooks, and a new one: "303 - Transfer Learning by DNN Featurization - Airplane or Automobile".

    • Fix serialization bugs in generated python PipelineStages.

    Acknowledgments

    Thanks to Ali Zaidi for some notebook beautifications.

    Source code(tar.gz)
    Source code(zip)
  • mmlspark-v0.5(Jul 18, 2022)

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Upgini : data search library for your machine learning pipelines

Automated data search library for your machine learning pipelines → find & deliver relevant external data & features to boost ML accuracy :chart_with_upwards_trend:

Upgini 175 Jan 8, 2023
Simplify stop motion animation with machine learning.

Simplify stop motion animation with machine learning.

Nick Bild 25 Sep 15, 2022
Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

DataCanvas 216 Dec 23, 2022
ZenML 🙏: MLOps framework to create reproducible ML pipelines for production machine learning.

ZenML is an extensible, open-source MLOps framework to create production-ready machine learning pipelines. It has a simple, flexible syntax, is cloud and tool agnostic, and has interfaces/abstractions that are catered towards ML workflows.

ZenML 2.6k Jan 8, 2023
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 1, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
Pytools is an open source library containing general machine learning and visualisation utilities for reuse

pytools is an open source library containing general machine learning and visualisation utilities for reuse, including: Basic tools for API developmen

BCG Gamma 26 Nov 6, 2022
SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

SageMaker Python SDK SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker. With the S

Amazon Web Services 1.8k Jan 1, 2023
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 3, 2023
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 3.1k Jan 6, 2023
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Azaria Gebremichael 2 Jul 29, 2021
MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine Learning work with thousands of other users.

The collaboration platform for Machine Learning MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine

MLReef 1.4k Dec 27, 2022
Accelerating model creation and evaluation.

EmeraldML A machine learning library for streamlining the process of (1) cleaning and splitting data, (2) training, optimizing, and testing various mo

Yusuf 0 Dec 6, 2021
A project based example of Data pipelines, ML workflow management, API endpoints and Monitoring.

MLOps template with examples for Data pipelines, ML workflow management, API development and Monitoring.

Utsav 33 Dec 3, 2022
Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

Rodrigo Arenas 1 Apr 26, 2022
MLOps pipeline project using Amazon SageMaker Pipelines

This project shows steps to build an end to end MLOps architecture that covers data prep, model training, realtime and batch inference, build model registry, track lineage of artifacts and model drift detection. It utilizes SageMaker Pipelines that offers machine learning (ML) to orchestrate SageMaker jobs and author reproducible ML pipelines.

AWS Samples 3 Sep 16, 2022
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 6, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022