Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

Overview


A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray


What is Analytics Zoo?

Analytics Zoo seamless scales TensorFlow, Keras and PyTorch to distributed big data (using Spark, Flink & Ray).


  • End-to-end pipeline for applying AI models (TensorFlow, PyTorch, OpenVINO, etc.) to distributed big data

    • Write TensorFlow or PyTorch inline with Spark code for distributed training and inference.
    • Native deep learning (TensorFlow/Keras/PyTorch/BigDL) support in Spark ML Pipelines.
    • Directly run Ray programs on big data cluster through RayOnSpark.
    • Plain Java/Python APIs for (TensorFlow/PyTorch/BigDL/OpenVINO) Model Inference.
  • High-level ML workflow for automating machine learning tasks

    • Cluster Serving for automatically distributed (TensorFlow/PyTorch/Caffe/OpenVINO) model inference .
    • Scalable AutoML for time series prediction.
  • Built-in models for Recommendation, Time Series, Computer Vision and NLP applications.


Why use Analytics Zoo?

You may want to develop your AI solutions using Analytics Zoo if:

  • You want to easily apply AI models (e.g., TensorFlow, Keras, PyTorch, BigDL, OpenVINO, etc.) to distributed big data.
  • You want to transparently scale your AI applications from a single laptop to large clusters with "zero" code changes.
  • You want to deploy your AI pipelines to existing YARN or K8S clusters WITHOUT any modifications to the clusters.
  • You want to automate the process of applying machine learning (such as feature engineering, hyperparameter tuning, model selection, distributed inference, etc.).

How to use Analytics Zoo?

Comments
  • Bump tensorflow from 1.15.2 to 2.4.0 in /readthedocs

    Bump tensorflow from 1.15.2 to 2.4.0 in /readthedocs

    Bumps tensorflow from 1.15.2 to 2.4.0.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.4.0

    Release 2.4.0

    Major Features and Improvements

    • tf.distribute introduces experimental support for asynchronous training of models via the tf.distribute.experimental.ParameterServerStrategy API. Please see the tutorial to learn more.

    • MultiWorkerMirroredStrategy is now a stable API and is no longer considered experimental. Some of the major improvements involve handling peer failure and many bug fixes. Please check out the detailed tutorial on Multi-worker training with Keras.

    • Introduces experimental support for a new module named tf.experimental.numpy which is a NumPy-compatible API for writing TF programs. See the detailed guide to learn more. Additional details below.

    • Adds Support for TensorFloat-32 on Ampere based GPUs. TensorFloat-32, or TF32 for short, is a math mode for NVIDIA Ampere based GPUs and is enabled by default.

    • A major refactoring of the internals of the Keras Functional API has been completed, that should improve the reliability, stability, and performance of constructing Functional models.

    • Keras mixed precision API tf.keras.mixed_precision is no longer experimental and allows the use of 16-bit floating point formats during training, improving performance by up to 3x on GPUs and 60% on TPUs. Please see below for additional details.

    • TensorFlow Profiler now supports profiling MultiWorkerMirroredStrategy and tracing multiple workers using the sampling mode API.

    • TFLite Profiler for Android is available. See the detailed guide to learn more.

    • TensorFlow pip packages are now built with CUDA11 and cuDNN 8.0.2.

    Breaking Changes

    • TF Core:

      • Certain float32 ops run in lower precsion on Ampere based GPUs, including matmuls and convolutions, due to the use of TensorFloat-32. Specifically, inputs to such ops are rounded from 23 bits of precision to 10 bits of precision. This is unlikely to cause issues in practice for deep learning models. In some cases, TensorFloat-32 is also used for complex64 ops. TensorFloat-32 can be disabled by running tf.config.experimental.enable_tensor_float_32_execution(False).
      • The byte layout for string tensors across the C-API has been updated to match TF Core/C++; i.e., a contiguous array of tensorflow::tstring/TF_TStrings.
      • C-API functions TF_StringDecode, TF_StringEncode, and TF_StringEncodedSize are no longer relevant and have been removed; see core/platform/ctstring.h for string access/modification in C.
      • tensorflow.python, tensorflow.core and tensorflow.compiler modules are now hidden. These modules are not part of TensorFlow public API.
      • tf.raw_ops.Max and tf.raw_ops.Min no longer accept inputs of type tf.complex64 or tf.complex128, because the behavior of these ops is not well defined for complex types.
      • XLA:CPU and XLA:GPU devices are no longer registered by default. Use TF_XLA_FLAGS=--tf_xla_enable_xla_devices if you really need them, but this flag will eventually be removed in subsequent releases.
    • tf.keras:

      • The steps_per_execution argument in model.compile() is no longer experimental; if you were passing experimental_steps_per_execution, rename it to steps_per_execution in your code. This argument controls the number of batches to run during each tf.function call when calling model.fit(). Running multiple batches inside a single tf.function call can greatly improve performance on TPUs or small models with a large Python overhead.
      • A major refactoring of the internals of the Keras Functional API may affect code that is relying on certain internal details:
        • Code that uses isinstance(x, tf.Tensor) instead of tf.is_tensor when checking Keras symbolic inputs/outputs should switch to using tf.is_tensor.
        • Code that is overly dependent on the exact names attached to symbolic tensors (e.g. assumes there will be ":0" at the end of the inputs, treats names as unique identifiers instead of using tensor.ref(), etc.) may break.
        • Code that uses full path for get_concrete_function to trace Keras symbolic inputs directly should switch to building matching tf.TensorSpecs directly and tracing the TensorSpec objects.
        • Code that relies on the exact number and names of the op layers that TensorFlow operations were converted into may have changed.
        • Code that uses tf.map_fn/tf.cond/tf.while_loop/control flow as op layers and happens to work before TF 2.4. These will explicitly be unsupported now. Converting these ops to Functional API op layers was unreliable before TF 2.4, and prone to erroring incomprehensibly or being silently buggy.
        • Code that directly asserts on a Keras symbolic value in cases where ops like tf.rank used to return a static or symbolic value depending on if the input had a fully static shape or not. Now these ops always return symbolic values.
        • Code already susceptible to leaking tensors outside of graphs becomes slightly more likely to do so now.
        • Code that tries directly getting gradients with respect to symbolic Keras inputs/outputs. Use GradientTape on the actual Tensors passed to the already-constructed model instead.
        • Code that requires very tricky shape manipulation via converted op layers in order to work, where the Keras symbolic shape inference proves insufficient.
        • Code that tries manually walking a tf.keras.Model layer by layer and assumes layers only ever have one positional argument. This assumption doesn't hold true before TF 2.4 either, but is more likely to cause issues now.

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.4.0

    Major Features and Improvements

    Breaking Changes

    • TF Core:
      • Certain float32 ops run in lower precision on Ampere based GPUs, including

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    python dependencies 
    opened by dependabot[bot] 83
  • Running spark standalone mode in AZ with random master node yields

    Running spark standalone mode in AZ with random master node yields "/bin/sh: 1: ray: not found"

    Start to launch ray on cluster Traceback (most recent call last):
    File "fashion_mnist.py", line 175, in main() File "fashion_mnist.py", line 161, in main backend="torch_distributed") File "/home/yifan/anaconda3/envs/zoo/lib/python3.7/site-packages/zoo/orca/learn/pytorch/estimator.py", line 92, in from_torch backend=backend) File "/home/yifan/anaconda3/envs/zoo/lib/python3.7/site-packages/zoo/orca/learn/pytorch/estimator.py", line 137, in init workers_per_node=workers_per_node) File "/home/yifan/anaconda3/envs/zoo/lib/python3.7/site-packages/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 107, in init ray_ctx = RayContext.get() File "/home/yifan/anaconda3/envs/zoo/lib/python3.7/site-packages/zoo/ray/raycontext.py", line 390, in get ray_ctx.init() File "/home/yifan/anaconda3/envs/zoo/lib/python3.7/site-packages/zoo/ray/raycontext.py", line 473, in init redis_address = self._start_cluster() File "/home/yifan/anaconda3/envs/zoo/lib/python3.7/site-packages/zoo/ray/raycontext.py", line 502, in _start_cluster verbose=self.verbose) File "/home/yifan/anaconda3/envs/zoo/lib/python3.7/site-packages/zoo/ray/process.py", line 113, in init self.print_ray_remote_err_out() File "/home/yifan/anaconda3/envs/zoo/lib/python3.7/site-packages/zoo/ray/process.py", line 117, in print_ray_remote_err_out raise Exception(str(self.master)) Exception: node_ip: 192.168.65.130 tag: ray-master, pgid: 3180, pids: [], returncode: 127, master_addr: 192.168.65.130:32350,
    /bin/sh: 1: ray: not found Stopping orca context stopping org.apache.spark.deploy.worker.Worker stopping org.apache.spark.deploy.master.Master

    user issue 
    opened by Yifanzhou-0713 31
  • Run the latest  version 'NYC'_ taxi_ dataset.ipynb 'error

    Run the latest version 'NYC'_ taxi_ dataset.ipynb 'error

    When I running the latest version 'NYC_ taxi_ dataset.ipynb',the following error occurred:

    from zoo.automl.common.util import train_val_test_split train_df, val_df, test_df = train_val_test_split(df, val_ratio=0.1, test_ratio=0.1)

    Prepending /home/wxy/anaconda3/envs/ZooAutoml/lib/python3.6/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path Adding /home/wxy/anaconda3/envs/ZooAutoml/lib/python3.6/site-packages/zoo/share/lib/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.8.1-jar-with-dependencies.jar to BIGDL_JARS Prepending /home/wxy/anaconda3/envs/ZooAutoml/lib/python3.6/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path

    ImportError Traceback (most recent call last) in ----> 1 from zoo.automl.common.util import train_val_test_split 2 train_df, val_df, test_df = train_val_test_split(df, val_ratio=0.1, test_ratio=0.1)

    ImportError: cannot import name 'train_val_test_split'

    opened by 2017wxyzwxyz 19
  • [BigDL 2.0] examples on k8s integration tests client mode on new image

    [BigDL 2.0] examples on k8s integration tests client mode on new image

    | Module | Example | Client Mode | | ---- | ---- | ---- | | nnframes | ImageInferenceExample.py | Succeed | | nnframes | ImageTransferLearningExample.py | Succeed | | pytorch | learn/pytorch/cifar10/cifar10.py | Succeed | | pytorch | learn/pytorch/fashion_mnist/fashion_mnist.py | Succeed | | pytorch | learn/pytorch/super_resolution/super_resolution.py | Succeed | | tf | learn/tf/basic_text_classification/basic_text_classification.py | Succeed | | tf | learn/tf/transfer_learning/transfer_learning.py | Succeed | | tf | learn/tf/inception/inception.py | Succeed | | tf | learn/tf/image_segmentation/image_segmentation.py | Succeed | | tf2 | learn/tf2/yolov3/yoloV3.py | Succeed | | torchmodel | torchmodel/train/imagenet/main.py | Succeed | | torchmodel | torchmodel/train/mnist/main.py | Succeed | | torchmodel | torchmodel/train/resnet_finetune/resnet_finetune.py | Succeed |

    opened by piaolaidelangman 17
  • Cannot successfuly pass test code for orca image

    Cannot successfuly pass test code for orca image

    I installed the development environment according to the official developer guide Run in IDE. But I get an error when I run test_write_parquet.py::test_write_mnist in Pycharm.

    /root/anaconda3/envs/zoo-dev/bin/python /root/app/pycharm-community-2020.3/plugins/python-ce/helpers/pycharm/_jb_pytest_runner.py --target test_write_parquet.py::test_write_mnist
    Testing started at 上午9:29 ...
    Launching pytest with arguments test_write_parquet.py::test_write_mnist in /root/zoo-project/analytics-zoo/pyzoo/test/zoo/orca/data
    
    ============================= test session starts ==============================
    platform linux -- Python 3.6.12, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /root/anaconda3/envs/zoo-dev/bin/python
    cachedir: .pytest_cache
    rootdir: /root/zoo-project/analytics-zoo/pyzoo
    collecting ... collected 1 item
    
    test_write_parquet.py::test_write_mnist ERROR                            [100%]Initializing orca context
    Current pyspark location is : /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/pyspark/__init__.py
    Start to getOrCreate SparkContext
    pyspark_submit_args is:  --driver-class-path /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/share/lib/bigdl-0.12.1-jar-with-dependencies.jar pyspark-shell 
    2020-12-23 09:29:24 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    
    test setup failed
    request = <SubRequest 'orca_context_fixture' for <Function test_write_mnist>>
    
        @pytest.fixture(autouse=True, scope='package')
        def orca_context_fixture(request):
            import os
            from zoo.orca import OrcaContext, init_orca_context, stop_orca_context
            OrcaContext._eager_mode = True
            access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
            secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
            if access_key_id is not None and secret_access_key is not None:
                env = {"AWS_ACCESS_KEY_ID": access_key_id,
                       "AWS_SECRET_ACCESS_KEY": secret_access_key}
            else:
                env = None
            sc = init_orca_context(cores=4, spark_log_level="INFO",
                                   env=env, object_store_memory="1g",
    >                              init_ray_on_spark=True)
    
    conftest.py:34: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    ../../../../zoo/orca/common.py:135: in init_orca_context
        sc = init_spark_on_local(cores, **spark_args)
    ../../../../zoo/common/nncontext.py:53: in init_spark_on_local
        python_location=python_location)
    ../../../../zoo/util/spark.py:56: in init_spark_on_local
        redirect_spark_log=self.redirect_spark_log)
    ../../../../zoo/common/nncontext.py:387: in init_nncontext
        redire_spark_logs()
    /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/util/common.py:449: in redire_spark_logs
        callBigDlFunc(bigdl_type, "redirectSparkLogs", log_path)
    /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/util/common.py:592: in callBigDlFunc
        for jinvoker in JavaCreator.instance(bigdl_type, gateway).value:
    /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/util/common.py:56: in instance
        cls._instance = cls(bigdl_type, *args)
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <bigdl.util.common.JavaCreator object at 0x7f39c12ed9e8>
    bigdl_type = 'float'
    gateway = <py4j.java_gateway.JavaGateway object at 0x7f39c1331978>
    
        def __init__(self, bigdl_type, gateway):
            self.value = []
            for creator_class in JavaCreator.get_creator_class():
                jclass = getattr(gateway.jvm, creator_class)
                if bigdl_type == "float":
    >               self.value.append(getattr(jclass, "ofFloat")())
    E               TypeError: 'JavaPackage' object is not callable
    
    /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/util/common.py:96: TypeError
    
    
    Assertion failed
    
    
    Assertion failed
    
    
    Assertion failed
    
    
    Assertion failed
    
    
    ==================================== ERRORS ====================================
    ______________________ ERROR at setup of test_write_mnist ______________________
    
    request = <SubRequest 'orca_context_fixture' for <Function test_write_mnist>>
    
        @pytest.fixture(autouse=True, scope='package')
        def orca_context_fixture(request):
            import os
            from zoo.orca import OrcaContext, init_orca_context, stop_orca_context
            OrcaContext._eager_mode = True
            access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
            secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
            if access_key_id is not None and secret_access_key is not None:
                env = {"AWS_ACCESS_KEY_ID": access_key_id,
                       "AWS_SECRET_ACCESS_KEY": secret_access_key}
            else:
                env = None
            sc = init_orca_context(cores=4, spark_log_level="INFO",
                                   env=env, object_store_memory="1g",
    >                              init_ray_on_spark=True)
    
    conftest.py:34: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    ../../../../zoo/orca/common.py:135: in init_orca_context
        sc = init_spark_on_local(cores, **spark_args)
    ../../../../zoo/common/nncontext.py:53: in init_spark_on_local
        python_location=python_location)
    ../../../../zoo/util/spark.py:56: in init_spark_on_local
        redirect_spark_log=self.redirect_spark_log)
    ../../../../zoo/common/nncontext.py:387: in init_nncontext
        redire_spark_logs()
    /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/util/common.py:449: in redire_spark_logs
        callBigDlFunc(bigdl_type, "redirectSparkLogs", log_path)
    /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/util/common.py:592: in callBigDlFunc
        for jinvoker in JavaCreator.instance(bigdl_type, gateway).value:
    /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/util/common.py:56: in instance
        cls._instance = cls(bigdl_type, *args)
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <bigdl.util.common.JavaCreator object at 0x7f39c12ed9e8>
    bigdl_type = 'float'
    gateway = <py4j.java_gateway.JavaGateway object at 0x7f39c1331978>
    
        def __init__(self, bigdl_type, gateway):
            self.value = []
            for creator_class in JavaCreator.get_creator_class():
                jclass = getattr(gateway.jvm, creator_class)
                if bigdl_type == "float":
    >               self.value.append(getattr(jclass, "ofFloat")())
    E               TypeError: 'JavaPackage' object is not callable
    
    /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/util/common.py:96: TypeError
    ---------------------------- Captured stdout setup -----------------------------
    Initializing orca context
    Current pyspark location is : /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/pyspark/__init__.py
    Start to getOrCreate SparkContext
    pyspark_submit_args is:  --driver-class-path /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/bigdl/share/lib/bigdl-0.12.1-jar-with-dependencies.jar pyspark-shell 
    2020-12-23 09:29:24 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    ---------------------------- Captured stderr setup -----------------------------
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    =============================== warnings summary ===============================
    ../../../../../../../anaconda3/envs/zoo-dev/lib/python3.6/site-packages/pyspark/cloudpickle.py:47
      /root/anaconda3/envs/zoo-dev/lib/python3.6/site-packages/pyspark/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
        import imp
    
    -- Docs: https://docs.pytest.org/en/stable/warnings.html
    =========================== short test summary info ============================
    ERROR test_write_parquet.py::test_write_mnist - TypeError: 'JavaPackage' obje...
    ========================= 1 warning, 1 error in 4.99s ==========================
    Stopping orca context
    Error in atexit._run_exitfuncs:
    Traceback (most recent call last):
      File "/root/zoo-project/analytics-zoo/pyzoo/zoo/orca/common.py", line 210, in stop_orca_context
        ray_ctx = RayContext.get(initialize=False)
      File "/root/zoo-project/analytics-zoo/pyzoo/zoo/ray/raycontext.py", line 393, in get
        raise Exception("No active RayContext. Please create a RayContext and init it first")
    Exception: No active RayContext. Please create a RayContext and init it first
    
    Process finished with exit code 1
    
    
    Assertion failed
    
    Assertion failed
    
    user issue 
    opened by GitEasonXu 17
  • Support pyspark.sql.types.ArrayType with pyspark.sql.types.StringType elements

    Support pyspark.sql.types.ArrayType with pyspark.sql.types.StringType elements

    This modification has been tested using pyzoo/test/zoo/orca/learn/ray/tf/test_tf_ray_estimator.py adding test_array_string_input(). (I also modified test_string_input() to adapt a vocabulary.)

    results are: test_array_string_input():

    [Row(id=0, input=['foo', 'qux', 'bar'], prediction=DenseVector([4.0, 1.0, 2.0])), Row(id=1, input=['qux', 'baz'], prediction=DenseVector([1.0, 3.0]))]
    

    test_string_input():

    [Row(input='foo qux bar', prediction=DenseVector([3.0, 2.0, 5.0, 0.0])), Row(input='qux baz', prediction=DenseVector([2.0, 4.0, 0.0, 0.0]))]
    

    test env:

    • python3.7
    • tensorflow==2.7.0
    • ray==1.9.2
    opened by nyamashi 16
  • Showing 'Exception in thread

    Showing 'Exception in thread "main" java.lang.IllegalStateException: Cannot find any build directories' when excecuting 'sc = init_nncontext()'

    Screenshot from 2020-01-21 11-55-24 Could you please fix this issue as it is showing error just after calling init_nncontext() at the very beginning of the code. As well as its showing error 'Exception: Java gateway process exited before sending its port number'. Could you please explain me what the problem is and how it can be solved?

    user issue 
    opened by dhannya34 16
  • AutoML Installation error, help me

    AutoML Installation error, help me

    I installed "automl" tool kit according to the steps in the webpage,'https://github.com/intel-analytics/analytics-zoo/tree/automl/apps/automl' but the following error occurred, asking for help,

    (1) Win10 Linux subsystem (Ubuntu18.04onWindows) (2) Anaconda3-2020.02-Linux-x86_64 (3)Details: (base) wxy@SC-202007040719:/$ conda activate zoo_automl (zoo_automl) wxy@SC-202007040719:/$ pip install analytics-zoo/pyzoo/dist/analytics_zoo-0.8.1-py2.py3-none-manylinux1_x86_64.whl[automl] WARNING: Requirement 'analytics-zoo/pyzoo/dist/analytics_zoo-0.8.1-py2.py3-none-manylinux1_x86_64.whl[automl]' looks like a filename, but the file does not exist Processing /analytics-zoo/pyzoo/dist/analytics_zoo-0.8.1-py2.py3-none-manylinux1_x86_64.whl ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/analytics-zoo/pyzoo/dist/analytics_zoo-0.8.1-py2.py3-none-manylinux1_x86_64.whl'

    (zoo_automl) wxy@SC-202007040719:/$ quit

    Command 'quit' not found, did you mean:

    command 'luit' from deb x11-utils command 'quot' from deb quota command 'qgit' from deb qgit command 'quilt' from deb quilt command 'quiz' from deb bsdgames

    Try: sudo apt install

    (zoo_automl) wxy@SC-202007040719:/$ conda deactivate (base) wxy@SC-202007040719:/$ source activate zoo_automl (zoo_automl) wxy@SC-202007040719:/$ pip install analytics-zoo/pyzoo/dist/analytics_zoo-0.8.1-py2.py3-none-manylinux1_x86_64.whl[automl] WARNING: Requirement 'analytics-zoo/pyzoo/dist/analytics_zoo-0.8.1-py2.py3-none-manylinux1_x86_64.whl[automl]' looks like a filename, but the file does not exist Processing /analytics-zoo/pyzoo/dist/analytics_zoo-0.8.1-py2.py3-none-manylinux1_x86_64.whl ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/analytics-zoo/pyzoo/dist/analytics_zoo-0.8.1-py2.py3-none-manylinux1_x86_64.whl'

    (zoo_automl) wxy@SC-202007040719:/$

    I am a beginner, please take care of me. Thank you very much!!!!!!!

    opened by 2017wxyzwxyz 15
  • Error while running the openvino example

    Error while running the openvino example

    Run by downloading the prebuilt package. https://github.com/intel-analytics/analytics-zoo/tree/master/pyzoo/zoo/examples/openvino

    Traceback (most recent call last): File "/tmp/1558587157102-0/model-optimizer/mo_tf.py", line 28, in from mo.main import main File "/tmp/1558587157102-0/model-optimizer/mo/main.py", line 28, in from mo.utils.cli_parser import get_placeholder_shapes, get_tuple_values, get_model_name,
    File "/tmp/1558587157102-0/model-optimizer/mo/utils/cli_parser.py", line 26, in from mo.front.extractor import split_node_in_port File "/tmp/1558587157102-0/model-optimizer/mo/front/extractor.py", line 21, in import networkx as nx ModuleNotFoundError: No module named 'networkx'

    high priority 
    opened by zhichao-li 15
  • FL server and client

    FL server and client

    Mainly modification is:

    • Use a FLServer to start all services, same as a FLClient
    • move some test client and utils to test directory

    some useless code are not deleted because I do not know whether they will be used in future.

    opened by Litchilitchy 14
  • Some random errors when running TF2Estimator on recsys full dataset

    Some random errors when running TF2Estimator on recsys full dataset

    When converting SparkXShards to RayXShards:

    (raylet, ip=172.16.0.113) [2021-07-22 11:06:06,984 C 134880 134880] service_based_gcs_client.cc:235: Couldn't reconnect to GCS server. The last attempted
    [Stage 504:==========================================>       (849 + 158) / 1007]2021-07-22 11:07:33 ERROR DAGScheduler:91 - Failed to update accumulator 0
     (org.apache.spark.api.python.PythonAccumulatorV2) for task 717
    java.net.SocketException: Connection reset
    
    2021-07-22 11:07:33 ERROR DAGScheduler:91 - Failed to update accumulator 0 (org.apache.spark.api.python.PythonAccumulatorV2) for task 885
    java.net.SocketException: Broken pipe (Write failed)
            at java.net.SocketOutputStream.socketWrite0(Native Method)
            at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
            at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
            at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
            at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
            at java.io.DataOutputStream.flush(DataOutputStream.java:123)
            at org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:650)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1257)
            at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1248)
            at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
            at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
            at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1248)
            at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1338)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2107)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
            at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    

    http://172.16.0.107:7777/history/application_1626654036089_0216/jobs/ http://172.16.0.107:7777/history/application_1626654036089_0518/jobs/ http://172.16.0.107:7777/history/application_1626654036089_0563/jobs/

    high priority orca friesian 
    opened by hkvision 14
  • TuneError: ('Trials did not complete', [train_func_19925_00004])

    TuneError: ('Trials did not complete', [train_func_19925_00004])

    I'm trying to implement auto_ts on my multivariate time series data using Lstm , while fitting it is giving the following error.

    image

    Data is in this format

    image

    == Status ==
    Memory usage on this node: 2.7/12.7 GiB
    Using FIFO scheduling algorithm.
    Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/6.79 GiB heap, 0.0/2.34 GiB objects
    Current best trial: 19925_00005 with mse=0.15181996493457617 and parameters={'hidden_dim': 64, 'layer_num': 2, 'lr': 0.0010343663029423226, 'dropout': 0.09671240437800133, 'input_feature_num': None, 'output_feature_num': 1, 'past_seq_len': 4, 'future_seq_len': 1, 'selected_features': ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'MINUTE', 'DAY', 'DAYOFYEAR', 'HOUR', 'WEEKDAY', 'WEEKOFYEAR', 'MONTH', 'YEAR', 'IS_AWAKE', 'IS_BUSY_HOURS', 'IS_WEEKEND'], 'batch_size': 32}
    Result logdir: /tmp/autots_estimator/autots_estimator
    Number of trials: 6/6 (1 ERROR, 5 TERMINATED)
    Number of errored trials: 1

    Trial name | # failures | error file -- | -- | -- train_func_19925_00004 | 1 | /tmp/autots_estimator/autots_estimator/train_func_19925_00004/error.txt


    ---------------------------------------------------------------------------
    
    TuneError                                 Traceback (most recent call last)
    
    <ipython-input-17-c1017f49fcaa> in <module>()
          2 ts_pipeline = auto_estimator.fit(data=tsdata_train, # train dataset
          3                                  validation_data=tsdata_val, # validation dataset
    ----> 4                                  epochs=5) # number of epochs to train in each trial
    
    

    4 frames
    /usr/local/lib/python3.7/dist-packages/zoo/chronos/autots/autotsestimator.py in fit(self, data, epochs, batch_size, validation_data, metric_threshold, n_sampling, search_alg, search_alg_params, scheduler, scheduler_params)
        246                 search_alg_params=search_alg_params,
        247                 scheduler=scheduler,
    --> 248                 scheduler_params=scheduler_params
        249             )
        250 
    
    
    /usr/local/lib/python3.7/dist-packages/zoo/chronos/autots/model/base_automodel.py in fit(self, data, epochs, batch_size, validation_data, metric_threshold, n_sampling, search_alg, search_alg_params, scheduler, scheduler_params)
         77             search_alg_params=search_alg_params,
         78             scheduler=scheduler,
    ---> 79             scheduler_params=scheduler_params,
         80         )
         81         self.best_model = self.auto_est._get_best_automl_model()
    
    
    /usr/local/lib/python3.7/dist-packages/zoo/orca/automl/auto_estimator.py in fit(self, data, epochs, validation_data, metric, metric_mode, metric_threshold, n_sampling, search_space, search_alg, search_alg_params, scheduler, scheduler_params)
        193                               scheduler=scheduler,
        194                               scheduler_params=scheduler_params)
    --> 195         self.searcher.run()
        196         self._fitted = True
        197 
    
    
    /usr/local/lib/python3.7/dist-packages/zoo/orca/automl/search/ray_tune/ray_tune_search_engine.py in run(self)
        181             resources_per_trial=self.resources_per_trial,
        182             verbose=1,
    --> 183             reuse_actors=True
        184         )
        185         self.trials = analysis.trials
    
    
    /usr/local/lib/python3.7/dist-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, loggers, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint)
        442     if incomplete_trials:
        443         if raise_on_failed_trial:
    --> 444             raise TuneError("Trials did not complete", incomplete_trials)
        445         else:
        446             logger.error("Trials did not complete: %s", incomplete_trials)
    
    
    TuneError: ('Trials did not complete', [train_func_19925_00004])

    user issue 
    opened by saiprasad2606 3
  • (raylet) socket.gaierror: [Errno -2] Name or service not known

    (raylet) socket.gaierror: [Errno -2] Name or service not known

    When I run https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-tf2keras-quickstart.html tensorFlow 2 For example. ############ Error: (raylet) Traceback (most recent call last): (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 334, in (raylet) raise e (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 323, in (raylet) loop.run_until_complete(agent.run()) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/asyncio/base_events.py", line 568, in run_until_complete (raylet) return future.result() (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 138, in run (raylet) modules = self._load_modules() (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 92, in _load_modules (raylet) c = cls(self) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in init (raylet) self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/metrics_agent.py", line 76, in init (raylet) namespace="ray", port=metrics_export_port))) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 334, in new_stats_exporter (raylet) options=option, gatherer=option.registry, collector=collector) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 266, in init (raylet) self.serve_http() (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/ray/prometheus_exporter.py", line 321, in serve_http (raylet) port=self.options.port, addr=str(self.options.address)) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server (raylet) TmpServer.address_family, addr = _get_best_family(addr, port) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family (raylet) infos = socket.getaddrinfo(address, port) (raylet) File "/usr/local/miniconda3/envs/zoo/lib/python3.7/socket.py", line 753, in getaddrinfo (raylet) for res in _socket.getaddrinfo(host, port, family, type, proto, flags): (raylet) socket.gaierror: [Errno -2] Name or service not known ##############

    Hosts file image

    After running the example, session files are generated in /tmp/ray/ of the system image

    Runtime environment: Docker deployment uses Miniconda to install AZ and Ray

    Conda create -n zoo python=3.7 conda activate zoo pip install --pre --upgrade analytics-zoo pip install analytics-zoo[ray] PIP install tensorflow = = 2.3.0

    conda list

    Name Version Build Channel _libgcc_mutex 0.1 main
    _openmp_mutex 5.1 1_gnu
    absl-py 1.0.0 pypi_0 pypi aiohttp 3.7.0 pypi_0 pypi aiohttp-cors 0.7.0 pypi_0 pypi aioredis 1.1.0 pypi_0 pypi analytics-zoo 0.12.0b2022052501 pypi_0 pypi astunparse 1.6.3 pypi_0 pypi async-timeout 3.0.1 pypi_0 pypi attrs 21.4.0 pypi_0 pypi bigdl 0.13.1.dev1 pypi_0 pypi blessings 1.7 pypi_0 pypi ca-certificates 2022.4.26 h06a4308_0
    cachetools 5.1.0 pypi_0 pypi certifi 2022.5.18.1 py37h06a4308_0
    chardet 3.0.4 pypi_0 pypi charset-normalizer 2.0.12 pypi_0 pypi click 8.1.3 pypi_0 pypi colorama 0.4.4 pypi_0 pypi colorful 0.5.4 pypi_0 pypi conda-pack 0.3.1 pypi_0 pypi deprecated 1.2.13 pypi_0 pypi filelock 3.7.0 pypi_0 pypi gast 0.3.3 pypi_0 pypi google-api-core 2.8.0 pypi_0 pypi google-auth 2.6.6 pypi_0 pypi google-auth-oauthlib 0.4.6 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi googleapis-common-protos 1.56.1 pypi_0 pypi gpustat 0.6.0 pypi_0 pypi grpcio 1.46.3 pypi_0 pypi h5py 2.10.0 pypi_0 pypi hiredis 1.1.0 pypi_0 pypi idna 3.3 pypi_0 pypi importlib-metadata 4.11.4 pypi_0 pypi importlib-resources 5.7.1 pypi_0 pypi jsonschema 4.5.1 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi libedit 3.1.20210910 h7f8727e_0
    libffi 3.2.1 hf484d3e_1007
    libgcc-ng 11.2.0 h1234567_0
    libgomp 11.2.0 h1234567_0
    libstdcxx-ng 11.2.0 h1234567_0
    markdown 3.3.7 pypi_0 pypi msgpack 1.0.3 pypi_0 pypi multidict 6.0.2 pypi_0 pypi ncurses 6.3 h7f8727e_2
    numpy 1.18.5 pypi_0 pypi nvidia-ml-py3 7.352.0 pypi_0 pypi oauthlib 3.2.0 pypi_0 pypi opencensus 0.9.0 pypi_0 pypi opencensus-context 0.1.2 pypi_0 pypi opencv-python 4.5.5.64 pypi_0 pypi openssl 1.0.2u h7b6447c_0
    opt-einsum 3.3.0 pypi_0 pypi packaging 21.3 pypi_0 pypi pip 21.2.2 py37h06a4308_0
    prometheus-client 0.14.1 pypi_0 pypi protobuf 3.20.1 pypi_0 pypi psutil 5.9.1 pypi_0 pypi py-spy 0.3.12 pypi_0 pypi py4j 0.10.7 pypi_0 pypi pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pyparsing 3.0.9 pypi_0 pypi pyrsistent 0.18.1 pypi_0 pypi pyspark 2.4.6 pypi_0 pypi python 3.7.0 h6e4f718_3
    pyyaml 6.0 pypi_0 pypi ray 1.2.0 pypi_0 pypi readline 7.0 h7b6447c_5
    redis 4.1.4 pypi_0 pypi requests 2.27.1 pypi_0 pypi requests-oauthlib 1.3.1 pypi_0 pypi rsa 4.8 pypi_0 pypi scipy 1.4.1 pypi_0 pypi setproctitle 1.2.3 pypi_0 pypi setuptools 61.2.0 py37h06a4308_0
    six 1.16.0 pypi_0 pypi sqlite 3.33.0 h62c20be_0
    tensorboard 2.9.0 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.1 pypi_0 pypi tensorflow 2.3.0 pypi_0 pypi tensorflow-estimator 2.3.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi tk 8.6.11 h1ccaba5_1
    typing-extensions 4.2.0 pypi_0 pypi urllib3 1.26.9 pypi_0 pypi werkzeug 2.1.2 pypi_0 pypi wheel 0.37.1 pyhd3eb1b0_0
    wrapt 1.14.1 pypi_0 pypi xz 5.2.5 h7f8727e_1
    yarl 1.7.2 pypi_0 pypi zipp 3.8.0 pypi_0 pypi zlib 1.2.12 h7f8727e_2

    ———————————————————— 1、Check python: from zoo.util.utils import detect_python_location detect_python_location() image

    2、Check ray installation /usr/local/miniconda3/envs/zoo/bin/python /usr/local/miniconda3/envs/zoo/bin/ray start --head --include-dashboard ture --dashboard-host 172.27.0.2 --port 35413 --redis-password 123456 --num-cpus 1 image

    /usr/local/miniconda3/envs/zoo/bin/python /usr/local/miniconda3/envs/zoo/bin/ray start --address 172.27.0.2:35413 --redis-password 123456 --num-cpus 1 image

    ray start --address=‘172.27.0.2:35413' --redis-password='0'

    image

    Related documents.zip

    user issue 
    opened by xunaichao 9
  • Cannot import name 'forecaster' from 'zoo.chronos'

    Cannot import name 'forecaster' from 'zoo.chronos'

    System information

    • OS Platform and Distribution: macOS Big sur, also tested on WSL with windows 10
    • Zoo version (zoo.__version__): 0.11.2
    • pyspark version (pyspark.__version__): 2.4.6
    • Python version: Python 3.6.13 | Anaconda, Inc.| (default, Feb 23 2021, 12:58:59)
    • Code we can use to reproduce: [example notebook](https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/chronos/use-case/fsi/stock_prediction_prophet.ipynb

    Message

    Hi

    I'm exploring Chronos for time series. I've decided to use this example notebook to start.

    When running from zoo.chronos import forecaster I got the following error message:

    ImportError: cannot import name 'forecaster'
    

    When list of the attributes with dir(zoo.chronos): I get data only. ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'data']

    user issue 
    opened by mbrhd 1
  • Fail at import: dependencies not installed

    Fail at import: dependencies not installed

    Hi

    I'm exploring Chronos for time series. I've decided to use this example notebook to start.

    When running from zoo.chronos.data import TSDataset I got the following error message:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/zoo/__init__.py", line 17, in <module>
        from zoo.common.nncontext import *
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/zoo/common/__init__.py", line 17, in <module>
        from .utils import *
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/zoo/common/utils.py", line 16, in <module>
        from bigdl.util.common import Sample as BSample, JTensor as BJTensor,\
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/bigdl/__init__.py", line 18, in <module>
        prepare_env()
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/bigdl/util/engine.py", line 155, in prepare_env
        __prepare_spark_env()
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/bigdl/util/engine.py", line 53, in __prepare_spark_env
        if exist_pyspark():
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/bigdl/util/engine.py", line 26, in exist_pyspark
        import pyspark
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/pyspark/__init__.py", line 51, in <module>
        from pyspark.context import SparkContext
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/pyspark/context.py", line 31, in <module>
        from pyspark import accumulators
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/pyspark/accumulators.py", line 97, in <module>
        from pyspark.serializers import read_int, PickleSerializer
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/pyspark/serializers.py", line 72, in <module>
        from pyspark import cloudpickle
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 145, in <module>
        _cell_set_template_code = _make_cell_set_template_code()
      File "/opt/anaconda3/envs/analytics-zoo-test/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
        return types.CodeType(
    TypeError: an integer is required (got type bytes)
    

    I fixed this by installing spark using: conda install pyspark

    Then the same command from zoo.chronos.data import TSDataset fails because of pandas, packaging, and tsfresh not installed. I fixed this issue by installing pandas, packaging and tsfresh.

    user issue 
    opened by mbrhd 1
  • Clarification on BigDL and Analytics-Zoo status

    Clarification on BigDL and Analytics-Zoo status

    The Intel page for BigDL indicates that BigDL merging Analytics-Zoo into the BigDL project, and that Analytics-Zoo is now a legacy tool. This is backed up by BigDL seeming to provide all the functionality Analytics-Zoo offers. This statement only appears on the Intel website.

    Could clarification be provided on this?

    user issue 
    opened by bendavidsteel 2
  • [BigDL2.0 k8s] cluster mode remain issues

    [BigDL2.0 k8s] cluster mode remain issues

    k8s cluster mode remain issues:

    1. pytorch jep requires to specifyexport PYTHONHOME=/usr/local/envs/pytf1 on driver. On cluster mode, it can not be set in the driver pod. The ks8 image has two python envs. We can not hard code PYTHONHOME in image.
    2. Tfpark on cluster mode throws ModuleNotFoundError: No module named 'nets'. The k8s image has already cloned slim models and set PYTHONPATH as opt/models/research/slim:$PYTHONPATH. In client mode, it can run successfully.
    3. orca openvino example throws RuntimeError: The support of IR v4 has been removed from the product. Please, convert the original model using the Model Optimizer which comes with this version of the OpenVINO to generate supported IR version. It seems the code can not support openvino isntalled by conda install openvino-ie4py-ubuntu18 -c intel
    opened by Le-Zheng 0
Releases(v0.11.2)
  • v0.11.2(Jan 24, 2022)

    Highlights

    Note: Analytics Zoo v0.11.2 has been updated to include functional and security updates. Users should update to the latest version.

    Source code(tar.gz)
    Source code(zip)
  • v0.11.1(Dec 22, 2021)

    Highlights

    Note: Analytics Zoo v0.11.1 has been updated to include functional and security updates. Users should update to the latest version.

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Jul 19, 2021)

    Highlights

    • Chronos: an overhaul of the previous Zouwu time-series analysis library, with:

    • Reference implementation of large-scale feature transformation pipelines for recommendation systems (e.g., DLRM, DIEN, W&D, etc.)

    • Enhancements to Orca (scaling TF/PyTorch models to distributed Big Data) for end-to-end computer vision pipelines (distributed image preprocessing, training and inference); for more information, please see our CPVR 2021 tutorial.

    • Initial Python and PySpark application support for PPML (privacy preserving big data and machine learning)

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(May 25, 2021)

    Highlights

    • Improved document website, including quickstarts for Orca, RayOnSpark, Zouwu and BigDL
    • Orca library: unified API for running distributed deep learning (TensorFlow, PyTorch, Keras, BigDL, OpenVINO, etc.) on distributed Big Data (using Apache Spark and Ray)
    • Experimental PPML support for privacy preserving big data analysis and machine learning (i.e., running unmodified Apache Spark, Apache Flink, BigDL and TF/PyTorch/OpenVINO inference in a secure fashion on cloud)
    • Improved AutoML and Time Series (Zouwu) support, including AutoXGBoost, TCN, ONNX inference, etc.
    • Improved Cluster Serving (including performance, stability, log information, etc.)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Dec 17, 2020)

  • v0.8.1(Apr 27, 2020)

  • v0.8.0(Apr 17, 2020)

    Highlights

    • Improved support for running Analytics Zoo on K8s
    • Improvement to tfpark (including support of pre-made TensorFlow Estimator, and support of Spark Dataframe and tf.data.Dataset in zoo.tfpark.TFDataset)
    • Improvement to Cluster Serving (including support for performance mode, TensorFlow saved model, and better TensorBoard integration)
    • Improvement to time series analysis (including MTNet and project Zouwu)
    • Support for Distributed MXNet training on Ray
    • Upgrade OpenVINO support to 2020.R1
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Jan 20, 2020)

  • v0.6.0(Oct 15, 2019)

  • v0.5.1(Jun 11, 2019)

  • v0.5.0(Jun 3, 2019)

  • v0.4.0(Jan 25, 2019)

    Highlights

    • Support for BigDL 0.7.2 and Spark 2.4; see the download page for all the supported versions.
    • Initial OpenVINO support for the model serving API, which can use OpenVINO toolkit to accelerate the inference speed for the TensorFlow models on Analytics Zoo. please refer to the related document and example for more details.
    • Initial Persistent Memory support for distributed deep learning training, which can leverage Intel Optane DC Persistent Memory to cache large training data set; please refer to the related document for more details.
    • Various new features, including additional built-in models (such as sequence-to-sequence model and unsupervised time series anomaly detection model), learning rate schedule in Adam, Spark ML vector support in nnframes, TensorFlow Keras model support in TFOptimizer, etc.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Oct 30, 2018)

    Highlights

    • Distributed TensorFlow (both training and inference) on Spark, which supports:

      • Data wrangling and analysis using PySpark
      • Deep learning model development using TensorFlow or Keras
      • Distributed training/inference on Spark and BigDL
      • All within a single unified pipeline and in a user-transparent fashion!
    • More support for text processing and models, including:

      • Common feature engineering operations for text data (such as tokenization, normalization, padding, etc.)
      • Word Embedding layers that directly load pretrained GloVe model
      • Text matching models (such as KNRM)
    • Various improvements and new features, such as:

      • Support for trainable variable (Parameter)
      • Support for Keras objectives (with zero-based label)
      • Improvements to model serving APIs
      • Improvements to example and use case documents
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jul 17, 2018)

    Highlights

    • Support for both BigDL 0.5.0 and BigDL 0.6.0
    • New reference use case (image similarity based house recommendation)
    • Additional pre-trained models (Inception v3, MobileNet v2, quantized models)
    • Improved support for autograd and custom loss/layer
    • Improved support for model serving APIs
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jun 13, 2018)

    Highlights

    • Support for building and productionizing end-to-end deep learning applications for big data
    • E2E analytics + deep learning pipelines (natively in Spark DataFrames and ML Pipelines) using nnframes
    • Flexible model definition using autograd, Keras & transfer learning APIs
    • Data preprocessing using built-in feature engineering operations
    • Out-of-the-box solutions for a variety of problem types using built-in deep learning models and reference use cases
    • Serving models using POJO model serving APIs for web services and other big data frameworks (e.g., Storm or Kafka)
    Source code(tar.gz)
    Source code(zip)
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

null 92 Dec 14, 2022
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make dis

Horovod 12.9k Jan 7, 2023
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

null 4.1k Jan 9, 2023
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.6k Dec 29, 2022
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 4, 2023
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 757 Dec 31, 2022
Uber Open Source 1.6k Dec 31, 2022
Microsoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark

Microsoft Azure 3.9k Dec 30, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.3k Dec 31, 2022
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

null 27 Aug 19, 2022
Spark development environment for k8s

Local Spark Dev Env with Docker Development environment for k8s. Using the spark-operator image to ensure it will be the same environment. Start conta

Otacilio Filho 18 Jan 4, 2022
Code base of KU AIRS: SPARK Autonomous Vehicle Team

KU AIRS: SPARK Autonomous Vehicle Project Check this link for the blog post describing this project and the video of SPARK in simulation and on parkou

Mehmet Enes Erciyes 1 Nov 23, 2021
A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python-Fast-Raytracer A basic Ray Tracer that exploits numpy arrays and functions to work fast. The code is written keeping as much readability as pos

Rafael de la Fuente 393 Dec 27, 2022
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

The Apache Software Foundation 121 Dec 28, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 10x Faster Trainin

Microsoft 8.4k Dec 30, 2022
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
a distributed deep learning platform

Apache SINGA Distributed deep learning system http://singa.apache.org Quick Start Installation Examples Issues JIRA tickets Code Analysis: Mailing Lis

The Apache Software Foundation 2.7k Jan 5, 2023
Distributed Computing for AI Made Simple

Project Home Blog Documents Paper Media Coverage Join Fiber users email list [email protected] Fiber Distributed Computing for AI Made Simp

Uber Open Source 997 Dec 30, 2022