TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

Overview

TensorFlowOnSpark

TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters.

Build Status Package Downloads Documentation

By combining salient features from the TensorFlow deep learning framework with Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.

It enables both distributed TensorFlow training and inferencing on Spark clusters, with a goal to minimize the amount of code changes required to run existing TensorFlow programs on a shared grid. Its Spark-compatible API helps manage the TensorFlow cluster with the following steps:

  1. Startup - launches the Tensorflow main function on the executors, along with listeners for data/control messages.
  2. Data ingestion
    • InputMode.TENSORFLOW - leverages TensorFlow's built-in APIs to read data files directly from HDFS.
    • InputMode.SPARK - sends Spark RDD data to the TensorFlow nodes via a TFNode.DataFeed class. Note that we leverage the Hadoop Input/Output Format to access TFRecords on HDFS.
  3. Shutdown - shuts down the Tensorflow workers and PS nodes on the executors.

Table of Contents

Background

TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.

TensorFlowOnSpark provides some important benefits (see our blog) over alternative deep learning solutions.

  • Easily migrate existing TensorFlow programs with <10 lines of code change.
  • Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard.
  • Server-to-server direct communication achieves faster learning when available.
  • Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow.
  • Easily integrate with your existing Spark data processing pipelines.
  • Easily deployed on cloud or on-premise and on CPUs or GPUs.

Install

TensorFlowOnSpark is provided as a pip package, which can be installed on single machines via:

# for tensorflow>=2.0.0
pip install tensorflowonspark

# for tensorflow<2.0.0
pip install tensorflowonspark==1.4.4

For distributed clusters, please see our wiki site for detailed documentation for specific environments, such as our getting started guides for single-node Spark Standalone, YARN clusters and AWS EC2. Note: the Windows operating system is not currently supported due to this issue.

Usage

To use TensorFlowOnSpark with an existing TensorFlow application, you can follow our Conversion Guide to describe the required changes. Additionally, our wiki site has pointers to some presentations which provide an overview of the platform.

Note: since TensorFlow 2.x breaks API compatibility with TensorFlow 1.x, the examples have been updated accordingly. If you are using TensorFlow 1.x, you will need to checkout the v1.4.4 tag for compatible examples and instructions.

API

API Documentation is automatically generated from the code.

Contribute

Please join the TensorFlowOnSpark user group for discussions and questions. If you have a question, please review our FAQ before posting.

Contributions are always welcome. For more information, please see our guide for getting involved.

License

The use and distribution terms for this software are covered by the Apache 2.0 license. See LICENSE file for terms.

Comments
  • Bump protobuf-java from 3.16.1 to 3.16.3

    Bump protobuf-java from 3.16.1 to 3.16.3

    Bumps protobuf-java from 3.16.1 to 3.16.3.

    Release notes

    Sourced from protobuf-java's releases.

    Protobuf Release v3.16.3

    Java

    • Refactoring java full runtime to reuse sub-message builders and prepare to migrate parsing logic from parse constructor to builder.
    • Move proto wireformat parsing functionality from the private "parsing constructor" to the Builder class.
    • Change the Lite runtime to prefer merging from the wireformat into mutable messages rather than building up a new immutable object before merging. This way results in fewer allocations and copy operations.
    • Make message-type extensions merge from wire-format instead of building up instances and merging afterwards. This has much better performance.
    • Fix TextFormat parser to build up recurring (but supposedly not repeated) sub-messages directly from text rather than building a new sub-message and merging the fully formed message into the existing field.
    • This release addresses a Security Advisory for Java users
    Commits
    • b8c2488 Updating version.json and repo version numbers to: 16.3
    • 42e47e5 Refactoring Java parsing (3.16.x) (#10668)
    • 98884a8 Merge pull request #10556 from deannagarcia/3.16.x
    • 450b648 Cherrypick ruby fixes for monterey
    • b17bb39 Merge pull request #10548 from protocolbuffers/3.16.x-202209131829
    • c18f5e7 Updating changelog
    • 6f4e817 Updating version.json and repo version numbers to: 16.2
    • a7d4e94 Merge pull request #10547 from deannagarcia/3.16.x
    • 55815e4 Apply patch
    • 152d7bf Update version.json with "lts": true (#10535)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Evalator hangs while training

    Evalator hangs while training

    Environment:

    • Python version 3.7
    • Spark version 2.4
    • TensorFlow version 2.5
    • TensorFlowOnSpark version 2.2.3
    • Cluster version hadoop

    Describe the bug: I found the evaluator node won't work any more after sometime while training nodes work fine and the whole cluster doesn't crash. The total training step is 80000 and the evaluator only evaluates for 10000+ step. After that no more logs are output. image

    image

    opened by jiqiujia 1
  • do we support   scala & java code write  tensorflow model with  tenorflow-core-api ?

    do we support scala & java code write tensorflow model with tenorflow-core-api ?

    Environment:

    • Python version [e.g. 2.7, 3.6]
    • Spark version [e.g. 2.1, 2.3.1]
    • TensorFlow version [e.g. 1.5, 1.9.0]
    • TensorFlowOnSpark version [e.g. 1.1, 1.3.2]
    • Cluster version [e.g. Standalone, Hadoop 2.8, CDH5]

    Describe the bug: A clear and concise description of what the bug is.

    Logs: If applicable, add logs to help explain your problem. Note: errors may not be fully described in the driver/console logs. Make sure to check the executor logs for possible root causes.

    Spark Submit Command Line: If applicable, add your spark-submit command line.

    opened by mullerhai 3
  • Get stuck at

    Get stuck at "Added broadcast_0_piece0 in memory on" while runing Spark standalone cluster

    Environment:

    • Python version 3.6.9
    • Spark version 3.0.0
    • TensorFlow version 2.6.2
    • TensorFlowOnSpark version 2.2.4
    • Cluster version Standalone, Hadoop 3.1.3

    Describe the bug: Get stuck at "INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on" while runing Spark standalone cluster while Training MNIST using Keras.

    Logs: 2021-12-27 10:51:01,579 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2021-12-27 10:51:03,012 INFO spark.SparkContext: Running Spark version 3.0.0 2021-12-27 10:51:03,044 INFO resource.ResourceUtils: ============================================================== 2021-12-27 10:51:03,045 INFO resource.ResourceUtils: Resources for spark.driver:

    2021-12-27 10:51:03,045 INFO resource.ResourceUtils: ============================================================== 2021-12-27 10:51:03,045 INFO spark.SparkContext: Submitted application: mnist_keras 2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing view acls to: amax 2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing modify acls to: amax 2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing view acls groups to: 2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing modify acls groups to: 2021-12-27 10:51:03,081 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set() 2021-12-27 10:51:03,232 INFO util.Utils: Successfully started service 'sparkDriver' on port 45175. 2021-12-27 10:51:03,255 INFO spark.SparkEnv: Registering MapOutputTracker 2021-12-27 10:51:03,275 INFO spark.SparkEnv: Registering BlockManagerMaster 2021-12-27 10:51:03,289 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 2021-12-27 10:51:03,289 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 2021-12-27 10:51:03,291 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat 2021-12-27 10:51:03,298 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-1a9cd977-b478-4add-bb24-887f1cb3e057 2021-12-27 10:51:03,311 INFO memory.MemoryStore: MemoryStore started with capacity 14.2 GiB 2021-12-27 10:51:03,320 INFO spark.SparkEnv: Registering OutputCommitCoordinator 2021-12-27 10:51:03,401 INFO util.log: Logging initialized @2665ms to org.sparkproject.jetty.util.log.Slf4jLog 2021-12-27 10:51:03,443 INFO server.Server: jetty-9.4.z-SNAPSHOT; built: 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 1.8.0_212-b10 2021-12-27 10:51:03,457 INFO server.Server: Started @2722ms 2021-12-27 10:51:03,474 INFO server.AbstractConnector: Started ServerConnector@1627cd3b{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 2021-12-27 10:51:03,475 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 2021-12-27 10:51:03,495 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5064c620{/jobs,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,496 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a32f47a{/jobs/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,497 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3508d47e{/jobs/job,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,498 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7264cdd7{/jobs/job/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,499 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a3d89f5{/stages,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,499 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7da95106{/stages/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,500 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46687b37{/stages/stage,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,501 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@656b0411{/stages/stage/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,501 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@60ad7cfe{/stages/pool,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46482ae5{/stages/pool/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1ef317fc{/storage,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15492f0a{/storage/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,503 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@10ab3ccf{/storage/rdd,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,503 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@57093dec{/storage/rdd/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4debbe0{/environment,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@327dac75{/environment/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b91ba46{/executors,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,505 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c385c3{/executors/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,505 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@74bf8e9e{/executors/threadDump,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,506 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@232553eb{/executors/threadDump/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,510 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b24a17f{/static,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,510 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1c7efc23{/,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,511 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@eeaa0a8{/api,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,511 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1823b1f1{/jobs/job/kill,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,512 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c274337{/stages/stage/kill,null,AVAILABLE,@Spark} 2021-12-27 10:51:03,513 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://amax11:4040 2021-12-27 10:51:03,654 INFO client.StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.1.11:7077... 2021-12-27 10:51:03,711 INFO client.TransportClientFactory: Successfully created connection to /192.168.1.11:7077 after 36 ms (0 ms spent in bootstraps) 2021-12-27 10:51:03,793 INFO cluster.StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20211227105103-0014 2021-12-27 10:51:03,794 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/0 on worker-20211226140545-192.168.1.11-43273 (192.168.1.11:43273) with 1 core(s) 2021-12-27 10:51:03,796 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/0 on hostPort 192.168.1.11:43273 with 1 core(s), 27.0 GiB RAM 2021-12-27 10:51:03,796 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/1 on worker-20211226140545-192.168.1.7-36031 (192.168.1.7:36031) with 1 core(s) 2021-12-27 10:51:03,797 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/1 on hostPort 192.168.1.7:36031 with 1 core(s), 27.0 GiB RAM 2021-12-27 10:51:03,797 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/2 on worker-20211226140545-192.168.1.5-36787 (192.168.1.5:36787) with 1 core(s) 2021-12-27 10:51:03,798 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/2 on hostPort 192.168.1.5:36787 with 1 core(s), 27.0 GiB RAM 2021-12-27 10:51:03,801 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42229. 2021-12-27 10:51:03,801 INFO netty.NettyBlockTransferService: Server created on amax11:42229 2021-12-27 10:51:03,803 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 2021-12-27 10:51:03,812 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, amax11, 42229, None) 2021-12-27 10:51:03,815 INFO storage.BlockManagerMasterEndpoint: Registering block manager amax11:42229 with 14.2 GiB RAM, BlockManagerId(driver, amax11, 42229, None) 2021-12-27 10:51:03,818 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, amax11, 42229, None) 2021-12-27 10:51:03,819 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, amax11, 42229, None) 2021-12-27 10:51:03,822 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/0 is now RUNNING 2021-12-27 10:51:03,822 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/2 is now RUNNING 2021-12-27 10:51:03,823 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/1 is now RUNNING 2021-12-27 10:51:03,934 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5ab79e0c{/metrics/json,null,AVAILABLE,@Spark} 2021-12-27 10:51:04,320 INFO history.SingleEventLogFileWriter: Logging events to hdfs://amax11:8020/spark-sa-history/app-20211227105103-0014.inprogress 2021-12-27 10:51:04,521 INFO cluster.StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 args: Namespace(batch_size=64, cluster_size=3, epochs=3, export_dir='/home/amax/TOS/TensorFlowOnSpark/mnist_export', images_labels='/home/amax/TOS/TensorFlowOnSpark/data/mnist/csv/train', model_dir='/home/amax/TOS/TensorFlowOnSpark/mnist_model', tensorboard=False) 2021-12-27 10:51:04,712 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KiB, free 14.2 GiB) 2021-12-27 10:51:04,759 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 14.2 GiB) 2021-12-27 10:51:04,762 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on amax11:42229 (size: 27.4 KiB, free: 14.2 GiB) 2021-12-27 10:51:04,765 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:0 2021-12-27 10:51:04,916 INFO (MainThread-11873) Reserving TFSparkNodes 2021-12-27 10:51:04,917 INFO (MainThread-11873) cluster_template: {'chief': [0], 'worker': [1, 2]} 2021-12-27 10:51:04,919 INFO (MainThread-11873) Reservation server binding to port 0 2021-12-27 10:51:04,919 INFO (MainThread-11873) listening for reservations at ('192.168.1.11', 37823) 2021-12-27 10:51:04,919 INFO (MainThread-11873) Starting TensorFlow on executors 2021-12-27 10:51:04,924 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 27648, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 2021-12-27 10:51:04,928 INFO (MainThread-11873) Waiting for TFSparkNodes to start 2021-12-27 10:51:04,928 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:04,983 INFO spark.SparkContext: Starting job: foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327 2021-12-27 10:51:04,999 INFO scheduler.DAGScheduler: Got job 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) with 3 output partitions 2021-12-27 10:51:04,999 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) 2021-12-27 10:51:05,000 INFO scheduler.DAGScheduler: Parents of final stage: List() 2021-12-27 10:51:05,000 INFO scheduler.DAGScheduler: Missing parents: List() 2021-12-27 10:51:05,003 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[3] at foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327), which has no missing parents 2021-12-27 10:51:05,012 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 17.1 KiB, free 14.2 GiB) 2021-12-27 10:51:05,014 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.7 KiB, free 14.2 GiB) 2021-12-27 10:51:05,015 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on amax11:42229 (size: 11.7 KiB, free: 14.2 GiB) 2021-12-27 10:51:05,015 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1200 2021-12-27 10:51:05,025 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (PythonRDD[3] at foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) (first 15 tasks are for partitions Vector(0, 1, 2)) 2021-12-27 10:51:05,026 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks 2021-12-27 10:51:05,245 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.5:52680) with ID 2 2021-12-27 10:51:05,279 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.7:40504) with ID 1 2021-12-27 10:51:05,286 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.11:44508) with ID 0 2021-12-27 10:51:05,343 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.5:33393 with 14.2 GiB RAM, BlockManagerId(2, 192.168.1.5, 33393, None) 2021-12-27 10:51:05,361 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.7:38801 with 14.2 GiB RAM, BlockManagerId(1, 192.168.1.7, 38801, None) 2021-12-27 10:51:05,363 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:35421 with 14.2 GiB RAM, BlockManagerId(0, 192.168.1.11, 35421, None) 2021-12-27 10:51:05,391 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.5, executor 2, partition 0, PROCESS_LOCAL, 7337 bytes) 2021-12-27 10:51:05,398 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.1.7, executor 1, partition 1, PROCESS_LOCAL, 7337 bytes) 2021-12-27 10:51:05,399 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.1.11, executor 0, partition 2, PROCESS_LOCAL, 7337 bytes) 2021-12-27 10:51:05,559 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.7:38801 (size: 11.7 KiB, free: 14.2 GiB) 2021-12-27 10:51:05,560 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.11:35421 (size: 11.7 KiB, free: 14.2 GiB) 2021-12-27 10:51:05,566 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.5:33393 (size: 11.7 KiB, free: 14.2 GiB) 2021-12-27 10:51:05,930 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:06,931 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:07,932 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:08,934 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:09,935 INFO (MainThread-11873) waiting for 3 reservations 2021-12-27 10:51:10,936 INFO (MainThread-11873) all reservations completed 2021-12-27 10:51:10,937 INFO (MainThread-11873) All TFSparkNodes started 2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 1, 'host': '192.168.1.7', 'job_name': 'worker', 'task_index': 0, 'port': 36945, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-8lelocpu/listener-c8uef51e', 'authkey': b'n/\x82\x08F\x94L\xb0\xb5T\xd0\xf4\xb6\n\xb2y'} 2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'} 2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 0, 'host': '192.168.1.5', 'job_name': 'chief', 'task_index': 0, 'port': 44093, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-sevn96me/listener-coe66f9n', 'authkey': b'3\xd8\x85\xe7\x06UKa\x8b\xcb\xe6\x11\n\xa1)C'} 2021-12-27 10:51:10,937 INFO (MainThread-11873) Feeding training data 2021-12-27 10:51:11,023 INFO mapred.FileInputFormat: Total input files to process : 10 2021-12-27 10:51:11,061 INFO spark.SparkContext: Starting job: collect at PythonRDD.scala:168 2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Got job 1 (collect at PythonRDD.scala:168) with 30 output partitions 2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at PythonRDD.scala:168) 2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Parents of final stage: List() 2021-12-27 10:51:11,065 INFO scheduler.DAGScheduler: Missing parents: List() 2021-12-27 10:51:11,066 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (PythonRDD[6] at RDD at PythonRDD.scala:53), which has no missing parents 2021-12-27 10:51:11,078 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KiB, free 14.2 GiB) 2021-12-27 10:51:11,079 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.3 KiB, free 14.2 GiB) 2021-12-27 10:51:11,080 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on amax11:42229 (size: 7.3 KiB, free: 14.2 GiB) 2021-12-27 10:51:11,080 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1200 2021-12-27 10:51:11,081 INFO scheduler.DAGScheduler: Submitting 30 missing tasks from ResultStage 1 (PythonRDD[6] at RDD at PythonRDD.scala:53) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) 2021-12-27 10:51:11,081 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 30 tasks 2021-12-27 10:51:15,330 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, 192.168.1.5, executor 2, partition 0, ANY, 7536 bytes) 2021-12-27 10:51:15,333 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 9956 ms on 192.168.1.5 (executor 2) (1/3) 2021-12-27 10:51:15,341 INFO python.PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 45655 2021-12-27 10:51:15,361 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.5:33393 (size: 7.3 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,387 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.5:33393 (size: 27.4 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,650 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, 192.168.1.11, executor 0, partition 1, ANY, 7536 bytes) 2021-12-27 10:51:15,650 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 10251 ms on 192.168.1.11 (executor 0) (2/3) 2021-12-27 10:51:15,691 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.11:35421 (size: 7.3 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,712 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, 192.168.1.7, executor 1, partition 2, ANY, 7536 bytes) 2021-12-27 10:51:15,712 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 10314 ms on 192.168.1.7 (executor 1) (3/3) 2021-12-27 10:51:15,714 INFO scheduler.DAGScheduler: ResultStage 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) finished in 10.703 s 2021-12-27 10:51:15,714 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 2021-12-27 10:51:15,715 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.11:35421 (size: 27.4 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,719 INFO scheduler.DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 2021-12-27 10:51:15,719 INFO scheduler.TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished 2021-12-27 10:51:15,721 INFO scheduler.DAGScheduler: Job 0 finished: foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327, took 10.737965 s 2021-12-27 10:51:15,749 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.7:38801 (size: 7.3 KiB, free: 14.2 GiB) 2021-12-27 10:51:15,779 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.7:38801 (size: 27.4 KiB, free: 14.2 GiB)

    The logs from Spark UI: Spark Executor Command: "/opt/module/jdk1.8.0_212/bin/java" "-cp" "/opt/module/spark-standalone/conf/:/opt/module/spark-standalone/jars/*" "-Xmx27648M" "-Dspark.driver.port=45175" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@amax11:45175" "--executor-id" "0" "--hostname" "192.168.1.11" "--cores" "1" "--app-id" "app-20211227105103-0014" "--worker-url" "spark://[email protected]:43273"

    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 21/12/27 10:51:04 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 11957@amax11 21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for TERM 21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for HUP 21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for INT 21/12/27 10:51:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 21/12/27 10:51:04 INFO SecurityManager: Changing view acls to: amax 21/12/27 10:51:04 INFO SecurityManager: Changing modify acls to: amax 21/12/27 10:51:04 INFO SecurityManager: Changing view acls groups to: 21/12/27 10:51:04 INFO SecurityManager: Changing modify acls groups to: 21/12/27 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set() 21/12/27 10:51:04 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:45175 after 54 ms (0 ms spent in bootstraps) 21/12/27 10:51:04 INFO SecurityManager: Changing view acls to: amax 21/12/27 10:51:04 INFO SecurityManager: Changing modify acls to: amax 21/12/27 10:51:04 INFO SecurityManager: Changing view acls groups to: 21/12/27 10:51:04 INFO SecurityManager: Changing modify acls groups to: 21/12/27 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set() 21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:45175 after 1 ms (0 ms spent in bootstraps) 21/12/27 10:51:05 INFO DiskBlockManager: Created local directory at /tmp/spark-66944227-f0bf-4829-b870-f5d85feae2cd/executor-2c04267d-7763-4e33-b90c-a6c1ead9f50b/blockmgr-6ac139bf-9f9c-496c-82de-a559a0202fad 21/12/27 10:51:05 INFO MemoryStore: MemoryStore started with capacity 14.2 GiB 21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@amax11:45175 21/12/27 10:51:05 INFO WorkerWatcher: Connecting to worker spark://[email protected]:43273 21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to /192.168.1.11:43273 after 1 ms (0 ms spent in bootstraps) 21/12/27 10:51:05 INFO WorkerWatcher: Successfully connected to spark://[email protected]:43273 21/12/27 10:51:05 INFO ResourceUtils: ============================================================== 21/12/27 10:51:05 INFO ResourceUtils: Resources for spark.executor:

    21/12/27 10:51:05 INFO ResourceUtils: ============================================================== 21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 21/12/27 10:51:05 INFO Executor: Starting executor ID 0 on host 192.168.1.11 21/12/27 10:51:05 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35421. 21/12/27 10:51:05 INFO NettyBlockTransferService: Server created on 192.168.1.11:35421 21/12/27 10:51:05 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 21/12/27 10:51:05 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 192.168.1.11, 35421, None) 21/12/27 10:51:05 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 192.168.1.11, 35421, None) 21/12/27 10:51:05 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 192.168.1.11, 35421, None) 21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Got assigned task 2 21/12/27 10:51:05 INFO Executor: Running task 2.0 in stage 0.0 (TID 2) 21/12/27 10:51:05 INFO TorrentBroadcast: Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB) 21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:42229 after 2 ms (0 ms spent in bootstraps) 21/12/27 10:51:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.7 KiB, free 14.2 GiB) 21/12/27 10:51:05 INFO TorrentBroadcast: Reading broadcast variable 1 took 90 ms 21/12/27 10:51:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 17.1 KiB, free 14.2 GiB) 2021-12-27 10:51:10,558 INFO (MainThread-12074) Available GPUs: ['0'] 2021-12-27 10:51:10,559 INFO (MainThread-12074) Proposed GPUs: ['0'] 2021-12-27 10:51:10,559 INFO (MainThread-12074) Requested 1 GPU(s), setting CUDA_VISIBLE_DEVICES=0 2021-12-27 10:51:10,578 INFO (MainThread-12074) connected to server at ('192.168.1.11', 37823) 2021-12-27 10:51:10,578 INFO (MainThread-12074) TFSparkNode.reserve: {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'} 2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 0, 'host': '192.168.1.5', 'job_name': 'chief', 'task_index': 0, 'port': 44093, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-sevn96me/listener-coe66f9n', 'authkey': b'3\xd8\x85\xe7\x06UKa\x8b\xcb\xe6\x11\n\xa1)C'} 2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 1, 'host': '192.168.1.7', 'job_name': 'worker', 'task_index': 0, 'port': 36945, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-8lelocpu/listener-c8uef51e', 'authkey': b'n/\x82\x08F\x94L\xb0\xb5T\xd0\xf4\xb6\n\xb2y'} 2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'} 2021-12-27 10:51:12,582 INFO (MainThread-12074) export TF_CONFIG: {"cluster": {"chief": ["192.168.1.5:44093"], "worker": ["192.168.1.7:36945", "192.168.1.11:45615"]}, "task": {"type": "worker", "index": 1}, "environment": "cloud"} 2021-12-27 10:51:15,616 INFO (MainThread-12074) Available GPUs: ['0'] 2021-12-27 10:51:15,616 INFO (MainThread-12074) Proposed GPUs: ['0'] 2021-12-27 10:51:15,616 INFO (MainThread-12074) Requested 1 GPU(s), setting CUDA_VISIBLE_DEVICES=0 2021-12-27 10:51:15,616 INFO (MainThread-12074) Starting TensorFlow worker:1 as worker on cluster node 2 on background process 21/12/27 10:51:15 INFO PythonRunner: Times: total = 9778, boot = 316, init = 1021, finish = 8441 2021-12-27 10:51:15,627 WARNING (MainThread-12152) From /home/amax/TOS/TensorFlowOnSpark/examples/mnist/keras/mnist_spark.py:11: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version. Instructions for updating: use distribute.MultiWorkerMirroredStrategy instead 21/12/27 10:51:15 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1549 bytes result sent to driver 21/12/27 10:51:15 INFO CoarseGrainedExecutorBackend: Got assigned task 4 21/12/27 10:51:15 INFO Executor: Running task 1.0 in stage 1.0 (TID 4) 21/12/27 10:51:15 INFO TorrentBroadcast: Started reading broadcast variable 2 with 1 pieces (estimated total size 4.0 MiB) 21/12/27 10:51:15 INFO TransportClientFactory: Successfully created connection to /192.168.1.5:33393 after 1 ms (0 ms spent in bootstraps) 21/12/27 10:51:15 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.3 KiB, free 14.2 GiB) 21/12/27 10:51:15 INFO TorrentBroadcast: Reading broadcast variable 2 took 25 ms 21/12/27 10:51:15 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KiB, free 14.2 GiB) 21/12/27 10:51:15 INFO HadoopRDD: Input split: hdfs://amax11:8020/home/amax/TOS/TensorFlowOnSpark/data/mnist/csv/train/part-00001:0+11232549 21/12/27 10:51:15 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB) 21/12/27 10:51:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 14.2 GiB) 21/12/27 10:51:15 INFO TorrentBroadcast: Reading broadcast variable 0 took 9 ms 21/12/27 10:51:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 426.4 KiB, free 14.2 GiB) 2021-12-27 10:51:16.542093: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-12-27 10:51:16.980995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8107 MB memory: -> device: 0, name: GeForce RTX 3080, pci bus id: 0000:65:00.0, compute capability: 8.6 2021-12-27 10:51:16.989052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:1/device:GPU:0 with 8107 MB memory: -> device: 0, name: GeForce RTX 3080, pci bus id: 0000:65:00.0, compute capability: 8.6 2021-12-27 10:51:16.994323: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job chief -> {0 -> 192.168.1.5:44093} 2021-12-27 10:51:16.994344: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 192.168.1.7:36945, 1 -> 192.168.1.11:45615} 2021-12-27 10:51:16.995505: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://192.168.1.11:45615 2021-12-27 10:51:17,003 INFO (MainThread-12152) Enabled multi-worker collective ops with available devices: ['/job:worker/replica:0/task:1/device:CPU:0', '/job:worker/replica:0/task:1/device:GPU:0'] 2021-12-27 10:51:17,104 INFO (MainThread-12152) Waiting for the cluster, timeout = inf 2021-12-27 10:51:17,194 INFO (MainThread-12169) Connected to TFSparkNode.mgr on 192.168.1.11, executor=2, state='running' 2021-12-27 10:51:17,201 INFO (MainThread-12169) mgr.state='running' 2021-12-27 10:51:17,201 INFO (MainThread-12169) Feeding partition <itertools.chain object at 0x7f8091c66d30> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f7fe8a757f0> 21/12/27 10:51:17 INFO PythonRunner: Times: total = 772, boot = -382, init = 438, finish = 716

    Spark Submit Command Line: ${SPARK_HOME}/bin/spark-submit
    --master ${MASTER}
    --conf spark.cores.max=${TOTAL_CORES}
    --conf spark.task.cpus=${CORES_PER_WORKER}
    --executor-memory 27G
    ${TFoS_HOME}/examples/mnist/keras/mnist_spark.py
    --cluster_size ${SPARK_WORKER_INSTANCES}
    --images_labels ${TFoS_HOME}/data/mnist/csv/train
    --model_dir ${TFoS_HOME}/mnist_model
    --export_dir ${TFoS_HOME}/mnist_export

    opened by icszhr 1
  • How to integrate a model into Spark cluster

    How to integrate a model into Spark cluster

    How can I integrate a model into a Spark cluster in real? I actually have a deep learning (tf, python) based model which I would like to integrate with the Spark cluster to do some experiments. Can anyone give me some suggestions or steps to follow to do that?

    opened by jahidhasanlinix 12
Releases(v2.2.5)
  • v2.2.5(Apr 21, 2022)

  • v2.2.4(May 25, 2021)

    • Added option to defer releasing temporary socket/port to user map_function for cases where user code may not bind to the assigned port soon enough to avoid other processes binding to the same port, e.g. extensive pre-processing before invoking TF APIs.
    • Updated screwdriver.cd build template.
    • Trigger documentation publish after PyPI push.
    Source code(tar.gz)
    Source code(zip)
  • v2.2.3(Mar 23, 2021)

  • v2.2.2(Dec 18, 2020)

  • v2.2.1(Mar 16, 2020)

    • Added support for port ranges in TFOS_SERVER_PORT environment variable.
    • Updated mnist/keras/mnist_tf.py example with workaround for tensorflow datasets issue.
    • Added more detailed error message for missing executor_id.
    • Added unit tests for gpu allocation variants.
    Source code(tar.gz)
    Source code(zip)
  • v2.2.0(Feb 19, 2020)

    • Added support for Spark 3.0 GPU resources
    • Updated to support Spark 2.4.5
    • Fixed dataset ordering in mnist_inference.py examples (thanks to @qsbao)
    • Added optional environment variables to configure TF server/grpc ports and TensorBoard ports on executors
    • Fixed bug with TFNode.start_cluster_server in backwards-compatibility code for TF1.x
    • Fixed file conflict issue with compat.export_saved_model in TF2.1
    • Removed support for Python 2.x
    Source code(tar.gz)
    Source code(zip)
  • v2.1.3(Jan 22, 2020)

  • v2.1.2(Jan 10, 2020)

  • v2.1.1(Jan 9, 2020)

    • added compat.is_gpu_available() method to use:
      • tf.config.list_logical_devices('GPU') (for TF2.1)
      • tf.test.is_cuda_available() (for earlier versions of TF).
    • added ability to launch TensorBoard on chief:0 or master:0 nodes (for small clusters without worker nodes).
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Dec 9, 2019)

    • Added compat module to manage minor API changes in TensorFlow.
    • Added compatibility for TF2.1.0rc0 (exporting saved_models and configuring auto-shard policy)
    • Re-introduced compatibility for TF1.x (except support for InputMode.TENSORFLOW in the ML Pipeline API).
    • Added TFParallel class for parallelized single-node inferencing via Spark executors.
    • Updated examples for TF API changes.
    • Updated to use module-level loggers.
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Oct 2, 2019)

    • initial release compatible with TensorFlow 2.x.
    • API changes:
      • removed TFNode.start_cluster_server, which is not required for tf.keras and tf.estimator.
      • removed TFNode.export_saved_model, which can be replaced by TF native APIs now.
      • added TFNodeContext.num_workers to count master, chief, and worker nodes.
    • Spark ML Pipeline API changes:
      • Scala API has been removed for now, since the Java library for TensorFlow 2.0 is not available yet.
      • removed InputMode.TENSORFLOW support for ML Pipelines, since the input data is always a Spark DataFrame for this API.
      • added HasMasterNode and HasGraceSecs params.
      • removed optional export_fn argument for Spark ML TFEstimator (use TF export APIs instead).
      • added standard default values for signature_def_key and tag_set for Spark ML TFModel.
      • modified inferencing code in TFModel for TF2.x APIs.
    • older TF 1.x examples have been replaced with TF 2.x compatible examples.
    Source code(tar.gz)
    Source code(zip)
  • v1.4.4(Sep 30, 2019)

    • last expected release compatible with TensorFlow 1.x (aside from any critical fixes), since the master branch will be moving to TF 2.0 compatibility.
    • handle multiple outputs with signaturedef (thanks to @markromedia).
    • handle exceptions after data feeding.
    • moved API docs to sphinx_rtd_theme.
    • updated to Spark 2.4.4.
    Source code(tar.gz)
    Source code(zip)
  • v1.4.3(Apr 6, 2019)

    • removed tensorflow as a dependency, in order to support other variants like tensorflow-gpu or tf-nightly.
    • allow use of evaluator node type in cluster (thanks to @bbshetty)
    • refactored cluster template generation.
    • updated wide-deep example to use models/official code.
    • restore termination of feed in mnist/spark example.
    • updated sample notebook instructions.
    • updated to use Spark 2.3.3.
    Source code(tar.gz)
    Source code(zip)
  • v1.4.2(Jan 22, 2019)

    • Set TF_CONFIG for "chief" clusters (required by DistributionStrategy APIs)
    • Fix GPU allocation for multi-gpu nodes
    • Updated examples for MNIST
    • Updated Hadoop and Spark dependency versions
    Source code(tar.gz)
    Source code(zip)
  • v1.4.1(Dec 3, 2018)

    • Added util.single_node_env(), which can be used to initialize the environment (HDFS compatibility + GPU allocation) for running a single-node instance of TensorFlow on the Spark driver.
    • Added an example of parallelized inferencing from a pre-trained SavedModel.
    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(Nov 16, 2018)

    • More deterministic GPU allocation for multi-GPU nodes.
    • Added timeout argument to TFCluster.shutdown() (default is 3 days). This is intended to shutdown the Spark application in the event that any of the TF nodes hang for any reason. Set to -1 to disable timeout.
    • Added ability to start reservation server on a specific port (contributed by @AvihayTsayeg).
    • Updated pipeline API for latest TF APIs (contributed by @AvihayTsayeg)
    • Added unit test for tf.SparseTensor support.
    • Updated examples to latest TF APIs (including workaround for https://github.com/tensorflow/tensorflow/issues/21745).
    • Updated Spark version dependency for Scala Inferencing API.
    • Added __version__ to module.
    Source code(tar.gz)
    Source code(zip)
  • v1.3.4(Sep 27, 2018)

  • v1.3.3(Sep 6, 2018)

    • Only set TF_CONFIG environment variable if cluster_spec has a "master", i.e. when using tf.estimator.
    • Updated mnist/keras/mnist_mlp_estimator.py with example of distributed/parallel inferencing via estimator.predict.
    • Added optional feed_timeout argument to TFCluster.train() for InputMode.SPARK.
    • Added optional grace_secs argument to TFCluster.shutdown().
    • Workaround for firewall proxy issue with get_ip_address (contributed by @viplav).
    • Add support for all Hadoop-compatible File System schemes (contributed by @vishnu2kmohan).
    • Added error messages to assert statements.
    • Initial Travis CI integration.
    Source code(tar.gz)
    Source code(zip)
  • v1.3.2(Jul 13, 2018)

    • add grace period to TFCluster.shutdown()
    • add wide & deep example (contributed by @crafet)
    • update mnist/pipeline examples to tf.data, add instructions, and misc code cleanup (from @yileic)
    • parameterize versions in pom.xml and code cleanup (from @tmielika)
    • update Scala Inferencing pom.xml to latest tensorflow-hadoop artifact (contributed by @psuszyns)
    Source code(tar.gz)
    Source code(zip)
  • v1.3.1(Jul 13, 2018)

    • Add keras/estimator example
    • Update original keras example to latest tf.keras apis
    • Update Scala Inferencing pom.xml to latest TF java version
    • Allow PS to use CPU on TF-GPU builds (contributed by @dratini6)
    • More pep8
    • More py2/py3 compat
    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Apr 1, 2018)

    • support tf.estimator.train_and_evaluate() API
    • use local file instead of ppid to uniquely identify executors
    • surface GPU allocation errors more readily
    • add sharding, epochs, and shuffling to mnist Dataset example.
    • TFoS example for criteo data (contributed by @amantrac)
    • use tf.train.MonitoredTrainingSession in mnist/spark example (contributed by @wuyifan18)
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Apr 1, 2018)

    • Error handling for TF exceptions in InputMode.SPARK (from @eordentlich).
    • Add timeout for reservations (from @eordentlich).
    • Errors will exit Spark job with non-zero exit code.
    • Fix regression in pipeline API.
    • Model export tool
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Apr 1, 2018)

Owner
Yahoo
Yahoo is a Verizon Media brand. This organization is the home to many of the active open source projects published by engineers at Yahoo and Verizon Media.
Yahoo
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 757 Dec 31, 2022
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

null 4.1k Jan 9, 2023
Microsoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark

Microsoft Azure 3.9k Dec 30, 2022
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make dis

Horovod 12.9k Jan 7, 2023
Uber Open Source 1.6k Dec 31, 2022
Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Somoclu Somoclu is a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing

Peter Wittek 239 Nov 10, 2022
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

null 27 Aug 19, 2022
Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

Tom Weichle 2 Apr 18, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 3, 2023
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.6k Dec 29, 2022
Spark development environment for k8s

Local Spark Dev Env with Docker Development environment for k8s. Using the spark-operator image to ensure it will be the same environment. Start conta

Otacilio Filho 18 Jan 4, 2022
Code base of KU AIRS: SPARK Autonomous Vehicle Team

KU AIRS: SPARK Autonomous Vehicle Project Check this link for the blog post describing this project and the video of SPARK in simulation and on parkou

Mehmet Enes Erciyes 1 Nov 23, 2021
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

The Apache Software Foundation 121 Dec 28, 2022
TensorFlow implementation of an arbitrary order Factorization Machine

This is a TensorFlow implementation of an arbitrary order (>=2) Factorization Machine based on paper Factorization Machines with libFM. It supports: d

Mikhail Trofimov 785 Dec 21, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

null 1.3k Dec 26, 2022
TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models and supports classification, regression and ranking. TF-DF is a TensorFlow wrapper around the Yggdrasil Decision Forests C++ libraries. Models trained with TF-DF are compatible with Yggdrasil Decision Forests' models, and vice versa.

null 538 Jan 1, 2023
SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and workloads.

Cray Labs 139 Jan 1, 2023
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 4, 2023
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

null 2.5k Dec 28, 2022
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 757 Dec 31, 2022