Environment:
- Python version 3.6.9
- Spark version 3.0.0
- TensorFlow version 2.6.2
- TensorFlowOnSpark version 2.2.4
- Cluster version Standalone, Hadoop 3.1.3
Describe the bug:
Get stuck at "INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on" while runing Spark standalone cluster while Training MNIST using Keras.
Logs:
2021-12-27 10:51:01,579 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-12-27 10:51:03,012 INFO spark.SparkContext: Running Spark version 3.0.0
2021-12-27 10:51:03,044 INFO resource.ResourceUtils: ==============================================================
2021-12-27 10:51:03,045 INFO resource.ResourceUtils: Resources for spark.driver:
2021-12-27 10:51:03,045 INFO resource.ResourceUtils: ==============================================================
2021-12-27 10:51:03,045 INFO spark.SparkContext: Submitted application: mnist_keras
2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing view acls to: amax
2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing modify acls to: amax
2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing view acls groups to:
2021-12-27 10:51:03,081 INFO spark.SecurityManager: Changing modify acls groups to:
2021-12-27 10:51:03,081 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set()
2021-12-27 10:51:03,232 INFO util.Utils: Successfully started service 'sparkDriver' on port 45175.
2021-12-27 10:51:03,255 INFO spark.SparkEnv: Registering MapOutputTracker
2021-12-27 10:51:03,275 INFO spark.SparkEnv: Registering BlockManagerMaster
2021-12-27 10:51:03,289 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2021-12-27 10:51:03,289 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
2021-12-27 10:51:03,291 INFO spark.SparkEnv: Registering BlockManagerMasterHeartbeat
2021-12-27 10:51:03,298 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-1a9cd977-b478-4add-bb24-887f1cb3e057
2021-12-27 10:51:03,311 INFO memory.MemoryStore: MemoryStore started with capacity 14.2 GiB
2021-12-27 10:51:03,320 INFO spark.SparkEnv: Registering OutputCommitCoordinator
2021-12-27 10:51:03,401 INFO util.log: Logging initialized @2665ms to org.sparkproject.jetty.util.log.Slf4jLog
2021-12-27 10:51:03,443 INFO server.Server: jetty-9.4.z-SNAPSHOT; built: 2019-04-29T20:42:08.989Z; git: e1bc35120a6617ee3df052294e433f3a25ce7097; jvm 1.8.0_212-b10
2021-12-27 10:51:03,457 INFO server.Server: Started @2722ms
2021-12-27 10:51:03,474 INFO server.AbstractConnector: Started ServerConnector@1627cd3b{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2021-12-27 10:51:03,475 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
2021-12-27 10:51:03,495 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5064c620{/jobs,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,496 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a32f47a{/jobs/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,497 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3508d47e{/jobs/job,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,498 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7264cdd7{/jobs/job/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,499 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a3d89f5{/stages,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,499 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7da95106{/stages/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,500 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46687b37{/stages/stage,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,501 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@656b0411{/stages/stage/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,501 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@60ad7cfe{/stages/pool,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46482ae5{/stages/pool/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1ef317fc{/storage,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,502 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15492f0a{/storage/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,503 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@10ab3ccf{/storage/rdd,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,503 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@57093dec{/storage/rdd/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4debbe0{/environment,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@327dac75{/environment/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,504 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b91ba46{/executors,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,505 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c385c3{/executors/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,505 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@74bf8e9e{/executors/threadDump,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,506 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@232553eb{/executors/threadDump/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,510 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b24a17f{/static,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,510 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1c7efc23{/,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,511 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@eeaa0a8{/api,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,511 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1823b1f1{/jobs/job/kill,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,512 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c274337{/stages/stage/kill,null,AVAILABLE,@Spark}
2021-12-27 10:51:03,513 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://amax11:4040
2021-12-27 10:51:03,654 INFO client.StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.1.11:7077...
2021-12-27 10:51:03,711 INFO client.TransportClientFactory: Successfully created connection to /192.168.1.11:7077 after 36 ms (0 ms spent in bootstraps)
2021-12-27 10:51:03,793 INFO cluster.StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20211227105103-0014
2021-12-27 10:51:03,794 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/0 on worker-20211226140545-192.168.1.11-43273 (192.168.1.11:43273) with 1 core(s)
2021-12-27 10:51:03,796 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/0 on hostPort 192.168.1.11:43273 with 1 core(s), 27.0 GiB RAM
2021-12-27 10:51:03,796 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/1 on worker-20211226140545-192.168.1.7-36031 (192.168.1.7:36031) with 1 core(s)
2021-12-27 10:51:03,797 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/1 on hostPort 192.168.1.7:36031 with 1 core(s), 27.0 GiB RAM
2021-12-27 10:51:03,797 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211227105103-0014/2 on worker-20211226140545-192.168.1.5-36787 (192.168.1.5:36787) with 1 core(s)
2021-12-27 10:51:03,798 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211227105103-0014/2 on hostPort 192.168.1.5:36787 with 1 core(s), 27.0 GiB RAM
2021-12-27 10:51:03,801 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42229.
2021-12-27 10:51:03,801 INFO netty.NettyBlockTransferService: Server created on amax11:42229
2021-12-27 10:51:03,803 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2021-12-27 10:51:03,812 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, amax11, 42229, None)
2021-12-27 10:51:03,815 INFO storage.BlockManagerMasterEndpoint: Registering block manager amax11:42229 with 14.2 GiB RAM, BlockManagerId(driver, amax11, 42229, None)
2021-12-27 10:51:03,818 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, amax11, 42229, None)
2021-12-27 10:51:03,819 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, amax11, 42229, None)
2021-12-27 10:51:03,822 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/0 is now RUNNING
2021-12-27 10:51:03,822 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/2 is now RUNNING
2021-12-27 10:51:03,823 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211227105103-0014/1 is now RUNNING
2021-12-27 10:51:03,934 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5ab79e0c{/metrics/json,null,AVAILABLE,@Spark}
2021-12-27 10:51:04,320 INFO history.SingleEventLogFileWriter: Logging events to hdfs://amax11:8020/spark-sa-history/app-20211227105103-0014.inprogress
2021-12-27 10:51:04,521 INFO cluster.StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
args: Namespace(batch_size=64, cluster_size=3, epochs=3, export_dir='/home/amax/TOS/TensorFlowOnSpark/mnist_export', images_labels='/home/amax/TOS/TensorFlowOnSpark/data/mnist/csv/train', model_dir='/home/amax/TOS/TensorFlowOnSpark/mnist_model', tensorboard=False)
2021-12-27 10:51:04,712 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KiB, free 14.2 GiB)
2021-12-27 10:51:04,759 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 14.2 GiB)
2021-12-27 10:51:04,762 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on amax11:42229 (size: 27.4 KiB, free: 14.2 GiB)
2021-12-27 10:51:04,765 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:0
2021-12-27 10:51:04,916 INFO (MainThread-11873) Reserving TFSparkNodes
2021-12-27 10:51:04,917 INFO (MainThread-11873) cluster_template: {'chief': [0], 'worker': [1, 2]}
2021-12-27 10:51:04,919 INFO (MainThread-11873) Reservation server binding to port 0
2021-12-27 10:51:04,919 INFO (MainThread-11873) listening for reservations at ('192.168.1.11', 37823)
2021-12-27 10:51:04,919 INFO (MainThread-11873) Starting TensorFlow on executors
2021-12-27 10:51:04,924 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 27648, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2021-12-27 10:51:04,928 INFO (MainThread-11873) Waiting for TFSparkNodes to start
2021-12-27 10:51:04,928 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:04,983 INFO spark.SparkContext: Starting job: foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327
2021-12-27 10:51:04,999 INFO scheduler.DAGScheduler: Got job 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) with 3 output partitions
2021-12-27 10:51:04,999 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327)
2021-12-27 10:51:05,000 INFO scheduler.DAGScheduler: Parents of final stage: List()
2021-12-27 10:51:05,000 INFO scheduler.DAGScheduler: Missing parents: List()
2021-12-27 10:51:05,003 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[3] at foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327), which has no missing parents
2021-12-27 10:51:05,012 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 17.1 KiB, free 14.2 GiB)
2021-12-27 10:51:05,014 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.7 KiB, free 14.2 GiB)
2021-12-27 10:51:05,015 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on amax11:42229 (size: 11.7 KiB, free: 14.2 GiB)
2021-12-27 10:51:05,015 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1200
2021-12-27 10:51:05,025 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (PythonRDD[3] at foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) (first 15 tasks are for partitions Vector(0, 1, 2))
2021-12-27 10:51:05,026 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
2021-12-27 10:51:05,245 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.5:52680) with ID 2
2021-12-27 10:51:05,279 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.7:40504) with ID 1
2021-12-27 10:51:05,286 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.1.11:44508) with ID 0
2021-12-27 10:51:05,343 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.5:33393 with 14.2 GiB RAM, BlockManagerId(2, 192.168.1.5, 33393, None)
2021-12-27 10:51:05,361 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.7:38801 with 14.2 GiB RAM, BlockManagerId(1, 192.168.1.7, 38801, None)
2021-12-27 10:51:05,363 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.11:35421 with 14.2 GiB RAM, BlockManagerId(0, 192.168.1.11, 35421, None)
2021-12-27 10:51:05,391 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.5, executor 2, partition 0, PROCESS_LOCAL, 7337 bytes)
2021-12-27 10:51:05,398 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.1.7, executor 1, partition 1, PROCESS_LOCAL, 7337 bytes)
2021-12-27 10:51:05,399 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 192.168.1.11, executor 0, partition 2, PROCESS_LOCAL, 7337 bytes)
2021-12-27 10:51:05,559 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.7:38801 (size: 11.7 KiB, free: 14.2 GiB)
2021-12-27 10:51:05,560 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.11:35421 (size: 11.7 KiB, free: 14.2 GiB)
2021-12-27 10:51:05,566 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.5:33393 (size: 11.7 KiB, free: 14.2 GiB)
2021-12-27 10:51:05,930 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:06,931 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:07,932 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:08,934 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:09,935 INFO (MainThread-11873) waiting for 3 reservations
2021-12-27 10:51:10,936 INFO (MainThread-11873) all reservations completed
2021-12-27 10:51:10,937 INFO (MainThread-11873) All TFSparkNodes started
2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 1, 'host': '192.168.1.7', 'job_name': 'worker', 'task_index': 0, 'port': 36945, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-8lelocpu/listener-c8uef51e', 'authkey': b'n/\x82\x08F\x94L\xb0\xb5T\xd0\xf4\xb6\n\xb2y'}
2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'}
2021-12-27 10:51:10,937 INFO (MainThread-11873) {'executor_id': 0, 'host': '192.168.1.5', 'job_name': 'chief', 'task_index': 0, 'port': 44093, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-sevn96me/listener-coe66f9n', 'authkey': b'3\xd8\x85\xe7\x06UKa\x8b\xcb\xe6\x11\n\xa1)C'}
2021-12-27 10:51:10,937 INFO (MainThread-11873) Feeding training data
2021-12-27 10:51:11,023 INFO mapred.FileInputFormat: Total input files to process : 10
2021-12-27 10:51:11,061 INFO spark.SparkContext: Starting job: collect at PythonRDD.scala:168
2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Got job 1 (collect at PythonRDD.scala:168) with 30 output partitions
2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at PythonRDD.scala:168)
2021-12-27 10:51:11,064 INFO scheduler.DAGScheduler: Parents of final stage: List()
2021-12-27 10:51:11,065 INFO scheduler.DAGScheduler: Missing parents: List()
2021-12-27 10:51:11,066 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (PythonRDD[6] at RDD at PythonRDD.scala:53), which has no missing parents
2021-12-27 10:51:11,078 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KiB, free 14.2 GiB)
2021-12-27 10:51:11,079 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.3 KiB, free 14.2 GiB)
2021-12-27 10:51:11,080 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on amax11:42229 (size: 7.3 KiB, free: 14.2 GiB)
2021-12-27 10:51:11,080 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1200
2021-12-27 10:51:11,081 INFO scheduler.DAGScheduler: Submitting 30 missing tasks from ResultStage 1 (PythonRDD[6] at RDD at PythonRDD.scala:53) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2021-12-27 10:51:11,081 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 30 tasks
2021-12-27 10:51:15,330 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, 192.168.1.5, executor 2, partition 0, ANY, 7536 bytes)
2021-12-27 10:51:15,333 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 9956 ms on 192.168.1.5 (executor 2) (1/3)
2021-12-27 10:51:15,341 INFO python.PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 45655
2021-12-27 10:51:15,361 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.5:33393 (size: 7.3 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,387 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.5:33393 (size: 27.4 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,650 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, 192.168.1.11, executor 0, partition 1, ANY, 7536 bytes)
2021-12-27 10:51:15,650 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 10251 ms on 192.168.1.11 (executor 0) (2/3)
2021-12-27 10:51:15,691 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.11:35421 (size: 7.3 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,712 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, 192.168.1.7, executor 1, partition 2, ANY, 7536 bytes)
2021-12-27 10:51:15,712 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 10314 ms on 192.168.1.7 (executor 1) (3/3)
2021-12-27 10:51:15,714 INFO scheduler.DAGScheduler: ResultStage 0 (foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327) finished in 10.703 s
2021-12-27 10:51:15,714 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
2021-12-27 10:51:15,715 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.11:35421 (size: 27.4 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,719 INFO scheduler.DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
2021-12-27 10:51:15,719 INFO scheduler.TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
2021-12-27 10:51:15,721 INFO scheduler.DAGScheduler: Job 0 finished: foreachPartition at /home/amax/anaconda3/envs/tos_sa/lib/python3.6/site-packages/tensorflowonspark/TFCluster.py:327, took 10.737965 s
2021-12-27 10:51:15,749 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.7:38801 (size: 7.3 KiB, free: 14.2 GiB)
2021-12-27 10:51:15,779 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.7:38801 (size: 27.4 KiB, free: 14.2 GiB)
The logs from Spark UI:
Spark Executor Command: "/opt/module/jdk1.8.0_212/bin/java" "-cp" "/opt/module/spark-standalone/conf/:/opt/module/spark-standalone/jars/*" "-Xmx27648M" "-Dspark.driver.port=45175" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@amax11:45175" "--executor-id" "0" "--hostname" "192.168.1.11" "--cores" "1" "--app-id" "app-20211227105103-0014" "--worker-url" "spark://[email protected]:43273"
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/12/27 10:51:04 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 11957@amax11
21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for TERM
21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for HUP
21/12/27 10:51:04 INFO SignalUtils: Registered signal handler for INT
21/12/27 10:51:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/12/27 10:51:04 INFO SecurityManager: Changing view acls to: amax
21/12/27 10:51:04 INFO SecurityManager: Changing modify acls to: amax
21/12/27 10:51:04 INFO SecurityManager: Changing view acls groups to:
21/12/27 10:51:04 INFO SecurityManager: Changing modify acls groups to:
21/12/27 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set()
21/12/27 10:51:04 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:45175 after 54 ms (0 ms spent in bootstraps)
21/12/27 10:51:04 INFO SecurityManager: Changing view acls to: amax
21/12/27 10:51:04 INFO SecurityManager: Changing modify acls to: amax
21/12/27 10:51:04 INFO SecurityManager: Changing view acls groups to:
21/12/27 10:51:04 INFO SecurityManager: Changing modify acls groups to:
21/12/27 10:51:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(amax); groups with view permissions: Set(); users with modify permissions: Set(amax); groups with modify permissions: Set()
21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:45175 after 1 ms (0 ms spent in bootstraps)
21/12/27 10:51:05 INFO DiskBlockManager: Created local directory at /tmp/spark-66944227-f0bf-4829-b870-f5d85feae2cd/executor-2c04267d-7763-4e33-b90c-a6c1ead9f50b/blockmgr-6ac139bf-9f9c-496c-82de-a559a0202fad
21/12/27 10:51:05 INFO MemoryStore: MemoryStore started with capacity 14.2 GiB
21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@amax11:45175
21/12/27 10:51:05 INFO WorkerWatcher: Connecting to worker spark://[email protected]:43273
21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to /192.168.1.11:43273 after 1 ms (0 ms spent in bootstraps)
21/12/27 10:51:05 INFO WorkerWatcher: Successfully connected to spark://[email protected]:43273
21/12/27 10:51:05 INFO ResourceUtils: ==============================================================
21/12/27 10:51:05 INFO ResourceUtils: Resources for spark.executor:
21/12/27 10:51:05 INFO ResourceUtils: ==============================================================
21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
21/12/27 10:51:05 INFO Executor: Starting executor ID 0 on host 192.168.1.11
21/12/27 10:51:05 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35421.
21/12/27 10:51:05 INFO NettyBlockTransferService: Server created on 192.168.1.11:35421
21/12/27 10:51:05 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/12/27 10:51:05 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 192.168.1.11, 35421, None)
21/12/27 10:51:05 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 192.168.1.11, 35421, None)
21/12/27 10:51:05 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 192.168.1.11, 35421, None)
21/12/27 10:51:05 INFO CoarseGrainedExecutorBackend: Got assigned task 2
21/12/27 10:51:05 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
21/12/27 10:51:05 INFO TorrentBroadcast: Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB)
21/12/27 10:51:05 INFO TransportClientFactory: Successfully created connection to amax11/192.168.1.11:42229 after 2 ms (0 ms spent in bootstraps)
21/12/27 10:51:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 11.7 KiB, free 14.2 GiB)
21/12/27 10:51:05 INFO TorrentBroadcast: Reading broadcast variable 1 took 90 ms
21/12/27 10:51:05 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 17.1 KiB, free 14.2 GiB)
2021-12-27 10:51:10,558 INFO (MainThread-12074) Available GPUs: ['0']
2021-12-27 10:51:10,559 INFO (MainThread-12074) Proposed GPUs: ['0']
2021-12-27 10:51:10,559 INFO (MainThread-12074) Requested 1 GPU(s), setting CUDA_VISIBLE_DEVICES=0
2021-12-27 10:51:10,578 INFO (MainThread-12074) connected to server at ('192.168.1.11', 37823)
2021-12-27 10:51:10,578 INFO (MainThread-12074) TFSparkNode.reserve: {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'}
2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 0, 'host': '192.168.1.5', 'job_name': 'chief', 'task_index': 0, 'port': 44093, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-sevn96me/listener-coe66f9n', 'authkey': b'3\xd8\x85\xe7\x06UKa\x8b\xcb\xe6\x11\n\xa1)C'}
2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 1, 'host': '192.168.1.7', 'job_name': 'worker', 'task_index': 0, 'port': 36945, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-8lelocpu/listener-c8uef51e', 'authkey': b'n/\x82\x08F\x94L\xb0\xb5T\xd0\xf4\xb6\n\xb2y'}
2021-12-27 10:51:12,582 INFO (MainThread-12074) node: {'executor_id': 2, 'host': '192.168.1.11', 'job_name': 'worker', 'task_index': 1, 'port': 45615, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-k4cka2ni/listener-um3u1iby', 'authkey': b'\x9e\xfa+\x93\xb2bFx\xa1\x84pQ\xba\xef=6'}
2021-12-27 10:51:12,582 INFO (MainThread-12074) export TF_CONFIG: {"cluster": {"chief": ["192.168.1.5:44093"], "worker": ["192.168.1.7:36945", "192.168.1.11:45615"]}, "task": {"type": "worker", "index": 1}, "environment": "cloud"}
2021-12-27 10:51:15,616 INFO (MainThread-12074) Available GPUs: ['0']
2021-12-27 10:51:15,616 INFO (MainThread-12074) Proposed GPUs: ['0']
2021-12-27 10:51:15,616 INFO (MainThread-12074) Requested 1 GPU(s), setting CUDA_VISIBLE_DEVICES=0
2021-12-27 10:51:15,616 INFO (MainThread-12074) Starting TensorFlow worker:1 as worker on cluster node 2 on background process
21/12/27 10:51:15 INFO PythonRunner: Times: total = 9778, boot = 316, init = 1021, finish = 8441
2021-12-27 10:51:15,627 WARNING (MainThread-12152) From /home/amax/TOS/TensorFlowOnSpark/examples/mnist/keras/mnist_spark.py:11: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version.
Instructions for updating:
use distribute.MultiWorkerMirroredStrategy instead
21/12/27 10:51:15 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1549 bytes result sent to driver
21/12/27 10:51:15 INFO CoarseGrainedExecutorBackend: Got assigned task 4
21/12/27 10:51:15 INFO Executor: Running task 1.0 in stage 1.0 (TID 4)
21/12/27 10:51:15 INFO TorrentBroadcast: Started reading broadcast variable 2 with 1 pieces (estimated total size 4.0 MiB)
21/12/27 10:51:15 INFO TransportClientFactory: Successfully created connection to /192.168.1.5:33393 after 1 ms (0 ms spent in bootstraps)
21/12/27 10:51:15 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.3 KiB, free 14.2 GiB)
21/12/27 10:51:15 INFO TorrentBroadcast: Reading broadcast variable 2 took 25 ms
21/12/27 10:51:15 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 12.6 KiB, free 14.2 GiB)
21/12/27 10:51:15 INFO HadoopRDD: Input split: hdfs://amax11:8020/home/amax/TOS/TensorFlowOnSpark/data/mnist/csv/train/part-00001:0+11232549
21/12/27 10:51:15 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
21/12/27 10:51:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.4 KiB, free 14.2 GiB)
21/12/27 10:51:15 INFO TorrentBroadcast: Reading broadcast variable 0 took 9 ms
21/12/27 10:51:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 426.4 KiB, free 14.2 GiB)
2021-12-27 10:51:16.542093: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-27 10:51:16.980995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8107 MB memory: -> device: 0, name: GeForce RTX 3080, pci bus id: 0000:65:00.0, compute capability: 8.6
2021-12-27 10:51:16.989052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:worker/replica:0/task:1/device:GPU:0 with 8107 MB memory: -> device: 0, name: GeForce RTX 3080, pci bus id: 0000:65:00.0, compute capability: 8.6
2021-12-27 10:51:16.994323: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job chief -> {0 -> 192.168.1.5:44093}
2021-12-27 10:51:16.994344: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 192.168.1.7:36945, 1 -> 192.168.1.11:45615}
2021-12-27 10:51:16.995505: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427] Started server with target: grpc://192.168.1.11:45615
2021-12-27 10:51:17,003 INFO (MainThread-12152) Enabled multi-worker collective ops with available devices: ['/job:worker/replica:0/task:1/device:CPU:0', '/job:worker/replica:0/task:1/device:GPU:0']
2021-12-27 10:51:17,104 INFO (MainThread-12152) Waiting for the cluster, timeout = inf
2021-12-27 10:51:17,194 INFO (MainThread-12169) Connected to TFSparkNode.mgr on 192.168.1.11, executor=2, state='running'
2021-12-27 10:51:17,201 INFO (MainThread-12169) mgr.state='running'
2021-12-27 10:51:17,201 INFO (MainThread-12169) Feeding partition <itertools.chain object at 0x7f8091c66d30> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f7fe8a757f0>
21/12/27 10:51:17 INFO PythonRunner: Times: total = 772, boot = -382, init = 438, finish = 716
Spark Submit Command Line:
${SPARK_HOME}/bin/spark-submit
--master ${MASTER}
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--executor-memory 27G
${TFoS_HOME}/examples/mnist/keras/mnist_spark.py
--cluster_size ${SPARK_WORKER_INSTANCES}
--images_labels ${TFoS_HOME}/data/mnist/csv/train
--model_dir ${TFoS_HOME}/mnist_model
--export_dir ${TFoS_HOME}/mnist_export