Distributed deep learning on Hadoop and Spark clusters.

Overview

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version of it and continue to use this code under the terms of the project license.

CaffeOnSpark

What's CaffeOnSpark?

CaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.

As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing, and feature extraction. Caffe users can now perform distributed learning using their existing LMDB data files and minorly adjusted network configuration (as illustrated).

CaffeOnSpark is a Spark package for deep learning. It is complementary to non-deep learning libraries MLlib and Spark SQL. CaffeOnSpark's Scala API provides Spark applications with an easy mechanism to invoke deep learning (see sample) over distributed datasets.

CaffeOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud. It's been in use by Yahoo for image search, content classification and several other use cases.

Why CaffeOnSpark?

CaffeOnSpark provides some important benefits (see our blog) over alternative deep learning solutions.

  • It enables model training, test and feature extraction directly on Hadoop datasets stored in HDFS on Hadoop clusters.
  • It turns your Hadoop or Spark cluster(s) into a powerful platform for deep learning, without the need to set up a new dedicated cluster for deep learning separately.
  • Server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck.
  • Caffe users' existing datasets (e.g. LMDB) and configurations could be applied for distributed learning without any conversion needed.
  • High-level API empowers Spark applications to easily conduct deep learning.
  • Incremental learning is supported to leverage previously trained models or snapshots.
  • Additional data formats and network interfaces could be easily added.
  • It can be easily deployed on public cloud (ex. AWS EC2) or a private cloud.

Using CaffeOnSpark

Please check CaffeOnSpark wiki site for detailed documentations such as building instruction, API reference and getting started guides for standalone cluster and AWS EC2 cluster.

  • Batch sizes specified in prototxt files are per device.
  • Memory layers should not be shared among GPUs, and thus "share_in_parallel: false" is required for layer configuration.

Building for Spark 2.X

CaffeOnSpark supports both Spark 1.x and 2.x. For Spark 2.0, our default settings are:

  • spark-2.0.0
  • hadoop-2.7.1
  • scala-2.11.7 You may want to adjust them in caffe-grid/pom.xml.

Mailing List

Please join CaffeOnSpark user group for discussions and questions.

License

The use and distribution terms for this software are covered by the Apache 2.0 license. See LICENSE file for terms.

Comments
  • Added OpenStack Swift integration with CaffeOnSpark Documentation

    Added OpenStack Swift integration with CaffeOnSpark Documentation

    Adding OpenStack Swift integration with CaffeOnSpark Documentation and few scripts to aid in getting the correct version of hadoop, spark. Added an editted pom.xml file to account for hadoop-openstack dependency while rebuilding CaffeOnSpark. Please take a look.

    https://github.com/arundasan91/CaffeOnSpark/wiki/GetStarted_yarn_object_storage

    opened by arundasan91 36
  • Trainining ImageNet on CaffeOnSpark

    Trainining ImageNet on CaffeOnSpark

    Hello,

    I wanted to study the scalability properties of CaffeOnSpark, and both MNIST and CIFAR-10 do not seem good examples, because increasing the number of GPUs/machines, does not translate into a significant speedup. So, I want to try a more complex model with more data and feel ImageNet is a good example. Given that ImageNet has >100 GB of train data, do you see any problems in porting thatexample to CaffeOnSpark. Also any must know suggestions will prove helpful.

    If it can prove helpful, I will be willing to contribute back the instructions to run and necessary files for the ImageNet example.

    enhancement 
    opened by rahulbhalerao001 32
  • Image_net data didn't work

    Image_net data didn't work

    I can test ciffar10 and mnist data using CaffeOnSpark. When I gave a test on image_net data(LMDB format), the application on yarn could finish adding executor procedure. Then the application just got stuck and would not go any further. imagenet.prototxt.txt imagenet_solver.prototxt.txt this is my spark submit script: export SPARK_WORKER_INSTANCES=10 export DEVICES=1

    hadoop fs -rm -f hdfs:///data/alexnet_muti/models/alexnet_model.model.h5 hadoop fs -rm -r -f hdfs:///data/alexnet_muti/models/result/ spark-submit --master yarn --deploy-mode cluster
    --num-executors ${SPARK_WORKER_INSTANCES}
    --executor-memory 40G --driver-cores 3 --driver-memory 100G
    --files /data/clusterserver/CaffeOnSpark/data/imagenet_solver.prototxt,/data/clusterserver/CaffeOnSpark/data/imagenet.prototxt,/data/clusterserver/CaffeOnSpark/data/imagenet_mean.binaryproto
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
    --class com.yahoo.ml.caffe.CaffeOnSpark
    /data/clusterserver/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
    -train
    -features accuracy, loss -label label
    -conf imagenet_solver.prototxt
    -connection ethernet
    -model hdfs:///data/alexnet_muti/models/alexnet_model.model.h5
    -output hdfs:///data/alexnet_muti/models/result/ hadoop fs -ls hdfs:///data/alexnet_muti/models/alexnet_model.model.h5 hadoop fs -cat hdfs:///data/alexnet_muti/models/result/* this is the Diagnostic on yarn UI : Shutdown hook called before final status was reported.

    I did have went through logs of this application. But nothing helped. this is my logs by driver :

    SLF4J: Found binding in [jar:file:/data/clusterdata/hadoop/nodemanager/yarn/local/usercache/root/filecache/17/spark-assembly-1.6.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/clusterserver/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 16/11/14 15:29:00 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM ^@^Fstdout^@^A0^@^A^@^@^D^@ ^@^GVERSION*(^@&container_1479107705502_0006_01_000001^D^Dnone^A^PΞMΞM^C^Qdata:BCFile.index^DnoneΙ^K^K^Pdata:TFile.index^Dnoneͣ66^Odata:TFile.mett a^DnoneΞ]^F^F^@^@^@^@^@^@^Cd^@^A^@^@ОQҨ~Qµ׶9ށ@~Rº

    opened by liam0949 23
  • NullPointerException when Running CaffeOnSpark on EC2

    NullPointerException when Running CaffeOnSpark on EC2

    I am running CaffeOnSpark on EC2 following the instructions https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_EC2. I got the following errors:

    16/05/06 00:10:35 INFO TaskSetManager: Starting task 1.1 in stage 1.0 (TID 5, ip-10-30-15-17.us-west-2.compute.internal, partition 1,PROCESS_LOCAL, 2197 bytes) 16/05/06 00:10:35 WARN TaskSetManager: Lost task 0.1 in stage 1.0 (TID 4, ip-10-30-15-17.us-west-2.compute.internal): java.lang.NullPointerException at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

    opened by syuquad 21
  • An Py4JJavaError happened when follow the python instructions

    An Py4JJavaError happened when follow the python instructions

    Hi, i am following the python instructions from: https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_python and trying to use the python APIs to train models. But when i use the following example command: pushd ${CAFFE_ON_SPARK}/data/ unzip ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip IPYTHON=1 pyspark --master yarn
    --num-executors 1
    --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
    --driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
    --conf spark.cores.max=1
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
    --py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
    --files ${CAFFE_ON_SPARK}/data/caffe/_caffe.so
    --jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" Then run examples as below, there is a error appeared for the last line: from pyspark import SparkConf,SparkContext from com.yahoo.ml.caffe.RegisterContext import registerContext,registerSQLContext from com.yahoo.ml.caffe.CaffeOnSpark import CaffeOnSpark from com.yahoo.ml.caffe.Config import Config from com.yahoo.ml.caffe.DataSource import DataSource from pyspark.mllib.linalg import Vectors from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithLBFGS registerContext(sc) registerSQLContext(sqlContext) cos=CaffeOnSpark(sc,sqlContext) cfg=Config(sc) cfg.protoFile='/Users/afeng/dev/ml/CaffeOnSpark/data/lenet_memory_solver.prototxt' cfg.modelPath = 'file:/tmp/lenet.model' cfg.devices = 1 cfg.isFeature=True cfg.label='label' cfg.features=['ip1'] cfg.outputFormat = 'json' cfg.clusterSize = 1 cfg.lmdb_partitions=cfg.clusterSize

    Train

    dl_train_source = DataSource(sc).getSource(cfg,True) cos.train(dl_train_source) <------------------error happened after call this.

    the error message is : In [41]: cos.train(dl_train_source) 16/04/27 10:44:34 INFO spark.SparkContext: Starting job: collect at CaffeOnSpark.scala:127 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Got job 4 (collect at CaffeOnSpark.scala:127) with 1 output partitions 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (collect at CaffeOnSpark.scala:127) 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Missing parents: List() 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[14] at map at CaffeOnSpark.scala:116), which has no missing parents 16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.2 KB, free 23.9 KB) 16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.1 KB, free 25.9 KB) 16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 10.110.53.146:59213 (size: 2.1 KB, free: 511.5 MB) 16/04/27 10:44:34 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[14] at map at CaffeOnSpark.scala:116) 16/04/27 10:44:34 INFO cluster.YarnScheduler: Adding task set 4.0 with 1 tasks 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 10, sweet, partition 0,PROCESS_LOCAL, 2169 bytes) 16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on sweet:46000 (size: 2.1 KB, free: 511.5 MB) 16/04/27 10:44:34 INFO scheduler.DAGScheduler: ResultStage 4 (collect at CaffeOnSpark.scala:127) finished in 0.084 s 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 10) in 84 ms on sweet (1/1) 16/04/27 10:44:34 INFO cluster.YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Job 4 finished: collect at CaffeOnSpark.scala:127, took 0.092871 s 16/04/27 10:44:34 INFO caffe.CaffeOnSpark: rank = 0, address = null, hostname = sweet 16/04/27 10:44:34 INFO caffe.CaffeOnSpark: rank 0:sweet 16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 112.0 B, free 26.0 KB) 16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 221.0 B, free 26.3 KB) 16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.110.53.146:59213 (size: 221.0 B, free: 511.5 MB) 16/04/27 10:44:34 INFO spark.SparkContext: Created broadcast 6 from broadcast at CaffeOnSpark.scala:146 16/04/27 10:44:34 INFO spark.SparkContext: Starting job: collect at CaffeOnSpark.scala:155 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Got job 5 (collect at CaffeOnSpark.scala:155) with 1 output partitions 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Final stage: ResultStage 5 (collect at CaffeOnSpark.scala:155) 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Missing parents: List() 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[16] at map at CaffeOnSpark.scala:149), which has no missing parents 16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 2.6 KB, free 28.9 KB) 16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1597.0 B, free 30.4 KB) 16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.110.53.146:59213 (size: 1597.0 B, free: 511.5 MB) 16/04/27 10:44:34 INFO spark.SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[16] at map at CaffeOnSpark.scala:149) 16/04/27 10:44:34 INFO cluster.YarnScheduler: Adding task set 5.0 with 1 tasks 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 11, sweet, partition 0,PROCESS_LOCAL, 2169 bytes) 16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on sweet:46000 (size: 1597.0 B, free: 511.5 MB) 16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on sweet:46000 (size: 221.0 B, free: 511.5 MB) 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 11) in 48 ms on sweet (1/1) 16/04/27 10:44:34 INFO scheduler.DAGScheduler: ResultStage 5 (collect at CaffeOnSpark.scala:155) finished in 0.049 s 16/04/27 10:44:34 INFO cluster.YarnScheduler: Removed TaskSet 5.0, whose tasks have all completed, from pool 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Job 5 finished: collect at CaffeOnSpark.scala:155, took 0.058122 s 16/04/27 10:44:34 INFO caffe.LmdbRDD: local LMDB path:/home/atlas/work/caffe_spark/CaffeOnSpark-master/data/mnist_train_lmdb 16/04/27 10:44:34 INFO caffe.LmdbRDD: 1 LMDB RDD partitions 16/04/27 10:44:34 INFO spark.SparkContext: Starting job: reduce at CaffeOnSpark.scala:205 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Got job 6 (reduce at CaffeOnSpark.scala:205) with 1 output partitions 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Final stage: ResultStage 6 (reduce at CaffeOnSpark.scala:205) 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Missing parents: List() 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[17] at mapPartitions at CaffeOnSpark.scala:190), which has no missing parents 16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 3.4 KB, free 33.8 KB) 16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 2.2 KB, free 35.9 KB) 16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.110.53.146:59213 (size: 2.2 KB, free: 511.5 MB) 16/04/27 10:44:34 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006 16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 6 (MapPartitionsRDD[17] at mapPartitions at CaffeOnSpark.scala:190) 16/04/27 10:44:34 INFO cluster.YarnScheduler: Adding task set 6.0 with 1 tasks 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 12, sweet, partition 0,PROCESS_LOCAL, 1992 bytes) 16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on sweet:46000 (size: 2.2 KB, free: 511.5 MB) 16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added rdd_12_0 on disk on sweet:46000 (size: 26.0 B) 16/04/27 10:44:34 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 6.0 (TID 12, sweet): java.lang.UnsupportedOperationException: empty.reduceLeft at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167) at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157) at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195) at scala.collection.AbstractIterator.reduce(Iterator.scala:1157) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:199) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:191) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

    16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 6.0 (TID 13, sweet, partition 0,PROCESS_LOCAL, 1992 bytes) 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 6.0 (TID 13) on executor sweet: java.lang.UnsupportedOperationException (empty.reduceLeft) [duplicate 1] 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 6.0 (TID 14, sweet, partition 0,PROCESS_LOCAL, 1992 bytes) 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 6.0 (TID 14) on executor sweet: java.lang.UnsupportedOperationException (empty.reduceLeft) [duplicate 2] 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 6.0 (TID 15, sweet, partition 0,PROCESS_LOCAL, 1992 bytes) 16/04/27 10:44:34 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 6.0 (TID 15) on executor sweet: java.lang.UnsupportedOperationException (empty.reduceLeft) [duplicate 3] 16/04/27 10:44:34 ERROR scheduler.TaskSetManager: Task 0 in stage 6.0 failed 4 times; aborting job 16/04/27 10:44:34 INFO cluster.YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool 16/04/27 10:44:34 INFO cluster.YarnScheduler: Cancelling stage 6 16/04/27 10:44:34 INFO scheduler.DAGScheduler: ResultStage 6 (reduce at CaffeOnSpark.scala:205) failed in 0.117 s

    16/04/27 10:44:34 INFO scheduler.DAGScheduler: Job 6 failed: reduce at CaffeOnSpark.scala:205, took 0.124712 s

    Py4JJavaError Traceback (most recent call last) in () ----> 1 cos.train(dl_train_source)

    /home/atlas/work/caffe_spark/CaffeOnSpark-master/data/com/yahoo/ml/caffe/CaffeOnSpark.py in train(self, train_source) 29 :param DataSource: the source for training data 30 """ ---> 31 self.dict.get('cos').train(train_source) 32 33 def test(self,test_source):

    /home/atlas/work/caffe_spark/CaffeOnSpark-master/data/com/yahoo/ml/caffe/ConversionUtil.py in call(self, _args) 814 for i in self.syms: 815 try: --> 816 return callJavaMethod(i,self.javaInstance,self._evalDefaults(),self.mirror,_args) 817 except Py4JJavaError: 818 raise

    /home/atlas/work/caffe_spark/CaffeOnSpark-master/data/com/yahoo/ml/caffe/ConversionUtil.py in callJavaMethod(sym, javaInstance, defaults, mirror, _args) 617 return javaInstance(__getConvertedTuple(args,sym,defaults,mirror)) 618 else: --> 619 return toPython(javaInstance.getattr(name)(*_getConvertedTuple(args,sym,defaults,mirror))) 620 #It is good for debugging to know whether the argument conversion was successful. 621 #If it was, a Py4JJavaError may be raised from the Java code.

    /home/atlas/work/caffe_spark/3rdparty/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in call(self, *args) 811 answer = self.gateway_client.send_command(command) 812 return_value = get_return_value( --> 813 answer, self.gateway_client, self.target_id, self.name) 814 815 for temp_arg in temp_args:

    /home/atlas/work/caffe_spark/3rdparty/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(_a, *_kw) 43 def deco(_a, *_kw): 44 try: ---> 45 return f(_a, *_kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString()

    /home/atlas/work/caffe_spark/3rdparty/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".", name), value) 309 else: 310 raise Py4JError(

    Py4JJavaError: An error occurred while calling o2122.train. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 15, sweet): java.lang.UnsupportedOperationException: empty.reduceLeft at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167) at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157) at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195) at scala.collection.AbstractIterator.reduce(Iterator.scala:1157) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:199) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:191) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

    Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:205) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.UnsupportedOperationException: empty.reduceLeft at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167) at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157) at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195) at scala.collection.AbstractIterator.reduce(Iterator.scala:1157) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:199) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:191) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ... 1 more

    Could you please help me to check what was happened?

    opened by dejunzhang 21
  • socket.cpp:61]   ERROR: Read partial messageheader [8 of 12]

    socket.cpp:61] ERROR: Read partial messageheader [8 of 12]

    When i run the lenet example; the job hang if this error appear at the slave node;this error will appear if set the --num-executors >=2 ; this my start shell : hadoop fs -rm -f -r hdfs:///yudaoming/mnist hadoop fs -rm -r -f hdfs:///yudaoming/mnist/feature_result

    export LD_LIBRARY_PATH=/home/hadoop/projects/CaffeOnSpark/caffe-public/distribute/lib:/home/hadoop/projects/CaffeOnSpark/caffe-distri/distribute/lib export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-7.5/lib64:/opt/OpenBLAS/lib export SPARK_WORKER_INSTANCES=2 export DEVICES=1

    spark-submit --master yarn --deploy-mode cluster
    --num-executors 2
    --executor-cores 1
    --files /home/hadoop/projects/CaffeOnSpark/data/lenet_memory_solver.prototxt,/home/hadoop/projects/CaffeOnSpark/data/lenet_memory_train_test.prototxt
    --driver-memory 10g
    --executor-memory 30g
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
    --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
    --class com.yahoo.ml.caffe.CaffeOnSpark
    /home/hadoop/projects/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
    -train
    -features accuracy,loss -label label
    -conf lenet_memory_solver.prototxt
    -devices 4
    -connection ethernet
    -model hdfs:///yudaoming/mnist/mnist.model
    -output hdfs:///yudaoming/mnist/feature_result

    My cluster has three servers: one master node ,two slave nodes ;each server have 8 nvidia K80s , 128G memory and 16 intel E2630 cores .

    opened by ydm2011 21
  • Null Pointer Exception on Runnin CIFAR-10 example

    Null Pointer Exception on Runnin CIFAR-10 example

    Hello,

    I was able to run the MNIST example on AWS cluster following steps in the README. I then tried to run the Cifar-10 example following the steps

    1. Downloading the data in lmdb format and mean image to master and both slaves
    2. Modifying the data layer to MemoryData layer. as shown in attached files
      cifar10_quick_solver.prototxt.txt cifar10_quick_train_test.prototxt.txt

    I am getting the below exceptions 16/02/28 00:11:58 WARN TransportChannelHandler: Exception in connection from ip-172-31-29-90.eu-west-1.compute.internal/172.31.29.90:46325 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 16/02/28 00:11:58 ERROR TaskSchedulerImpl: Lost executor 0 on ip-172-31-29-90.eu-west-1.compute.internal: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. AND

    16/02/28 00:12:00 WARN TaskSetManager: Lost task 1.1 in stage 1.0 (TID 5, ip-172-31-29-90.eu-west-1.compute.internal): java.lang.NullPointerException at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:158) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:154) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:154) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

    It will be great if you could guide in solving it.

    opened by rahulbhalerao001 20
  • Job stuck in reduce phase

    Job stuck in reduce phase

    I'm running mnist data but get some problem. My cluster has one master node and 6 data nodes, each data node with 128GB memory, and caffeOnSpark is running under CPU mode. The job runs well for executor = 1, but it would get stuck in the reduce phase if I increase executor number, as shown below stuck_reduce

    For executor = 1 the UI turn out like this: complete_reduce

    It finished success and show me some results, but the reduces are skipped.

    The log where the program got stuck was like below

    I1116 14:51:57.595015 12024 layer_factory.hpp:77] Creating layer data
    I1116 14:51:57.595031 12024 net.cpp:100] Creating Layer data
    I1116 14:51:57.595053 12024 net.cpp:408] data -> data
    I1116 14:51:57.595067 12024 net.cpp:408] data -> label
    I1116 14:51:57.595149 12024 net.cpp:150] Setting up data
    I1116 14:51:57.595165 12024 net.cpp:157] Top shape: 100 1 28 28 (78400)
    I1116 14:51:57.595181 12024 net.cpp:157] Top shape: 100 (100)
    I1116 14:51:57.595190 12024 net.cpp:165] Memory required for data: 314000
    I1116 14:51:57.595197 12024 layer_factory.hpp:77] Creating layer label_data_1_split
    I1116 14:51:57.595208 12024 net.cpp:100] Creating Layer label_data_1_split
    I1116 14:51:57.595216 12024 net.cpp:434] label_data_1_split <- label
    I1116 14:51:57.595227 12024 net.cpp:408] label_data_1_split -> label_data_1_split_0
    I1116 14:51:57.595239 12024 net.cpp:408] label_data_1_split -> label_data_1_split_1
    I1116 14:51:57.595252 12024 net.cpp:150] Setting up label_data_1_split
    I1116 14:51:57.595263 12024 net.cpp:157] Top shape: 100 (100)
    I1116 14:51:57.595283 12024 net.cpp:157] Top shape: 100 (100)
    I1116 14:51:57.595291 12024 net.cpp:165] Memory required for data: 314800
    I1116 14:51:57.595299 12024 layer_factory.hpp:77] Creating layer conv1
    I1116 14:51:57.595315 12024 net.cpp:100] Creating Layer conv1
    I1116 14:51:57.595324 12024 net.cpp:434] conv1 <- data
    I1116 14:51:57.595351 12024 net.cpp:408] conv1 -> conv1
    I1116 14:51:57.595386 12024 net.cpp:150] Setting up conv1
    I1116 14:51:57.595397 12024 net.cpp:157] Top shape: 100 20 24 24 (1152000)
    I1116 14:51:57.595404 12024 net.cpp:165] Memory required for data: 4922800
    I1116 14:51:57.595418 12024 layer_factory.hpp:77] Creating layer pool1
    I1116 14:51:57.595429 12024 net.cpp:100] Creating Layer pool1
    I1116 14:51:57.595437 12024 net.cpp:434] pool1 <- conv1
    I1116 14:51:57.595446 12024 net.cpp:408] pool1 -> pool1
    I1116 14:51:57.595460 12024 net.cpp:150] Setting up pool1
    I1116 14:51:57.595470 12024 net.cpp:157] Top shape: 100 20 12 12 (288000)
    I1116 14:51:57.595477 12024 net.cpp:165] Memory required for data: 6074800
    I1116 14:51:57.595484 12024 layer_factory.hpp:77] Creating layer conv2
    I1116 14:51:57.595497 12024 net.cpp:100] Creating Layer conv2
    I1116 14:51:57.595505 12024 net.cpp:434] conv2 <- pool1
    I1116 14:51:57.595515 12024 net.cpp:408] conv2 -> conv2
    I1116 14:51:57.595755 12024 net.cpp:150] Setting up conv2
    I1116 14:51:57.595768 12024 net.cpp:157] Top shape: 100 50 8 8 (320000)
    I1116 14:51:57.595777 12024 net.cpp:165] Memory required for data: 7354800
    I1116 14:51:57.595789 12024 layer_factory.hpp:77] Creating layer pool2
    I1116 14:51:57.595799 12024 net.cpp:100] Creating Layer pool2
    I1116 14:51:57.595808 12024 net.cpp:434] pool2 <- conv2
    I1116 14:51:57.595816 12024 net.cpp:408] pool2 -> pool2
    I1116 14:51:57.595830 12024 net.cpp:150] Setting up pool2
    I1116 14:51:57.595840 12024 net.cpp:157] Top shape: 100 50 4 4 (80000)
    I1116 14:51:57.595849 12024 net.cpp:165] Memory required for data: 7674800
    I1116 14:51:57.595855 12024 layer_factory.hpp:77] Creating layer ip1
    I1116 14:51:57.595866 12024 net.cpp:100] Creating Layer ip1
    I1116 14:51:57.595875 12024 net.cpp:434] ip1 <- pool2
    I1116 14:51:57.595887 12024 net.cpp:408] ip1 -> ip1
    I1116 14:51:57.600900 12024 net.cpp:150] Setting up ip1
    I1116 14:51:57.600922 12024 net.cpp:157] Top shape: 100 500 (50000)
    I1116 14:51:57.600930 12024 net.cpp:165] Memory required for data: 7874800
    I1116 14:51:57.600945 12024 layer_factory.hpp:77] Creating layer relu1
    I1116 14:51:57.600957 12024 net.cpp:100] Creating Layer relu1
    I1116 14:51:57.600966 12024 net.cpp:434] relu1 <- ip1
    I1116 14:51:57.600976 12024 net.cpp:395] relu1 -> ip1 (in-place)
    I1116 14:51:57.600988 12024 net.cpp:150] Setting up relu1
    I1116 14:51:57.600997 12024 net.cpp:157] Top shape: 100 500 (50000)
    I1116 14:51:57.601004 12024 net.cpp:165] Memory required for data: 8074800
    I1116 14:51:57.601012 12024 layer_factory.hpp:77] Creating layer ip2
    I1116 14:51:57.601024 12024 net.cpp:100] Creating Layer ip2
    I1116 14:51:57.601032 12024 net.cpp:434] ip2 <- ip1
    I1116 14:51:57.601043 12024 net.cpp:408] ip2 -> ip2
    I1116 14:51:57.601111 12024 net.cpp:150] Setting up ip2
    I1116 14:51:57.601122 12024 net.cpp:157] Top shape: 100 10 (1000)
    I1116 14:51:57.601130 12024 net.cpp:165] Memory required for data: 8078800
    I1116 14:51:57.601140 12024 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
    I1116 14:51:57.601151 12024 net.cpp:100] Creating Layer ip2_ip2_0_split
    I1116 14:51:57.601166 12024 net.cpp:434] ip2_ip2_0_split <- ip2
    I1116 14:51:57.601177 12024 net.cpp:408] ip2_ip2_0_split -> ip2_ip2_0_split_0
    I1116 14:51:57.601189 12024 net.cpp:408] ip2_ip2_0_split -> ip2_ip2_0_split_1
    I1116 14:51:57.601202 12024 net.cpp:150] Setting up ip2_ip2_0_split
    I1116 14:51:57.601212 12024 net.cpp:157] Top shape: 100 10 (1000)
    I1116 14:51:57.601222 12024 net.cpp:157] Top shape: 100 10 (1000)
    I1116 14:51:57.601228 12024 net.cpp:165] Memory required for data: 8086800
    I1116 14:51:57.601235 12024 layer_factory.hpp:77] Creating layer accuracy
    I1116 14:51:57.601246 12024 net.cpp:100] Creating Layer accuracy
    I1116 14:51:57.601254 12024 net.cpp:434] accuracy <- ip2_ip2_0_split_0
    I1116 14:51:57.601263 12024 net.cpp:434] accuracy <- label_data_1_split_0
    I1116 14:51:57.601274 12024 net.cpp:408] accuracy -> accuracy
    I1116 14:51:57.601286 12024 net.cpp:150] Setting up accuracy
    I1116 14:51:57.601296 12024 net.cpp:157] Top shape: (1)
    I1116 14:51:57.601303 12024 net.cpp:165] Memory required for data: 8086804
    I1116 14:51:57.601311 12024 layer_factory.hpp:77] Creating layer loss
    I1116 14:51:57.601321 12024 net.cpp:100] Creating Layer loss
    I1116 14:51:57.601330 12024 net.cpp:434] loss <- ip2_ip2_0_split_1
    I1116 14:51:57.601337 12024 net.cpp:434] loss <- label_data_1_split_1
    I1116 14:51:57.601347 12024 net.cpp:408] loss -> loss
    I1116 14:51:57.601359 12024 layer_factory.hpp:77] Creating layer loss
    I1116 14:51:57.601379 12024 net.cpp:150] Setting up loss
    I1116 14:51:57.601389 12024 net.cpp:157] Top shape: (1)
    I1116 14:51:57.601397 12024 net.cpp:160]     with loss weight 1
    I1116 14:51:57.601408 12024 net.cpp:165] Memory required for data: 8086808
    I1116 14:51:57.601415 12024 net.cpp:226] loss needs backward computation.
    I1116 14:51:57.601424 12024 net.cpp:228] accuracy does not need backward computation.
    I1116 14:51:57.601433 12024 net.cpp:226] ip2_ip2_0_split needs backward computation.
    I1116 14:51:57.601439 12024 net.cpp:226] ip2 needs backward computation.
    I1116 14:51:57.601447 12024 net.cpp:226] relu1 needs backward computation.
    I1116 14:51:57.601454 12024 net.cpp:226] ip1 needs backward computation.
    I1116 14:51:57.601461 12024 net.cpp:226] pool2 needs backward computation.
    I1116 14:51:57.601469 12024 net.cpp:226] conv2 needs backward computation.
    I1116 14:51:57.601475 12024 net.cpp:226] pool1 needs backward computation.
    I1116 14:51:57.601483 12024 net.cpp:226] conv1 needs backward computation.
    I1116 14:51:57.601491 12024 net.cpp:228] label_data_1_split does not need backward computation.
    I1116 14:51:57.601500 12024 net.cpp:228] data does not need backward computation.
    I1116 14:51:57.601506 12024 net.cpp:270] This network produces output accuracy
    I1116 14:51:57.601514 12024 net.cpp:270] This network produces output loss
    I1116 14:51:57.601531 12024 net.cpp:283] Network initialization done.
    I1116 14:51:57.601585 12024 solver.cpp:60] Solver scaffolding done.
    I1116 14:51:57.601702 12024 socket.cpp:221] Waiting for valid port [0]
    I1116 14:51:57.601812 12041 socket.cpp:162] Assigned socket server port [50184]
    I1116 14:51:57.602773 12041 socket.cpp:175] Socket Server ready [0.0.0.0]
    I1116 14:51:57.611819 12024 socket.cpp:221] Waiting for valid port [50184]
    I1116 14:51:57.611841 12024 socket.cpp:229] Valid port found [50184]
    I1116 14:51:57.611863 12024 CaffeNet.cpp:262] Socket adapter: longzhou-hdp1.lz.dscc:50184
    I1116 14:51:57.612177 12024 CaffeNet.cpp:402] 0-th Socket addr: 
    I1116 14:51:57.612201 12024 CaffeNet.cpp:402] 1-th Socket addr: longzhou-hdp1.lz.dscc:50184
    I1116 14:51:57.612211 12024 CaffeNet.cpp:402] 2-th Socket addr: longzhou-hdp1.lz.dscc:50184
    I1116 14:51:57.612221 12024 CaffeNet.cpp:402] 3-th Socket addr: longzhou-hdp1.lz.dscc:50184
    I1116 14:51:57.612236 12024 JniCaffeNet.cpp:145] 0-th local addr: 
    I1116 14:51:57.612246 12024 JniCaffeNet.cpp:145] 1-th local addr: longzhou-hdp1.lz.dscc:50184
    I1116 14:51:57.612253 12024 JniCaffeNet.cpp:145] 2-th local addr: longzhou-hdp1.lz.dscc:50184
    I1116 14:51:57.612262 12024 JniCaffeNet.cpp:145] 3-th local addr: longzhou-hdp1.lz.dscc:50184
    16/11/16 14:51:57 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 8). 1018 bytes result sent to driver
    16/11/16 14:51:58 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 14
    16/11/16 14:51:58 INFO executor.Executor: Running task 2.0 in stage 3.0 (TID 14)
    16/11/16 14:51:58 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
    16/11/16 14:51:58 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1584.0 B, free 18.7 KB)
    16/11/16 14:51:58 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 12 ms
    16/11/16 14:51:58 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.6 KB, free 21.3 KB)
    16/11/16 14:51:58 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
    16/11/16 14:51:58 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 170.0 B, free 21.4 KB)
    16/11/16 14:51:58 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 7 ms
    16/11/16 14:51:58 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 1536.0 B, free 22.9 KB)
    I1116 14:51:58.208997 12024 common.cpp:61] 0-th string is NULL
    I1116 14:51:58.209087 12024 socket.cpp:257] Trying to connect with ...[longzhou-hdp3.lz.dscc:56137]
    I1116 14:51:58.209530 12024 socket.cpp:321] Connected to server [longzhou-hdp3.lz.dscc:56137] with client_fd [374]
    I1116 14:51:58.228672 12041 socket.cpp:188] Accepted the connection from client [longzhou-hdp2.lz.dscc]
    I1116 14:51:58.230128 12041 socket.cpp:188] Accepted the connection from client [longzhou-hdp4.lz.dscc]
    I1116 14:51:58.276432 12041 socket.cpp:188] Accepted the connection from client [longzhou-hdp3.lz.dscc]
    I1116 14:51:59.209713 12024 socket.cpp:257] Trying to connect with ...[longzhou-hdp2.lz.dscc:51049]
    I1116 14:51:59.210247 12024 socket.cpp:321] Connected to server [longzhou-hdp2.lz.dscc:51049] with client_fd [378]
    I1116 14:52:00.210403 12024 socket.cpp:257] Trying to connect with ...[longzhou-hdp4.lz.dscc:52620]
    I1116 14:52:00.210934 12024 socket.cpp:321] Connected to server [longzhou-hdp4.lz.dscc:52620] with client_fd [294]
    16/11/16 14:52:01 INFO caffe.CaffeProcessor: Interleave enabled
    I1116 14:52:01.218786 12048 CaffeNet.cpp:633] Interleaved
    I1116 14:52:01.218842 12048 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
    I1116 14:52:01.218858 12048 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
    16/11/16 14:52:01 INFO caffe.CaffeProcessor: Start transformer for train in CaffeProcessor StartThreads
    16/11/16 14:52:01 INFO caffe.CaffeProcessor: Start transformer for validation in CaffeProcessor StartThreads
    16/11/16 14:52:01 INFO executor.Executor: Finished task 2.0 in stage 3.0 (TID 14). 908 bytes result sent to driver
    16/11/16 14:52:01 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 16
    16/11/16 14:52:01 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 16)
    16/11/16 14:52:01 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 5
    16/11/16 14:52:01 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.1 KB, free 25.0 KB)
    16/11/16 14:52:01 INFO broadcast.TorrentBroadcast: Reading broadcast variable 5 took 10 ms
    16/11/16 14:52:01 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.3 KB, free 28.3 KB)
    16/11/16 14:52:01 INFO storage.BlockManager: Found block rdd_1_0 locally
    16/11/16 14:52:01 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 16). 2830 bytes result sent to driver
    16/11/16 14:52:01 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 17
    16/11/16 14:52:01 INFO executor.Executor: Running task 0.0 in stage 5.0 (TID 17)
    16/11/16 14:52:01 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 6
    16/11/16 14:52:01 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 2.0 KB, free 30.3 KB)
    16/11/16 14:52:01 INFO broadcast.TorrentBroadcast: Reading broadcast variable 6 took 10 ms
    16/11/16 14:52:01 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 3.1 KB, free 33.5 KB)
    16/11/16 14:52:01 INFO storage.BlockManager: Found block rdd_1_0 locally
    16/11/16 14:52:01 INFO executor.Executor: Finished task 0.0 in stage 5.0 (TID 17). 2016 bytes result sent to driver
    16/11/16 14:52:01 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 21
    16/11/16 14:52:01 INFO executor.Executor: Running task 1.0 in stage 6.0 (TID 21)
    16/11/16 14:52:01 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 7
    16/11/16 14:52:01 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 2.0 KB, free 35.5 KB)
    16/11/16 14:52:01 INFO broadcast.TorrentBroadcast: Reading broadcast variable 7 took 10 ms
    16/11/16 14:52:01 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 3.1 KB, free 38.6 KB)
    16/11/16 14:52:01 INFO storage.BlockManager: Found block rdd_3_1 locally
    16/11/16 14:52:01 INFO executor.Executor: Finished task 1.0 in stage 6.0 (TID 21). 2015 bytes result sent to driver
    16/11/16 14:52:01 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 25
    16/11/16 14:52:01 INFO executor.Executor: Running task 2.0 in stage 7.0 (TID 25)
    16/11/16 14:52:01 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 8
    16/11/16 14:52:01 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 1309.0 B, free 39.9 KB)
    16/11/16 14:52:01 INFO broadcast.TorrentBroadcast: Reading broadcast variable 8 took 9 ms
    16/11/16 14:52:01 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 2.0 KB, free 42.0 KB)
    16/11/16 14:52:01 INFO executor.Executor: Finished task 2.0 in stage 7.0 (TID 25). 938 bytes result sent to driver
    16/11/16 14:52:02 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 29
    16/11/16 14:52:02 INFO executor.Executor: Running task 0.0 in stage 8.0 (TID 29)
    16/11/16 14:52:02 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 9
    16/11/16 14:52:02 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 2.5 KB, free 44.5 KB)
    16/11/16 14:52:02 INFO broadcast.TorrentBroadcast: Reading broadcast variable 9 took 10 ms
    16/11/16 14:52:02 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 4.1 KB, free 48.6 KB)
    16/11/16 14:52:02 INFO storage.BlockManager: Found block rdd_1_0 locally
    16/11/16 14:52:02 INFO executor.Executor: Finished task 0.0 in stage 8.0 (TID 29). 2211 bytes result sent to driver
    16/11/16 14:52:02 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 31
    16/11/16 14:52:02 INFO executor.Executor: Running task 0.0 in stage 9.0 (TID 31)
    16/11/16 14:52:02 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 33
    16/11/16 14:52:02 INFO executor.Executor: Running task 1.0 in stage 9.0 (TID 33)
    16/11/16 14:52:02 INFO spark.MapOutputTrackerWorker: Updating epoch to 1 and clearing cache
    16/11/16 14:52:02 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 10
    16/11/16 14:52:02 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 2.6 KB, free 51.2 KB)
    16/11/16 14:52:02 INFO broadcast.TorrentBroadcast: Reading broadcast variable 10 took 8 ms
    16/11/16 14:52:02 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 4.4 KB, free 55.5 KB)
    16/11/16 14:52:02 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
    16/11/16 14:52:02 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
    16/11/16 14:52:02 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://[email protected]:2671)
    16/11/16 14:52:02 INFO spark.MapOutputTrackerWorker: Got the output locations
    16/11/16 14:52:02 INFO storage.ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
    16/11/16 14:52:02 INFO storage.ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 4 blocks
    16/11/16 14:52:02 INFO storage.ShuffleBlockFetcherIterator: Started 3 remote fetches in 15 ms
    16/11/16 14:52:02 INFO storage.ShuffleBlockFetcherIterator: Started 3 remote fetches in 16 ms
    

    And no more log for this container after it said "Started remote fetches"

    Still my cluster runs well with other spark programs with reduce phase (not skipped).

    opened by mouendless 18
  • org.apache.maven.plugins:maven-antrun-plugin:1.7

    org.apache.maven.plugins:maven-antrun-plugin:1.7

    root@ubuntu:~/GitProgram/Spark-Program/CaffeOnSpark# make build cd caffe-public; make proto; make -j4 -e distribute; cd .. make[1]: Entering directory /root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public' PROTOC src/caffe/proto/caffe.proto make[1]: protoc: Command not found make[1]: *** [.build_release/src/caffe/proto/caffe.pb.cc] Error 127 make[1]: Leaving directory/root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public' make[1]: Entering directory /root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public' PROTOC src/caffe/proto/caffe.proto make[1]: protoc: Command not found make[1]: *** [.build_release/src/caffe/proto/caffe.pb.h] Error 127 make[1]: *** Waiting for unfinished jobs.... make[1]: Leaving directory/root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public' export LD_LIBRARY_PATH="/usr/local/cuda/lib64::/root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public/distribute/lib:/root/GitProgram/Spark-Program/CaffeOnSpark/caffe-distri/distribute/lib:/usr/lib64:/lib64 "; mvn -B package [INFO] Scanning for projects... [WARNING] [WARNING] Some problems were encountered while building the effective model for com.yahoo.ml:caffe-grid:jar:0.1-SNAPSHOT [WARNING] The expression ${version} is deprecated. Please use ${project.version} instead. [WARNING] [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build. [WARNING] [WARNING] For this reason, future Maven versions might no longer support building such malformed projects. [WARNING] [INFO] ------------------------------------------------------------------------ [INFO] Reactor Build Order: [INFO] [INFO] caffe [INFO] caffe-distri [INFO] caffe-grid [INFO]
    [INFO] ------------------------------------------------------------------------ [INFO] Building caffe 0.1-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO]
    [INFO] ------------------------------------------------------------------------ [INFO] Building caffe-distri 0.1-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-antrun-plugin:1.7:run (proto) @ caffe-distri --- [INFO] Executing tasks

    protoc: [exec] make[1]: Entering directory /root/GitProgram/Spark-Program/CaffeOnSpark/caffe-distri' [exec] make[1]: Leaving directory/root/GitProgmake[1]: *** No rule to make target ../caffe-public/distribute/proto/caffe.proto', needed bysrc/main/java/caffe/Caffe.java'. ram/Spark-Program/CaffeOnSpark/caffe-distri' [exec] Stop. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] caffe .............................................. SUCCESS [ 0.003 s] [INFO] caffe-distri ....................................... FAILURE [ 1.268 s] [INFO] caffe-grid ......................................... SKIPPED [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1.591 s [INFO] Finished at: 2016-05-16T18:20:36+08:00 [INFO] Final Memory: 9M/176M [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (proto) on project caffe-distri: An Ant BuildException has occured: exec returned: 2 [ERROR] around Ant part ...... @ 5:109 in /root/GitProgram/Spark-Program/CaffeOnSpark/caffe-distri/target/antrun/build-protoc.xml [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :caffe-distri make: *** [build] Error 1

    opened by ai408 17
  • nvidia-smi hangs when devices=2 using spark-submit in yarn

    nvidia-smi hangs when devices=2 using spark-submit in yarn

    In single node with 2 gpus, when submit using devices=2. nvidia-smi hang up. Need to shutdown computer to bring nvidia gpu back for use. The operation system is ubuntu 14.04.

    opened by nhe150 16
  • no lmdbjni in java.library.path exception

    no lmdbjni in java.library.path exception

    16/02/26 16:34:34 INFO caffe.DataSource$: Source data layer:0 16/02/26 16:34:34 INFO caffe.LMDB: Batch size:64 Exception in thread "main" java.lang.UnsatisfiedLinkError: no lmdbjni in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at com.yahoo.ml.caffe.LMDB$.makeSequence(LMDB.scala:28) at com.yahoo.ml.caffe.LMDB.makeRDD(LMDB.scala:94) at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:113) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:44) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 16/02/26 16:34:34 INFO spark.SparkContext: Invoking stop() from shutdown hook

    opened by melody-rain 16
  • java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1

    java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1

    I got an error when I was trying to join two datasets which are from postgres and csv file the error message is like this: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1 staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, targetString), StringType), true, false) AS targetString#205 staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, deviceName), StringType), true, false) AS deviceName#206 staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, alarmDetectionCode), StringType), true, false) AS alarmDetectionCode#207 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:292) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:593) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:593) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:174) at org.apache.spark.sql.Row$class.isNullAt(Row.scala:191) at org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:166) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.StaticInvoke_1$(generated.java:94) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:36) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289) ... 15 more

    Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:929) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:929) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2111) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2060) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2049) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:740) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2113) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544) at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363) at org.apache.spark.sql.Dataset.head(Dataset.scala:2544) at org.apache.spark.sql.Dataset.take(Dataset.scala:2758) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254) at org.apache.spark.sql.Dataset.showString(Dataset.scala:291) at org.apache.spark.sql.Dataset.show(Dataset.scala:745) at org.apache.spark.sql.Dataset.show(Dataset.scala:704) at org.apache.spark.sql.Dataset.show(Dataset.scala:713) at jp.co.nec.necdas.commons.customize.service.dataset.ALMTriggerProcessLogic.process(ALMTriggerProcessLogic.java:295) at jp.co.nec.necdas.commons.customize.service.dataset.ALMTriggerProcessLogic.execute(ALMTriggerProcessLogic.java:179) at jp.co.nec.necdas.commons.customize.service.dataset.ALMTriggerProcessLogic.execute(ALMTriggerProcessLogic.java:1) at jp.co.nec.necdas.commons.service.StandardExecutorService.execute(StandardExecutorService.java:65) at jp.co.nec.necdas.commons.service.StandardExecutorServiceTest.testExecute(StandardExecutorServiceTest.java:26) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.springframework.test.context.junit4.statements.RunBeforeTestExecutionCallbacks.evaluate(RunBeforeTestExecutionCallbacks.java:73) at org.springframework.test.context.junit4.statements.RunAfterTestExecutionCallbacks.evaluate(RunAfterTestExecutionCallbacks.java:83) at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:75) at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:86) at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:84) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:251) at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:97) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61) at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:70) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:190) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:538) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:760) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:206) Caused by: java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1 staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, targetString), StringType), true, false) AS targetString#205 staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, deviceName), StringType), true, false) AS deviceName#206 staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, alarmDetectionCode), StringType), true, false) AS alarmDetectionCode#207 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:292) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:593) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:593) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:174) at org.apache.spark.sql.Row$class.isNullAt(Row.scala:191) at org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:166) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.StaticInvoke_1$(generated.java:94) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:36) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289) ... 15 more

    It looked like mismatch happened when spark application was joining two dataset with different schema , but I don't know how it happened.

    My java code: Dataset result = null; result = deviceInfoDataset.join(searchInfo,deviceInfoDataset.col("deviceName").equalTo(searchInfo.col("deviceName"))); result.show();

    Dataset schema: device +--------+----------+----------+ |ctgry_cd|deviceInfo|deviceName| +--------+----------+----------+

    searchinfo

    +------------+----------+------------------+ |targetString|deviceName|alarmDetectionCode| +------------+----------+------------------+

    opened by yuanwuab 0
  • Attribute protoFile not valid

    Attribute protoFile not valid

    [root@node93 ~]# pyspark --master yarn \

    --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
    --driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
    --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
    --py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
    --files ${CAFFE_ON_SPARK}/data/caffe/caffe.so
    --jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" Python 2.7.5 (default, Nov 6 2016, 00:28:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 Type "help", "copyright", "credits" or "license" for more information. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/2.6.4.0-91/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/2.6.4.0-91/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/03/01 11:12:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/03/01 11:12:55 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. Welcome to ____ __ / / ___ / / \ / _ / _ `/ __/ '/ / / .__/_,/
    / //_\ version 2.2.0.2.6.4.0-91 //

    Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) SparkSession available as 'spark'.

    from pyspark import SparkConf,SparkContext from com.yahoo.ml.caffe.RegisterContext import registerContext,registerSQLContext from com.yahoo.ml.caffe.CaffeOnSpark import CaffeOnSpark from com.yahoo.ml.caffe.Config import Config from com.yahoo.ml.caffe.DataSource import DataSource from pyspark.mllib.linalg import Vectors from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.linalg import Vectors from pyspark.mllib.classification import LogisticRegressionWithLBFGS registerContext(sc) registerSQLContext(sqlContext) cos=CaffeOnSpark(sc) cfg=Config(sc) cfg.protoFile='/usr/hdp/2.6.4.0-91/CaffeOnSpark/data/lenet_memory_solver.prototxt' Attribute protoFile not valid cfg.protoFile='/user/root/lenet_memory_solver.prototxt' Attribute protoFile not valid

    I dont know why the protofile is not valid

    opened by qinhuaiqiang 0
  • hive java.io.filenotfoundexception system cannot find specified path

    hive java.io.filenotfoundexception system cannot find specified path

    I'm installing hive on windows, following tutorial from http://sandeeppatil101.blogspot.mx/2017/05/step-1-download-hive-2.html...when I reached the póint to run "schematool -initSchema -dbType mysql" I've got the following error:

    2018-05-14 10:18:22,583 main WARN Unable to instantiate org.fusesource.jansi.WindowsAnsiOutputStream 2018-05-14 10:18:22,585 main WARN Unable to instantiate org.fusesource.jansi.WindowsAnsiOutputStream org.apache.hadoop.hive.metastore.HiveMetaException: File /usr/hive\scripts\metastore\upgrade\mysql\upgrade.order.mysqlnot found Underlying cause: java.io.FileNotFoundException : \usr\hive\scripts\metastore\upgrade\mysql\upgrade.order.mysql (El sistema no puede encontrar la ruta especificada) Use --verbose for detailed stacktrace. *** schemaTool failed ***

    The first couple lines showed a warning message...I'm more intrigued by the last two lines....where the not found & The sytem cannot find the specified path...

    I assume this has to be with the path where upgrade.order.mysql is...I checked the file does exist and the path is correct, but for an unknown reason the path inside cygwin is not correct...where I can see that configuration path? How I fix this error?

    Any suggestions will be helpful. Thanks

    opened by emrosales 0
  • Error running javah command: Error executing command line

    Error running javah command: Error executing command line

    Hi... I Like to share a couple of things: First as I'm new to hadoop, for teaching purposes I followed a couple of tutorials for installing, Then, I overcome the problem of .net + SDK7.1...during building hadoop I had got a warning and an error warninf concerning maven-gpg-plugin that was missing in line 133 of the pom.xml. I have realized that there wasnot a mention of the plugin in the pluginManagement section, so I added and the warning disapear... Second and more serious, I have got the following error:

    [ERROR] Failed to execute goal org.codehaus.mojo:native-maven-plugin:1.0-alpha-8:javah (default) on project hadoop-common: Error running javah command: Error executing command line. Exit code:1 -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

    Checking the ouput (I redirect the output to a file, I'm including) I realized that the command line

    [INFO] --- native-maven-plugin:1.0-alpha-8:javah (default) @ hadoop-common --- [INFO] cmd.exe /X /C "C:\Progra~1\Java\jdk1.8.0_161\bin\javah -d C:\hadoop-3.0.0-src\hadoop- ... org.apache.hadoop.crypto.OpensslCipher org.apache.hadoop.crypto.random.OpensslSecureRandom org.apache.hadoop.util.NativeCrc32"

    is too long...and my guess is that line was not executed, thus the error. logfile.txt

    I'm running in windows 7, it is the only platform I have got so far, as I said it is for teaching purposes... Is there any way to work-around the too long command line?? Thanks

    opened by emrosales 2
  • err “java.lang.UnsupportedOperationException: empty.reduceLeft”

    err “java.lang.UnsupportedOperationException: empty.reduceLeft”

    hello, After i run 100 iters, there is an err "java.lang.UnsupportedOperationException: empty.reduceLeft", it look like somthing wrong with my DataFrame data, but the same data can work well in another cluster. The different with two clusters is that: Working well cluster:
    os:ubuntu | num_cpus/per_computer:1| devices/per_computer:3(GTX1080)| total_divices:9 Wrong cluster: os:centos| num_cpus/per_computer:2| devices/per_computer:8(GTX1080)| total_devices:16 can you give me some suggestion? Thanks.

    opened by Zzmc 0
  • Error: Exception in thread

    Error: Exception in thread "AWT-EventQueue-0" java.lang.UnsatisfiedLinkError: /Applications/Alice 2.4.app/Contents/Required/lib/osx/libjogl_awt.jnilib: Library not loaded: /System/Library/Frameworks/JavaVM.framework/Libraries/libjawt.dylib

    Can't figure out how to fix this. I'm trying to download the program Alice on my MacBook Pro and I already updated java but I keep getting an error when the program starts up. How do I fix this?

    opened by purple792 0
Owner
Yahoo
This organization is the home to many of the active open source projects published by engineers at Yahoo Inc.
Yahoo
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

null 4.1k Jan 9, 2023
Distributed Deep learning with Keras & Spark

Elephas: Distributed Deep Learning with Keras & Spark Elephas is an extension of Keras, which allows you to run distributed deep learning models at sc

Max Pumperla 1.6k Dec 29, 2022
DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

DistML is a Ray extension library to support large-scale distributed ML training on heterogeneous multi-node multi-GPU clusters

null 27 Aug 19, 2022
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

null 2.5k Dec 28, 2022
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

null 92 Dec 14, 2022
Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Somoclu Somoclu is a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing

Peter Wittek 239 Nov 10, 2022
Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

Tom Weichle 2 Apr 18, 2022
Uber Open Source 1.6k Dec 31, 2022
Microsoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark

Microsoft Azure 3.9k Dec 30, 2022
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 10x Faster Trainin

Microsoft 8.4k Dec 30, 2022
a distributed deep learning platform

Apache SINGA Distributed deep learning system http://singa.apache.org Quick Start Installation Examples Issues JIRA tickets Code Analysis: Mailing Lis

The Apache Software Foundation 2.7k Jan 5, 2023
WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

WAGMA-SGD is a decentralized asynchronous SGD based on wait-avoiding group model averaging. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can be initiated without requiring that all the processes enter it. It partially reduces the data within non-overlapping groups of process, improving the parallel scalability.

Shigang Li 6 Jun 18, 2022
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 757 Dec 31, 2022
Spark development environment for k8s

Local Spark Dev Env with Docker Development environment for k8s. Using the spark-operator image to ensure it will be the same environment. Start conta

Otacilio Filho 18 Jan 4, 2022
Code base of KU AIRS: SPARK Autonomous Vehicle Team

KU AIRS: SPARK Autonomous Vehicle Project Check this link for the blog post describing this project and the video of SPARK in simulation and on parkou

Mehmet Enes Erciyes 1 Nov 23, 2021
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.3k Dec 31, 2022
🎛 Distributed machine learning made simple.

?? lazycluster Distributed machine learning made simple. Use your preferred distributed ML framework like a lazy engineer. Getting Started • Highlight

Machine Learning Tooling 44 Nov 27, 2022
Management of exclusive GPU access for distributed machine learning workloads

TensorHive is an open source tool for managing computing resources used by multiple users across distributed hosts. It focuses on granting

Paweł Rościszewski 131 Dec 12, 2022