A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility

Overview

Tensorpack

Tensorpack is a neural network training interface based on TensorFlow.

ReadTheDoc Gitter chat model-zoo

Features:

It's Yet Another TF high-level API, with speed, and flexibility built together.

  1. Focus on training speed.

    • Speed comes for free with Tensorpack -- it uses TensorFlow in the efficient way with no extra overhead. On common CNNs, it runs training 1.2~5x faster than the equivalent Keras code. Your training can probably gets faster if written with Tensorpack.

    • Data-parallel multi-GPU/distributed training strategy is off-the-shelf to use. It scales as well as Google's official benchmark.

    • See tensorpack/benchmarks for some benchmark scripts.

  2. Focus on large datasets.

    • You don't usually need tf.data. Symbolic programming often makes data processing harder. Tensorpack helps you efficiently process large datasets (e.g. ImageNet) in pure Python with autoparallelization.
  3. It's not a model wrapper.

    • There are too many symbolic function wrappers in the world. Tensorpack includes only a few common models. But you can use any symbolic function library inside Tensorpack, including tf.layers/Keras/slim/tflearn/tensorlayer/....

See tutorials and documentations to know more about these features.

Examples:

We refuse toy examples. Instead of showing tiny CNNs trained on MNIST/Cifar10, we provide training scripts that reproduce well-known papers.

We refuse low-quality implementations. Unlike most open source repos which only implement papers, Tensorpack examples faithfully reproduce papers, demonstrating its flexibility for actual research.

Vision:

Reinforcement Learning:

Speech / NLP:

Install:

Dependencies:

  • Python 3.3+.
  • Python bindings for OpenCV. (Optional, but required by a lot of features)
  • TensorFlow ≥ 1.5, < 2
    • TF is not not required if you only want to use tensorpack.dataflow alone as a data processing library
    • TF2 is supported if used in graph mode (and use tf.compat.v1 when needed)
pip install --upgrade git+https://github.com/tensorpack/tensorpack.git
# or add `--user` to install to user's local directories

Please note that tensorpack is not yet stable. If you use tensorpack in your code, remember to mark the exact version of tensorpack you use as your dependencies.

Citing Tensorpack:

If you use Tensorpack in your research or wish to refer to the examples, please cite with:

@misc{wu2016tensorpack,
  title={Tensorpack},
  author={Wu, Yuxin and others},
  howpublished={\url{https://github.com/tensorpack/}},
  year={2016}
}
Comments
  • Run Inference after training

    Run Inference after training

    Hello! I am sorry if it is unrelated to Tensorpack. I runned the ResNet on Cifar10 dataset with Trained Ternary Quantization. Now i dont know how to run Inference on the saved checkpoint after training. I have already read "Don’t Use Training Metagraph for Inference" in Tensorpack documentation. However, i still dont know how to use this one as below exactly:

    a, b = tf.placeholder(...), tf.placeholder(...)
    with TowerContext('', is_training=False):
          model.build_graph(a, b)
    

    Could you guide me to do that? Thanks you in advance!

    usage 
    opened by minhson 58
  • error running alexnet_dorefa.py

    error running alexnet_dorefa.py

    environment: tensorflow1.13.0(in docker) cuda8.0 cudnn6 anaconda2

    error running alexnet_dorefa.py. it is weird that in the /root/tensorpack_data, there is a caffe_ilsvrc12.tar.gz file but it is only 4kb in size, which should be in 17MB in size. These are a little confusing to me. Any help is appreciated! @ppwwyyxx the error looks like this:

    root@997991b14e71:/data/home/users/ccc/projects/tensorpack/examples/DoReFa-Net# ./alexnet-dorefa.py --dorefa 1,2,6 --data /data/data/ImageNetOrigin --gpu 4,5,6,7
    /root/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
      from ._conv import register_converters as _register_converters
    [0703 06:54:57 @logger.py:109] WRN Log directory train_log/alexnet-dorefa-1,2,6 exists! Use 'd' to delete it. 
    [0703 06:54:57 @logger.py:112] WRN If you're resuming from a previous run, you can choose to keep it.
    Press any other key to exit. 
    Select Action: k (keep) / d (delete) / q (quit):d
    [0703 06:54:58 @logger.py:74] Argv: ./alexnet-dorefa.py --dorefa 1,2,6 --data /data/data/ImageNetOrigin --gpu 4,5,6,7
    [0703 06:54:58 @alexnet-dorefa.py:222] Batch per tower: 64
    [0703 06:54:58 @fs.py:88] WRN Env var $TENSORPACK_DATASET not set, using /root/tensorpack_data for datasets.
    caffe_ilsvrc12.tar.gz: 8.19kB [00:00, 26.0kB/s]
    Succesfully downloaded caffe_ilsvrc12.tar.gz. 2942 bytes.
    Traceback (most recent call last):
      File "./alexnet-dorefa.py", line 224, in <module>
        config = get_config()
      File "./alexnet-dorefa.py", line 147, in get_config
        data_train = get_data('train')
      File "./alexnet-dorefa.py", line 143, in get_data
        args.data, dataset_name, BATCH_SIZE, augmentors)
      File "/data/home/users/ccc/projects/tensorpack/examples/DoReFa-Net/imagenet_utils.py", line 101, in get_imagenet_dataflow
        ds = dataset.ILSVRC12(datadir, name, shuffle=True)
      File "/root/anaconda2/lib/python2.7/site-packages/tensorpack/dataflow/dataset/ilsvrc.py", line 247, in __init__
        dir, name, meta_dir, shuffle, dir_structure)
      File "/root/anaconda2/lib/python2.7/site-packages/tensorpack/dataflow/dataset/ilsvrc.py", line 158, in __init__
        meta = ILSVRCMeta(meta_dir)
      File "/root/anaconda2/lib/python2.7/site-packages/tensorpack/dataflow/dataset/ilsvrc.py", line 32, in __init__
        self._download_caffe_meta()
      File "/root/anaconda2/lib/python2.7/site-packages/tensorpack/dataflow/dataset/ilsvrc.py", line 57, in _download_caffe_meta
        tarfile.open(fpath, 'r:gz').extractall(self.dir)
      File "/root/anaconda2/lib/python2.7/tarfile.py", line 1693, in open
        return func(name, filemode, fileobj, **kwargs)
      File "/root/anaconda2/lib/python2.7/tarfile.py", line 1751, in gzopen
        raise ReadError("not a gzip file")
    tarfile.ReadError: not a gzip file
    
    examples 
    opened by brisker 55
  • Quantizing Gradients - Meaning of max0() operator in DoReFa v2 paper?

    Quantizing Gradients - Meaning of max0() operator in DoReFa v2 paper?

    Thank you for your help so far.

    (1) In section 2.5 on quantizing gradients you use an operator called max0 but do not define it. I did not find a definition in the XNOR or BNN papers either. What does this operator do? How is it different from the regular max() operator?

    (2) Second, you say that dr / 2max0(|dr|) + 1/2 is an affine transform to map the gradient into [0,1], but it seems like in your code you apply an additional step to manually clip the values. Why do you need this additional step?

    Code: https://github.com/ppwwyyxx/tensorpack/blob/master/examples/DoReFa-Net/dorefa.py

     def grad_fg(op, x):
                rank = x.get_shape().ndims
                assert rank is not None
                maxx = tf.reduce_max(tf.abs(x), list(range(1,rank)), keep_dims=True)
                x = x / maxx
                n = float(2**bitG-1)
                x = x * 0.5 + 0.5 + tf.random_uniform(
                        tf.shape(x), minval=-0.5/n, maxval=0.5/n)
                x = tf.clip_by_value(x, 0.0, 1.0) # this is the extra step not in the paper
                x = quantize(x, bitG) - 0.5
                return x * maxx * 2
    

    (3) I am also having trouble understanding this line, could you please explain? - maxx = tf.reduce_max(tf.abs(x), list(range(1,rank)), keep_dims=True).

    It seems like list(range(1,rank)) is somehow related to your statement that "Here dr = ∂c/∂r is the back-propagated gradient of the output r of some layer, and the maximum is taken over all axis of the gradient tensor dr except for the mini-batch axis (therefore each instance in a mini-batch will have its own scaling factor)", but I do not understand this sentence either. Thank you for your help!

    examples 
    opened by the-bobo 35
  • train on an Atari game: Breakout-v0 (Utilization of gpu and convergence)

    train on an Atari game: Breakout-v0 (Utilization of gpu and convergence)

    Hello Yuxin,

    I am doing training on Atari Game and I noticed that utilization of gpu ( nvidia smi -l ) is very low ( ~ 10-50%). Could you comment that, please?

    nvidia-smi-l.txt

    Could you also tell wherever my training is going all right, please? It runs for quite a lot of time and I would like to make sure that there is a progress.

    Part of the output: ................ [0120 23:32:23 @timer.py:46] Epoch 273 (global_step 1638000) finished, time:2611.25sec. [0120 23:32:24 @stats.py:101] SummaryGradient/conv0/W/rms: 0.0015963 [0120 23:32:24 @stats.py:101] SummaryGradient/conv0/b/rms: 0.034784 [0120 23:32:24 @stats.py:101] SummaryGradient/conv1/W/rms: 0.00075034 [0120 23:32:24 @stats.py:101] SummaryGradient/conv1/b/rms: 0.014863 [0120 23:32:24 @stats.py:101] SummaryGradient/conv2/W/rms: 0.00071202 [0120 23:32:24 @stats.py:101] SummaryGradient/conv2/b/rms: 0.0056869 [0120 23:32:24 @stats.py:101] SummaryGradient/conv3/W/rms: 0.00084989 [0120 23:32:24 @stats.py:101] SummaryGradient/conv3/b/rms: 0.0093001 [0120 23:32:24 @stats.py:101] SummaryGradient/fc-pi/W/rms: 0.0036259 [0120 23:32:24 @stats.py:101] SummaryGradient/fc-pi/b/rms: 0.0050046 [0120 23:32:24 @stats.py:101] SummaryGradient/fc-v/W/rms: 0.023725 [0120 23:32:24 @stats.py:101] SummaryGradient/fc-v/b/rms: 0.030802 [0120 23:32:24 @stats.py:101] SummaryGradient/fc0/W/rms: 0.00015396 [0120 23:32:24 @stats.py:101] SummaryGradient/fc0/b/rms: 0.0010555 [0120 23:32:24 @stats.py:101] SummaryGradient/prelu/alpha/rms: 0.083734 [0120 23:32:24 @stats.py:101] async_global_step: 1.638e+06 [0120 23:32:24 @stats.py:101] cost: 0.010786 [0120 23:32:24 @stats.py:101] input_queue_size: 2.3367e-37 [0120 23:32:24 @stats.py:101] learning_rate: 0.0001 [0120 23:32:24 @stats.py:101] policy_loss: -0.57677 [0120 23:32:24 @stats.py:101] predict_reward: 2.8047 [0120 23:32:24 @stats.py:101] rms_advantage: 0.20093 [0120 23:32:24 @stats.py:101] value_loss: 2.9039 [0120 23:32:24 @stats.py:101] xentropy_loss: -189.29 [0120 23:32:25 @timer.py:42] Start Epoch 274 (global_step 1644000) ... 100%|#####################################################################|6000/6000[43:57<00:00, 2.22it/s] [0121 00:16:22 @timer.py:46] Epoch 274 (global_step 1644000) finished, time:2637.45sec. [2017-01-21 00:16:24,998] Making new env: Breakout-v0 [2017-01-21 00:16:25,189] Making new env: Breakout-v0 100%|#########################################################################|16/16[06:02<00:00, 0.05it/s] [0121 00:22:28 @common.py:76] Waiting for all the workers to finish the last run... [0121 00:22:28 @stats.py:101] SummaryGradient/conv0/W/rms: 0.0017033 [0121 00:22:28 @stats.py:101] SummaryGradient/conv0/b/rms: 0.030689 [0121 00:22:28 @stats.py:101] SummaryGradient/conv1/W/rms: 0.00074152 [0121 00:22:28 @stats.py:101] SummaryGradient/conv1/b/rms: 0.01373 [0121 00:22:28 @stats.py:101] SummaryGradient/conv2/W/rms: 0.00068949 [0121 00:22:28 @stats.py:101] SummaryGradient/conv2/b/rms: 0.005354 [0121 00:22:28 @stats.py:101] SummaryGradient/conv3/W/rms: 0.00080288 [0121 00:22:28 @stats.py:101] SummaryGradient/conv3/b/rms: 0.0079926 [0121 00:22:28 @stats.py:101] SummaryGradient/fc-pi/W/rms: 0.0033409 [0121 00:22:28 @stats.py:101] SummaryGradient/fc-pi/b/rms: 0.0056811 [0121 00:22:28 @stats.py:101] SummaryGradient/fc-v/W/rms: 0.01776 [0121 00:22:28 @stats.py:101] SummaryGradient/fc-v/b/rms: 0.026071 [0121 00:22:28 @stats.py:101] SummaryGradient/fc0/W/rms: 0.00015412 [0121 00:22:28 @stats.py:101] SummaryGradient/fc0/b/rms: 0.001081 [0121 00:22:28 @stats.py:101] SummaryGradient/prelu/alpha/rms: 0.088892 [0121 00:22:28 @stats.py:101] async_global_step: 1.644e+06 [0121 00:22:28 @stats.py:101] cost: 0.0021201 [0121 00:22:28 @stats.py:101] input_queue_size: 0.00082628 [0121 00:22:28 @stats.py:101] learning_rate: 0.0001 [0121 00:22:28 @stats.py:101] max_score: 864 [0121 00:22:28 @stats.py:101] mean_score: 543.19 [0121 00:22:28 @stats.py:101] policy_loss: -1.5347 [0121 00:22:28 @stats.py:101] predict_reward: 2.6608 [0121 00:22:28 @stats.py:101] rms_advantage: 0.19512 [0121 00:22:28 @stats.py:101] value_loss: 2.762 [0121 00:22:28 @stats.py:101] xentropy_loss: -191.18 [0121 00:22:28 @group.py:42] Callbacks took 364.255 sec in total. Periodic-Evaluator: 363.350sec [0121 00:22:28 @timer.py:42] Start Epoch 275 (global_step 1650000) ... ......................

    examples 
    opened by ghost 33
  • Train Faster RCNN

    Train Faster RCNN

    I get an error to train faster rcnn based on your example; however, with your model, I am able to evaluate its performance and get the same results you posted on github.

    Always include the following:

    1. What you did. (command you run if using examples; post or describe your code if not)

    ./examples/FasterRCNN/train.py --load snapshots/tensorpack/COCO-ResNet50-FasterRCNN.npz --gpu 2,3 --datadir /path/to/COCO14 --logdir snapshots/fasterRCNN-ResNet50

    1. What you observed. (training logs)
    [1116 16:23:10 @graph.py:70] Running Op sync_variables_from_main_tower ...  
    2017-11-16 16:23:10.457645: E tensorflow/stream_executor/cuda/cuda_driver.cc:1299] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED  
    [1116 16:23:14 @param.py:144] After epoch 0, learning_rate will change to 0.00300000  
    [1116 16:23:15 @base.py:209] Start Epoch 1 ...
    

    and then the program is idle there forever, does it related to the line about CUDA_ERROR_NOT_INITIALIZED

    1. Your environment (TF version, GPUs), if it matters. TF version 1.4.0, Python-3.6, CUDA 9, CUDNN-7. Tensorpack version: the newest commit.

    2. Others:

    • if I commented out the ds = PrefetchDataZMQ(ds, 1) in get_train_dataflow function. of data.py file, the training is running. Or if I replace ds = PrefetchDataZMQ(ds, 1) by ds = PrefetchData(ds, 500, 1), it will work as well.

    Thanks.

    opened by chunfuchen 32
  • Build ZMQ-operator

    Build ZMQ-operator

    I tried to compile your custom-operator on my machine and get

    Compiling user ops ...
    make: Entering directory '/home/patwie/git/tensorpack/tensorpack/user_ops'
    [dep] zmq_recv_op.cc ...
    In file included from zmq_conn.h:8:0,
                     from zmq_recv_op.cc:10:
    zmq.hpp:84:36: error: missing binary operator before token "("
     #if ZMQ_VERSION >= ZMQ_MAKE_VERSION(3, 3, 0)
    

    Can you shortly comment, which zmq version do you use. I had to change

    //#include <zmq.hpp> into
    #include "zmq.hpp"
    

    and use https://github.com/zeromq/cppzmq

    But still getting the error.

    enhancement 
    opened by PatWie 32
  • Bug Reports: How to deal with ValueError: Cannot feed value of shape (224, 224, 3) for Tensor 'input:0', which has shape '(?, 224, 224, 3)'

    Bug Reports: How to deal with ValueError: Cannot feed value of shape (224, 224, 3) for Tensor 'input:0', which has shape '(?, 224, 224, 3)'

    It seems the first run would be OK after reboot the server. For the following attempt, it will give me this error message.

    The log is as below:

    [1026 20:26:32 @logger.py:74] Argv: main.py [1026 20:26:32 @tensor_net.py:46] Running on 2 towers. Batch size per tower: 64 [1026 20:26:32 @fs.py:89] WRN Env var $TENSORPACK_DATASET not set, using /home/hgao/tensorpack_data for datasets. [1026 20:26:34 @prefetch.py:263] [PrefetchDataZMQ] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d. [1026 20:26:34 @ilsvrc.py:118] Assuming directory /tempspace2/hgao/data/imagenet/val has original structure. [1026 20:26:34 @param.py:189] Use ./logdir/hyper.txt to set hyperparam: 'learning_rate'. [1026 20:26:34 @inference_runner.py:83] InferenceRunner will eval on an InputSource of size 782 [1026 20:27:04 @input_source.py:178] Setting up the queue 'QueueInput/input_queue' for CPU prefetching ... [1026 20:27:04 @input_source.py:459] Setting up StagingArea for GPU prefetching ... [1026 20:27:04 @training.py:41] Training a model of 2 towers [1026 20:27:04 @training.py:92] Building graph for training tower 0 on device LeastLoadedDeviceSetter-/gpu:0... [1026 20:27:06 @regularize.py:108] Add REGULARIZATION_LOSSES of 58 tensors on the total cost. [1026 20:27:07 @training.py:92] Building graph for training tower 1 on device LeastLoadedDeviceSetter-/gpu:1... [1026 20:27:08 @regularize.py:108] Add REGULARIZATION_LOSSES of 58 tensors on the total cost. [1026 20:27:10 @model_utils.py:47] Model Parameters: name shape dim device


    conv_s/weights:0 [3, 3, 3, 32] 864 /device:GPU:0 conv_s/batch_norm/gamma:0 [32] 32 /device:GPU:1 conv_s/batch_norm/beta:0 [32] 32 /device:GPU:1 conv_1_0/conv1/conv/weights:0 [3, 3, 32, 1] 288 /device:GPU:1 conv_1_0/conv1/batch_norm/gamma:0 [32] 32 /device:GPU:1 conv_1_0/conv1/batch_norm/beta:0 [32] 32 /device:GPU:1 conv_1_0/conv2/weights:0 [1, 1, 32, 64] 2048 /device:GPU:1 conv_1_0/conv2/batch_norm/gamma:0 [64] 64 /device:GPU:0 conv_1_0/conv2/batch_norm/beta:0 [64] 64 /device:GPU:0 conv_1_1/conv1/conv/weights:0 [3, 3, 64, 1] 576 /device:GPU:0 conv_1_1/conv1/batch_norm/gamma:0 [64] 64 /device:GPU:0 conv_1_1/conv1/batch_norm/beta:0 [64] 64 /device:GPU:0 conv_1_1/conv2/weights:0 [1, 1, 64, 128] 8192 /device:GPU:0 conv_1_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_1_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_1_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_1_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_1_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_1_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_1_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_1_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_1_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_1_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_1_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_1_3/conv2/weights:0 [1, 1, 128, 256] 32768 /device:GPU:0 conv_1_3/conv2/batch_norm/gamma:0 [256] 256 /device:GPU:1 conv_1_3/conv2/batch_norm/beta:0 [256] 256 /device:GPU:1 conv_1_4/conv1/conv/weights:0 [3, 3, 256, 1] 2304 /device:GPU:1 conv_1_4/conv1/batch_norm/gamma:0 [256] 256 /device:GPU:1 conv_1_4/conv1/batch_norm/beta:0 [256] 256 /device:GPU:1 conv_1_4/conv2/weights:0 [1, 1, 256, 256] 65536 /device:GPU:1 conv_1_4/conv2/batch_norm/gamma:0 [256] 256 /device:GPU:0 conv_1_4/conv2/batch_norm/beta:0 [256] 256 /device:GPU:0 conv_1_5/conv1/conv/weights:0 [3, 3, 256, 1] 2304 /device:GPU:0 conv_1_5/conv1/batch_norm/gamma:0 [256] 256 /device:GPU:0 conv_1_5/conv1/batch_norm/beta:0 [256] 256 /device:GPU:0 conv_1_5/conv2/weights:0 [1, 1, 256, 512] 131072 /device:GPU:0 conv_1_5/conv2/batch_norm/gamma:0 [512] 512 /device:GPU:1 conv_1_5/conv2/batch_norm/beta:0 [512] 512 /device:GPU:1 conv_2/group_0_conv0/conv/weights:0 [1, 1, 4, 1, 1] 4 /device:GPU:1 conv_2/group_0/conv_0/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_0/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_0/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_0/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_0/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_0/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_1/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_1/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_1/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_1/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_3/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_3/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_3/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_4/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_0/conv_4/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_4/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_0/conv_4/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_0/conv_4/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_0/conv_4/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1_conv0/conv/weights:0 [1, 1, 4, 1, 1] 4 /device:GPU:0 conv_2/group_1/conv_0/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_1/conv_0/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_0/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_0/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_1/conv_0/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_0/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_1/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_1/conv_1/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_1/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_1/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_1/conv_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_1/conv_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_1/conv_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_1/conv_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_3/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_1/conv_3/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_3/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_4/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_1/conv_4/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_4/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_1/conv_4/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_1/conv_4/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_1/conv_4/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2_conv0/conv/weights:0 [1, 1, 4, 1, 1] 4 /device:GPU:1 conv_2/group_2/conv_0/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_2/conv_0/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_0/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_0/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_2/conv_0/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_0/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_1/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_2/conv_1/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_1/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_1/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_2/conv_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_2/conv_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_2/conv_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_2/conv_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_3/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_2/conv_3/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_3/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_4/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_2/conv_4/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_4/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_2/conv_4/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_2/conv_4/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_2/conv_4/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3_conv0/conv/weights:0 [1, 1, 4, 1, 1] 4 /device:GPU:0 conv_2/group_3/conv_0/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_3/conv_0/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_0/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_0/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_3/conv_0/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_0/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_1/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_3/conv_1/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_1/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_1/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_3/conv_1/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_1/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_2/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_3/conv_2/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_2/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_2/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_3/conv_2/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_2/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_3/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:1 conv_2/group_3/conv_3/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_3/conv1/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_3/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:1 conv_2/group_3/conv_3/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_3/conv2/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_4/conv1/conv/weights:0 [3, 3, 128, 1] 1152 /device:GPU:0 conv_2/group_3/conv_4/conv1/batch_norm/gamma:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_4/conv1/batch_norm/beta:0 [128] 128 /device:GPU:0 conv_2/group_3/conv_4/conv2/weights:0 [1, 1, 128, 128] 16384 /device:GPU:0 conv_2/group_3/conv_4/conv2/batch_norm/gamma:0 [128] 128 /device:GPU:1 conv_2/group_3/conv_4/conv2/batch_norm/beta:0 [128] 128 /device:GPU:1 conv_3_0/conv1/conv/weights:0 [3, 3, 512, 1] 4608 /device:GPU:1 conv_3_0/conv1/batch_norm/gamma:0 [512] 512 /device:GPU:1 conv_3_0/conv1/batch_norm/beta:0 [512] 512 /device:GPU:1 conv_3_0/conv2/weights:0 [1, 1, 512, 1024] 524288 /device:GPU:1 conv_3_0/conv2/batch_norm/gamma:0 [1024] 1024 /device:GPU:0 conv_3_0/conv2/batch_norm/beta:0 [1024] 1024 /device:GPU:0 conv_3_1/conv1/conv/weights:0 [3, 3, 1024, 1] 9216 /device:GPU:0 conv_3_1/conv1/batch_norm/gamma:0 [1024] 1024 /device:GPU:0 conv_3_1/conv1/batch_norm/beta:0 [1024] 1024 /device:GPU:0 conv_3_1/conv2/weights:0 [1, 1, 1024, 1024] 1048576 /device:GPU:0 conv_3_1/conv2/batch_norm/gamma:0 [1024] 1024 /device:GPU:1 conv_3_1/conv2/batch_norm/beta:0 [1024] 1024 /device:GPU:1 out/pool/batch_norm/gamma:0 [1024] 1024 /device:GPU:1 out/pool/batch_norm/beta:0 [1024] 1024 /device:GPU:1 out/dense/weights:0 [1024, 1000] 1024000 /device:GPU:1 out/dense/biases:0 [1000] 1000 /device:GPU:0 Total #vars=179, #param=3251000 (12.40 MB assuming all float32) [1026 20:27:10 @base.py:207] Setup callbacks graph ... [1026 20:27:11 @input_source.py:178] Setting up the queue 'DataParallelInferenceRunner/QueueInput/input_queue' for CPU prefetching ... [1026 20:27:11 @predictor_factory.py:54] Building predictor tower 'InferenceTower0' on device /gpu:0 ... [1026 20:27:12 @predictor_factory.py:54] Building predictor tower 'InferenceTower1' on device /gpu:1 ... [1026 20:27:13 @summary.py:34] Maintain moving average summary of 4 tensors. [1026 20:27:13 @graph.py:91] Applying collection UPDATE_OPS of 232 ops. [1026 20:27:16 @base.py:212] Creating the session ... [1026 20:27:19 @base.py:216] Initializing the session ... [1026 20:27:19 @base.py:223] Graph Finalized. [1026 20:27:21 @concurrency.py:36] Starting EnqueueThread DataParallelInferenceRunner/QueueInput/input_queue ... [1026 20:27:21 @concurrency.py:36] Starting EnqueueThread QueueInput/input_queue ... [1026 20:27:21 @input_source.py:418] Pre-filling staging area ... [1026 20:27:21 @input_source.py:140] ERR Exception in EnqueueThread DataParallelInferenceRunner/QueueInput/input_queue: Traceback (most recent call last): File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorpack/input_source/input_source.py", line 133, in run self.op.run(feed_dict=feed) File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2084, in run _run_using_default_session(self, feed_dict, self.graph, session) File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4542, in _run_using_default_session session.run(operation, feed_dict) File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/tempspace/hgao/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1096, in _run % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (224, 224, 3) for Tensor 'input:0', which has shape '(?, 224, 224, 3)' [1026 20:27:22 @input_source.py:146] EnqueueThread DataParallelInferenceRunner/QueueInput/input_queue Exited.

    opened by HongyangGao 31
  • MultiProcessRunner RuntimeError

    MultiProcessRunner RuntimeError

    If you're asking about an unexpected problem which you do not know the root cause, use this template. PLEASE DO NOT DELETE THIS TEMPLATE, FILL IT:

    If you already know the root cause to your problem, feel free to delete everything in this template.

    1. What you did:

    (1) If you're using examples, what's the command you run:

    (2) If you're using examples, have you made any changes to the examples? Paste git status; git diff here:

    (3) If not using examples, tell us what you did:

    It's always better to copy-paste what you did than to describe them.

    Please try to provide enough information to let other reproduce your issues. Without reproducing the issue, we may not be able to investigate it.

    I tried to follow the "Efficient Dataflow" tutorial, continuing from https://github.com/tensorpack/tensorpack/issues/1209.

    2. What you observed:

    (1) Include the ENTIRE logs here:

    It's always better to copy-paste what you observed instead of describing them.

    It's always better to paste as much as possible, although sometimes a partial log is OK.

    Tensorpack typically saves stdout to its training log. If stderr is relevant, you can run a command with my_command 2>&1 | tee logs.txt to save both stdout and stderr to one file.

    [0528 10:55:08 @parallel.py:195] WRN MultiProcessRunner does support Windows. However, Windows requires more strict picklability on processes, which may lead of failure on some of the code. Traceback (most recent call last): File "", line 1, in File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 106, in spawn_main exitcode = _main(fd) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 115, in _main prepare(preparation_data) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 226, in prepare _fixup_main_from_path(data['init_main_from_path']) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 278, in _fixup_main_from_path run_name="mp_main") File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\runpy.py", line 254, in run_path pkg_name=pkg_name, script_name=fname) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\AI_Workspace\z_debug\load_lmdb.py", line 79, in load_lmdb3() File "C:\AI_Workspace\z_debug\load_lmdb.py", line 69, in load_lmdb3 ds = MultiProcessRunner(ds, 5000, 1) # NOTE: PrefetchData() deprecated in May 2019 File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\site-packages\tensorpack\dataflow\parallel.py", line 214, in init start_proc_mask_signal(self.procs) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\site-packages\tensorpack\utils\concurrency.py", line 244, in start_proc_mask_signal p.start() File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\contextlib.py", line 77, in exit self.gen.throw(type, value, traceback) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\site-packages\tensorpack\utils\concurrency.py", line 216, in mask_sigint yield True File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\site-packages\tensorpack\utils\concurrency.py", line 244, in start_proc_mask_signal p.start() File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\context.py", line 212, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\context.py", line 313, in _Popen return Popen(process_obj) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\popen_spawn_win32.py", line 34, in init prep_data = spawn.get_preparation_data(process_obj._name) File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 144, in get_preparation_data _check_not_importing_main() File "C:\Users\dps42\AppData\Local\Continuum\miniconda3\envs\dps42_dev\lib\multiprocessing\spawn.py", line 137, in _check_not_importing_main is not going to be frozen to produce an executable.''') RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:
    
            if __name__ == '__main__':
                freeze_support()
                ...
    
        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
    

    I will attach the code here: z_debug.zip

    But please notice that the LMDB file I'm using is too large to be attached to the zip file. The LMDB file was created from the same "debug2.py" but with more images and data entries.

    From load_lmdb3() function, the code crashed with "MultiProcessRunner()" with a RuntimeError. Maybe another Windows issue ? I had the same error before PrefetchData() was renamed to MultiProcessRunner()

    (2) Other observations, if any: For example, CPU/GPU utilization, output images, tensorboard curves, if relevant to your issue.

    3. What you expected, if not obvious.

    If you expect higher speed, please read http://tensorpack.readthedocs.io/tutorial/performance-tuning.html before posting.

    If you expect certain accuracy, only in one of the two conditions can we help with it: (1) You're unable to reproduce the accuracy documented in tensorpack examples. (2) It appears to be a tensorpack bug.

    Otherwise, how to train a model to certain accuracy is a machine learning question. We do not answer machine learning questions and it is your responsibility to figure out how to make your models more accurate.

    4. Your environment:

    • Paste the output of this command: python -c 'import tensorpack.tfutils as u; print(u.collect_env_info())' If this command failed, tell us your version of Python/TF/tensorpack.
    • You can install Tensorpack master by pip install -U git+https://github.com/ppwwyyxx/tensorpack.git and see if your issue is already solved.
    • If you're not using tensorpack under a normal command line shell (e.g., using an IDE or jupyter notebook), please retry under a normal command line shell.
    • Include relevant hardware information, e.g. number of GPUs used for training, amount of RAM.

    You may often want to provide extra information related to your issue, but at the minimum please try to provide the above information accurately to save effort in the investigation.

    Windows 10. I think no GPU was used at the moment.

    enhancement 
    opened by dps42 30
  • how to adapt model-agnostic meta learning in tensorpack

    how to adapt model-agnostic meta learning in tensorpack

    Hello,

    I would like to do model-agnostic meta learning in tensorpack The training algorithm of a classification task using model-agnostic meta learning is below:

    We have fθ as the model with parameter θ , α,β are hyperparameters

    1. in each iteration sample [ inputa, inputb, labela, labelb ] from training set
    2. forward inputa to fθ and evaluate the gradient using cross entropy
    3. Compute adapted parameters with gradient descent:
    4. θ' = θ - α∇θfθ(inputa)
    5. update θ ← θ − β∇θfθ'(inputb)

    https://arxiv.org/abs/1703.03400

    The source code of model-agnostic meta learning from github is below:

           for j in range(num_updates - 1):
                    loss = self.loss_func(self.forward(inputa, fast_weights, reuse=True), labela)
                    grads = tf.gradients(loss, list(fast_weights.values()))
                    if FLAGS.stop_grad:
                        grads = [tf.stop_gradient(grad) for grad in grads]
                    gradients = dict(zip(fast_weights.keys(), grads))
                    fast_weights = dict(zip(fast_weights.keys(), [fast_weights[key] - self.update_lr*gradients[key] for key in fast_weights.keys()]))
                    output = self.forward(inputb, fast_weights, reuse=True)
                    task_outputbs.append(output)
                    task_lossesb.append(self.loss_func(output, labelb))
         
            task_output = [task_outputa, task_outputbs, task_lossa, task_lossesb]
    

    https://github.com/cbfinn/maml/blob/master/maml.py

    I'd like to know in tensorpack and using trainers, how can I access model weights θ between the training iteration and forward with inputa, compute the gradient decent and adapted as θ' and update the model weight θ using the task_lossesb as we used to do at the end of an iteration.

    usage 
    opened by john81923 30
  • Better ModelDesc

    Better ModelDesc

    The original design lacks enough consideration and it's not clear how the graph is built, and what one can and cannot do inside build_graph. E.g.:

    • Is it OK to create placeholders inside build_graph?
    • What symbolic functions are allowed to use and what not? (e.g. tf.layers.batch_norm? tf.train.input_producer?)..
    • What to put in get_inputs and what not? Is this interface even necessary?
    • FIXED by introducing TowerTrainer, TowerFunc, TowerTensorHandle How to access a tensor a bit later? Because setting self.xxx sadly doesn't work (#287), and using the tensor names is not easy. (#315, #317, #442)
    • RESOLVED Use return cost for single-cost ModelDesc. For other types of models, you need to write your own trainer any way, so you'll build the graph by yourself anyway. On the contrary, self.cost needs to be set. This seems very hard-coded, and the reason behind it is that self.cost is only set because some (but not all) trainers need it. This contract between Model and Trainer needs to be addressed in a clearer way.
    • FIXED What's worse, some examples now actually is using self.xxx. Technically they should not rely on this unsupported use.
    • Fancy dynamic stuff might also be hard, but I'm not very familiar.

    Some of example use case that is hard or too tricky to do with the current interface:

    • Input data has different layout (needs different placeholder) in training vs inference.
    • Access some tensors in all towers.
    • Mix of data/model parallel. A special case is to create some variables (not reuse) in each tower.

    Nothing should be deprecated because the current interface works well for most problems. But I'm thinking about new ones which can expose more of the graph building process to users.

    enhancement 
    opened by ppwwyyxx 30
  • Stuck in Pre-filling StagingArea

    Stuck in Pre-filling StagingArea

    Hi there, Thanks for tensorpack ! I am training segmentation model on cityscapes. I write dataflow refering to get_imagenet_dataflow()

    def __iter__(self):
            for img_addr, gt_addr in self.lst:
                img = cv2.cvtColor(cv2.imread(img_addr, cv2.IMREAD_COLOR), cv2.COLOR_BGR2RGB)
                gt = cv2.imread(gt_addr, cv2.IMREAD_GRAYSCALE)
                yield [img, gt]
    

    And test this dataflow using below code, it prints the numpy array and achieves like 30 it/s(8 cores), and it will suddenly stop at somewhere, like 250/5000.

    ds = PrefetchDataZMQ(ds, parallel)
        ds = BatchData(ds, batch_size, remainder=False)  
        ds.reset_state()
        print(next(ds.get_data()))
        TestDataSpeed(ds).start()
    

    Then run training with SyncMultiGPUTrainerParameterServer, the problem is it stuck at Pre-filling StagingArea, showed in below. At the start, CPU is running at 104% with little GPU memory usage, after about 10-15 mins, CPU usage drops and GPU increase, but no computation on GPU with GPU-Util 0%. I have no idea where I did wrong. Could you give me some insights on this ?? Thanks so much.

    [0926 11:25:49 @base.py:211] Initializing the session ...
    [0926 11:25:49 @base.py:218] Graph Finalized.
    [0926 11:25:50 @concurrency.py:37] Starting EnqueueThread QueueInput/input_queue ...
    [0926 11:26:01 @param.py:148] [HyperParamSetter] At global_step=0, learning_rate will change to 0.00025000
    [0926 11:26:03 @base.py:250] Start Epoch 1 ...
      0%|                                                                                                              |0/371[00:00<?,?it/s]
    [0926 11:26:03 @input_source.py:550] Pre-filling StagingArea ...
    [0926 11:26:05 @input_source.py:554] 1 element was put into StagingArea on each tower.
    

    My environment:

    • Python version: Python 2.7
    • TF version: tf 1.6.0
    • Tensorpack version: 0.8.9.
    • OS: Ubuntu 16.04
    • Hardware information: E5 2630, 4 1080Ti GPUs.
    usage 
    opened by s7ev3n 27
  • Add MMEval support for COCO detection evaluation

    Add MMEval support for COCO detection evaluation

    Hi, thanks for this nice work!

    This PR wants to provide a new evaluation tool for examples/FasterRCNN: MMEval

    MMEval is a unified evaluation library for multiple machine-learning libraries, the link to the home page is: https://github.com/open-mmlab/mmeval

    The coco_det_mmeval.py support multi-gpus and multi-node evaluation with MPI4PY:

    # run evaluation
    python tensorpack_mmeval.py --load <model_path>
    
    # launch multi-gpus evaluation by mpirun
    mpirun -np 8 python tensorpack_mmeval.py --load <model_path>
    

    We tested this evaluation script on COCO-MaskRCNN-R50C41x and got the same evaluation results as the TensorPack report.

    Related refer: https://github.com/open-mmlab/mmeval/tree/main/examples/tensorpack

    opened by ice-tong 0
  • Option to disable the tqdm progress bars

    Option to disable the tqdm progress bars

    Could you guys add the option to disable the tqdm progress bar? I made the code change here, adding a keyword argument "pbar_disable", but I'm not able to check it in.

    def send_dataflow_zmq(df, addr, hwm=50, format=None, bind=False, pbar_disable=False):
        """
        Run DataFlow and send data to a ZMQ socket addr.
        It will serialize and send each datapoint to this address with a PUSH socket.
        This function never returns.
    
        Args:
            df (DataFlow): Will infinitely loop over the DataFlow.
            addr: a ZMQ socket endpoint.
            hwm (int): ZMQ high-water mark (buffer size)
            format (str): The serialization format.
                 Default format uses :mod:`utils.serialize`.
                 This format works with :class:`dataflow.RemoteDataZMQ`.
                 An alternate format is 'zmq_ops', used by https://github.com/tensorpack/zmq_ops
                 and :class:`input_source.ZMQInput`.
            bind (bool): whether to bind or connect to the endpoint address.
        """
        assert format in [None, 'zmq_op', 'zmq_ops']
        if format is None:
            dump_fn = dumps
        else:
            from zmq_ops import dump_arrays
            dump_fn = dump_arrays
    
        ctx = zmq.Context()
        socket = ctx.socket(zmq.PUSH)
        socket.set_hwm(hwm)
        if bind:
            socket.bind(addr)
        else:
            socket.connect(addr)
        try:
            df.reset_state()
            logger.info("Serving data to {} with {} format ...".format(
                addr, 'default' if format is None else 'zmq_ops'))
            INTERVAL = 200
            q = deque(maxlen=INTERVAL)
    
            try:
                total = len(df)
            except NotImplementedError:
                total = 0
            tqdm_args = get_tqdm_kwargs(
                leave=True, smoothing=0.8, disable=pbar_disable)
            tqdm_args['bar_format'] = tqdm_args['bar_format'] + "{postfix}"
            while True:
                with tqdm.trange(total, **tqdm_args) as pbar:
                    for dp in df:
                        start = time.time()
                        socket.send(dump_fn(dp), copy=False)
                        q.append(time.time() - start)
                        pbar.update(1)
                        if pbar.n % INTERVAL == 0:
                            avg = "{:.3f}".format(sum(q) / len(q))
                            pbar.set_postfix({'AvgSendLat': avg})
        finally:
            logger.info("Exiting send_dataflow_zmq ...")
            socket.setsockopt(zmq.LINGER, 0)
            socket.close()
            if not ctx.closed:
                ctx.destroy(0)
    
    opened by actuallyaswin 0
  • Issue when using automatic mixed precision in training with evaluation callback

    Issue when using automatic mixed precision in training with evaluation callback

    1. What you did:

    I tried to use automatic mixed precision when training a MaskRCNN model via a graph rewrite. As presented here: https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite, I added the following line at the end of the generalized_rcnn function GeneralizedRCNN.optimizer(): opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

    2. What you observed:

    When I train the model without evaluation callback, there is no issue at all. Once it is trained, if I load the model with OfflinePredictor, it also works well. However, if I train the model with evaluation callback, I get the following error during the first evaluation:

    InternalError                             Traceback (most recent call last)
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
       1364     try:
    -> 1365       return fn(*args)
       1366     except errors.OpError as e:
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
       1349       return self._call_tf_sessionrun(options, feed_dict, fetch_list,
    -> 1350                                       target_list, run_metadata)
       1351 
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
       1442                                             fetch_list, target_list,
    -> 1443                                             run_metadata)
       1444 
    
    InternalError: 2 root error(s) found.
      (0) Internal: Blas GEMM launch failed : a.shape=(12032000, 1), b.shape=(1, 4), m=12032000, n=4, k=1
    	 [[{{node tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul}}]]
      (1) Internal: Blas GEMM launch failed : a.shape=(12032000, 1), b.shape=(1, 4), m=12032000, n=4, k=1
    	 [[{{node tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul}}]]
    0 successful operations.
    0 derived errors ignored.
    
    During handling of the above exception, another exception occurred:
    
    InternalError                             Traceback (most recent call last)
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/train/interface.py in launch_train_with_config(config, trainer)
         97         starting_epoch=config.starting_epoch,
         98         max_epoch=config.max_epoch,
    ---> 99         extra_callbacks=config.extra_callbacks)
        100 
        101 
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py in train_with_defaults(self, _sentinel, callbacks, monitors, session_creator, session_init, steps_per_epoch, starting_epoch, max_epoch, extra_callbacks)
        340         self.train(callbacks, monitors,
        341                    session_creator, session_init,
    --> 342                    steps_per_epoch, starting_epoch, max_epoch)
        343 
        344     def __new__(cls, *args, **kwargs):
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py in train(self, callbacks, monitors, session_creator, session_init, steps_per_epoch, starting_epoch, max_epoch)
        312         self.setup_callbacks(callbacks, monitors)
        313         self.initialize(session_creator, session_init)
    --> 314         self.main_loop(steps_per_epoch, starting_epoch, max_epoch)
        315 
        316     def train_with_defaults(
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/utils/argtools.py in wrapper(*args, **kwargs)
        166         cache.add(func)
        167 
    --> 168         return func(*args, **kwargs)
        169 
        170     return wrapper
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py in main_loop(self, steps_per_epoch, starting_epoch, max_epoch)
        284 
        285                     # trigger epoch outside the timing region.
    --> 286                     self._callbacks.trigger_epoch()
        287                 logger.info("Training has finished!")
        288             except (StopTraining, tf.errors.OutOfRangeError) as e:
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/base.py in trigger_epoch(self)
        154 
        155     def trigger_epoch(self):
    --> 156         self._trigger_epoch()
        157 
        158     def _trigger_epoch(self):
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/group.py in _trigger_epoch(self)
         93             display_name = str(cb)
         94             with tm.timed_callback(display_name):
    ---> 95                 cb.trigger_epoch()
         96         tm.log()
         97 
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/base.py in trigger_epoch(self)
        154 
        155     def trigger_epoch(self):
    --> 156         self._trigger_epoch()
        157 
        158     def _trigger_epoch(self):
    
    /opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
        433                 raise CancelledError()
        434             elif self._state == FINISHED:
    --> 435                 return self.__get_result()
        436             else:
        437                 raise TimeoutError()
    
    /opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
        382     def __get_result(self):
        383         if self._exception:
    --> 384             raise self._exception
        385         else:
        386             return self._result
    
    /opt/conda/lib/python3.7/concurrent/futures/thread.py in run(self)
         55 
         56         try:
    ---> 57             result = self.fn(*self.args, **self.kwargs)
         58         except BaseException as exc:
         59             self.future.set_exception(exc)
    
    /home/jovyan/eval.py in predict_dataflow()
    --> 157               outputs = predict_image(img, model_func)
    
    /home/jovyan/eval.py in predict_image(img, model_func)
    ---> 46     outputs = model_func(img)
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/predict/base.py in __call__(self, *dp)
         39             list[array]: list of outputs
         40         """
    ---> 41         output = self._do_call(dp)
         42         if self.return_input:
         43             return (dp, output)
    
    /opt/conda/lib/python3.7/site-packages/tensorpack/predict/base.py in _do_call(self, dp)
        134         # run_metadata = tf.RunMetadata()
        135         # options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    --> 136         return self._callable(*dp)
        137 
        138 
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _generic_run(*feed_args, **kwargs)
       1230             feed: feed_val for feed, feed_val in zip(feed_list, feed_args)
       1231         }
    -> 1232         return self.run(fetches, feed_dict=feed_dict, **kwargs)
       1233 
       1234       return _generic_run
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
        954     try:
        955       result = self._run(None, fetches, feed_dict, options_ptr,
    --> 956                          run_metadata_ptr)
        957       if run_metadata:
        958         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
       1178     if final_fetches or final_targets or (handle and feed_dict_tensor):
       1179       results = self._do_run(handle, final_targets, final_fetches,
    -> 1180                              feed_dict_tensor, options, run_metadata)
       1181     else:
       1182       results = []
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
       1357     if handle is None:
       1358       return self._do_call(_run_fn, feeds, fetches, targets, options,
    -> 1359                            run_metadata)
       1360     else:
       1361       return self._do_call(_prun_fn, handle, feeds, fetches)
    
    /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
       1382                     '\nsession_config.graph_options.rewrite_options.'
       1383                     'disable_meta_optimizer = True')
    -> 1384       raise type(e)(node_def, op, message)
       1385 
       1386   def _extend_graph(self):
    
    InternalError: 2 root error(s) found.
      (0) Internal: Blas GEMM launch failed : a.shape=(12032000, 1), b.shape=(1, 4), m=12032000, n=4, k=1
    	 [[node tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul (defined at /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
      (1) Internal: Blas GEMM launch failed : a.shape=(12032000, 1), b.shape=(1, 4), m=12032000, n=4, k=1
    	 [[node tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul (defined at /opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
    0 successful operations.
    0 derived errors ignored.
    
    Original stack trace for 'tower-pred-0/fpn/upsample_lat4/Tensordot/MatMul':
      File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py", line 16, in <module>
        app.launch_new_instance()
      File "/opt/conda/lib/python3.7/site-packages/traitlets/config/application.py", line 845, in launch_instance
        app.start()
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelapp.py", line 612, in start
        self.io_loop.start()
      File "/opt/conda/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 199, in start
        self.asyncio_loop.run_forever()
      File "/opt/conda/lib/python3.7/asyncio/base_events.py", line 541, in run_forever
        self._run_once()
      File "/opt/conda/lib/python3.7/asyncio/base_events.py", line 1786, in _run_once
        handle._run()
      File "/opt/conda/lib/python3.7/asyncio/events.py", line 88, in _run
        self._context.run(self._callback, *self._args)
      File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 688, in <lambda>
        lambda f: self._run_callback(functools.partial(callback, future))
      File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
        ret = callback()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 814, in inner
        self.ctx_run(self.run)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 775, in run
        yielded = self.gen.send(value)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 374, in dispatch_queue
        yield self.process_one()
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 250, in wrapper
        runner = Runner(ctx_run, result, future, yielded)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 741, in __init__
        self.ctx_run(self.run)
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 775, in run
        yielded = self.gen.send(value)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 358, in process_one
        yield gen.maybe_future(dispatch(*args))
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 234, in wrapper
        yielded = ctx_run(next, result)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell
        yield gen.maybe_future(handler(stream, idents, msg))
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 234, in wrapper
        yielded = ctx_run(next, result)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 538, in execute_request
        user_expressions, allow_stdin,
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 234, in wrapper
        yielded = ctx_run(next, result)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/ipkernel.py", line 302, in do_execute
        res = shell.run_cell(code, store_history=store_history, silent=silent)
      File "/opt/conda/lib/python3.7/site-packages/ipykernel/zmqshell.py", line 539, in run_cell
        return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2895, in run_cell
        raw_cell, store_history, silent, shell_futures)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2940, in _run_cell
        return runner(coro)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/async_helpers.py", line 68, in _pseudo_sync_runner
        coro.send(None)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3166, in run_cell_async
        interactivity=interactivity, compiler=compiler, result=result)
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3357, in run_ast_nodes
        if (await self.run_code(code, result,  async_=asy)):
      File "/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-2-f9d37edbca59>", line 23, in <module>
        commit_hash = "unknown",
      File "/home/jovyan/train.py", line 315, in train_mask_rcnn
        launch_train_with_config(traincfg, trainer)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/interface.py", line 99, in launch_train_with_config
        extra_callbacks=config.extra_callbacks)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py", line 342, in train_with_defaults
        steps_per_epoch, starting_epoch, max_epoch)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py", line 312, in train
        self.setup_callbacks(callbacks, monitors)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/utils/argtools.py", line 168, in wrapper
        return func(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/base.py", line 209, in setup_callbacks
        self._callbacks.setup_graph(weakref.proxy(self))
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/base.py", line 59, in setup_graph
        self._setup_graph()
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/group.py", line 68, in _setup_graph
        cb.setup_graph(self.trainer)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/callbacks/base.py", line 59, in setup_graph
        self._setup_graph()
      File "/home/jovyan/eval.py", line 305, in _setup_graph
        self.predictors = [self._build_predictor(k % num_gpu) for k in range(self.num_predictor)]
      File "/home/jovyan/eval.py", line 305, in <listcomp>
        self.predictors = [self._build_predictor(k % num_gpu) for k in range(self.num_predictor)]
      File "/home/jovyan/eval.py", line 319, in _build_predictor
        return self.trainer.get_predictor(self._in_names, self._out_names, device=idx)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/train/tower.py", line 136, in get_predictor
        self.tower_func(*input.get_input_tensors())
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/tfutils/tower.py", line 291, in __call__
        output = self._tower_fn(*args)
      File "/home/jovyan/modeling/generalized_rcnn.py", line 129, in build_graph
        features = self.backbone(image)
      File "/home/jovyan/modeling/generalized_rcnn.py", line 307, in backbone
        p23456 = fpn_model('fpn', c2345)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/models/registry.py", line 173, in wrapped_func
        outputs = func(*args, **actual_args)
      File "/home/jovyan/modeling/model_fpn.py", line 65, in fpn_model
        lat = lat + upsample2x('upsample_lat{}'.format(6 - idx), lat_sum_5432[-1])
      File "/home/jovyan/modeling/model_fpn.py", line 51, in upsample2x
        data_format='channels_first')
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/models/registry.py", line 173, in wrapped_func
        outputs = func(*args, **actual_args)
      File "/opt/conda/lib/python3.7/site-packages/tensorpack/models/pool.py", line 127, in FixedUnPooling
        ret = tf.tensordot(x, mat, axes=1)  # bxcxhxwxshxsw
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 4071, in tensordot
        ab_matmul = matmul(a_reshape, b_reshape)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
        return target(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 2754, in matmul
        a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6136, in mat_mul
        name=name)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
        op_def=op_def)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
        return func(*args, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
        attrs, op_def, compute_device)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
        op_def=op_def)
      File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
        self._traceback = tf_stack.extract_stack()
    

    4. Your environment:

    sys.platform          linux
    Python                3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0]
    Tensorpack            v0.10.1-0-g8f831349
    Numpy                 1.19.5
    TensorFlow            1.15.5/v1.15.5-1-g7d0c58b5326
    TF Compiler Version   7.3.1 20180303
    TF CUDA support       True
    TF MKL support        False
    TF XLA support        False
    Nvidia Driver         /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.51.06
    CUDA                  /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0.221
    CUDNN                 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
    NCCL                  /usr/lib/x86_64-linux-gnu/libnccl.so.2.7.8
    CUDA_VISIBLE_DEVICES  Unspecified
    GPU 0                 Tesla T4
    Free RAM              21.86/29.45 GB
    CPU Count             8
    Horovod               0.21.3
    cv2                   4.4.0
    msgpack               1.0.2
    python-prctl          False
    

    Question: is it possible to run evaluation callback while training with automatic mixed precision (even if it already works in inference outside of the training) or are there changes to perform to make it work?

    opened by martinjammes 0
  • Is there an analogue for parallel Dataset.interleave in Dataflow?

    Is there an analogue for parallel Dataset.interleave in Dataflow?

    A typical data loading pipeline in TensorFlow using tf.data.Dataset might look something like this:

    dataset = tf.data.Dataset.from_tensor_slices(filenames)
    dataset = dataset.interleave(
        tf.data.TFRecordDataset,
        num_parallel_calls=reader_num_threads)
    dataset = dataset.batch(batch_size, drop_remainder=True)
    dataset = dataset.map(
        lambda serialized_example: tf.io.parse_example(serialized_example, features),
        num_parallel_calls=parser_num_threads)
    

    Obviously, I'm not trying to use Dataflow to parse TFRecords, but it is somewhat of an analogous workflow of wanting to parallelize reading multiple file iterators at a time. I understand how to do the parallel map using Dataflow, but I don't quite see how to do the parallel interleave. Any tips?

    enhancement 
    opened by cyc 6
  • Why doesn't MultiProcessMapData() stop?

    Why doesn't MultiProcessMapData() stop?

    I tried something very simple with MultiProcessMapData():

    from tensorpack import *
    
    class MyFlow(DataFlow):
        def __init__(self, n):
            super().__init__()
            self.n = n
    
        def __iter__(self):
            for i in range(self.n):
                yield i
    
        def __len__(self):
            return self.n
    
    def f(i):
        return i*10
    
    d0 = MyFlow(10)
    d1 = MultiProcessMapData(d0, num_proc = 4, map_func=f, buffer_size=10, strict=False)
    d1.reset_state()
    
    for i in d1:
        print(i)
    print("end")
    

    In this example, the loop never stops. It just produces more and more numbers. If I set strict to False, the code produces 5 numbers (0, 10, 20, 30, 40) and then freezes. Is this the expected behaviour? I am using the latest version of Tensorpack on macOS. Thank you.

    opened by hsinhaoyu 2
  • [Placeholder]Detectron2 fbnet backbone

    [Placeholder]Detectron2 fbnet backbone

    It was amazing to see detectron2, that's like the best of pytorch and tensorflow. Thank you for the great library.

    according to @wat3rbro https://github.com/facebookresearch/detectron2/issues/12#issuecomment-565566046

    https://github.com/facebookresearch/detectron2/issues/12#issuecomment-566822670 mobile friendly models are coming soon.

    Creating this issue as a placeholder to support fbnet backbone when even they are available.

    Once again thank you for the great library. Pardon if the category is wrong.

    opened by no-1ne 0
Owner
Tensorpack
Use TensorFlow in the right way
Tensorpack
Flax is a neural network ecosystem for JAX that is designed for flexibility.

Flax: A neural network library and ecosystem for JAX designed for flexibility Overview | Quick install | What does Flax look like? | Documentation See

Google 3.9k Jan 2, 2023
The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

MIC-DKFZ 1.2k Jan 4, 2023
U-2-Net: U Square Net - Modified for paired image training of style transfer

U2-Net: U Square Net Modified for paired image training of style transfer This is an unofficial repo making use of the code which was made available b

Doron Adler 43 Oct 3, 2022
Speed-Test - You can check your intenet speed using this tool

Speed-Test Tool By Hez_X >> AVAILABLE ON : Termux & Kali linux & Ubuntu (Linux E

Hez-X 3 Feb 17, 2022
Neural networks applied in recognizing guitar chords using python, AutoML.NET with C# and .NET Core

Chord Recognition Demo application The demo application is written in C# with .NETCore. As of July 9, 2020, the only version available is for windows

Andres Mauricio Rondon Patiño 24 Oct 22, 2022
U^2-Net - Portrait matting This repository explores possibilities of using the original u^2-net model for portrait matting.

U^2-Net - Portrait matting This repository explores possibilities of using the original u^2-net model for portrait matting.

Dennis Bappert 104 Nov 25, 2022
RGBD-Net - This repository contains a pytorch lightning implementation for the 3DV 2021 RGBD-Net paper.

[3DV 2021] We propose a new cascaded architecture for novel view synthesis, called RGBD-Net, which consists of two core components: a hierarchical depth regression network and a depth-aware generator network.

Phong Nguyen Ha 4 May 26, 2022
QuakeLabeler is a Python package to create and manage your seismic training data, processes, and visualization in a single place — so you can focus on building the next big thing.

QuakeLabeler Quake Labeler was born from the need for seismologists and developers who are not AI specialists to easily, quickly, and independently bu

Hao Mai 15 Nov 4, 2022
Accelerate Neural Net Training by Progressively Freezing Layers

FreezeOut A simple technique to accelerate neural net training by progressively freezing layers. This repository contains code for the extended abstra

Andy Brock 203 Jun 19, 2022
Simple codebase for flexible neural net training

neural-modular Simple codebase for flexible neural net training. Allows for seamless exchange of models, dataset, and optimizers. Uses hydra for confi

Jannik Kossen 7 Apr 5, 2022
A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

ffcv ImageNet Training A minimal, single-file PyTorch ImageNet training script designed for hackability. Run train_imagenet.py to get... ...high accur

FFCV 92 Dec 31, 2022
Reimplementation of the paper `Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words? (ACL2020)`

Human Attention for Text Classification Re-implementation of the paper Human Attention Maps for Text Classification: Do Humans and Neural Networks Foc

Shunsuke KITADA 15 Dec 13, 2021
Neural-net-from-scratch - A simple Neural Network from scratch in Python using the Pymathrix library

A Simple Neural Network from scratch A Simple Neural Network from scratch in Pyt

Youssef Chafiqui 2 Jan 7, 2022
🔥RandLA-Net in Tensorflow (CVPR 2020, Oral & IEEE TPAMI 2021)

RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds (CVPR 2020) This is the official implementation of RandLA-Net (CVPR2020, Oral

Qingyong 1k Dec 30, 2022
Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Troyanskaya Laboratory 323 Jan 1, 2023
This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CNPs), Neural Processes (NPs), Attentive Neural Processes (ANPs).

The Neural Process Family This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CN

DeepMind 892 Dec 28, 2022
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Machine Learning From Scratch About Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The purpose

Erik Linder-Norén 21.8k Jan 9, 2023
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022