DeepRec is a recommendation engine based on TensorFlow.

Alibaba

Last update: Jan 3, 2023

Related tags

Deep Learning DeepRec

Overview

DeepRec

Introduction

DeepRec is a recommendation engine based on TensorFlow 1.15, Intel-TensorFlow and NVIDIA-TensorFlow.

Background

Sparse model is a type of deep learning model that accounts for a relatively high proportion of discrete feature calculation logic in the model structure. Discrete features are usually expressed as non-numeric features that cannot be directly processed by algorithms such as id, tag, text, and phrases. They are widely used in high-value businesses such as search, advertising, and recommendation.

DeepRec has been deeply cultivated since 2016, which supports core businesses such as Taobao Search, recommendation and advertising. It precipitates a list of features on basic frameworks and has excellent performance in sparse models training. Facing a wide variety of external needs and the environment of deep learning framework embracing open source, DeepeRec open source is conducive to establishing standardized interfaces, cultivating user habits, greatly reducing the cost of external customers working on cloud and establishing the brand value.

Key Features

DeepRec has super large-scale distributed training capability, supporting model training of trillion samples and 100 billion Embedding Processing. For sparse model scenarios, in-depth performance optimization has ben conducted across CPU and GPU platform. It contains 3 kinds of features to improve usability and performance for super-scale scenarios.

Sparse Functions

Embedding Variable.
Dynamic Dimension Embedding Variable.
Adaptive Embedding Variable.
Multiple Hash Embedding Variable.

Performance Optimization

Distributed Training Framework Optimization, such as grpc+seastar, FuseRecv, StarServer, HybridBackend etc.
Runtime Optimization, such as CPU memory allocator (PRMalloc), GPU memory allocator etc.
Operator level optimization, such as BF16 mixed precision optimization, sparse operator optimization and EmbeddingVariable on PMEM and GPU, new hardware feature enabling, etc.
Graph level optimization, such as AutoGraphFusion, SmartStage, AutoPipeline, StrutureFeature, MicroBatch etc.

Deploy and Serving

Incremental model loading and exporting
Super-scale sparse model distributed serving
Multilevel hybrid storage and multi backend supported ..
Online deep learning with low latency

Installation

Prepare for installation

CPU Platform

registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-developer:1.15deeprec2106-cpu-py36-ubuntu18.04

GPU Platform

registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-developer:1.15deeprec2106-gpu-py36-cu110-ubuntu18.04

How to Build

configure

$ ./configure

Compile for CPU and GPU defaultly

$ bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package

Compile for CPU and GPU: ABI=0

$ bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package

Compile for CPU optimization: oneDNN + Unified Eigen Thread pool

$ bazel build  -c opt --config=opt  --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package

Compile for CPU optimization and ABI=0

$ bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package

Create whl package

$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

Install whl package

$ pip3 install /tmp/tensorflow_pkg/tensorflow-1.15.5+deeprec2106-cp36-cp36m-linux_x86_64.whl

Nightly Images

Image for GPU CUDA11.0

registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:deeprec-nightly-gpu-py36-cu110-ubuntu18.04

Image for CPU

registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:deeprec-nightly-cpu-py36-ubuntu18.04

Jave Compilation

$ ./configure
$ bazel build --config opt //tensorflow/java:tensorflow   //tensorflow/java:libtensorflow_jni
$ javac -cp bazel-bin/tensorflow/java/libtensorflow.jar ...
$ java -cp bazel-bin/tensorflow/java/libtensorflow.jar  -Djava.library.path=bazel-bin/tensorflow/java  ...

License

Apache License 2.0

Comments

[Grappler] Add Concat+Cast fusion

For the BF16 graph, we usually find a concat+cast pattern from the feature column to DNN part. The optimization is for concat(FP32 -> FP32) + cast(FP32 > BF16) and concat(BF16 -> BF16) + cast(BF16 > FP32) to fuse one op.

opened by aalbersk 7

Build from source and import error "cannot import name saver"

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
TensorFlow version:1.15
Python version:2.7
Installed using virtualenv? pip? conda?:
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source): g++ 7.5
CUDA/cuDNN version:
GPU model and memory:

Describe the problem

ERROR: /DeepRec/tensorflow/BUILD:893:1: Executing genrule //tensorflow:tf_python_api_gen_v1 failed (Exit 1)
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 27, in <module>
    from tensorflow.python.tools.api.generator import doc_srcs
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 73, in <module>
    from tensorflow.python.ops.standard_ops import *
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/ops/standard_ops.py", line 25, in <module>
    from tensorflow.python import autograph
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/__init__.py", line 35, in <module>
    from tensorflow.python.autograph import operators
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/__init__.py", line 40, in <module>
    from tensorflow.python.autograph.operators.control_flow import for_stmt
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/control_flow.py", line 65, in <module>
    from tensorflow.python.autograph.operators import py_builtins
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/py_builtins.py", line 30, in <module>
    from tensorflow.python.data.ops import dataset_ops
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/__init__.py", line 25, in <module>
    from tensorflow.python.data import experimental
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/experimental/__init__.py", line 89, in <module>
    from tensorflow.python.data.experimental.ops.batching import dense_to_sparse_batch
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/experimental/ops/batching.py", line 20, in <module>
    from tensorflow.python.data.ops import dataset_ops
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/ops/dataset_ops.py", line 40, in <module>
    from tensorflow.python.data.ops import iterator_ops
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 35, in <module>
    from tensorflow.python.training.saver import BaseSaverBuilder
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/training/saver.py", line 57, in <module>
    from tensorflow.python.training.saving import saveable_object_util
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/training/saving/saveable_object_util.py", line 33, in <module>
    from tensorflow.python.training import saver
ImportError: cannot import name saver
Target //tensorflow/tools/pip_package:build_pip_package failed to build

Provide the exact sequence of commands / steps that you executed before running into the problem

bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=1" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=1" -c opt --config=v1 --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package

Any other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

opened by 3-shi 6

at PMEM memkind environment execute the launch script ,I got error log
When I use the latest commit to build a PMEM memkind environment and execute the launch script, the following error will appear.

The commit code version I used

2.The build option I used

bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --copt="-L/usr/local/lib" --copt="-lpmem" --copt="-lmemkind" --config=opt //tensorflow/tools/pip_package:build_pip_package

The scprit I used numactl -N 1 ./launch.sh --batch_size=1280 --dim_size=512 --max_mock_id_amplify=1800 --num_steps=2000 --ev_storage=pmem_memkind

error logs

INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. Traceback (most recent call last): File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:ps/replica:0/task:0: MultiLevel EV's Cache size -1 should large than IDs in batch 1280 [[{{node fm/embedding_lookup_36}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./benchmark.py", line 228, in tf.app.run() File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "./benchmark.py", line 203, in main sess.run(train_op) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run run_metadata=run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run run_metadata=run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run raise six.reraise(*original_exc_info) File "/home/pai/lib/python3.6/site-packages/six.py", line 719, in reraise raise value File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run return self._sess.run(*args, **kwargs) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run run_metadata=run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run return self._sess.run(*args, **kwargs) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:ps/replica:0/task:0: MultiLevel EV's Cache size -1 should large than IDs in batch 1280 [[node fm/embedding_lookup_36 (defined at /home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'fm/embedding_lookup_36': File "./benchmark.py", line 228, in tf.app.run() File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "./benchmark.py", line 121, in main tf.nn.embedding_lookup(fm_w, batch['col{}'.format(sidx)])) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 418, in embedding_lookup counts=counts) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 184, in _embedding_lookup_and_transform counts=counts), File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper return target(*args, **kwargs) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py", line 3958, in gather counts=counts) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/kv_variable_ops.py", line 749, in sparse_read name=name) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_kv_variable_ops.py", line 647, in kv_resource_gather validate_indices=validate_indices, name=name) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()
opened by jiefengshuo 4
Tensor slice example, tensor slice is much slower than TextLineDataset

Try replacing the TextLineDataset with a tensor slice dataset (see train.py), but MonitoredTrainingSession is much slower than the original. It takes roughly 100-110 seconds to create. The TextLineDataSet takes only 7 seconds. If I set checkpoint_dir to None, it can save 70 seconds.

Do you have good advice for this? Whether checkpoint_dir can be improved?

opened by zhanglirong1999 3
[BUILD] gcc-8.3 build DeepRec fail.
Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Centos 7

Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no

TensorFlow installed from (source or binary): source

TensorFlow version: r1.15.5-deeprec2204u1

Python version:

Installed using virtualenv? pip? conda?:

Bazel version (if compiling from source):

GCC/Compiler version (if compiling from source): gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)

CUDA/cuDNN version: cuda11.4

GPU model and memory:

Describe the problem

Build deeprec fail when we use gcc 8.3.1. It triggers gcc 8.3.1 compiler bug. The error is as follows:

unique_ali_op_ut.h:498:77: internal compiler error: in is_normal_capture_proxy, at cp/lambda.c:292

Provide the exact sequence of commands / steps that you executed before running into the problem

Any other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

bug
opened by ProphetPeng 3
Unsupport GlobalStep in subclass of ValuePtrBase

When we save checkpoint, the error F ./tensorflow/core/framework/embedding/value_ptr.h:256] Unsupport GlobalStep in subclass of ValuePtrBase occurs. Because I find that the checkpoint is a temporary file best_checkpoint/best.data-00000-of-00001.tempstate11898667549733680686.

opened by Lihengwannafly 3
[Modelzoo] DIN and DIEN perf drop based on r1.15.5-deeprec2201 tag.

Modelzoo perf Test based on [Release] Update DeepRec release version to 1.15.5+deeprec2201. (#43). Test machines: Alibaba Cloud ECS general purpose instance family with high clock speeds - ecs.hfg7.2xlarge.

Test perf result:

Gstep | WDL | WDL | DLRM | DLRM | DeepFM | DeepFM | DSSM | DSSM | DIEN | DIEN | DIN | DIN -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- / | value | percent | value | percent | value | percent | value | percent | value | percent | value | percent Commuty TF | 31.92626 | baseline | 82.09168 | baseline | 37.20978 | baseline | 18.54726 | baseline | 14.62987 | baseline | 18.57746 | baseline DeepRec FP32 | 34.69318 | 108.67% | 105.4547 | 128.46% | 43.31713 | 116.41% | 21.64175 | 116.68% | 13.27125 | 90.71% | 17.6932 | 95.24% DeepRec BF16 | 49.38222 | 154.68% | 114.2221 | 139.14% | 47.34401 | 127.24% | 23.13698 | 124.75% | 13.0392 | 89.13% | 17.20525 | 92.61%

Test AUC result:

AUC | WDL | WDL | DLRM | DLRM | DeepFM | DeepFM | DSSM | DSSM | DIEN | DIEN | DIN | DIN -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- / | value | percent | value | percent | value | percent | value | percent | value | percent | value | percent Commuty TF | 0.775168 | baseline | 0.768852 | baseline | 0.744794 | baseline | 0.504404 | baseline | 0.8443 | baseline | 0.7887 | baseline DeepRec FP32 | 0.775515 | 100.04% | 0.771128 | 100.30% | 0.746055 | 100.17% | 0.503653 | 99.85% | 0.8472 | 100.34% | 0.7913 | 100.33% DeepRec BF16 | 0.77604 | 100.11% | 0.772185 | 100.43% | 0.741192 | 99.52% | 0.492327 | 97.61% | 0.8358 | 98.99% | 0.7883 | 99.95%

PS: DSSM dataset is small, so its ACC and AUC is limited.

opened by changqi1 3
[SmartStage] SmartStage has low performance on GPU.

测试环境性能对比

[1] Invalid argument: Trying to access resource linear/linear_model/C1/weights/part_0 located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0 [2] 2022-06-07 09:49:01.768708: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at resource_variable_ops.cc:400 : Invalid argument: Trying to access resource linear/linear_model/C12/weights/part_0 located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0

opened by JackMoriarty 2
[Op] Parallelize UnsortedSegment op.

Parallelize UnsortedSegmentSum on CPU deivce.

Under the same condition, we can see the “parallel” way is more effective. Op | Row | Col | S_id | T_nums -- | -- | -- | -- | -- UnsortedSegmentSum | 4096 | 1024 | 128 | 1 UnsortedSegmentSum | 4096 | 1024 | 128 | 2 UnsortedSegmentSum | 4096 | 1024 | 128 | 4 UnsortedSegmentSum | 4096 | 1024 | 128 | 8 UnsortedSegmentSum | 4096 | 1024 | 128 | 16

enhancement

opened by marvin-Yu 2

DeepRec utilize GPU with really low utilization on the special kind of CPU

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 in Docker
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): r1.15.5-deeprec2204-39-g0527d0b2ad8 1.15.5
Python version: Python 3.6.9
Bazel version (if compiling from source): Bazelisk version: v1.11.0 Build label: 0.24.1
GCC/Compiler version (if compiling from source): gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA/cuDNN version: CUDA=11.4, V11.4.152, cuDNN 8
GPU model and memory: NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4, Tesla P100 * 4, 16280MiB

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior In some kind of GPU instance in aliyun, I build DeepRec from source following this docs: https://github.com/alibaba/DeepRec#how-to-build, I confirm I enabled GPU, but in this machine, I notice my code only run on CPU, and GPU-Util is always zero and with low GPU Memory-Usage, here is a runtime capture

But on other machines, the same building and execute behavior works normally.

Here is the CPU info which works fine:

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
stepping        : 4
microcode       : 0x1
cpu MHz         : 2499.998
cache size      : 33792 KB
physical id     : 0
siblings        : 16
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse
4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsav
eopt xsavec xgetbv1 arat
bogomips        : 4999.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Here is the CPU info which works with low GPU util:

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
stepping        : 1
microcode       : 0x1
cpu MHz         : 2499.996
cache size      : 40960 KB
physical id     : 0
siblings        : 32
core id         : 0
cpu cores       : 16
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat spec_ctrl intel_stibp
bogomips        : 4999.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Describe the expected behavior

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

opened by fuhailin 2

[OP] Change fused matmul layout type and number thread for small size inputs.
This PR mainly change the _MklFuedMatMul layout type, It deleted those unnecessary tensor format changes and reduce framework overhead.

Before applying this PR.

After applying this PR..

Performance changing |_MklFusedMatMul performance|Time(ms)|percent| |:--:|:--:|:--:| |DeepRec FP32 - Before|8.862|baseline| |DeepRec FP32 - After|8.689|101%|
opened by changqi1 2

ParquetDataset return a error shape .

error log:

Traceback (most recent call last):
  File "train.py", line 832, in <module>
    main()
  File "train.py", line 544, in main
    iterator = tf.data.Iterator.from_structure(train_dataset.output_types,
AttributeError: 'PrefetchDataset' object has no attribute 'output_types'

modify :/root/DeepRec/modelzoo/dlrm

diff --git a/modelzoo/dlrm/train.py b/modelzoo/dlrm/train.py
index 1cd0e7915e..5fbc5ee4f2 100644
--- a/modelzoo/dlrm/train.py
+++ b/modelzoo/dlrm/train.py
@@ -24,6 +24,7 @@ from tensorflow.python.client import timeline
 import json

 from tensorflow.python.ops import partitioned_variables
+from tensorflow.python.data.experimental.ops import parquet_dataset_ops

 # Set to INFO for tracking training, default is WARN. ERROR for least messages
 tf.logging.set_verbosity(tf.logging.INFO)
@@ -300,6 +301,22 @@ def build_model_input(filename, batch_size, num_epochs):
         features = all_columns
         return features, labels

+    def parse_parquet(value):
+        cont_defaults = [[0.0] for i in range(1, 14)]
+        cate_defaults = [[' '] for i in range(1, 27)]
+        label_defaults = [[0]]
+        column_headers = TRAIN_DATA_COLUMNS
+        record_defaults = label_defaults + cont_defaults + cate_defaults
+        columns = value
+        vs = []
+        for k,v in columns.items():
+            vs.append(v)
+        all_columns = collections.OrderedDict(zip(column_headers, vs))
+        labels = all_columns.pop(LABEL_COLUMN[0])
+        features = all_columns
+        return features, labels
+
+
     '''Work Queue Feature'''
     if args.workqueue and not args.tf:
         from tensorflow.python.ops.work_queue import WorkQueue
@@ -311,12 +328,8 @@ def build_model_input(filename, batch_size, num_epochs):

opened by zhaozheng09 0

ParquetDataset return dynamic shape Tensor when set drop_remainder True

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: None
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 1.15.5+deeprec2208
Python version: python3.6
Bazel version (if compiling from source): 0.26.1
GCC/Compiler version (if compiling from source): gcc version 7.5.0
CUDA/cuDNN version: None
GPU model and memory: None

Describe the current behavior I use python generate parquet files, when read parquet files use ParquetDataset and set drop_remainder=True, it return a dynamic shape Tensor.

Describe the expected behavior when use TFRecordDataset and set drop_remainder=True, it return a static shape Tensor. it should be a static shape Tensor when drop_remainder=True.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. generate parquet:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
 
schema = pa.schema([
    ('f1', pa.int64()),
    ('f2', pa.int64()),
    ('f3', pa.int64()),
    ('f4', pa.int64()),
    ('label', pa.float32())
])
 
f1 = pa.array([1, 2, 3, 4, 5], type = pa.int64())
f2 = pa.array([1, 2, 3, 4, 5], type = pa.int64())
f3 = pa.array([1, 2, 3, 4, 5], type = pa.int64())
f4 = pa.array([1, 2, 3, 4, 5], type = pa.int64())
label = pa.array([0.1, 0.2, 0.3, 0.4, 0.5], pa.float32())
 
batch = pa.RecordBatch.from_arrays(
    [f1, f2, f3, f4, label],
    schema = schema
)
table = pa.Table.from_batches([batch])
 
pq.write_table(table, 'feature.parquet')

read parquet:

import os

import tensorflow as tf
from tensorflow.python.data.experimental.ops.dataframe import DataFrame
from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset
from tensorflow.python.data.ops import dataset_ops


def make_initializable_iterator(ds):
    r"""Wrapper of make_initializable_iterator."""
    if hasattr(dataset_ops, "make_initializable_iterator"):
        return dataset_ops.make_initializable_iterator(ds)
    return ds.make_initializable_iterator()


def parquet_map(record):
    label = record.pop("label")
    return record, label


filename = """feature.parquet"""

ds = ParquetDataset(
    filename,
    batch_size=2,
    fields=[
        DataFrame.Field("f1", tf.int64),
        DataFrame.Field("f2", tf.int64),
        DataFrame.Field("f3", tf.int64),
        DataFrame.Field("f4", tf.int64),
        DataFrame.Field("label", tf.float32),
    ],
    num_parallel_reads=8,
    drop_remainder=True,
).map(parquet_map)
ds = ds.prefetch(4)

iterator = make_initializable_iterator(ds)
features, labels = iterator.get_next()
print("f1 type is:")
print(type(features['f1']))
print('f1 shape is:')
print(features['f1'].shape)

sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)

with tf.Session(config=sess_config) as sess:
    sess.run(iterator.initializer)
    for i in range(1):
        feature, label = sess.run([features, labels])
        print(feature)
        print("Label: ")
        print(label)

f1 type is:
<class 'tensorflow.python.framework.ops.Tensor'>
f1 shape is:
(?,)
2023-01-03 13:23:36.623328: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2600000000 Hz
2023-01-03 13:23:36.629326: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4ea9280 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-01-03 13:23:36.629356: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
{'f1': array([1, 2]), 'f2': array([1, 2]), 'f3': array([1, 2]), 'f4': array([1, 2])}
Label: 
[0.1 0.2]

add tfrecord code: write tfrecord:

import tensorflow as tf
tf.enable_eager_execution()

# All raw values should be converted to a type compatible with tf.Example. Use
# the following functions to do these convertions.
def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))


def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))


def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def write_record():
    f1 = [1, 2, 3, 4, 5]
    label = [1, 2, 3, 4, 5]

    feature = {
        'label': _int64_feature(label),
        'f1': _int64_feature(f1),
    }
    
    # Create a `example` from the feature dict.
    tf_example = tf.train.Example(features=tf.train.Features(feature=feature))
  
    # Write the serialized example to a record file.
    with tf.python_io.TFRecordWriter('feature.tfrecords') as writer:
        writer.write(tf_example.SerializeToString())

if __name__ == "__main__":
    write_record()

read tfrecord:

import os

import tensorflow as tf
from tensorflow.python.data.experimental.ops.dataframe import DataFrame
from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset
from tensorflow.python.data.ops import dataset_ops


def make_initializable_iterator(ds):
    r"""Wrapper of make_initializable_iterator."""
    if hasattr(dataset_ops, "make_initializable_iterator"):
        return dataset_ops.make_initializable_iterator(ds)
    return ds.make_initializable_iterator()


def tfrecord_map(example_proto):
    features = {}
    features['f1'] = tf.FixedLenFeature(shape=(1,), dtype=tf.int64)
    features['label'] = tf.FixedLenFeature(shape=(1,), dtype=tf.int64)
    parsed_example = tf.parse_example(example_proto, features)
    f1 = parsed_example['f1']
    label = parsed_example['label']
    features = {'f1': f1}
    labels = {'label': label}
    return features, labels


filename = """feature.tfrecords"""

dataset = tf.data.TFRecordDataset(filename)
dataset = dataset.batch(2, drop_remainder=True)
dataset = dataset.map(lambda example_proto: tfrecord_map(example_proto))
dataset = dataset.prefetch(2)
iterator = make_initializable_iterator(dataset)
features, labels = iterator.get_next()
print("f1 type is:")
print(type(features['f1']))
print('f1 shape is:')
print(features['f1'].shape)

result:

f1 type is:
<class 'tensorflow.python.framework.ops.Tensor'>
f1 shape is:
(2, 1)

opened by welsonzhang 0

ParquetDataset met coredump when data contain DELTA_BINARY_PACKED encoding
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04

Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: None

TensorFlow installed from (source or binary): source

TensorFlow version (use command below): r1.15.5-deeprec2210-25-ga27850bf1de 1.15.5

Python version: Python 3.6.9

Bazel version (if compiling from source): 0.26.1

GCC/Compiler version (if compiling from source): gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

CUDA/cuDNN version: cuda:11.7.0-cudnn8

GPU model and memory: NVIDIA TITAN V 12288MiB

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

I use apache iceberg to generate parquet files, when parquet files compressed by zstd, ParquetDataset crashed with reading int64 type data. I notice DeepRec use arrow=5.0, but the arrow supports DELTA_BINARY_PACKED encoding begins at version 7.0, so I think we need to upgrade arrow version, and that won't affect other user's compatibility.

Describe the expected behavior Works expected with DELTA_BINARY_PACKED encoding.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. part.zstd.parquet: https://drive.google.com/file/d/1CoumvsuL47trnFi4Bn6haRIsgTy9frSE/view?usp=share_link part.gz.parquet: https://drive.google.com/file/d/1V_cOrjIVTVZ5y7Q4KbHa085ay6GeaZH-/view?usp=share_link

import os import tensorflow as tf from tensorflow.python.data.experimental.ops.dataframe import DataFrame from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset from tensorflow.python.data.ops import dataset_ops def make_initializable_iterator(ds): r"""Wrapper of make_initializable_iterator.""" if hasattr(dataset_ops, "make_initializable_iterator"): return dataset_ops.make_initializable_iterator(ds) return ds.make_initializable_iterator() def parquet_map(record): label = record.pop("label") return record, label filename = """part.zstd.parquet""" # filename = 'part.gz.parquet' # Read from a parquet file. ds = ParquetDataset( filename, batch_size=4, fields=[ DataFrame.Field("f_2672", tf.int64), DataFrame.Field("f_2671", tf.int64, ragged_rank=0), DataFrame.Field("f_2673", tf.int64, ragged_rank=0), DataFrame.Field("f_5196", tf.float32, ragged_rank=0), DataFrame.Field("f_8436", tf.float32, ragged_rank=0), DataFrame.Field("label", tf.int32), ], num_parallel_reads=8, ).map(parquet_map) ds = ds.prefetch(4) iterator = make_initializable_iterator(ds) features, labels = iterator.get_next() sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False) with tf.Session(config=sess_config) as sess: sess.run(iterator.initializer) for i in range(1): feature, label = sess.run([features, labels]) print(feature) print("Label: ") print(label)

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
opened by fuhailin 0

[Auto Micro Batch] Iterator has not been initialized when setting micro_batch_num

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.2 LTS
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
TensorFlow version (use command below): 1.15
Python version: 3.6
Bazel version (if compiling from source): 0.26.1
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.4
GPU model and memory: T4

Describe the current behavior I set sess_config.graph_options.optimizer_options.micro_batch_num = 2, and it occurs that

  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
         [[{{node cond/IteratorGetNext_1/dup0}}]]
         [[metrics_1/ROC_cvr2_cpd_second_stay_act_metric/assert_greater_equal/Assert/AssertGuard/Assert/data_1/_22249]]
  (1) Failed precondition: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
         [[{{node cond/IteratorGetNext_1/dup0}}]]

But if I disable micro_batch, the code is running normally.

Code to reproduce the issue

_train_input_fn = tf.compat.v1.data.make_initializable_iterator(_train_input_fn)
_eval_input_fn = tf.compat.v1.data.make_initializable_iterator(_eval_input_fn)
features, labels = tf.cond(is_training, true_fn=lambda: _train_input_fn.get_next(),
                           false_fn=lambda: _eval_input_fn.get_next())
nitializer = [tf.compat.v1.global_variables_initializer(),
                                tf.compat.v1.local_variables_initializer(),
                                tf.compat.v1.tables_initializer(),
                                _train_input_fn.initializer,
                                _eval_input_fn.initializer]
sess_config = tf.compat.v1.ConfigProto(allow_soft_placement=True, log_device_placement=log_device_placement)
sess_config.gpu_options.allow_growth = True
sess_config.graph_options.optimizer_options.micro_batch_num = 2

sess_config.intra_op_parallelism_threads = intra_threads
sess_config.inter_op_parallelism_threads = inter_threads
session = tf.compat.v1.Session(config=sess_config)
with session:
    session.run(initializer)

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

opened by Lihengwannafly 0

ParquetDataset raise ValueError: No supported fields found in parquet file

Problem: 使用MapReduce生成对应的Parquet存储，但是使用DeepRec的ParquetDataset 抛异常了，异常如下： No supported fields found in parquet file。对应的parquet的schema格式如下： message example {required int32 id;required binary email;} 使用的MR任务参考：https://github.com/whale2/iow-hadoop-streaming 原因：走读代码发现没法扫描footer里面的schema，导致抛异常了。为什么没法获取schema呢？

排查步骤： (1) 自己用python代码是可以顺利读取对应的schema的。 (2) 放弃mr，使用python代码生成类似的结构的内容，ParquetDataset 是可以顺利读出来的。 (3) 对比MR生成的parquest和python本地生成的parquet格式发现，一个是bytes，一个是string。 (4) 因此怀疑是MR生成的parquet是bytes格式，导致DeepRec没法识别。

最终将schema定义成如下，问题解决： message example {required int32 id;required binary email(UTF-8);}

opened by welsonzhang 0

Releases(r1.15.5-deeprec2210)

r1.15.5-deeprec2210(Nov 17, 2022)
Major Features and Improvements

Embedding

Support HBM-DRAM-SSD storage in EmbeddingVariable multi-tier storage.

Support multi-tier EmbeddingVariable initialized based on frequency when restore model.

Support to lookup location of ids of EmbeddingVariable.

Support kv_initialized_op for GPU Embedding Variable.

Support restore compatibility of EmbeddingVariable using init_from_proto.

Improve performance of apply/gather ops for EmbeddingVariable.

Add Eviction Manager in EmbeddingVariable Multi-tier storage.

Add unified thread pool for cache of Multi-tier storage in EmbeddingVariable.

Save frequencies and versions of features in SSDHash and LevelDB storage of EmbeddingVariable.

Avoid invalid eviction use HBM-DRAM storage of EmbeddingVariable.

Preventing from accessing uninitialized data use EmbeddingVariable.

Graph & Grappler Optimization

Optimize Async EmbeddingLookup by placement optimization.

Place VarHandlerOp to Compute main graph for SmartStage.

Support independent thread pool for stage subgraph to avoid thread contention.

Implement device placement optimization.

Runtime Optimization

Support CUDA Graph execution by adding CUDA Graph mode session.

Support CUDA Graph execution in JIT mode.

Support intra task cost estimate in CostModel in Executor.

Support tf.stream and tf.colocate python API for CUDA multi-stream.

Support embedding subgraphs partition policy when use CUDA multi-stream.

Optimize CUDA multi-stream by merging copy stream into compute stream.

Ops & Hardware Acceleration

Add a list of Quantized* and _MklQuantized* ops.

Implement GPU version of SparseFillEmptyRows.

Implement c version of spin_lock to support multi-architectures.

Upgrade the OneDNN version to v2.7.

Distributed

Support distributed training use SOK based on EmbeddingVariable.

Add NETWORK_MAX_CONNECTION_TIMEOUT to support connection timeout configurable in StarServer.

Upgrade the SOK version to v4.2.

IO

Add TF_NEED_PARQUET_DATASET to enable ParquetDataset.

Serving

Optimize embedding lookup performance by disable feature filter when serving.

Optimize error code for user when parse request or response failed.

Support independent update model threadpool to avoid performance jitter.

ModelZoo

Add MaskNet Model.

Add PLE Model.

Support variable type BF16 in DCN model.

BugFix

Fix tf.nn.embedding_lookup interface bug and session hang bug when enabling async embedding.

Fix warmup failed bug when user set warmup file path.

Fix build failure in ev_allocator.cc and hash.cc on ARM.

Fix build failure in arrow when build on ARM

Fix redefined error in NEON header file for ARM.

Fix _mm_malloc build failure in sparsehash on ARM.

Fix warmup failed bug when use session_group.

Fix build save graph bug when creating partitioned EmbeddingVariable in feature_column API.

Fix the colocation error when using EmbeddingVariable in distribution.

Fix HostNameToIp fails by replacing gethostbyname by getaddrinfo in StarServer.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2210-cpu-py36-ubuntu18.04

GPU Image

alideeprec/deeprec-release:deeprec2210-gpu-py36-cu116-ubuntu18.04

Thanks to our Contributors

Duyi-Wang, Locke, shijieliu, Honglin Zhu, chenxujun, GosTraight2020, LALBJ, Nanno
Source code(tar.gz)
Source code(zip)
r1.15.5-deeprec2208u1(Nov 2, 2022)
Major Features and Improvements

BugFix

Fix a list of Quantized* and _MklQuantized* ops not found issue.

Fix build save graph bug when creating partitioned EmbeddingVariable in feature_column API.

Fix warmup failed bug when user set warmup file path.

Fix warmup failed bug when use session_group.

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2208u1-cpu-py36-ubuntu18.04

GPU Image

alideeprec/deeprec-release:deeprec2208u1-gpu-py36-cu116-ubuntu18.04
Source code(tar.gz)
Source code(zip)
r1.15.5-deeprec2208(Sep 23, 2022)
Major Features and Improvements

Embedding

Multi-tier of EmbeddingVariable support HBM, add async compactor in SSDHashKV.

Support tf.feature_column.shard_embedding_columns, SequenceCategoricalColumn and WeightedCategoricalColumn API for EmbeddingVariable.

Support save and restore checkpoint of GPU EmbeddingVariable.

Support EmbeddingVariable OpKernel with REAL_NUMBER_TYPES.

Support user defined default_value for feature filter.

Support feature column API for MultiHash.

Graph & Grappler Optimization

Add FP32 fused l2 normalize op and grad op and tf.nn.fused_layer_normalize API.

Add Concat+Cast fusion ops.

Optimize SmartStage performance on GPU.

Add macro to control to optimize mkl_layout_pass.

Support asynchronous embedding lookup.

Runtime Optimization

CPUAllocator, avoid multiple threads cleanup at the same time.

Support independent intra threadpool for each session and intra threadpool be pinned to cpuset.

Support multi-stream with virtual device.

Ops & Hardware Acceleration

Implement ApplyFtrl, ResourceApplyFtrl, ApplyFtrlV2 and ResourceApplyFtrlV2 GPU kernels.

Optimize BatchMatmul GPU kernel.

Integrate cuBLASlt into backend and use BlasLtMatmul in batch_matmul_op.

Support GPU fusion of matmal+bias+(activation).

Merge NV-TF r1.15.5+22.06.

Optimizer

Support AdamW optimizer for EmbeddingVariable.

Model Save/Restore

Support asynchronously restore EmbeddingVariable from checkpoint.

Support EmbeddingVariable in init_from_checkpoint.

Serving

Add go/java/python client SDK and demo.

Support GPU multi-streams in SessionGroup.

Support independent inter thread pool for each session in SessionGroup.

Support multi-tiered Embedding.

Support immutable EmbeddingVariable.

Quantization

Add low precision optimization tool, support BF16, FP16, INT8 for savedmodel and checkpoint.

Add embedding variable quantization.

ModelZoo

Optimize DIN's BF16 performance.

Add DCN & DCNv2 models and MLPerf recommendation benchmark.

Profiler

Add detail information for RecvTensor in timeline.

Dockerfile

Add ubuntu 22.04 dockerfile and images with gcc11.2 and python3.8.6.

Add cuda11.2, cuda11.4, cuda11.6, cuda11.7 docker images and use cuda 11.6 as default GPU image.

Environment & Build

Update default TF_CUDA_COMPUTE_CAPABILITIES to 6.0,6.1,7.0,7.5,8.0.

Upgrade bazel version to 0.26.1.

Support for building DeepRec on ROCm2.10.0.

BugFix

Fix build failures with gcc11 & gcc12.

StarServer, remove user packet split to avoid multiple user packet out-of-order issue.

Fix the 'NodeIsInGpu is not declare' issue.

Fix the placement bug of worker devices when distributed training in Modelzoo.

Fix out of range issue for BiasAddGrad op when enable AVX512.

Avoid loading invalid model when model update in serving.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2208-cpu-py36-ubuntu18.04

GPU Image

alideeprec/deeprec-release:deeprec2208-gpu-py36-cu116-ubuntu18.04
Source code(tar.gz)
Source code(zip)
r1.15.5-deeprec2206(Jul 6, 2022)
Major Features and Improvements

Embedding

Multi-tier of EmbeddingVariable, add SSD_HashKV which is better performance than LevelDB.

Support GPU EmbeddingVariable which gather/apply ops place on GPU.

Add user API to record frequence and version for EmbeddingVariable.

Graph Optimization

Add Embedding Fusion ops for CPU/GPU.

Optimize SmartStage performance on GPU.

Runtime Optimization

Executor, support cost-based and critical path ops first.

GPUAllocator, support CUDA malloc async allocator. (need to use >= CUDA 11.2)

CPUAllocator, automatically memory allocation policy generation.

PMEMAllocator, optimize allocator and add statistic.

Ops & Hardware Acceleration

Implement SparseReshape, SparseApplyAdam, SparseApplyAdagrad, SparseApplyFtrl, ApplyAdamAsync, SparseApplyAdamAsync, KvSparseApplyAdamAsync GPU kernels.

Optimize UnSortedSegment on CPU.

Upgrade OneDNN to v2.6.

IO & Dataset

ParquetDataset, add parquet dataset which could reduce storage and improve performance.

Model Save/Restore

Asynchronous restore EmbeddingVariable from checkpoint.

Serving

SessionGroup, highly improve QPS and RT in inference.

ModelZoo

Add models SimpleMultiTask, ESSM, DBMTL, MMoE, BST.

Profiler

Support for mapping of operators and real thread ids in timeline.

BugFix

Fix EmbeddingVariable core when EmbeddingVariable only has primary embedding value.

Fix abnormal behavior in L2-norm calculation.

Fix save checkpoint issue when use LevelDB in EmbeddingVariable.

Fix delete old checkpoint failure when use incremental checkpoint.

Fix build failure with CUDA 11.6.

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2206-cpu-py36-ubuntu18.04

GPU Image

alideeprec/deeprec-release:deeprec2206-gpu-py36-cu110-ubuntu18.04
Source code(tar.gz)
Source code(zip)
r1.15.5-deeprec2204u1(Apr 28, 2022)
Major Features and Improvements

BugFix

Fix saving checkpoint issue when use EmbeddingVariable. (https://github.com/alibaba/DeepRec/issues/167)

Fix inputs from different frames issue when use auto graph fusion. (https://github.com/alibaba/DeepRec/issues/144)

Fix embedding_lookup_sparse graph issue.

Release Images

CPU Image

alideeprec/deeprec-release:deeprec2204u1-cpu-py36-ubuntu18.04

GPU Image

alideeprec/deeprec-release:deeprec2204u1-gpu-py36-cu110-ubuntu18.04
Source code(tar.gz)
Source code(zip)
r1.15.5-deeprec2204(Apr 7, 2022)
Major Features and Improvements

Embedding

Support hybrid storage of EmbeddingVariable (DRAM, PMEM, LevelDB)

Support memory-continuous storage of multi-slot EmbeddingVariable.

Optimize beta1_power and beta2_power slots of EmbeddingVariable.

Support restore frequency of features in EmbeddingVariable.

Distributed Training

Integrate SOK in DeepRec.

Graph Optimization

Auto Graph Fusion, support float32/int32/int64 type for select fusion.

SmartStage, fix graph contains circle bug when enable SmartStage optimization.

Runtime Optimization

GPUTensorPoolAllocator, which reduce GPU memory usage and improve performance.

PMEMAllocator, support allocation in persistent memory.

Optimizer

Optimize AdamOptimizer performance.

Op & Hardware Acceleration

Change fused MatMul layout type and number thread for small size inputs.

IO & Dataset

KafkaGroupIODataset, support consumer rebalance.

Model Save/Restore

Support dump incremental graph info.

Serving

Add serving module (ODL processor), which support Online Deep Learning (ODL).

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2204-cpu-py36-ubuntu18.04

GPU Image

registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2204-gpu-py36-cu110-ubuntu18.04

Known Issue

Some user report issue when use Embedding Variable, such as https://github.com/alibaba/DeepRec/issues/167. The bug is fixed in r1.15.5-deeprec2204u1.
Source code(tar.gz)
Source code(zip)
r1.15.5-deeprec2201(Jan 11, 2022)
This is the first release of DeepRec. DeepRec has super large-scale distributed training capability, supporting model training of trillion samples and 100 billion Embedding Processing. For sparse model scenarios, in-depth performance optimization has been conducted across CPU and GPU platform.

Major Features and Improvements

Embedding

Embedding Variable (including feature eviction and feature filter)

Dynamic Dimension Embedding Variable

Adaptive Embedding

Multi-Hash Variable

Distributed Training

GRPC++

StarServer

Graph Optimization

Auto Micro Batch

Auto Graph Fusion

Embedding Fusion

Smart Stage

Runtime Optimization

CPU Memory Optimization

GPU Memory Optimization

GPU Virtual Memory

Optimizer

AdamAsync Optimizer

AdagradDecay Optimizer

Op & Hardware Acceleration

Unique, Gather, DynamicStitch, BiasAdd, Select, Transpose, SparseSegmentReduction, where, DynamicPartition, SparseConcat tens of ops' CPU/GPU optimization.

support oneDNN-2.3.2 & bf16

Support TF32

IO & Dataset

WorkQueue

KafkaDataset

More details of features: https://deeprec.readthedocs.io/zh/latest/

Release Images

CPU Image

registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2201-cpu-py36-ubuntu18.04

GPU Image

registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2201-gpu-py36-cu110-ubuntu18.04
Source code(tar.gz)
Source code(zip)