NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Last update: Jan 7, 2023

Related tags

Deep Learning NVTabular

Overview

NVTabular | Documentation

NVTabular is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets and train deep learning (DL) based recommender systems. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS Dask-cuDF library. NVTabular is designed to be interoperable with both PyTorch and TensorFlow using dataloaders that have been developed as extensions of native framework code. In our experiments, we were able to speed up existing TensorFlow pipelines by nine times and existing PyTorch pipelines by five times with our highly-optimized dataloaders.

NVTabular is a component of NVIDIA Merlin Open Beta. NVIDIA Merlin is used for building large-scale recommender systems. With NVTabular being a part of the Merlin ecosystem, it also works with the other Merlin components including HugeCTR and Triton Inference Server to provide end-to-end acceleration of recommender systems on the GPU. Extending beyond model training, with NVIDIA’s Triton Inference Server, the feature engineering and preprocessing steps performed on the data during training can be automatically applied to incoming data during inference.

Benefits

When training DL recommender systems, data scientists and machine learning (ML) engineers have been faced with the following challenges:

Huge Datasets: Commercial recommenders are trained on huge datasets that may be several terabytes in scale.
Complex Data Feature Engineering and Preprocessing Pipelines: Datasets need to be preprocessed and transformed so that they can be used with DL models and frameworks. In addition, feature engineering creates an extensive set of new features from existing ones, requiring multiple iterations to arrive at an optimal solution.
Input Bottleneck: Data loading, if not well optimized, can be the slowest part of the training process, leading to under-utilization of high-throughput computing devices such as GPUs.
Extensive Repeated Experimentation: The entire data engineering, training, and evaluation process can be repetitious and time consuming, requiring significant computational resources.

NVTabular alleviates these challenges and helps data scientists and ML engineers:

process datasets that exceed GPU and CPU memory without having to worry about scale.
use optimized dataloaders to accelerate training with TensorFlow, PyTorch, and HugeCTR.
focus on what to do with the data and not how to do it by using abstraction at the operation level.
prepare datasets quickly and easily for experimentation so that more models can be trained.

NVTabular provides faster iteration on massive tabular datasets during experimentation and training. NVTabular helps ML/Ops engineers with deploying models into production by providing faster dataset transformation. This makes it easy for production models to be trained more frequently and kept up to date, helping improve responsiveness and model performance.

To learn more about NVTabular's core features, see the following:

Performance

When running NVTabular on the Criteo 1TB Click Logs Dataset using a single V100 32GB GPU, feature engineering and preprocessing was able to be completed in 13 minutes. Furthermore, when running NVTabular on a DGX-1 cluster with eight V100 GPUs, feature engineering and preprocessing was able to be completed within three minutes. Combined with HugeCTR, the dataset can be processed and a full model can be trained in only six minutes.

The performance of the Criteo DRLM workflow also demonstrates the effectiveness of the NVTabular library. The original ETL script provided in Numpy took over five days to complete. Combined with CPU training, the total iteration time is over one week. By optimizing the ETL code in Spark and running on a DGX-1 equivalent cluster, the time to complete feature engineering and preprocessing was reduced to three hours. Meanwhile, training was completed in one hour.

Installation

Prior to installing NVTabular, ensure that you meet the following prerequisites:

CUDA version 10.1+
Python version 3.7+
NVIDIA Pascal GPU or later

NOTE: NVTabular will only run on Linux. Other operating systems are not currently supported.

Installing NVTabular Using Conda

NVTabular can be installed with Anaconda from the nvidia channel by running the following command:

conda install -c nvidia -c rapidsai -c numba -c conda-forge nvtabular python=3.7 cudatoolkit=11.0

If you'd like to create a full conda environment to run the example notebooks, do the following:

Use the environment files that have been provided to install the CUDA Toolkit (11.0 or 11.2).

Clone the NVTabular repo and run the following commands from the root directory:

conda env create -f=conda/environments/nvtabular_dev_cuda11.2.yml
conda activate nvtabular_dev_11.2
python -m ipykernel install --user --name=nvt
pip install -e .
jupyter notebook

When opening a notebook, be sure to select nvt from the Kernel->Change Kernel menu.

Installing NVTabular with Docker

NVTabular Docker containers are available in the NVIDIA Merlin container repository. There are four different containers:

Container Name	Container Location	Functionality
merlin-inference	https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-inference	NVTabular, HugeCTR, and Triton Inference
merlin-training	https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-training	NVTabular and HugeCTR
merlin-tensorflow-training	https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-tensorflow-training	NVTabular, TensorFlow, and HugeCTR Tensorflow Embedding plugin
merlin-pytorch-training	https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-pytorch-training	NVTabular and PyTorch

To use these Docker containers, you'll first need to install the NVIDIA Container Toolkit to provide GPU support for Docker. You can use the NGC links referenced in the table above to obtain more information about how to launch and run these containers. To obtain more information about the software and model versions that NVTabular supports per container, see Support Matrix.

Notebook Examples and Tutorials

We provide a collection of examples, use cases, and tutorials as Jupyter notebooks that demonstrate how to use the following datasets:

MovieLens25M
Outbrain Click Prediction
Criteo 1TB Click Logs
RecSys2020 Competition Hosted by Twitter
Rossmann Sales Prediction

Each Jupyter notebook covers the following:

Feature engineering and preprocessing with NVTabular
Advanced workflows with NVTabular
Accelerated dataloaders for TensorFlow and PyTorch
Scaling to multi-GPU and multi-node systems
Integrating NVTabular with HugeCTR
Deploying to inference with Triton

Feedback and Support

If you'd like to contribute to the library directly, see the Contributing.md. We're particularly interested in contributions or feature requests for our feature engineering and preprocessing operations. To further advance our Merlin Roadmap, we encourage you to share all the details regarding your recommender system pipeline in this survey.

If you're interested in learning more about how NVTabular works, see our NVTabular documentation. We also have API documentation that outlines the specifics of the available calls within the library.

Comments

Adding ops for feature column functionality and feature column to workflow mapping function

Increasing NVTabular compatibility with TensorFlow feature column API by adding remaining necessary ops (cross op and bucketize) and a function which can map from a set of feature columns to an NVTabular workflow that performs all analogous preprocessing. Addresses #371

HashedCross doesn't support multi-hot yet, and I'm not sure that extending to it will be necessarily easy. For reference, the TF cross op handles multi-hots by doing a cartesian product of all indices for each feature. See the documentation here.

Still need to add bucketized support and test everything.

opened by alecgunny 40
Data inspect
Opening because #510 was closed after new_api branch was merged into main.

@jperez999 I applied the feedback you told me, and I also added unit test. I have seen your data generation branch, I can follow your packaging style when you get that PR Merged.

Pending for the future:

Fix dask-cudf dtypes for lists: We need cudf support, there is an issue created (https://github.com/rapidsai/cudf/issues/7157)

When list are supported, test them in the unit tests. @benfred can I modify the testing dataset to add one column that is a list?
opened by albert17 39
[REVIEW] Adding TensorFlow example

Building end-to-end TensorFlow Criteo example to demonstrate how to use NVTabular as a data loader and online preprocessor for TensorFlow. End version will probably end up restructuring existing Criteo example to make its preprocessing notebook callable

opened by alecgunny 38
Asv test support

These changes add support for benchmarking leveraging the Air Speed Velocity framework to display benchmarks over time. This will be extremely helpful in catch regressions in performance while also ensuring any new changes do not adversely affect our selected benchmarks.

opened by jperez999 37
data generator phase 1

builds a dataset given, user supplies number of cont columns, cat columns, cardinality list and distribution type. Also added distribution verification capability to the generator.

opened by jperez999 35
[REVIEW] NVTabular+TF training and inference integration test

Even though we run notebooks as a part of our integration tests, we do not compare the output of the Triton inference with the NVTabular transform() and TF predict() using the same data. This PR addresses the need of an integration test that does that.

We'll be able to test C++ backend with these tests as well.

We might wanna add or run this test from the C++ backend repo.

opened by oyilmaz-nvidia 34
Handle hive-partitioning in NVTabular.dataset.Dataset
Closes #642 Addresses global shuffle component of #641

Thu purpose of this PR is to improve handling of hive-partitioned parquet data in NVTabular. Since the Dataset API already uses dask.dataframe.read_parquet, there is currenlty no "correcness" issue with reading hive-partitioned data. However, (1) there is no convenient mechanism to write hive-partitioned data, and (2) the read stage typically results in many small partitions (rather than a single partition for each input directory).

Solution to (1): The Dataset.to_parquet method now supports a partition_on= argument. This is designed to match the same option in dask.dataframe/dask_cudf. If the user passes a list of 1+ columns with this argument, the output data will be shuffled at IO time into a distinct directory for each unique combination of those partition_on column values. When multiple columns are use for partitioning (e.g. ["month", "day"]), the directory structure is nested (so that the full path for an output file will look something like "/month=Mar/day=30/part.0.parquet").

Solution to (2): Since #641 will need a mechanism to ensure a unique mapping between specified column groups and ddf partitions, this PR adds a Dataset.partition_by_keys method to perform a global shuffle on the specified column group (keys) and return a new (shuffled) Dataset. For general Dataset objects, this method will simply call ddf.shuffle() under the hood. For Dataset objects that are backed by hive-partitioned data, however, we use the metadata stored in the file paths to avoid a full shuffle. In the future, this optimization can be pushed even further by directly agregating all IO tasks within the same hive-partition. However, I suspect that shuch an optimization should be implemented in dask.dataframe.

Example Usage

import pandas as pd import dask.dataframe as dd import dask import nvtabular as nvt path = "fake.data" # Create a sample ddf ddf = dask.datasets.timeseries( start="2000-01-01", end="2000-01-03", freq="600s", partition_freq="6h", seed=42, ).reset_index() ddf['timestamp'] = ddf['timestamp'].dt.round('D').dt.day # Convert to a Datset and write out hive-partitioned data to disk keys = ["timestamp", "name"] nvt.Dataset(ddf).to_parquet(path, partition_on=keys)

This will produce a directory structure like:

$ find fake.data/ -type d -print fake.data/ fake.data/timestamp=1 fake.data/timestamp=1/name=Alice fake.data/timestamp=1/name=Frank fake.data/timestamp=1/name=Victor fake.data/timestamp=1/name=George fake.data/timestamp=1/name=Quinn fake.data/timestamp=1/name=Kevin fake.data/timestamp=1/name=Ursula ...

Then, you can read the data back in with NVT, and ensure that the ddf partitions are shuffled by keys:

ds = nvt.Dataset(path, engine="parquet").shuffle_by_keys(keys) ds.to_ddf().compute()

id x y timestamp name 0 991 -0.750009 -0.587392 1 Alice 1 1022 0.866823 -0.682096 1 Alice 2 991 0.467775 0.683211 1 Alice 3 967 0.534984 -0.931405 1 Bob 4 991 -0.149716 -0.651939 1 Bob .. ... ... ... ... ... 25 964 0.843602 0.598580 3 Yvonne 26 961 0.853070 -0.987596 3 Yvonne 27 947 0.934162 0.190069 3 Yvonne 28 1024 -0.107280 0.662606 3 Yvonne 29 1006 0.169090 -0.784889 3 Zelda [288 rows x 5 columns]
opened by rjzamora 33
TF and PT dataloader list support

Moving MH support from #365 into backend then adding TensorFlow support on top of it. Still needs tests, benchmarking, embedding compatibility on the TF side, and I'm not sure how I feel about the naming conventions adopted (multi-hot columns get __values and __nnz appended to the end of their TF dict representation), but functionality should be there.

Note that while PT is using offsets representation, TF is using nnz since RaggedTensor embedding layer will need that. Eventually we should migrate to offsets when we build custom EmbeddingBag op, so this is controllable with a keyword.

The other key functionality I've added is moving all the conversion from gdf to framework tensors into a make_tensor method on the dataloaders. This way, you can transform "arbitrary" datarames (to the extent they match the schema used to initialize the dataloader) into framework tensors. The intended use case is if you have some test dataset which hasn't yet been transformed by the preprocessing workflow, you can iterate through it, apply the workflow, then send it into a framework for prediction in one foul swoop.

opened by alecgunny 33
[REVIEW] Adding file type option for HugeCTR output

Adds option in HugeCTR to choose if the output should be in binary or parquet formar.

Creates a Base Class (Writer) to be used for Shuffles and HugeCTR.

Modifications: workflow.py and io.oy.

opened by albert17 33
various fixes for diff issues

This will refactor the workflow process for ingesting operators including simplified api and allowing multiple operators of same kind. Ordering will have priority this will allow chaining to follow a more user friendly convention. Reductions to phases will still be conducted before application phase. This also fixes tensorflow API 2 gpu memory usage util and band aid for torch tensor convergence issue. #383 #377 #372

opened by jperez999 32
[REVIEW] Implement Feature Request from #1077 on Left Padding

This pull request is for resolving issue #1077 on implementing left padding for sparse sequential features.

The changes being made, or to be made, in this PR include the implementation of left padding in the torch and tensorflow dataloading modules, any needed changes to user-facing methods, unit test(s) for the this change, and documentation updates related to this fix and while I have been reading and working on the dataloader modules.
dataloader ops

opened by lesnikow 31
Replace `flake8` with `ruff` (which is equivalent but faster)

ruff is a very fast Python linter written in Rust that covers the flake8 rules in addition to a bunch of other popular linters. This sets up a config for it that covers the existing flake8 config, and adds additional linters while generating minimal errors with the existing code base. Since pylint is much slower than ruff (and I suspect slower than flake8 too), this config executes the ruff checks first in order to fail fast.
clean up chore

opened by karlhigley 2
Add support for serializing modules involved in LambdaOp execution by value
These commits address issue #1737 in two ways:

by allowing users to direct Workflow.save to serialize an explicit list of named modules by value, and

by allowing users to direct Workflow.save to use a heuristic to automatically infer the external modules involved in a workflow.

Taken together, these commits will make it easier to serialize workflows that are resilient to execution in environments without source files for all of the modules that were available when the workflow was created, e.g., Docker containers.

The main contribution is the addition of a modules_byvalue parameter to Workflow.save. If this is passed an explicit list of modules, Workflow.save will direct cloudpickle to serialize these modules by value. If it is passed the string "auto", Workflow.save will employ a heuristic approach that will be useful for many real-world uses of LambdaOp.

I believe that it would be harmless and useful to make "auto" the default but have not done so in this PR.

For more background on some of the tradeoffs that would be involved in making the heuristic more precise, please see this blog post. I do not believe that many of the cases I identified in that post (explicit imports within functions, list comprehensions, etc.) are likely to present in realistic NVTabular LambdaOp client code, and thus took a relatively simple approach.
bug
opened by willb 3
[BUG] LambdaOp may fail to deserialize if the processing function is declared in another module
Describe the bug

By default, Cloudpickle serializes functions and modules by reference. Deserializing workflows that contain user functions declared in other modules can fail if the Python source files for those modules are not available. This means that serializing a workflow and publishing it in a Docker container requires either (1) that all user source files are published in the same paths in the container as they were when the workflow was serialized or (2) that user modules are not serialized by reference.

Steps/Code to reproduce bug

Assume that identity.py contains the identity function, called identity.

with open("identity.py", "w") as of: of.write(""" def identity(col): return col """)

Then run this code:

import nvtabular as nvt import identity wf = nvt.Workflow( ["col_a"] >> nvt.ops.LambdaOp(identity.identity) ) wf.save("identity-workflow")

This serialized workflow will fail to deserialize if identity.py is not available in the same location, e.g., in a Docker container or on another host.

import sys import os del sys.modules["identity"] os.unlink("identity.py") try: wf_ref = nvt.Workflow.load("identity-workflow") except ModuleNotFoundError as mnfe: print("Failed to load workflow") print(str(mnfe))

Expected behavior

Serialized workflows should be largely self-contained and resilient to missing source files containing user transformation code.

Environment details (please complete the following information):

Environment location: Docker

Method of NVTabular install: Docker

If method of install is [Docker], provide docker pull & docker run commands used:

docker run -v /home/willb/devel/nvtabular-sandbox:/workspace/host --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8787 -p 8796:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-training:22.03 /bin/bash

Additional context

This notebook demonstrates the minimal reproducer as well as a workaround.

I expect to submit a patch shortly that will enhance Workflow.save in two ways:

to allow clients to provide explicit direction that certain modules should be serialized by value, and

to allow clients to request optional automatic identification of such modules.

bug P1
opened by willb 0
Fix max_size behavior in Categorify

Closes https://github.com/NVIDIA-Merlin/NVTabular/issues/1735

Changes both combo- and joint-encoding logic to increase the null-value count to include any categories that will be dropped to satisfy the max_size argument.

Warning: This change means the 0th row in unique.*.parquet will no longer correspond to actual null-value count, but to the number of values in the dataset that would be mapped to the "null category" at transformation time (including both null and low-frequency values). cc @rnyak - Please confirm that this is the expected behavior.
ops

opened by rjzamora 2
[BUG] Categorify does not set non-frequent item mapping size in unique parquet file when `max_size` arg is set
Describe the bug

When we set the max_size arg in Categorify() op, we intent to apply frequency capping, i.e., non-frequent items are mapped to 0, based on max_size value that is set. However, although Categorify works fine and encodes the frequent and non-frequent items properly, the unique_...parquet file does not set the size of non_frequent items correct. This is an important issue since users rely on these unique...parquet files and they do further processes with them. One example is to calculate the item frequencies from unique.item_id.parquet file and do some logits correction in Merlin Models.

Steps/Code to reproduce bug Please run the code to repro the issue

import nvtabular as nvt from merlin.core import dispatch from merlin.core.dispatch import make_df from nvtabular import ColumnSelector, ops import pandas as pd df = dispatch.make_df( { "item_id": [ "A", "E", "B", "C", "A", "A", "B", "C", "D", "A", "B", "A", "B", ], } ) cat_names = ["item_id"] dataset = nvt.Dataset(df) cat_features = cat_names >> ops.Categorify(max_size=3) processor = nvt.Workflow(cat_features) processor.fit(dataset) new_gdf = processor.transform(dataset).to_ddf().compute() pd.read_parquet('./categories/unique.item_id.parquet') item_id item_id_size 0 <NA> 0 1 A 5 2 B 4

Expected behavior The size value in the first row in the unique.item_id.parquet should read as 4 instead of 0, since after setting max_size arg, we have 4 samples in the processed new_gdf that were mapped to 0.

Environment details (please complete the following information): I am using merlin-tensorflow:22.11 image with the latest main branches of the Merlin libs pulled and installed.
bug P0
opened by rnyak 0

Releases(v1.8.0)

v1.8.0(Dec 30, 2022)
What’s Changed

📄 Documentation

Address virtual developer review feedback @mikemckiernan (#1724)

🔧 Maintenance

remove test references that are no longer available @jperez999 (#1730)

remove integration tests for notebooks no longer available @jperez999 (#1729)

Use pre-commit for lint checks in GitHub Actions Workflow @oliverholworthy (#1723)

Remove echo from command in tox.ini @oliverholworthy (#1725)

Migrate the legacy examples to the Merlin repo @karlhigley (#1711)

Handle data loader as an iterator @oliverholworthy (#1720)

Release draft fix @jperez999 (#1712)

Add Jenkinsfile @AyodeAwe (#1702)

Source code(tar.gz)
Source code(zip)
nvtabular-1.8.0-cp38-cp38-linux_x86_64.whl(254.99 KB)
nvtabular-1.8.0-cp39-cp39-linux_x86_64.whl(255.37 KB)
nvtabular-1.8.0.tar.gz(123.24 KB)
v1.7.0(Nov 23, 2022)
What’s Changed

🐜 Bug Fixes

fix tox to use correct branch in release tags @jperez999 (#1710)

Remove min value count from properties when using sparse_max @oliverholworthy (#1705)

Update metrics keys in example notebook tests @karlhigley (#1703)

Fix first/last groupby aggregation on list columns @rjzamora (#1693)

📄 Documentation

Update metrics keys in example notebook tests @karlhigley (#1703)

docs: Add semver to calver banner @mikemckiernan (#1699)

docs: Add basic SEO configuration @mikemckiernan (#1697)

🔧 Maintenance

fix tox to use correct branch in release tags @jperez999 (#1710)

Upload binary wheels for nvtabular @benfred (#1696)

Use merlin-dataloader package @benfred (#1694)

Source code(tar.gz)
Source code(zip)
nvtabular-1.7.0-cp38-cp38-linux_x86_64.whl(254.79 KB)
nvtabular-1.7.0-cp39-cp39-linux_x86_64.whl(254.63 KB)
nvtabular-1.7.0.tar.gz(123.35 KB)
v1.6.0(Oct 31, 2022)
What’s Changed

🐜 Bug Fixes

Fix first/last groupby aggregation on list columns @rjzamora (#1693)

Fix Categorify bug for combo encoding with null values @rjzamora (#1652)

Fix joint Categorify with list columns @rjzamora (#1685)

📄 Documentation

update NVTabular examples @radekosmulski (#1633)

Remove examples Part 1 - Rossmann, RecSys2020, Outbrain @bschifferer (#1669)

🔧 Maintenance

adding import or skip for tensorflow framework required by examples @jperez999 (#1691)

Source code(tar.gz)
Source code(zip)
v1.5.0(Sep 26, 2022)
What’s Changed

🐜 Bug Fixes

Use Merlin DAG executors from core in integration tests @jperez999 (#1677)

Fix target encoding tagging issue @bbozkaya (#1672)

🔧 Maintenance

Remove stray file left over from Torch/Horovod multi-GPU example @karlhigley (#1674)

Use Merlin DAG executors from core in integration tests @jperez999 (#1677)

Remove poetry config @benfred (#1673)

chore: Add pybind11 as a tox requirement @mikemckiernan (#1675)

Switch to using the DAG executors from Merlin Core @karlhigley (#1666)

Use the latest version of Merlin Core from main in the tox test envs @karlhigley (#1671)

Set up tox environments for testing, linting, and building docs @karlhigley (#1667)

Source code(tar.gz)
Source code(zip)
nvtabular-1.5.0.tar.gz(130.49 KB)
v1.4.0(Sep 6, 2022)
What’s Changed

⚠ Breaking Changes

Remove FastAI notebooks @benfred (#1668)

Fix dl @jperez999 (#1661)

Replace cudf series ceil() with numpy ceil() @jperez999 (#1656)

🐜 Bug Fixes

Fix integration tests that reached into Workflow's private methods @karlhigley (#1660)

Fix groupby on lists with cudf 22.06+ @benfred (#1654)

Update the Categorify operator to set the domain max correctly @oliverholworthy (#1641)

Test LambdaOp with dask workflows @benfred (#1634)

🚀 Features

Add sum to supported aggregations in Groupby @radekosmulski (#1638)

📄 Documentation

Remove using-feature-columns nb @rnyak (#1657)

Fix typos @benfred (#1655)

🔧 Maintenance

Add optional requirement specifiers for GPU and dev requirements @karlhigley (#1664)

Add scipy as a dependency @karlhigley (#1663)

Fix dl @jperez999 (#1661)

Fix integration tests that reached into Workflow's private methods @karlhigley (#1660)

Update black/pylint/flake8,isort etc @benfred (#1659)

Remove using-feature-columns nb @rnyak (#1657)

Replace cudf series ceil() with numpy ceil() @jperez999 (#1656)

Extract Python and Dask Executor classes from Workflow @karlhigley (#1609)

Update versioneer from 0.19 to 0.23 @oliverholworthy (#1651)

Source code(tar.gz)
Source code(zip)
nvtabular-1.4.0.tar.gz(132.41 KB)
v1.3.3(Jul 22, 2022)

Source code(tar.gz)
Source code(zip)
nvtabular-1.3.3.tar.gz(129.48 KB)
v1.3.2(Jul 20, 2022)

Source code(tar.gz)
Source code(zip)
nvtabular-1.3.2.tar.gz(129.56 KB)
v1.3.1(Jul 19, 2022)
What’s Changed

🔧 Maintenance

Tri up time @jperez999 (#1623)

Source code(tar.gz)
Source code(zip)
nvtabular-1.3.1.tar.gz(129.49 KB)
v1.3.0(Jul 19, 2022)
What’s Changed

🐜 Bug Fixes

Don't install tests with nvtabular @benfred (#1608)

Groupby to no longer require groupby_cols in column selector @radekosmulski (#1598)

Adjust imports in the TritonPythonModel for Workflows @karlhigley (#1604)

column names can now include aggregations in ops.Groupby @radekosmulski (#1592)

Normalize Op using fp32 @benfred (#1597)

Cast warning to string in configure_tensorflow @leewyang (#1587)

📄 Documentation

docs: Add TF compat info @mikemckiernan (#1528)

🔧 Maintenance

Fix movielens notebook data path @jperez999 (#1622)

skip download step, that is not allowed in CI @jperez999 (#1620)

fix tritonserver gpu id & fixed timeout for criteo integration tests @jperez999 (#1619)

Remove unnecessary docs dependencies @mikemckiernan (#1617)

fix ci script for integration tests and added skip check @jperez999 (#1616)

Integration tests refactor @jperez999 (#1614)

Don't git pull origin main in integration tests, use container version @karlhigley (#1610)

Source code(tar.gz)
Source code(zip)
nvtabular-1.3.0.tar.gz(129.52 KB)
v1.2.2(Jun 21, 2022)
What’s Changed

🐜 Bug Fixes

add casting for additional aggs in groupby @radekosmulski (#1580)

📄 Documentation

Update URLs to Criteo datasets @mikemckiernan (#1591)

🔧 Maintenance

Fix integration tests @benfred (#1594)

Source code(tar.gz)
Source code(zip)
nvtabular-1.2.2.tar.gz(209.88 KB)
v1.2.1(Jun 16, 2022)
What’s Changed

🔧 Maintenance

Update the container labels in the integration tests @benfred (#1588)

Update poetry config @benfred (#1585)

Source code(tar.gz)
Source code(zip)
nvtabular-1.2.1.tar.gz(209.56 KB)
v1.2.0(Jun 15, 2022)
What’s Changed

🐜 Bug Fixes

remove nvtabular triton backend that seg faults on termination. @jperez999 (#1576)

Fix LambdaOp example usage 1 @rnyak (#1561)

📄 Documentation

Merlin offers three containers @mikemckiernan (#1581)

Fix dataloader docstring @benfred (#1573)

Improved docstrings of GroupBy op to reinforce the required usage of dataset.shuffle_by_keys() @gabrielspmoreira (#1551)

Remove old support matrix table, @benfred (#1560)

Update CONTRIBUTING to mention PR labels @mikemckiernan (#1554)

Update changelog to point to github releases @benfred (#1549)

Use common release-drafter workflow @mikemckiernan (#1548)

🔧 Maintenance

Add a GA workflow that requires labels on PR's @benfred (#1579)

Use shared implementation of triage workflow @benfred (#1577)

Don't pull main on running NVT unittests @benfred (#1578)

Don't build model_config_pb2 @benfred (#1566)

Add conda builds to our github actions workflow @benfred (#1557)

Add release-drafter workflow for generating changelogs @benfred (#1540)

Remove message about integration tests missing @benfred (#1539)

Source code(tar.gz)
Source code(zip)
nvtabular-1.2.0.tar.gz(209.74 KB)
v1.1.1(May 10, 2022)

Source code(tar.gz)
Source code(zip)
nvtabular-1.1.1.tar.gz(234.46 KB)
v1.1.0(May 10, 2022)
Known Issues

Error when sending request to Triton after loading a Transformers4Rec PyTorch model https://github.com/NVIDIA-Merlin/NVTabular/issues/1502

What's Changed

Automate pushing package to pypi by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1505

docs: Add attention admonition to Merlin SMX by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1507

added category name to domain for column properties by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1508

Fix the embedding size lookup in Categorify op by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1511

Max auc by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1513

Fix inf container tag in getting started TF-inf nb and polish exp README by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1516

Fix for max-size categorify operator category ordering by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1519

Criteo HugeCTR Inference Configuration Fix by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1522

Add ascending param in the Groupby op by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1525

Remove os.environ["TF_MEMORY_ALLOCATION"] from getting-started 03-Training-with-TF nb to avoid OOM by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1527

Fix getting started 03-Training-with-HugeCTR.ipynb nb's training without printing out auc and loss metrics issue by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1532

reqs fixed by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1536

docs: Add ext-toc, switch to MyST-NB by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1529

remove horovod example, no longer supported by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1530

Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v1.0.0...v1.1.0
Source code(tar.gz)
Source code(zip)
nvtabular-1.1.0.tar.gz(234.34 KB)
v1.0.0(Apr 6, 2022)
What's Changed

Assume 'merlin' is a first party package for isort by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1420

End-to-end inference POC migration to new ensemble API by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1391

Update test_integration.sh by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1422

update test_tf4rec.py by @radekosmulski in https://github.com/NVIDIA-Merlin/NVTabular/pull/1424

Fix lambda dtype issue in PyTorch Multi-GPU training example notebook by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1425

Prevent dataloaders from using GPU memory when CPU device is selected by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1429

Fix dtype bug with GroupBy operator when aggs is a string by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1430

Fix typo in example notebook by @L0Z1K in https://github.com/NVIDIA-Merlin/NVTabular/pull/1390

Extract Triton Ensemble DAG to merlin.systems package by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1426

Add TagAs and related wrapper classes by @radekosmulski in https://github.com/NVIDIA-Merlin/NVTabular/pull/1414

docs: Add preview doc build to PR by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1432

Docs script by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1433

docs: Ensure that parent review directory exists by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1434

Update reqs by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1406

Handle aiobotocore v2.0+ in test_s3 by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1439

Update to work with the latest merlin-core by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1441

Add intersphinx mappings for merlin.core by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1440

Updates Container tests by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1445

Asvdb fix for integration testing by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1413

remove setuptools by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1460

Update imports for classes that moved to merlin-core by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1447

Reactivate hugectr Criteo integration test by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1457

Wrapper for TagAs did not work by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1462

Set up automated docstring coverage checks by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1454

doc: Update matrix for 22.03 by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1450

Remove Systems library from nvtabular by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1456

Fix bug about criteo download notebook by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1453

Add deprecation warnings to modules that moved to core by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1466

Hard-code the Workflow output dtypes for HugeCTR in Triton by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1468

AWS SageMaker by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1421

Improve Workflow error about mismatched dtypes by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1465

Exclude additional directories and boost docstring coverage req to 35 percent by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1471

fix(docs): Restore the version picker by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1474

Documentation fixes from the docstring scrub by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1475

Add missing --user flag to natsort CI install by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1476

Change merlin level NVT import to transforms (from transform) by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1472

Move merlin.core.worker to merlin.io.worker by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1477

Fix merlin.core.worker imports by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1482

Use quieter DeprecationWarning instead of FutureWarning by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1486

Remove imports to deprecated modules by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1487

README updates by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1478

Add Troubleshoot for OOM errors with NVTabular dataloaders by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1373

Upgrade poetry dependencies by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1489

Note in the README that installing with pip runs only on CPU by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1494

Add deprecation warnings to loader, inference, framework_utils by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1492

Add merlin.transforms.ops sub-package by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1491

fix for 1455 by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1497

Restrict running on pandas 1.4.x by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1496

Fixing Criteo Inference for TensorFlow and HugeCTR by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1500

docs: Add a redirect page by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1499

Final updates for 1.0 release by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1501

update to compatible dtype by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1503

New Contributors

@radekosmulski made their first contribution in https://github.com/NVIDIA-Merlin/NVTabular/pull/1424

@L0Z1K made their first contribution in https://github.com/NVIDIA-Merlin/NVTabular/pull/1390

Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.11.0...v1.0.0
Source code(tar.gz)
Source code(zip)
v0.11.0(Mar 1, 2022)
What's Changed

Docs: Update URL to Criteo notebook by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1383

Update support_matrix.rst by @lgardenhire in https://github.com/NVIDIA-Merlin/NVTabular/pull/1375

Support min_val for categorical features in DataGen by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1369

Fix null_size logic in Categorify op by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1386

Fix CUDA version doc by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1387

Fixes tests utils imports by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1393

Exit integration by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1395

Fix lambdaop call by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1394

Add ReduceDtypeSize op by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1398

Fix remove_inputs usage in export_pytorch_ensemble by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1389

Param to send test results by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1405

Migrate io, graph, dispatch, worker, and utils to merlin.core by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1384

Import Distributed and Serial execution-manager utilities from merlin-core by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1380

Pin merlin-core to a specific commit to avoid breaking changes by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1409

Rename merlin.graph to merlin.dag by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1411

Add DropLowCardinality op by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1412

Update merlin-core to v0.1.1 (instead of main branch) by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1419

New Contributors

@mikemckiernan made their first contribution in https://github.com/NVIDIA-Merlin/NVTabular/pull/1383

Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.10.0...v0.11.0
Source code(tar.gz)
Source code(zip)
v0.10.0(Feb 2, 2022)
What's Changed

schema metadata propagation by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1354

Create TagSet as a container that resolves conflicts between tags (like continuous and categorical) by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1360

Update support_matrix.rst by @lgardenhire in https://github.com/NVIDIA-Merlin/NVTabular/pull/1363

Raise an error when the actual dtype produced by an operator doesn't match the schema by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1362

Deprecate client from Dataset, Workflow, and DatasetInspector by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1318

fixes asv display to one metric per notebook and does not repeat metrics by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1366

Keras loader nvt dataset usage by default if available by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1374

Fixes hash_crossed with cudf 21.12 by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1376

Fixes tests by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1377

Support custom Python operators in the Triton operator/ensemble API by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1368

Use new fsspec.parquet module to accelerate reads from remote storage by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1241

Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.9.0...v0.10.0
Source code(tar.gz)
Source code(zip)
v0.9.0(Jan 11, 2022)
What's Changed

Workflow for adding issues to the backlog by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1305

Set the priority and date added fields for new issues. by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1308

Label issues not created by nvidia-merlin members by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1309

moved tf import to after tf config is completed by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1311

Fix Triton import for _convert_string2pytorch_dtype by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1312

Apply NVT graph API/DSL to building Triton ensembles by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1292

Fixes tests by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1326

Activates Blossom CI by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1324

Add a compute_input_schema method to operators by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1330

removed column_types.json from nvtabular by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1317

working refit as expected by user by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1338

Update support_matrix.rst by @lgardenhire in https://github.com/NVIDIA-Merlin/NVTabular/pull/1336

HugeCTR Multihot Training-Inference example by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1329

Triton setup via merlin graph api by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1339

removed parents selector logic in selector setter, by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1343

Switch to packaging.version.Version for version checks by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1345

fix for storage name bug in path creation by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1347

Fix multiGPU Pytorch MovieLens by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1319

Update dead links in Documentation by @SimonCW in https://github.com/NVIDIA-Merlin/NVTabular/pull/1342

Fixes cudf 21.10 error by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1350

Fixes unit tests for containers by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1349

Create an explicit mapping between Operator input and output columns by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1348

Updates notebooks for cudf 21.10 by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1353

Revert notebook by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1355

Update conda packages to cudf >= 21.10 and add pynvml by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1356

Fix writing out workflows to S3 by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1357

New Contributors

@SimonCW made their first contribution in https://github.com/NVIDIA-Merlin/NVTabular/pull/1342

Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.8.0...v0.9.0
Source code(tar.gz)
Source code(zip)
v0.8.0(Dec 7, 2021)
What's Changed

Allow writing workflows to cloud storage by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1232

Avoid copy of remote-data buffer in call to read_parquet by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1239

Update container references to merlin 21.11 by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1242

Fix numpy version in CI by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1255

Modularize the Triton inference model for NVT Workflows by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1252

Dl cpu by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1245

fixes for schema saving and writing by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1215

decouple io from schema by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1161

Remove non-exist Torch uint dtypes from Triton conversion utils by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1270

utf-8 when opening notebooks by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1271

Add 'pad' option for the ListSlice op by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1262

End-to-end Inference support for Transformers4Rec Tensorflow Models by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1256

fix lookup error on typo in tags for target by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1281

Fix resolution of tags to column names when executing Workflows by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1285

Extract all knowledge of Triton from the serving-time WorkflowRunners by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1257

Extract an abstract graph package from NVT Workflows by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1265

dataset duck typing for dataloader by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1272

Reduce device-memory footprint in Categorify fit by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1259

Fixes for ListSlice operator with padding by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1288

Update support_matrix.rst by @lgardenhire in https://github.com/NVIDIA-Merlin/NVTabular/pull/1243

Fix notebook tests broken by recent graph refactoring by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1293

add init file for import support by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1300

add missing dependencies to poetry by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1298

Fix inference issues for end-to-end TF example for Transformers4Rec by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1299

Uninstall NVT (removing versions from PyPI) before installing NVT in CI by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1303

Updates integration tests by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1294

fix train_test split by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1291

fix arbitrary output file number bug, shrink number of files and warn… by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1301

Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.7.1...v0.8.0
Source code(tar.gz)
Source code(zip)
v0.7.1(Nov 4, 2021)
NVTabular v0.7.1 (2 November 2021)

Improvements

Add LogOp support for list features #1153

Add Normalize operator support for list features #1154

Add DataLoader.epochs() method and Dataset.to_iter(epochs=) argument #1147

Add ValueCount operator for recording of multihot min and max list lengths #1171

Bug Fixes

Fix Criteo inference #1198

Fix performance regressions in Criteo benchmark #1222

Fix error in JoinGroupby op #1167

Fix Filter/JoinExternal key error #1143

Fix LambdaOp transforming dependency values #1185

Fix reading parquet files with list columns from GCS #1155

Fix TargetEncoding with dependencies as the target #1165

Fix Categorify op to calculate unique count stats for Nulls #1159

Source code(tar.gz)
Source code(zip)
v0.7.0(Sep 24, 2021)
NVTabular v0.7.0

Improvements

Add column tagging API #943

Export dataset schema when writing out datasets #948

Make dataloaders aware of schema #947

Standardize a Workflows representation of its output columns #372

Add multi-gpu training example using PyTorch Distributed #775

Speed up reading Parquet files from remote storage like GCS or S3 #1119

Add utility to convert TFRecord datasets to Parquet #1085

Add multi-gpu training example using PyTorch Distributed #775

Add multihot support for PyTorch inference #719

Add options to reserve categorical indices in the Categorify() op #1074

Update notebooks to work with CPU only systems #960

Save output from Categorify op in a single table for HugeCTR #946

Add a keyset file for HugeCTR integration #1049

Bug Fixes

Fix category counts written out by the Categorify op #1128

Fix HugeCTR inference example #1130

Fix make_feature_column_workflow bug in Categorify if features have vocabularies of varying size. #1062

Fix TargetEncoding op on CPU only systems #976

Fix writing empty partitions to Parquet files #1097

Source code(tar.gz)
Source code(zip)
v0.6.1(Aug 11, 2021)
NVTabular v0.6.1

Bug Fixes

Fix installing package via pip #1030

Fix inference with groupby operator #1019

Install tqdm with conda package #1030

Fix workflow output_dtypes with empty partitions #1028

Source code(tar.gz)
Source code(zip)
v0.6.0(Aug 3, 2021)
NVTabular v0.6.0

Improvements

Add CPU support #534

Speed up inference on Triton Inference Server #744

Add support for session based recommenders #355

Add PyTorch Dataloader support for Sparse Tensors #500

Add ListSlice operator for truncating list columns #734

Categorical ids sorted by frequency #799

Add ability to select a subset of a ColumnGroup #809

Add option to use Rename op to give a single column a new fixed name #825

Add a 'map' function to KerasSequenceLoader, which enables sample weights #667

Add JoinExternal option on nvt.Dataset in addition to cudf #370

Allow passing ColumnGroup to get_embedding_sizes #732

Add ability to name LambdaOp and provide a better default name in graph visualizations #860

Bug Fixes

Fix make_feature_column_workflow for Categorical columns #763

Fix Categorify output dtypes for list columns #963

Fix inference for Outbrain example #669

Fix dask metadata after calling workflow.to_ddf() #852

Fix out of memory errors #896, #971

Fix normalize output when stdev is zero #993

Fix using UCX with a dask cluster on Merlin containers #872

Source code(tar.gz)
Source code(zip)
v0.5.3(May 26, 2021)
Bug Fixes

Fix Shuffling in Torch DataLoader #818

Fix "Unsupported type_id conversion" in triton inference for string columns #813

Fix HugeCTR inference backend Merlin#8

Source code(tar.gz)
Source code(zip)
v0.5.2(May 13, 2021)
Bug Fixes

Fix Movielens TF example running on 1080ti #792

Fix Multihot output from get_embedding_sizes #808

Fix accelerated training documentation #791

Source code(tar.gz)
Source code(zip)
v0.5.1(May 3, 2021)
Improvements

Update dependencies to use cudf 0.19

Removed conda from docker containers, leading to much smaller container sizes

Added CUDA 11.2 support

Added FastAI v2.3 support

Bug Fixes

Fix NVTabular preprocessing with HugeCTR inference

Source code(tar.gz)
Source code(zip)
v0.5.0(Apr 12, 2021)
Improvements

Adding Horovod integration to NVTabular's dataloaders, allowing you to use multiple GPU's to train TensorFlow and PyTorch models

Adding a Groupby operation for use with session based recommender models

Added ability to read and write datasets partitioned by a column

Add example notebooks for using Triton Inference Server with NVTabular

Restructure and simplify Criteo example notebooks

Add support for PyTorch inference with Triton Inference Server

Bug Fixes

Fix bug with preprocessing categorical columns with NVTabular not working with HugeCTR and Triton Inference Server #707

Source code(tar.gz)
Source code(zip)
v0.4.0(Mar 9, 2021)
Breaking Changes

The API for NVTabular has been signficantly refactored, and existing code targetting the 0.3 API will need to be updated. Workflows are now represented as graphs of operations, and applied using a sklearn 'transformers' style api. Read more by checking out the examples

Improvements

Triton integration support for NVTabular with TensorFlow and HugeCTR models

Recommended cloud configuration and support for AWS and GCP

Reorganized examples and documentation

Unified Docker containers for Merlin components (NVTabular, HugeCTR and Triton)

Dataset analysis and generation tools

Source code(tar.gz)
Source code(zip)
v0.3.0(Mar 9, 2021)

Source code(tar.gz)
Source code(zip)