NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Overview

NVTabular | Documentation

NVTabular is a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets and train deep learning (DL) based recommender systems. It provides high-level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS Dask-cuDF library. NVTabular is designed to be interoperable with both PyTorch and TensorFlow using dataloaders that have been developed as extensions of native framework code. In our experiments, we were able to speed up existing TensorFlow pipelines by nine times and existing PyTorch pipelines by five times with our highly-optimized dataloaders.

NVTabular is a component of NVIDIA Merlin Open Beta. NVIDIA Merlin is used for building large-scale recommender systems. With NVTabular being a part of the Merlin ecosystem, it also works with the other Merlin components including HugeCTR and Triton Inference Server to provide end-to-end acceleration of recommender systems on the GPU. Extending beyond model training, with NVIDIA’s Triton Inference Server, the feature engineering and preprocessing steps performed on the data during training can be automatically applied to incoming data during inference.

Benefits

When training DL recommender systems, data scientists and machine learning (ML) engineers have been faced with the following challenges:

  • Huge Datasets: Commercial recommenders are trained on huge datasets that may be several terabytes in scale.
  • Complex Data Feature Engineering and Preprocessing Pipelines: Datasets need to be preprocessed and transformed so that they can be used with DL models and frameworks. In addition, feature engineering creates an extensive set of new features from existing ones, requiring multiple iterations to arrive at an optimal solution.
  • Input Bottleneck: Data loading, if not well optimized, can be the slowest part of the training process, leading to under-utilization of high-throughput computing devices such as GPUs.
  • Extensive Repeated Experimentation: The entire data engineering, training, and evaluation process can be repetitious and time consuming, requiring significant computational resources.

NVTabular alleviates these challenges and helps data scientists and ML engineers:

  • process datasets that exceed GPU and CPU memory without having to worry about scale.
  • use optimized dataloaders to accelerate training with TensorFlow, PyTorch, and HugeCTR.
  • focus on what to do with the data and not how to do it by using abstraction at the operation level.
  • prepare datasets quickly and easily for experimentation so that more models can be trained.

NVTabular provides faster iteration on massive tabular datasets during experimentation and training. NVTabular helps ML/Ops engineers with deploying models into production by providing faster dataset transformation. This makes it easy for production models to be trained more frequently and kept up to date, helping improve responsiveness and model performance.

To learn more about NVTabular's core features, see the following:

Performance

When running NVTabular on the Criteo 1TB Click Logs Dataset using a single V100 32GB GPU, feature engineering and preprocessing was able to be completed in 13 minutes. Furthermore, when running NVTabular on a DGX-1 cluster with eight V100 GPUs, feature engineering and preprocessing was able to be completed within three minutes. Combined with HugeCTR, the dataset can be processed and a full model can be trained in only six minutes.

The performance of the Criteo DRLM workflow also demonstrates the effectiveness of the NVTabular library. The original ETL script provided in Numpy took over five days to complete. Combined with CPU training, the total iteration time is over one week. By optimizing the ETL code in Spark and running on a DGX-1 equivalent cluster, the time to complete feature engineering and preprocessing was reduced to three hours. Meanwhile, training was completed in one hour.

Installation

Prior to installing NVTabular, ensure that you meet the following prerequisites:

  • CUDA version 10.1+
  • Python version 3.7+
  • NVIDIA Pascal GPU or later

NOTE: NVTabular will only run on Linux. Other operating systems are not currently supported.

Installing NVTabular Using Conda

NVTabular can be installed with Anaconda from the nvidia channel by running the following command:

conda install -c nvidia -c rapidsai -c numba -c conda-forge nvtabular python=3.7 cudatoolkit=11.0

If you'd like to create a full conda environment to run the example notebooks, do the following:

  1. Use the environment files that have been provided to install the CUDA Toolkit (11.0 or 11.2).
  2. Clone the NVTabular repo and run the following commands from the root directory:
    conda env create -f=conda/environments/nvtabular_dev_cuda11.2.yml
    conda activate nvtabular_dev_11.2
    python -m ipykernel install --user --name=nvt
    pip install -e .
    jupyter notebook
    
    When opening a notebook, be sure to select nvt from the Kernel->Change Kernel menu.

Installing NVTabular with Docker

NVTabular Docker containers are available in the NVIDIA Merlin container repository. There are four different containers:

Container Name Container Location Functionality
merlin-inference https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-inference NVTabular, HugeCTR, and Triton Inference
merlin-training https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-training NVTabular and HugeCTR
merlin-tensorflow-training https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-tensorflow-training NVTabular, TensorFlow, and HugeCTR Tensorflow Embedding plugin
merlin-pytorch-training https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-pytorch-training NVTabular and PyTorch

To use these Docker containers, you'll first need to install the NVIDIA Container Toolkit to provide GPU support for Docker. You can use the NGC links referenced in the table above to obtain more information about how to launch and run these containers. To obtain more information about the software and model versions that NVTabular supports per container, see Support Matrix.

Notebook Examples and Tutorials

We provide a collection of examples, use cases, and tutorials as Jupyter notebooks that demonstrate how to use the following datasets:

  • MovieLens25M
  • Outbrain Click Prediction
  • Criteo 1TB Click Logs
  • RecSys2020 Competition Hosted by Twitter
  • Rossmann Sales Prediction

Each Jupyter notebook covers the following:

  • Feature engineering and preprocessing with NVTabular
  • Advanced workflows with NVTabular
  • Accelerated dataloaders for TensorFlow and PyTorch
  • Scaling to multi-GPU and multi-node systems
  • Integrating NVTabular with HugeCTR
  • Deploying to inference with Triton

Feedback and Support

If you'd like to contribute to the library directly, see the Contributing.md. We're particularly interested in contributions or feature requests for our feature engineering and preprocessing operations. To further advance our Merlin Roadmap, we encourage you to share all the details regarding your recommender system pipeline in this survey.

If you're interested in learning more about how NVTabular works, see our NVTabular documentation. We also have API documentation that outlines the specifics of the available calls within the library.

Comments
  • Adding ops for feature column functionality and feature column to workflow mapping function

    Adding ops for feature column functionality and feature column to workflow mapping function

    Increasing NVTabular compatibility with TensorFlow feature column API by adding remaining necessary ops (cross op and bucketize) and a function which can map from a set of feature columns to an NVTabular workflow that performs all analogous preprocessing. Addresses #371

    HashedCross doesn't support multi-hot yet, and I'm not sure that extending to it will be necessarily easy. For reference, the TF cross op handles multi-hots by doing a cartesian product of all indices for each feature. See the documentation here.

    Still need to add bucketized support and test everything.

    opened by alecgunny 40
  • Data inspect

    Data inspect

    Opening because #510 was closed after new_api branch was merged into main.

    @jperez999 I applied the feedback you told me, and I also added unit test. I have seen your data generation branch, I can follow your packaging style when you get that PR Merged.

    Pending for the future:

    1. Fix dask-cudf dtypes for lists: We need cudf support, there is an issue created (https://github.com/rapidsai/cudf/issues/7157)
    2. When list are supported, test them in the unit tests. @benfred can I modify the testing dataset to add one column that is a list?
    opened by albert17 39
  • [REVIEW] Adding TensorFlow example

    [REVIEW] Adding TensorFlow example

    Building end-to-end TensorFlow Criteo example to demonstrate how to use NVTabular as a data loader and online preprocessor for TensorFlow. End version will probably end up restructuring existing Criteo example to make its preprocessing notebook callable

    opened by alecgunny 38
  • Asv test support

    Asv test support

    These changes add support for benchmarking leveraging the Air Speed Velocity framework to display benchmarks over time. This will be extremely helpful in catch regressions in performance while also ensuring any new changes do not adversely affect our selected benchmarks.

    opened by jperez999 37
  • data generator phase 1

    data generator phase 1

    builds a dataset given, user supplies number of cont columns, cat columns, cardinality list and distribution type. Also added distribution verification capability to the generator.

    opened by jperez999 35
  • [REVIEW] NVTabular+TF training and inference integration test

    [REVIEW] NVTabular+TF training and inference integration test

    Even though we run notebooks as a part of our integration tests, we do not compare the output of the Triton inference with the NVTabular transform() and TF predict() using the same data. This PR addresses the need of an integration test that does that.

    We'll be able to test C++ backend with these tests as well.

    We might wanna add or run this test from the C++ backend repo.

    opened by oyilmaz-nvidia 34
  • Handle hive-partitioning in NVTabular.dataset.Dataset

    Handle hive-partitioning in NVTabular.dataset.Dataset

    Closes #642 Addresses global shuffle component of #641

    Thu purpose of this PR is to improve handling of hive-partitioned parquet data in NVTabular. Since the Dataset API already uses dask.dataframe.read_parquet, there is currenlty no "correcness" issue with reading hive-partitioned data. However, (1) there is no convenient mechanism to write hive-partitioned data, and (2) the read stage typically results in many small partitions (rather than a single partition for each input directory).

    • Solution to (1): The Dataset.to_parquet method now supports a partition_on= argument. This is designed to match the same option in dask.dataframe/dask_cudf. If the user passes a list of 1+ columns with this argument, the output data will be shuffled at IO time into a distinct directory for each unique combination of those partition_on column values. When multiple columns are use for partitioning (e.g. ["month", "day"]), the directory structure is nested (so that the full path for an output file will look something like "/month=Mar/day=30/part.0.parquet").
    • Solution to (2): Since #641 will need a mechanism to ensure a unique mapping between specified column groups and ddf partitions, this PR adds a Dataset.partition_by_keys method to perform a global shuffle on the specified column group (keys) and return a new (shuffled) Dataset. For general Dataset objects, this method will simply call ddf.shuffle() under the hood. For Dataset objects that are backed by hive-partitioned data, however, we use the metadata stored in the file paths to avoid a full shuffle. In the future, this optimization can be pushed even further by directly agregating all IO tasks within the same hive-partition. However, I suspect that shuch an optimization should be implemented in dask.dataframe.

    Example Usage

    import pandas as pd
    import dask.dataframe as dd
    import dask
    import nvtabular as nvt
    
    path = "fake.data"
    
    # Create a sample ddf
    ddf = dask.datasets.timeseries(
        start="2000-01-01",
        end="2000-01-03",
        freq="600s",
        partition_freq="6h",
        seed=42,
    ).reset_index()
    ddf['timestamp'] = ddf['timestamp'].dt.round('D').dt.day
    
    # Convert to a Datset and write out hive-partitioned data to disk
    keys = ["timestamp", "name"]
    nvt.Dataset(ddf).to_parquet(path, partition_on=keys)
    

    This will produce a directory structure like:

    $ find fake.data/ -type d -print
    fake.data/
    fake.data/timestamp=1
    fake.data/timestamp=1/name=Alice
    fake.data/timestamp=1/name=Frank
    fake.data/timestamp=1/name=Victor
    fake.data/timestamp=1/name=George
    fake.data/timestamp=1/name=Quinn
    fake.data/timestamp=1/name=Kevin
    fake.data/timestamp=1/name=Ursula
    ...
    

    Then, you can read the data back in with NVT, and ensure that the ddf partitions are shuffled by keys:

    ds = nvt.Dataset(path, engine="parquet").shuffle_by_keys(keys)
    ds.to_ddf().compute()
    
          id         x         y timestamp    name
    0    991 -0.750009 -0.587392         1   Alice
    1   1022  0.866823 -0.682096         1   Alice
    2    991  0.467775  0.683211         1   Alice
    3    967  0.534984 -0.931405         1     Bob
    4    991 -0.149716 -0.651939         1     Bob
    ..   ...       ...       ...       ...     ...
    25   964  0.843602  0.598580         3  Yvonne
    26   961  0.853070 -0.987596         3  Yvonne
    27   947  0.934162  0.190069         3  Yvonne
    28  1024 -0.107280  0.662606         3  Yvonne
    29  1006  0.169090 -0.784889         3   Zelda
    
    [288 rows x 5 columns]
    
    opened by rjzamora 33
  • TF and PT dataloader list support

    TF and PT dataloader list support

    Moving MH support from #365 into backend then adding TensorFlow support on top of it. Still needs tests, benchmarking, embedding compatibility on the TF side, and I'm not sure how I feel about the naming conventions adopted (multi-hot columns get __values and __nnz appended to the end of their TF dict representation), but functionality should be there.

    Note that while PT is using offsets representation, TF is using nnz since RaggedTensor embedding layer will need that. Eventually we should migrate to offsets when we build custom EmbeddingBag op, so this is controllable with a keyword.

    The other key functionality I've added is moving all the conversion from gdf to framework tensors into a make_tensor method on the dataloaders. This way, you can transform "arbitrary" datarames (to the extent they match the schema used to initialize the dataloader) into framework tensors. The intended use case is if you have some test dataset which hasn't yet been transformed by the preprocessing workflow, you can iterate through it, apply the workflow, then send it into a framework for prediction in one foul swoop.

    opened by alecgunny 33
  • [REVIEW] Adding file type option for HugeCTR output

    [REVIEW] Adding file type option for HugeCTR output

    Adds option in HugeCTR to choose if the output should be in binary or parquet formar.

    Creates a Base Class (Writer) to be used for Shuffles and HugeCTR.

    Modifications: workflow.py and io.oy.

    opened by albert17 33
  • various fixes for diff issues

    various fixes for diff issues

    This will refactor the workflow process for ingesting operators including simplified api and allowing multiple operators of same kind. Ordering will have priority this will allow chaining to follow a more user friendly convention. Reductions to phases will still be conducted before application phase. This also fixes tensorflow API 2 gpu memory usage util and band aid for torch tensor convergence issue. #383 #377 #372

    opened by jperez999 32
  • [REVIEW] Implement Feature Request from #1077 on Left Padding

    [REVIEW] Implement Feature Request from #1077 on Left Padding

    This pull request is for resolving issue #1077 on implementing left padding for sparse sequential features.

    The changes being made, or to be made, in this PR include the implementation of left padding in the torch and tensorflow dataloading modules, any needed changes to user-facing methods, unit test(s) for the this change, and documentation updates related to this fix and while I have been reading and working on the dataloader modules.

    dataloader ops 
    opened by lesnikow 31
  • Replace `flake8` with `ruff` (which is equivalent but faster)

    Replace `flake8` with `ruff` (which is equivalent but faster)

    ruff is a very fast Python linter written in Rust that covers the flake8 rules in addition to a bunch of other popular linters. This sets up a config for it that covers the existing flake8 config, and adds additional linters while generating minimal errors with the existing code base. Since pylint is much slower than ruff (and I suspect slower than flake8 too), this config executes the ruff checks first in order to fail fast.

    clean up chore 
    opened by karlhigley 2
  • Add support for serializing modules involved in LambdaOp execution by value

    Add support for serializing modules involved in LambdaOp execution by value

    These commits address issue #1737 in two ways:

    1. by allowing users to direct Workflow.save to serialize an explicit list of named modules by value, and
    2. by allowing users to direct Workflow.save to use a heuristic to automatically infer the external modules involved in a workflow.

    Taken together, these commits will make it easier to serialize workflows that are resilient to execution in environments without source files for all of the modules that were available when the workflow was created, e.g., Docker containers.

    The main contribution is the addition of a modules_byvalue parameter to Workflow.save. If this is passed an explicit list of modules, Workflow.save will direct cloudpickle to serialize these modules by value. If it is passed the string "auto", Workflow.save will employ a heuristic approach that will be useful for many real-world uses of LambdaOp.

    I believe that it would be harmless and useful to make "auto" the default but have not done so in this PR.

    For more background on some of the tradeoffs that would be involved in making the heuristic more precise, please see this blog post. I do not believe that many of the cases I identified in that post (explicit imports within functions, list comprehensions, etc.) are likely to present in realistic NVTabular LambdaOp client code, and thus took a relatively simple approach.

    bug 
    opened by willb 3
  • [BUG] LambdaOp may fail to deserialize if the processing function is declared in another module

    [BUG] LambdaOp may fail to deserialize if the processing function is declared in another module

    Describe the bug

    By default, Cloudpickle serializes functions and modules by reference. Deserializing workflows that contain user functions declared in other modules can fail if the Python source files for those modules are not available. This means that serializing a workflow and publishing it in a Docker container requires either (1) that all user source files are published in the same paths in the container as they were when the workflow was serialized or (2) that user modules are not serialized by reference.

    Steps/Code to reproduce bug

    Assume that identity.py contains the identity function, called identity.

    with open("identity.py", "w") as of:
        of.write("""
    def identity(col):
        return col
    
    """)
    

    Then run this code:

    import nvtabular as nvt
    import identity
    
    wf = nvt.Workflow(
        ["col_a"] >> nvt.ops.LambdaOp(identity.identity)
    )
    
    wf.save("identity-workflow")
    

    This serialized workflow will fail to deserialize if identity.py is not available in the same location, e.g., in a Docker container or on another host.

    import sys
    import os
    
    del sys.modules["identity"]
    os.unlink("identity.py")
    
    try:
        wf_ref = nvt.Workflow.load("identity-workflow")
    except ModuleNotFoundError as mnfe:
        print("Failed to load workflow")
        print(str(mnfe))
    
    

    Expected behavior

    Serialized workflows should be largely self-contained and resilient to missing source files containing user transformation code.

    Environment details (please complete the following information):

    • Environment location: Docker
    • Method of NVTabular install: Docker
      • If method of install is [Docker], provide docker pull & docker run commands used:

    docker run -v /home/willb/devel/nvtabular-sandbox:/workspace/host --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8787 -p 8796:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-training:22.03 /bin/bash

    Additional context

    This notebook demonstrates the minimal reproducer as well as a workaround.

    I expect to submit a patch shortly that will enhance Workflow.save in two ways:

    1. to allow clients to provide explicit direction that certain modules should be serialized by value, and
    2. to allow clients to request optional automatic identification of such modules.
    bug P1 
    opened by willb 0
  • Fix max_size behavior in Categorify

    Fix max_size behavior in Categorify

    Closes https://github.com/NVIDIA-Merlin/NVTabular/issues/1735

    Changes both combo- and joint-encoding logic to increase the null-value count to include any categories that will be dropped to satisfy the max_size argument.

    Warning: This change means the 0th row in unique.*.parquet will no longer correspond to actual null-value count, but to the number of values in the dataset that would be mapped to the "null category" at transformation time (including both null and low-frequency values). cc @rnyak - Please confirm that this is the expected behavior.

    ops 
    opened by rjzamora 2
  • [BUG] Categorify does not set non-frequent item mapping size in unique parquet file when `max_size` arg is set

    [BUG] Categorify does not set non-frequent item mapping size in unique parquet file when `max_size` arg is set

    Describe the bug

    When we set the max_size arg in Categorify() op, we intent to apply frequency capping, i.e., non-frequent items are mapped to 0, based on max_size value that is set. However, although Categorify works fine and encodes the frequent and non-frequent items properly, the unique_...parquet file does not set the size of non_frequent items correct. This is an important issue since users rely on these unique...parquet files and they do further processes with them. One example is to calculate the item frequencies from unique.item_id.parquet file and do some logits correction in Merlin Models.

    Steps/Code to reproduce bug Please run the code to repro the issue

    import nvtabular as nvt
    from merlin.core import dispatch
    from merlin.core.dispatch import make_df
    from nvtabular import ColumnSelector, ops
    import pandas as pd
    
    df = dispatch.make_df(
        {
            "item_id": [
                "A",
                "E",
                "B",
                "C",
                "A",
                "A",
                "B",
                "C",
                "D",
                "A",
                "B",
                "A",
                "B",
            ],
        }
    )
    
    cat_names = ["item_id"]
    dataset = nvt.Dataset(df)
    cat_features = cat_names >> ops.Categorify(max_size=3)
    processor = nvt.Workflow(cat_features)
    processor.fit(dataset)
    new_gdf = processor.transform(dataset).to_ddf().compute()
    pd.read_parquet('./categories/unique.item_id.parquet')
    
    	item_id	item_id_size
    0	<NA>	0
    1	A	5
    2	B	4
    

    Expected behavior The size value in the first row in the unique.item_id.parquet should read as 4 instead of 0, since after setting max_size arg, we have 4 samples in the processed new_gdf that were mapped to 0.

    Environment details (please complete the following information): I am using merlin-tensorflow:22.11 image with the latest main branches of the Merlin libs pulled and installed.

    bug P0 
    opened by rnyak 0
Releases(v1.8.0)
  • v1.8.0(Dec 30, 2022)

    What’s Changed

    📄 Documentation

    • Address virtual developer review feedback @mikemckiernan (#1724)

    🔧 Maintenance

    • remove test references that are no longer available @jperez999 (#1730)
    • remove integration tests for notebooks no longer available @jperez999 (#1729)
    • Use pre-commit for lint checks in GitHub Actions Workflow @oliverholworthy (#1723)
    • Remove echo from command in tox.ini @oliverholworthy (#1725)
    • Migrate the legacy examples to the Merlin repo @karlhigley (#1711)
    • Handle data loader as an iterator @oliverholworthy (#1720)
    • Release draft fix @jperez999 (#1712)
    • Add Jenkinsfile @AyodeAwe (#1702)
    Source code(tar.gz)
    Source code(zip)
    nvtabular-1.8.0-cp38-cp38-linux_x86_64.whl(254.99 KB)
    nvtabular-1.8.0-cp39-cp39-linux_x86_64.whl(255.37 KB)
    nvtabular-1.8.0.tar.gz(123.24 KB)
  • v1.7.0(Nov 23, 2022)

    What’s Changed

    🐜 Bug Fixes

    • fix tox to use correct branch in release tags @jperez999 (#1710)
    • Remove min value count from properties when using sparse_max @oliverholworthy (#1705)
    • Update metrics keys in example notebook tests @karlhigley (#1703)
    • Fix first/last groupby aggregation on list columns @rjzamora (#1693)

    📄 Documentation

    • Update metrics keys in example notebook tests @karlhigley (#1703)
    • docs: Add semver to calver banner @mikemckiernan (#1699)
    • docs: Add basic SEO configuration @mikemckiernan (#1697)

    🔧 Maintenance

    • fix tox to use correct branch in release tags @jperez999 (#1710)
    • Upload binary wheels for nvtabular @benfred (#1696)
    • Use merlin-dataloader package @benfred (#1694)
    Source code(tar.gz)
    Source code(zip)
    nvtabular-1.7.0-cp38-cp38-linux_x86_64.whl(254.79 KB)
    nvtabular-1.7.0-cp39-cp39-linux_x86_64.whl(254.63 KB)
    nvtabular-1.7.0.tar.gz(123.35 KB)
  • v1.6.0(Oct 31, 2022)

    What’s Changed

    🐜 Bug Fixes

    • Fix first/last groupby aggregation on list columns @rjzamora (#1693)
    • Fix Categorify bug for combo encoding with null values @rjzamora (#1652)
    • Fix joint Categorify with list columns @rjzamora (#1685)

    📄 Documentation

    • update NVTabular examples @radekosmulski (#1633)
    • Remove examples Part 1 - Rossmann, RecSys2020, Outbrain @bschifferer (#1669)

    🔧 Maintenance

    • adding import or skip for tensorflow framework required by examples @jperez999 (#1691)
    Source code(tar.gz)
    Source code(zip)
  • v1.5.0(Sep 26, 2022)

    What’s Changed

    🐜 Bug Fixes

    • Use Merlin DAG executors from core in integration tests @jperez999 (#1677)
    • Fix target encoding tagging issue @bbozkaya (#1672)

    🔧 Maintenance

    • Remove stray file left over from Torch/Horovod multi-GPU example @karlhigley (#1674)
    • Use Merlin DAG executors from core in integration tests @jperez999 (#1677)
    • Remove poetry config @benfred (#1673)
    • chore: Add pybind11 as a tox requirement @mikemckiernan (#1675)
    • Switch to using the DAG executors from Merlin Core @karlhigley (#1666)
    • Use the latest version of Merlin Core from main in the tox test envs @karlhigley (#1671)
    • Set up tox environments for testing, linting, and building docs @karlhigley (#1667)
    Source code(tar.gz)
    Source code(zip)
    nvtabular-1.5.0.tar.gz(130.49 KB)
  • v1.4.0(Sep 6, 2022)

    What’s Changed

    ⚠ Breaking Changes

    • Remove FastAI notebooks @benfred (#1668)
    • Fix dl @jperez999 (#1661)
    • Replace cudf series ceil() with numpy ceil() @jperez999 (#1656)

    🐜 Bug Fixes

    • Fix integration tests that reached into Workflow's private methods @karlhigley (#1660)
    • Fix groupby on lists with cudf 22.06+ @benfred (#1654)
    • Update the Categorify operator to set the domain max correctly @oliverholworthy (#1641)
    • Test LambdaOp with dask workflows @benfred (#1634)

    🚀 Features

    • Add sum to supported aggregations in Groupby @radekosmulski (#1638)

    📄 Documentation

    • Remove using-feature-columns nb @rnyak (#1657)
    • Fix typos @benfred (#1655)

    🔧 Maintenance

    • Add optional requirement specifiers for GPU and dev requirements @karlhigley (#1664)
    • Add scipy as a dependency @karlhigley (#1663)
    • Fix dl @jperez999 (#1661)
    • Fix integration tests that reached into Workflow's private methods @karlhigley (#1660)
    • Update black/pylint/flake8,isort etc @benfred (#1659)
    • Remove using-feature-columns nb @rnyak (#1657)
    • Replace cudf series ceil() with numpy ceil() @jperez999 (#1656)
    • Extract Python and Dask Executor classes from Workflow @karlhigley (#1609)
    • Update versioneer from 0.19 to 0.23 @oliverholworthy (#1651)
    Source code(tar.gz)
    Source code(zip)
    nvtabular-1.4.0.tar.gz(132.41 KB)
  • v1.3.1(Jul 19, 2022)

  • v1.3.0(Jul 19, 2022)

    What’s Changed

    🐜 Bug Fixes

    • Don't install tests with nvtabular @benfred (#1608)
    • Groupby to no longer require groupby_cols in column selector @radekosmulski (#1598)
    • Adjust imports in the TritonPythonModel for Workflows @karlhigley (#1604)
    • column names can now include aggregations in ops.Groupby @radekosmulski (#1592)
    • Normalize Op using fp32 @benfred (#1597)
    • Cast warning to string in configure_tensorflow @leewyang (#1587)

    📄 Documentation

    • docs: Add TF compat info @mikemckiernan (#1528)

    🔧 Maintenance

    • Fix movielens notebook data path @jperez999 (#1622)
    • skip download step, that is not allowed in CI @jperez999 (#1620)
    • fix tritonserver gpu id & fixed timeout for criteo integration tests @jperez999 (#1619)
    • Remove unnecessary docs dependencies @mikemckiernan (#1617)
    • fix ci script for integration tests and added skip check @jperez999 (#1616)
    • Integration tests refactor @jperez999 (#1614)
    • Don't git pull origin main in integration tests, use container version @karlhigley (#1610)
    Source code(tar.gz)
    Source code(zip)
    nvtabular-1.3.0.tar.gz(129.52 KB)
  • v1.2.2(Jun 21, 2022)

  • v1.2.1(Jun 16, 2022)

  • v1.2.0(Jun 15, 2022)

    What’s Changed

    🐜 Bug Fixes

    • remove nvtabular triton backend that seg faults on termination. @jperez999 (#1576)
    • Fix LambdaOp example usage 1 @rnyak (#1561)

    📄 Documentation

    • Merlin offers three containers @mikemckiernan (#1581)
    • Fix dataloader docstring @benfred (#1573)
    • Improved docstrings of GroupBy op to reinforce the required usage of dataset.shuffle_by_keys() @gabrielspmoreira (#1551)
    • Remove old support matrix table, @benfred (#1560)
    • Update CONTRIBUTING to mention PR labels @mikemckiernan (#1554)
    • Update changelog to point to github releases @benfred (#1549)
    • Use common release-drafter workflow @mikemckiernan (#1548)

    🔧 Maintenance

    • Add a GA workflow that requires labels on PR's @benfred (#1579)
    • Use shared implementation of triage workflow @benfred (#1577)
    • Don't pull main on running NVT unittests @benfred (#1578)
    • Don't build model_config_pb2 @benfred (#1566)
    • Add conda builds to our github actions workflow @benfred (#1557)
    • Add release-drafter workflow for generating changelogs @benfred (#1540)
    • Remove message about integration tests missing @benfred (#1539)
    Source code(tar.gz)
    Source code(zip)
    nvtabular-1.2.0.tar.gz(209.74 KB)
  • v1.1.0(May 10, 2022)

    Known Issues

    • Error when sending request to Triton after loading a Transformers4Rec PyTorch model https://github.com/NVIDIA-Merlin/NVTabular/issues/1502

    What's Changed

    • Automate pushing package to pypi by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1505
    • docs: Add attention admonition to Merlin SMX by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1507
    • added category name to domain for column properties by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1508
    • Fix the embedding size lookup in Categorify op by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1511
    • Max auc by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1513
    • Fix inf container tag in getting started TF-inf nb and polish exp README by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1516
    • Fix for max-size categorify operator category ordering by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1519
    • Criteo HugeCTR Inference Configuration Fix by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1522
    • Add ascending param in the Groupby op by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1525
    • Remove os.environ["TF_MEMORY_ALLOCATION"] from getting-started 03-Training-with-TF nb to avoid OOM by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1527
    • Fix getting started 03-Training-with-HugeCTR.ipynb nb's training without printing out auc and loss metrics issue by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1532
    • reqs fixed by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1536
    • docs: Add ext-toc, switch to MyST-NB by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1529
    • remove horovod example, no longer supported by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1530

    Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v1.0.0...v1.1.0

    Source code(tar.gz)
    Source code(zip)
    nvtabular-1.1.0.tar.gz(234.34 KB)
  • v1.0.0(Apr 6, 2022)

    What's Changed

    • Assume 'merlin' is a first party package for isort by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1420
    • End-to-end inference POC migration to new ensemble API by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1391
    • Update test_integration.sh by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1422
    • update test_tf4rec.py by @radekosmulski in https://github.com/NVIDIA-Merlin/NVTabular/pull/1424
    • Fix lambda dtype issue in PyTorch Multi-GPU training example notebook by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1425
    • Prevent dataloaders from using GPU memory when CPU device is selected by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1429
    • Fix dtype bug with GroupBy operator when aggs is a string by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1430
    • Fix typo in example notebook by @L0Z1K in https://github.com/NVIDIA-Merlin/NVTabular/pull/1390
    • Extract Triton Ensemble DAG to merlin.systems package by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1426
    • Add TagAs and related wrapper classes by @radekosmulski in https://github.com/NVIDIA-Merlin/NVTabular/pull/1414
    • docs: Add preview doc build to PR by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1432
    • Docs script by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1433
    • docs: Ensure that parent review directory exists by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1434
    • Update reqs by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1406
    • Handle aiobotocore v2.0+ in test_s3 by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1439
    • Update to work with the latest merlin-core by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1441
    • Add intersphinx mappings for merlin.core by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1440
    • Updates Container tests by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1445
    • Asvdb fix for integration testing by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1413
    • remove setuptools by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1460
    • Update imports for classes that moved to merlin-core by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1447
    • Reactivate hugectr Criteo integration test by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1457
    • Wrapper for TagAs did not work by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1462
    • Set up automated docstring coverage checks by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1454
    • doc: Update matrix for 22.03 by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1450
    • Remove Systems library from nvtabular by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1456
    • Fix bug about criteo download notebook by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1453
    • Add deprecation warnings to modules that moved to core by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1466
    • Hard-code the Workflow output dtypes for HugeCTR in Triton by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1468
    • AWS SageMaker by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1421
    • Improve Workflow error about mismatched dtypes by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1465
    • Exclude additional directories and boost docstring coverage req to 35 percent by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1471
    • fix(docs): Restore the version picker by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1474
    • Documentation fixes from the docstring scrub by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1475
    • Add missing --user flag to natsort CI install by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1476
    • Change merlin level NVT import to transforms (from transform) by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1472
    • Move merlin.core.worker to merlin.io.worker by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1477
    • Fix merlin.core.worker imports by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1482
    • Use quieter DeprecationWarning instead of FutureWarning by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1486
    • Remove imports to deprecated modules by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1487
    • README updates by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1478
    • Add Troubleshoot for OOM errors with NVTabular dataloaders by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1373
    • Upgrade poetry dependencies by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1489
    • Note in the README that installing with pip runs only on CPU by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1494
    • Add deprecation warnings to loader, inference, framework_utils by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1492
    • Add merlin.transforms.ops sub-package by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1491
    • fix for 1455 by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1497
    • Restrict running on pandas 1.4.x by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1496
    • Fixing Criteo Inference for TensorFlow and HugeCTR by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1500
    • docs: Add a redirect page by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1499
    • Final updates for 1.0 release by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1501
    • update to compatible dtype by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1503

    New Contributors

    • @radekosmulski made their first contribution in https://github.com/NVIDIA-Merlin/NVTabular/pull/1424
    • @L0Z1K made their first contribution in https://github.com/NVIDIA-Merlin/NVTabular/pull/1390

    Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.11.0...v1.0.0

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Mar 1, 2022)

    What's Changed

    • Docs: Update URL to Criteo notebook by @mikemckiernan in https://github.com/NVIDIA-Merlin/NVTabular/pull/1383
    • Update support_matrix.rst by @lgardenhire in https://github.com/NVIDIA-Merlin/NVTabular/pull/1375
    • Support min_val for categorical features in DataGen by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1369
    • Fix null_size logic in Categorify op by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1386
    • Fix CUDA version doc by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1387
    • Fixes tests utils imports by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1393
    • Exit integration by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1395
    • Fix lambdaop call by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1394
    • Add ReduceDtypeSize op by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1398
    • Fix remove_inputs usage in export_pytorch_ensemble by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1389
    • Param to send test results by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1405
    • Migrate io, graph, dispatch, worker, and utils to merlin.core by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1384
    • Import Distributed and Serial execution-manager utilities from merlin-core by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1380
    • Pin merlin-core to a specific commit to avoid breaking changes by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1409
    • Rename merlin.graph to merlin.dag by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1411
    • Add DropLowCardinality op by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1412
    • Update merlin-core to v0.1.1 (instead of main branch) by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1419

    New Contributors

    • @mikemckiernan made their first contribution in https://github.com/NVIDIA-Merlin/NVTabular/pull/1383

    Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.10.0...v0.11.0

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Feb 2, 2022)

    What's Changed

    • schema metadata propagation by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1354
    • Create TagSet as a container that resolves conflicts between tags (like continuous and categorical) by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1360
    • Update support_matrix.rst by @lgardenhire in https://github.com/NVIDIA-Merlin/NVTabular/pull/1363
    • Raise an error when the actual dtype produced by an operator doesn't match the schema by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1362
    • Deprecate client from Dataset, Workflow, and DatasetInspector by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1318
    • fixes asv display to one metric per notebook and does not repeat metrics by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1366
    • Keras loader nvt dataset usage by default if available by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1374
    • Fixes hash_crossed with cudf 21.12 by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1376
    • Fixes tests by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1377
    • Support custom Python operators in the Triton operator/ensemble API by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1368
    • Use new fsspec.parquet module to accelerate reads from remote storage by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1241

    Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.9.0...v0.10.0

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Jan 11, 2022)

    What's Changed

    • Workflow for adding issues to the backlog by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1305
    • Set the priority and date added fields for new issues. by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1308
    • Label issues not created by nvidia-merlin members by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1309
    • moved tf import to after tf config is completed by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1311
    • Fix Triton import for _convert_string2pytorch_dtype by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1312
    • Apply NVT graph API/DSL to building Triton ensembles by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1292
    • Fixes tests by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1326
    • Activates Blossom CI by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1324
    • Add a compute_input_schema method to operators by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1330
    • removed column_types.json from nvtabular by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1317
    • working refit as expected by user by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1338
    • Update support_matrix.rst by @lgardenhire in https://github.com/NVIDIA-Merlin/NVTabular/pull/1336
    • HugeCTR Multihot Training-Inference example by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1329
    • Triton setup via merlin graph api by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1339
    • removed parents selector logic in selector setter, by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1343
    • Switch to packaging.version.Version for version checks by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1345
    • fix for storage name bug in path creation by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1347
    • Fix multiGPU Pytorch MovieLens by @bschifferer in https://github.com/NVIDIA-Merlin/NVTabular/pull/1319
    • Update dead links in Documentation by @SimonCW in https://github.com/NVIDIA-Merlin/NVTabular/pull/1342
    • Fixes cudf 21.10 error by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1350
    • Fixes unit tests for containers by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1349
    • Create an explicit mapping between Operator input and output columns by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1348
    • Updates notebooks for cudf 21.10 by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1353
    • Revert notebook by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1355
    • Update conda packages to cudf >= 21.10 and add pynvml by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1356
    • Fix writing out workflows to S3 by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1357

    New Contributors

    • @SimonCW made their first contribution in https://github.com/NVIDIA-Merlin/NVTabular/pull/1342

    Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.8.0...v0.9.0

    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Dec 7, 2021)

    What's Changed

    • Allow writing workflows to cloud storage by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1232
    • Avoid copy of remote-data buffer in call to read_parquet by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1239
    • Update container references to merlin 21.11 by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1242
    • Fix numpy version in CI by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1255
    • Modularize the Triton inference model for NVT Workflows by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1252
    • Dl cpu by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1245
    • fixes for schema saving and writing by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1215
    • decouple io from schema by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1161
    • Remove non-exist Torch uint dtypes from Triton conversion utils by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1270
    • utf-8 when opening notebooks by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1271
    • Add 'pad' option for the ListSlice op by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1262
    • End-to-end Inference support for Transformers4Rec Tensorflow Models by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1256
    • fix lookup error on typo in tags for target by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1281
    • Fix resolution of tags to column names when executing Workflows by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1285
    • Extract all knowledge of Triton from the serving-time WorkflowRunners by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1257
    • Extract an abstract graph package from NVT Workflows by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1265
    • dataset duck typing for dataloader by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1272
    • Reduce device-memory footprint in Categorify fit by @rjzamora in https://github.com/NVIDIA-Merlin/NVTabular/pull/1259
    • Fixes for ListSlice operator with padding by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1288
    • Update support_matrix.rst by @lgardenhire in https://github.com/NVIDIA-Merlin/NVTabular/pull/1243
    • Fix notebook tests broken by recent graph refactoring by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1293
    • add init file for import support by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1300
    • add missing dependencies to poetry by @benfred in https://github.com/NVIDIA-Merlin/NVTabular/pull/1298
    • Fix inference issues for end-to-end TF example for Transformers4Rec by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1299
    • Uninstall NVT (removing versions from PyPI) before installing NVT in CI by @karlhigley in https://github.com/NVIDIA-Merlin/NVTabular/pull/1303
    • Updates integration tests by @albert17 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1294
    • fix train_test split by @rnyak in https://github.com/NVIDIA-Merlin/NVTabular/pull/1291
    • fix arbitrary output file number bug, shrink number of files and warn… by @jperez999 in https://github.com/NVIDIA-Merlin/NVTabular/pull/1301

    Full Changelog: https://github.com/NVIDIA-Merlin/NVTabular/compare/v0.7.1...v0.8.0

    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Nov 4, 2021)

    NVTabular v0.7.1 (2 November 2021)

    Improvements

    • Add LogOp support for list features #1153
    • Add Normalize operator support for list features #1154
    • Add DataLoader.epochs() method and Dataset.to_iter(epochs=) argument #1147
    • Add ValueCount operator for recording of multihot min and max list lengths #1171

    Bug Fixes

    • Fix Criteo inference #1198
    • Fix performance regressions in Criteo benchmark #1222
    • Fix error in JoinGroupby op #1167
    • Fix Filter/JoinExternal key error #1143
    • Fix LambdaOp transforming dependency values #1185
    • Fix reading parquet files with list columns from GCS #1155
    • Fix TargetEncoding with dependencies as the target #1165
    • Fix Categorify op to calculate unique count stats for Nulls #1159
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Sep 24, 2021)

    NVTabular v0.7.0

    Improvements

    • Add column tagging API #943
    • Export dataset schema when writing out datasets #948
    • Make dataloaders aware of schema #947
    • Standardize a Workflows representation of its output columns #372
    • Add multi-gpu training example using PyTorch Distributed #775
    • Speed up reading Parquet files from remote storage like GCS or S3 #1119
    • Add utility to convert TFRecord datasets to Parquet #1085
    • Add multi-gpu training example using PyTorch Distributed #775
    • Add multihot support for PyTorch inference #719
    • Add options to reserve categorical indices in the Categorify() op #1074
    • Update notebooks to work with CPU only systems #960
    • Save output from Categorify op in a single table for HugeCTR #946
    • Add a keyset file for HugeCTR integration #1049

    Bug Fixes

    • Fix category counts written out by the Categorify op #1128
    • Fix HugeCTR inference example #1130
    • Fix make_feature_column_workflow bug in Categorify if features have vocabularies of varying size. #1062
    • Fix TargetEncoding op on CPU only systems #976
    • Fix writing empty partitions to Parquet files #1097
    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Aug 11, 2021)

  • v0.6.0(Aug 3, 2021)

    NVTabular v0.6.0

    Improvements

    • Add CPU support #534
    • Speed up inference on Triton Inference Server #744
    • Add support for session based recommenders #355
    • Add PyTorch Dataloader support for Sparse Tensors #500
    • Add ListSlice operator for truncating list columns #734
    • Categorical ids sorted by frequency #799
    • Add ability to select a subset of a ColumnGroup #809
    • Add option to use Rename op to give a single column a new fixed name #825
    • Add a 'map' function to KerasSequenceLoader, which enables sample weights #667
    • Add JoinExternal option on nvt.Dataset in addition to cudf #370
    • Allow passing ColumnGroup to get_embedding_sizes #732
    • Add ability to name LambdaOp and provide a better default name in graph visualizations #860

    Bug Fixes

    • Fix make_feature_column_workflow for Categorical columns #763
    • Fix Categorify output dtypes for list columns #963
    • Fix inference for Outbrain example #669
    • Fix dask metadata after calling workflow.to_ddf() #852
    • Fix out of memory errors #896, #971
    • Fix normalize output when stdev is zero #993
    • Fix using UCX with a dask cluster on Merlin containers #872
    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(May 26, 2021)

  • v0.5.2(May 13, 2021)

  • v0.5.1(May 3, 2021)

    Improvements

    • Update dependencies to use cudf 0.19
    • Removed conda from docker containers, leading to much smaller container sizes
    • Added CUDA 11.2 support
    • Added FastAI v2.3 support

    Bug Fixes

    • Fix NVTabular preprocessing with HugeCTR inference
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Apr 12, 2021)

    Improvements

    • Adding Horovod integration to NVTabular's dataloaders, allowing you to use multiple GPU's to train TensorFlow and PyTorch models
    • Adding a Groupby operation for use with session based recommender models
    • Added ability to read and write datasets partitioned by a column
    • Add example notebooks for using Triton Inference Server with NVTabular
    • Restructure and simplify Criteo example notebooks
    • Add support for PyTorch inference with Triton Inference Server

    Bug Fixes

    • Fix bug with preprocessing categorical columns with NVTabular not working with HugeCTR and Triton Inference Server #707
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Mar 9, 2021)

    Breaking Changes

    • The API for NVTabular has been signficantly refactored, and existing code targetting the 0.3 API will need to be updated. Workflows are now represented as graphs of operations, and applied using a sklearn 'transformers' style api. Read more by checking out the examples

    Improvements

    • Triton integration support for NVTabular with TensorFlow and HugeCTR models
    • Recommended cloud configuration and support for AWS and GCP
    • Reorganized examples and documentation
    • Unified Docker containers for Merlin components (NVTabular, HugeCTR and Triton)
    • Dataset analysis and generation tools
    Source code(tar.gz)
    Source code(zip)
Owner
Merlin is a framework providing end-to-end GPU-accelerated recommender systems, from feature engineering to deep learning training and deploying to production
null
Quickly and easily create / train a custom DeepDream model

Dream-Creator This project aims to simplify the process of creating a custom DeepDream model by using pretrained GoogleNet models and custom image dat

null 55 Dec 27, 2022
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

ShopRunner 97 Jan 3, 2023
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Collie do

ShopRunner 96 Dec 29, 2022
deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

null 63 Oct 17, 2022
DIT is a DTLS MitM proxy implemented in Python 3. It can intercept, manipulate and suppress datagrams between two DTLS endpoints and supports psk-based and certificate-based authentication schemes (RSA + ECC).

DIT - DTLS Interception Tool DIT is a MitM proxy tool to intercept DTLS traffic. It can intercept, manipulate and/or suppress DTLS datagrams between t

null 52 Nov 30, 2022
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets

[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets Introduction This repo contains the source code accompanying the paper: Well-tuned Sim

null 52 Jan 4, 2023
A standard framework for modelling Deep Learning Models for tabular data

PyTorch Tabular aims to make Deep Learning with Tabular data easy and accessible to real-world cases and research alike.

null 801 Jan 8, 2023
A framework for attentive explainable deep learning on tabular data

?? kendrite A framework for attentive explainable deep learning on tabular data ?? Quick start kedro run ?? Built upon Technology Description Links ke

Marnix Koops 3 Nov 6, 2021
TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Fernando Pérez-García 1.6k Jan 6, 2023
A Comparative Framework for Multimodal Recommender Systems

Cornac Cornac is a comparative framework for multimodal recommender systems. It focuses on making it convenient to work with models leveraging auxilia

Preferred.AI 671 Jan 3, 2023
Open-sourcing the Slates Dataset for recommender systems research

FINN.no Recommender Systems Slate Dataset This repository accompany the paper "Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sa

FINN.no 48 Nov 28, 2022
An efficient PyTorch implementation of the evaluation metrics in recommender systems.

recsys_metrics An efficient PyTorch implementation of the evaluation metrics in recommender systems. Overview • Installation • How to use • Benchmark

Xingdong Zuo 12 Dec 2, 2022
Automates Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning :rocket:

MLJAR Automated Machine Learning Documentation: https://supervised.mljar.com/ Source Code: https://github.com/mljar/mljar-supervised Table of Contents

MLJAR 2.4k Dec 31, 2022
Özlem Taşkın 0 Feb 23, 2022
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

The SLIDE package contains the source code for reproducing the main experiments in this paper. Dataset The Datasets can be downloaded in Amazon-

Intel Labs 72 Dec 16, 2022
Research on Tabular Deep Learning (Python package & papers)

Research on Tabular Deep Learning For paper implementations, see the section "Papers and projects". rtdl is a PyTorch-based package providing a user-f

Yura Gorishniy 510 Dec 30, 2022
House_prices_kaggle - Predict sales prices and practice feature engineering, RFs, and gradient boosting

House Prices - Advanced Regression Techniques Predicting House Prices with Machine Learning This project is build to enhance my knowledge about machin

Gurpreet Singh 1 Jan 1, 2022