Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Ludwig

Last update: Dec 31, 2022

Related tags

Deep Learning learning machine-learning natural-language-processing deep-neural-networks computer-vision deep-learning machine natural-language deep python3 machinelearning deeplearning natural-language-generation natural-language-understanding

Overview

Translated in 🇰🇷 Korean/

Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on top of TensorFlow.

To train a model you need to provide is a file containing your data, a list of columns to use as inputs, and a list of columns to use as outputs, Ludwig will do the rest. Simple commands can be used to train models both locally and in a distributed way, and to use them to predict new data.

A programmatic API is also available to use Ludwig from Python. A suite of visualization tools allows you to analyze models' training and test performance and to compare them.

Ludwig is built with extensibility principles in mind and is based on datatype abstractions, making it easy to add support for new datatypes as well as new model architectures.

It can be used by practitioners to quickly train and test deep learning models as well as by researchers to obtain strong baselines to compare against and have an experimentation setting that ensures comparability by performing the same data processing and evaluation.

Ludwig provides a set of model architectures that can be combined together to create an end-to-end model for a given use case. As an analogy, if deep learning libraries provide the building blocks to make your building, Ludwig provides the buildings to make your city, and you can choose among the available buildings or add your own building to the set of available ones.

The core design principles baked into the toolbox are:

No coding required: no coding skills are required to train a model and use it for obtaining predictions.
Generality: a new datatype-based approach to deep learning model design makes the tool usable across many different use cases.
Flexibility: experienced users have extensive control over model building and training, while newcomers will find it easy to use.
Extensibility: easy to add new model architecture and new feature datatypes.
Understandability: deep learning model internals are often considered black boxes, but Ludwig provides standard visualizations to understand their performance and compare their predictions.
Open Source: Apache License 2.0

Ludwig is hosted by the Linux Foundation as part of the LF AI & Data Foundation. For details about who's involved and how Ludwig fits into the larger open source AI landscape, read the Linux Foundation announcement.

Installation

Ludwig requires you to use Python 3.6+. If you don’t have Python 3 installed, install it by running:

sudo apt install python3  # on ubuntu
brew install python3      # on mac

You may want to use a virtual environment to maintain an isolated Python environment.

virtualenv -p python3 venv

In order to install Ludwig just run:

pip install ludwig

This will install only Ludwig's basic requirements, different feature types require different dependencies. We divided them as different extras so that users could install only the ones they actually need:

ludwig[text] for text dependencies.
ludwig[audio] for audio and speech dependencies.
ludwig[image] for image dependencies.
ludwig[hyperopt] for hyperparameter optimization dependencies.
ludwig[horovod] for distributed training dependencies.
ludwig[serve] for serving dependencies.
ludwig[viz] for visualization dependencies.
ludwig[test] for dependencies needed for testing.

Distributed training is supported with Horovod, which can be installed with pip install ludwig[horovod] or HOROVOD_GPU_OPERATIONS=NCCL pip install ludwig[horovod] for GPU support. See Horovod's installation guide for full details on available installation options.

Any combination of extra packages can be installed at the same time with pip install ludwig[extra1,extra2,...] like for instance pip install ludwig[text,viz]. The full set of dependencies can be installed with pip install ludwig[full].

For developers who wish to build the source code from the repository:

git clone [email protected]:ludwig-ai/ludwig.git
cd ludwig
virtualenv -p python3 venv
source venv/bin/activate
pip install -e '.[test]'

Note: that if you are running without GPUs, you may wish to use the CPU-only version of TensorFlow, which takes up much less space on disk. To use a CPU-only TensorFlow version, uninstall tensorflow and replace it with tensorflow-cpu after having installed ludwig. Be sure to install a version within the compatible range as shown in requirements.txt.

Basic Principles

Ludwig provides three main functionalities: training models and using them to predict and evaluate them. It is based on datatype abstraction, so that the same data preprocessing and postprocessing will be performed on different datasets that share datatypes and the same encoding and decoding models developed can be re-used across several tasks.

Training a model in Ludwig is pretty straightforward: you provide a dataset file and a config YAML file.

The config contains a list of input features and output features, all you have to do is specify names of the columns in the dataset that are inputs to your model alongside with their datatypes, and names of columns in the dataset that will be outputs, the target variables which the model will learn to predict. Ludwig will compose a deep learning model accordingly and train it for you.

Currently, the available datatypes in Ludwig are:

binary
numerical
category
set
bag
sequence
text
timeseries
image
audio
date
h3
vector

By choosing different datatype for inputs and outputs, users can solve many different tasks, for instance:

text input + category output = text classifier
image input + category output = image classifier
image input + text output = image captioning
audio input + binary output = speaker verification
text input + sequence output = named entity recognition / summarization
category, numerical and binary inputs + numerical output = regression
timeseries input + numerical output = forecasting model
category, numerical and binary inputs + binary output = fraud detection

take a look at the Examples to see how you can use Ludwig for several more tasks.

The config can contain additional information, in particular how to preprocess each column in the data, which encoder and decoder to use for each one, architectural and training parameters, hyperparameters to optimize. This allows ease of use for novices and flexibility for experts.

Training

For example, given a text classification dataset like the following:

doc_text	class
Former president Barack Obama ...	politics
Juventus hired Cristiano Ronaldo ...	sport
LeBron James joins the Lakers ...	sport
...	...

you want to learn a model that uses the content of the doc_text column as input to predict the values in the class column. You can use the following config:

{input_features: [{name: doc_text, type: text}], output_features: [{name: class, type: category}]}

and start the training typing the following command in your console:

ludwig train --dataset path/to/file.csv --config "{input_features: [{name: doc_text, type: text}], output_features: [{name: class, type: category}]}"

where path/to/file.csv is the path to a UTF-8 encoded CSV file containing the dataset in the previous table (many other data formats are supported). Ludwig will:

Perform a random split of the data.
Preprocess the dataset.
Build a ParallelCNN model (the default for text features) that decodes output classes through a softmax classifier.
Train the model on the training set until the performance on the validation set stops improving.

Training progress will be displayed in the console, but the TensorBoard can also be used.

If you prefer to use an RNN encoder and increase the number of epochs to train for, all you have to do is to change the config to:

{input_features: [{name: doc_text, type: text, encoder: rnn}], output_features: [{name: class, type: category}], training: {epochs: 50}}

Refer to the User Guide to find out all the options available to you in the config and take a look at the Examples to see how you can use Ludwig for several different tasks.

After training, Ludwig will create a results directory containing the trained model with its hyperparameters and summary statistics of the training process. You can visualize them using one of the several visualization options available in the visualize tool, for instance:

ludwig visualize --visualization learning_curves --training_statistics path/to/training_statistics.json

This command will display a graph like the following, where you can see loss and accuracy during the training process:

Several more visualizations are available, please refer to Visualizations for more details.

Distributed Training

You can distribute the training of your models using Horovod, which allows training on a single machine with multiple GPUs as well as on multiple machines with multiple GPUs. Refer to the User Guide for full details.

Prediction and Evaluation

If you want your previously trained model to predict target output values on new data, you can type the following command in your console:

ludwig predict --dataset path/to/data.csv --model_path /path/to/model

Running this command will return model predictions.

If your dataset also contains ground truth values of the target outputs, you can compare them to the predictions obtained from the model to evaluate the model performance.

ludwig evaluate --dataset path/to/data.csv --model_path /path/to/model

This will produce evaluation performance statistics that can be visualized by the visualize tool, which can also be used to compare performances and predictions of different models, for instance:

ludwig visualize --visualization compare_performance --test_statistics path/to/test_statistics_model_1.json path/to/test_statistics_model_2.json

will return a bar plot comparing the models on different metrics:

A handy ludwig experiment command that performs training and prediction one after the other is also available.

Programmatic API

Ludwig also provides a simple programmatic API that allows you to train or load a model and use it to obtain predictions on new data:

from ludwig.api import LudwigModel

# train a model
config = {...}
model = LudwigModel(config)
train_stats = model.train(training_data)

# or load a model
model = LudwigModel.load(model_path)

# obtain predictions
predictions = model.predict(test_data)

config containing the same information of the YAML file provided to the command line interface. More details are provided in the User Guide and in the API documentation.

Extensibility

Ludwig is built from the ground up with extensibility in mind. It is easy to add an additional datatype that is not currently supported by adding a datatype-specific implementation of abstract classes that contain functions to preprocess the data, encode it, and decode it.

Furthermore, new models, with their own specific hyperparameters, can be easily added by implementing a class that accepts tensors (of a specific rank, depending on the datatype) as inputs and provides tensors as output. This encourages reuse and sharing new models with the community. Refer to the Developer Guide for further details.

Full documentation

You can find the full documentation here.

License

Comments

TF2 porting: Sequence generator decoder
Code Pull Requests

Start of the sequence feature's generator decoder. This is not a complete implementation. I wanted to give you an early look at the approach I'm taking.

Instead of retrofitting the TF1 code, I'm using the approach described in TFA and demonstrated in these examples: example one and example two

I'm just restating the obvious to establish my starting point. Ludwig supports one of two tensor shapes for Decoder input

shape [b, h] w/o attention

shape [b, s, h] with attention

The expected decoder output: shape [b, s, c]

One question, it is not clear to me how to determine sequence length for the decoder's output tensor. Right now I'm just using num_classes as a place holder so that the decoder generates a rank 3 tensor.

State of the implementation

Initial commit supports decoder input shape [b, h] and generates a tensor with shape [b, s, c], implemented the basic decoder.

No exceptions are generated infull_experiment(). The loss value decreases each epoch.

These warning messages appear during training. Same as with category feature.

WARNING:tensorflow:Gradients do not exist for variables ['embeddings:0', 'ecd/sequence_output_feature/fc_stack/fc_layer/dense/kernel:0', 'ecd/sequence_output_feature/fc_stack/fc_layer/dense/bias:0', 'ecd/sequence_output_feature/fc_stack/fc_layer_1/dense_1/kernel:0', 'ecd/sequence_output_feature/fc_stack/fc_layer_1/dense_1/bias:0', 'ecd/sequence_output_feature/fc_stack/fc_layer_2/dense_2/kernel:0', 'ecd/sequence_output_feature/fc_stack/fc_layer_2/dense_2/bias:0', 'ecd/sequence_output_feature/fc_stack/fc_layer_3/dense_3/kernel:0', 'ecd/sequence_output_feature/fc_stack/fc_layer_3/dense_3/bias:0', 'ecd/sequence_output_feature/fc_stack/fc_layer_4/dense_4/kernel:0', 'ecd/sequence_output_feature/fc_stack/fc_layer_4/dense_4/bias:0'] when minimizing the loss.

If interested here is the log file for an experiment run. sandbox_model_sequence_feature.txt

Next Steps

Add support for Attention
opened by jimthompson5802 153

TF2 porting: eager mode training and evaluation, numerical and binary features

Reopening PR for TF2 porting...

I'm hoping this posting provides some evidence of progress. With your last guidance, I was able to get a "minimal training loop" working with model subclassing and eager execution.

I wanted to offer you an early look at how I'm adapting Ludwig training to TF2 eager execution.

This commit demonstrates the minimal training loop for this model_definition:

input_features = [
    {'name': 'x1', 'type': 'numerical', 'preprocessing': {'normalization': 'zscore'}},
    {'name': 'x2', 'type': 'numerical', 'preprocessing': {'normalization': 'zscore'}},
    {'name': 'x3', 'type': 'numerical', 'preprocessing': {'normalization': 'zscore'}}
]
output_features = [
    {'name': 'y', 'type': 'numerical'}
]

model_definition = {
    'input_features': input_features,
    'output_features': output_features,
    'combiner': {
        'type': 'concat',  # 'concat',
        'num_fc_layers': 5,
        'fc_size': 64
    },
    'training': {'epochs': 100}
}

The main result of this "minimal training loop" is demonstrating:

not blowing up while running the specified number epochs
Create numpy arrays used for training from input features and output features specified in model_definition
reduction in the loss function during training

Here is an excerpt of the log file for the minimal training loop:


Epoch   1

Epoch   1
Training: 100%|██████████| 6/6 [00:00<00:00, 50.40it/s]
Epoch 1, train Loss: 9732.708984375, : train metric 9743.146484375

Epoch   2

Epoch   2
Training: 100%|██████████| 6/6 [00:00<00:00, 77.71it/s]
Epoch 2, train Loss: 9711.203125, : train metric 9721.9404296875

Epoch   3

Epoch   3
Training: 100%|██████████| 6/6 [00:00<00:00, 80.07it/s]
Epoch 3, train Loss: 9678.2998046875, : train metric 9689.5517578125

<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>

Epoch  50

Epoch  50
Training: 100%|██████████| 6/6 [00:00<00:00, 80.76it/s]
Epoch 50, train Loss: 1540.60400390625, : train metric 1547.77294921875

Epoch  51

Epoch  51
Training: 100%|██████████| 6/6 [00:00<00:00, 75.30it/s]
Epoch 51, train Loss: 1510.5498046875, : train metric 1517.5792236328125

Epoch  52

Epoch  52
Training: 100%|██████████| 6/6 [00:00<00:00, 75.37it/s]
Epoch 52, train Loss: 1481.64404296875, : train metric 1488.5391845703125

<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Epoch  98

Epoch  98
Training: 100%|██████████| 6/6 [00:00<00:00, 78.33it/s]
Epoch 98, train Loss: 787.4824829101562, : train metric 791.1544189453125

Epoch  99

Epoch  99
Training: 100%|██████████| 6/6 [00:00<00:00, 75.26it/s]
Epoch 99, train Loss: 779.5379028320312, : train metric 783.1728515625

Epoch 100

Epoch 100
Training: 100%|██████████| 6/6 [00:00<00:00, 83.06it/s]
Epoch 100, train Loss: 771.7520141601562, : train metric 775.3506469726562

Here is the entire eager execution log file: proof-of-concept_log_tf2_eager_exec.txt.

Now comes all the limitations and caveats for the current state of the code:

training loss reduction is not has fast as the TF1 implementation. Here is the TF1 log file for the same model proof-of-concept_log_tf1.txt.
Commented out large sections of the code in the model.train() method. I envision re-enabling and modifying the commented code as work progresses
Several items were hard-coded in this implemantion to miminize the amount of change required just to demonstrate the "training loop". The hard-code functions are
- The encoder structure is hard-coded in the model.call() method. This will change to reflect the encorder/decoders specified in the model_definition
- the objective, loss and metric functions are hard-coded. Future work will be build these from the model_defintion
One of the sections of code commented out creates these data structures. Without theses structures Ludwig processing after the training loop abnormally terminates. This will be fixed as the work progresses. Actually, I'm thinking that this is the next thing to fix with the simple model I'm using for testing. If I can get these enabled, then the rest of Ludwig processing "should work".
- progress_tracker.train_stats
- progress_tracker.vali_stats
- progress_tracker.test_stats

opened by jimthompson5802 92

TF2 porting: Enable early stopping + model save and load

Code Pull Requests

Re-introduced early stopping option. From what I can tell there is no existing unit test for early stopping; I added early such a test. This test used two values (3 and 5) for early_stop option and confirmed training stopped per the early stop specification.

pytest -v test_model_training_options.py
======================================================= test session starts ========================================================
platform linux -- Python 3.6.9, pytest-5.4.3, py-1.8.1, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /opt/project
plugins: pycharm-0.6.0, typeguard-2.8.0
collected 2 items

test_model_training_options.py::test_early_stopping[3] PASSED                                                                [ 50%]
test_model_training_options.py::test_early_stopping[5] PASSED                                                                [100%]

========================================================= warnings summary =========================================================
/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py:15
  /usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py:15: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp

-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================== 2 passed, 1 warning in 12.85s ===================================================

The test_model_training_options.py can be used as a foundation to other unit tests around model training options, e.g. save or not save various logs, etc.

This PR enabled only the early stopping test. Nothing else was reenabled.

If this looks OK, I can start re-enabling other model training options.

opened by jimthompson5802 71

TF2 porting: initial work
@w4nderlust This is a test of sending results of my work. Initially I tried pushing my changes directly to uber/ludwig but it failed because I did not have write permission to uber/ludwig.

I'm now trying to submit my changes through a PR for branch tf2_porting branch in my forkjimthompson5802/ludwig. The target for this PR is uber/ludwig' branch 'tf2_porting. If this works, then I can just add commits to this PR.

The Docker image I use for my development environment is built with the updated requirements.txt on the tf2_porting branch, which contains

tensorflow==2.1 tensorflow-addons

With this initial set of commits, train completes and some, not all, of the data are saved for TensorBoard. For my test, I'm using the Titanic Survivor example. Here is the log from training. tf2_sample_train_log.txt

Here is screenshot of TensorBoard for the data that was collected.

Let me know what you think.
opened by jimthompson5802 54

TF2 porting: sequence feature

Code Pull Requests

This is the start of porting sequence feature to TF2. There is still more to be done. Current state of code:

placed TF2 related stub methods in sequence_feature.py. These will be filled in as work progresses.
Created classes SequencePassthroughEncoder and SequenceEmbedEncoder as subclass of tf.keras.layers.Layer. Integrated existing functions to work with the Layer subclass.

From a small test both encoders work. The test passed an input array of shape [2, 5] through the SequencePassthroughEncoder and SequenceEmbedEncoder. Results of the test:

Passthrough encoder with reduce_output=None resulted in an output tensor with shape [2, 5, 1]
Embed encoder with embedding_size=3, reduce_output='sum' resulted in an output tensor with shape [2,3]
Embed encoder with embedding_size=3, reduce_output='None' resulted in an output tensor with shape [2, 5, 3]

Does this look correct?

Next steps for me is to get a working decoder and full_experiment() to complete successfully for a simple model.

This is the simple test case and its output.

import numpy as np
import tensorflow as tf

from ludwig.models.modules.sequence_encoders import SequencePassthroughEncoder, \
    SequenceEmbedEncoder

# setup input
input = np.array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]])

# test Passthrough encoder
print("Sample SequencePassthroughEncoder")
spt = SequencePassthroughEncoder()
output = spt(input)
print('input shape:', input.shape,'\nvalues:\n', input)
print('\noutput shape:\n', output.shape,'\nvalues\n', output.numpy())


# test embed encoder
print("\n\nSample SequenceEmbedEncoder")
vocab = list('abcdefghij')
print('vocab:', len(vocab), vocab)
emb = SequenceEmbedEncoder(vocab, embedding_size=3)
output2 = emb(input)
print('input shape:', input.shape,'\nvalues:\n', input)
print('\noutput shape:\n', output2.shape,'\nvalues\n', output2.numpy())


# test embed encoder with reduce_output=None
print("\n\nSample SequenceEmbedEncoder, reduce_output=None")
vocab = list('abcdefghij')
print('vocab:', len(vocab), vocab)
emb2 = SequenceEmbedEncoder(vocab, embedding_size=3, reduce_output='None')
output3 = emb2(input)
print('input shape:', input.shape,'\nvalues:\n', input)
print('\noutput shape:\n', output3.shape,'\nvalues\n', output3.numpy())

Test output

2516c6abd7cc:python -u /opt/project/sandbox/tf2_port/sequence_encoder_tester.py
2020-04-10 03:41:27.645738: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-04-10 03:41:27.645874: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-04-10 03:41:27.645895: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Sample SequencePassthroughEncoder
2020-04-10 03:41:28.838864: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-04-10 03:41:28.838974: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-04-10 03:41:28.839016: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (2516c6abd7cc): /proc/driver/nvidia/version does not exist
2020-04-10 03:41:28.839430: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-10 03:41:28.845809: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2791435000 Hz
2020-04-10 03:41:28.846760: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x439f600 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-10 03:41:28.846828: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
input shape: (2, 5) 
values:
 [[0 1 2 3 4]
 [5 6 7 8 9]]

output shape:
 (2, 5, 1) 
values
 [[[0.]
  [1.]
  [2.]
  [3.]
  [4.]]

 [[5.]
  [6.]
  [7.]
  [8.]
  [9.]]]


Sample SequenceEmbedEncoder
vocab: 10 ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
input shape: (2, 5) 
values:
 [[0 1 2 3 4]
 [5 6 7 8 9]]

output shape:
 (2, 3) 
values
 [[ 0.43313265  2.5553155   0.4227686 ]
 [-2.2676423   0.80896974 -0.28734756]]


Sample SequenceEmbedEncoder, reduce_output=None
vocab: 10 ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
input shape: (2, 5) 
values:
 [[0 1 2 3 4]
 [5 6 7 8 9]]

output shape:
 (2, 5, 3) 
values
 [[[ 0.          0.         -0.        ]
  [-0.3263557  -0.9351046   0.81713986]
  [ 0.10236788 -0.8854079   0.736264  ]
  [ 0.6721873  -0.6389749  -0.22343826]
  [ 0.9717505   0.37574863 -0.77134395]]

 [[-0.29198527  0.8138087  -0.12796235]
  [-0.7267971   0.70457935  0.5321882 ]
  [ 0.8782997   0.01055479  0.969434  ]
  [ 0.89878535  0.36403346  0.67668414]
  [ 0.55935645 -0.59581757  0.71213484]]]

Process finished with exit code 0

opened by jimthompson5802 50

TF2 porting: category feature
Code Pull Requests

Here is the start of converting the category feature to TF2 eager execution. More work is needed to complete.

The following have been implemented:

Setup category encoder and decoder

Adapt the category encoder to use current Ludwig Embed class

Custom softmax cross entropy loss function for training and evaluation

Custom softmax cross entropy metric function

At this point training phase completes w/o error. As noted above only the Loss function has been implemented.

The predict phase fails because of an incomplete implementation of the predictions() method and missing metric functions. The work to be done is similar to what I did with the binary feature.

Since this is the first time I've implemented encoder and decoder, I'd appreciate if you would take a look at how I implemented them.

I'm attaching the training data I'm using for testing. Included in the zip file is a log file from a test run. ludwig_category_feature.zip

Here is the model definition I'm using

python -m ludwig.experiment --data_csv data4/train.csv \ --skip_save_processed_input \ --model_definition "{input_features: [{name: x1, type: numerical, preprocessing: {normalization: minmax}}, {name: x2, type: numerical, preprocessing: {normalization: zscore}}, {name: x3, type: numerical, preprocessing: {normalization: zscore}}, {name: c1, type: category, embedding_size: 6}], combiner: {type: concat, num_fc_layers: 5, fc_size: 64}, output_features: [{name: y, type: category}], training: {epochs: 10}}"
opened by jimthompson5802 38
Initial pass at wandb Ludwig integration

This is initial pass at a basic integration of wandb with Ludwig. It currently mirrors tensorboard events, stores hyperparameters, syncs artifacts, and stores evaluation results.

I'm using the python "black" code formatter which seems to be clashing with some of the existing code. Is there a preferred formatter to use?

opened by vanpelt 35
Mac M1 Support
Hi, I'm wondering if Ludwig currently has support for the Mac M1? I have tried to install it several times through the steps on the website and no luck.

I've downloaded Tensorflow 2.4.0-rc0 (the only one available for M1) separately as it wasn't getting anywhere through the "pip install ludwig" command. I kept getting dependency conflict errors and I corrected them for the most part but there seems to be no workaround for TF- see conflicts below.

The conflict is caused by: ludwig 0.3.3 depends on tensorflow>=2.3.1 ludwig 0.3.2 depends on tensorflow>=2.3.1 ludwig 0.3.1 depends on tensorflow>=2.2 ludwig 0.3 depends on tensorflow>=2.2 ludwig 0.2.2.8 depends on tensorflow==1.15.3 ludwig 0.2.2.7 depends on tensorflow==1.15.3 ludwig 0.2.2.6 depends on tensorflow==1.15.2 ludwig 0.2.2.5 depends on tensorflow==1.15.2 ludwig 0.2.2.4 depends on tensorflow==1.15.2 ludwig 0.2.2.3 depends on tensorflow==1.15.2 ludwig 0.2.2.2 depends on tensorflow-gpu==1.15.2 ludwig 0.2.2 depends on tensorflow-gpu==1.15.2 ludwig 0.2.1 depends on tensorflow==1.14.0 ludwig 0.2 depends on tensorflow==1.14.0 ludwig 0.1.2 depends on tensorflow==1.13.1 ludwig 0.1.1 depends on tensorflow==1.13.1 ludwig 0.1.0 depends on tensorflow>=1.12

To fix this you could try to:

loosen the range of package versions you've specified

remove package versions to allow pip attempt to solve the dependency conflict

It seems M1's limited TF availability is not letting the Ludwig install get through the TF dependencies.

Expected behavior Successfully install Ludwig.

Environment (please complete the following information):

OS: Big Sur 11.2.1

Python 3.8

Ludwig 0.3.3

Thanks
waiting for answer dependencies
opened by camaya7 34
pandas.errors.ParserError: Error tokenizing data. C error: Expected 83 fields in line 40, saw 92

Describe the bug Hi, I am getting this error message: pandas.errors.ParserError: Error tokenizing data. C error: Expected 83 fields in line 40, saw 92

I made all the possible in order to use a good CSV, but I guess something is not fine. As a suggestion I believe it could be good to find a way to handle bad lines (skip) and present a value (total) of ignored lines.

Back to my settings. I have UTF-8 for the saved csv file. My columns contain mostly text and some got ',' which I turned into '\t' as founded in the documentation. Also, I made the following basic test: I have re-imported the csv in Excel and made sure all the lines and columns (24 in total in my case) are in the right place. That was the case: all fine inside of MS Excel view.

The YAML has been validated correctly. I exclude it is that one my case.

Here below the lines from my terminal.

ludwig_version: '0.1.0' command: ('/Library/Frameworks/Python.framework/Versions/3.6/bin/ludwig train ' '--data_csv /Users/my_mac/Projects/ML/LUDWIG/Wine/red_dataset.csv ' '--model_definition_file ' '/Users/my_mac/Projects/ML/LUDWIG/Wine/dataset.yaml') dataset_type: '/Users/my_mac/Projects/ML/LUDWIG/Wine/red_dataset.csv' model_definition: { 'combiner': {'type': 'concat'}, 'input_features': [ { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Order', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Produttore', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Tipo', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Descrizione', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Vitigni', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Vigneti', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Vinificazione', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Affinamento', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Filosofia', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Temperatura', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Quando_aprire', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Ideale', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Quando_bere', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Descrizione_long', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Colore', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Profumo', 'tied_weights': None, 'type': 'text'}, { 'encoder': 'parallel_cnn', 'level': 'word', 'name': 'Gusto', 'tied_weights': None, 'type': 'text'}, { 'in_memory': True, 'name': 'Immagine-src', 'should_resize': False, 'tied_weights': None, 'type': 'image'}], 'output_features': [ { 'dependencies': [], 'loss': { 'type': 'mean_squared_error', 'weight': 1}, 'name': 'Prezzo', 'reduce_dependencies': 'sum', 'reduce_input': 'sum', 'type': 'numerical', 'weight': 1}, { 'decoder': 'generator', 'dependencies': [], 'level': 'char', 'loss': { 'class_distance_temperature': 0, 'class_weights': 1, 'type': 'softmax_cross_entropy', 'weight': 1}, 'name': 'Denominazione', 'reduce_dependencies': 'sum', 'reduce_input': 'sum', 'type': 'text', 'weight': 1}, { 'decoder': 'generator', 'dependencies': [], 'level': 'char', 'loss': { 'class_distance_temperature': 0, 'class_weights': 1, 'type': 'softmax_cross_entropy', 'weight': 1}, 'name': 'Regione', 'reduce_dependencies': 'sum', 'reduce_input': 'sum', 'type': 'text', 'weight': 1}, { 'decoder': 'generator', 'dependencies': [], 'level': 'char', 'loss': { 'class_distance_temperature': 0, 'class_weights': 1, 'type': 'softmax_cross_entropy', 'weight': 1}, 'name': 'Gradazione', 'reduce_dependencies': 'sum', 'reduce_input': 'sum', 'type': 'text', 'weight': 1}], 'preprocessing': { 'bag': { 'fill_value': '', 'format': 'space', 'lowercase': 10000, 'missing_value_strategy': 'fill_with_const', 'most_common': False}, 'binary': { 'fill_value': 0, 'missing_value_strategy': 'fill_with_const'}, 'category': { 'fill_value': '', 'lowercase': False, 'missing_value_strategy': 'fill_with_const', 'most_common': 10000}, 'force_split': False, 'image': {'missing_value_strategy': 'backfill'}, 'numerical': { 'fill_value': 0, 'missing_value_strategy': 'fill_with_const'}, 'sequence': { 'fill_value': '', 'format': 'space', 'lowercase': False, 'missing_value_strategy': 'fill_with_const', 'most_common': 20000, 'padding': 'right', 'padding_symbol': '', 'sequence_length_limit': 256, 'unknown_symbol': ''}, 'set': { 'fill_value': '', 'format': 'space', 'lowercase': False, 'missing_value_strategy': 'fill_with_const', 'most_common': 10000}, 'split_probabilities': (0.7, 0.1, 0.2), 'stratify': None, 'text': { 'char_format': 'characters', 'char_most_common': 70, 'char_sequence_length_limit': 1024, 'fill_value': '', 'lowercase': True, 'missing_value_strategy': 'fill_with_const', 'padding': 'right', 'padding_symbol': '', 'unknown_symbol': '', 'word_format': 'space_punct', 'word_most_common': 20000, 'word_sequence_length_limit': 256}, 'timeseries': { 'fill_value': '', 'format': 'space', 'missing_value_strategy': 'fill_with_const', 'padding': 'right', 'padding_value': 0, 'timeseries_length_limit': 256}}, 'training': { 'batch_size': 128, 'bucketing_field': None, 'decay': False, 'decay_rate': 0.96, 'decay_steps': 10000, 'dropout_rate': 0.0, 'early_stop': 3, 'epochs': 10, 'gradient_clipping': None, 'increase_batch_size_on_plateau': 0, 'increase_batch_size_on_plateau_max': 512, 'increase_batch_size_on_plateau_patience': 5, 'increase_batch_size_on_plateau_rate': 2, 'learning_rate': 0.001, 'learning_rate_warmup_epochs': 5, 'optimizer': { 'beta1': 0.9, 'beta2': 0.999, 'epsilon': 1e-08, 'type': 'adam'}, 'reduce_learning_rate_on_plateau': 0, 'reduce_learning_rate_on_plateau_patience': 5, 'reduce_learning_rate_on_plateau_rate': 0.5, 'regularization_lambda': 0, 'regularizer': 'l2', 'staircase': False, 'validation_field': 'combined', 'validation_measure': 'loss'}}

Using full raw csv, no hdf5 and json file with the same name have been found Building dataset (it may take a while) Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ludwig/utils/data_utils.py", line 46, in read_csv df = pd.read_csv(data_fp, header=header) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f return _read(filepath_or_buffer, kwds) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read data = parser.read(nrows) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read ret = self._engine.read(nrows) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 83 fields in line 40, saw 92

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/bin/ludwig", line 11, in sys.exit(main()) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ludwig/cli.py", line 86, in main CLI() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ludwig/cli.py", line 64, in init getattr(self, args.command)() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ludwig/cli.py", line 70, in train train.cli(sys.argv[2:]) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ludwig/train.py", line 663, in cli full_train(**vars(args)) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ludwig/train.py", line 224, in full_train random_seed=random_seed File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ludwig/data/preprocessing.py", line 457, in preprocess_for_training random_seed=random_seed File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ludwig/data/preprocessing.py", line 54, in build_dataset dataset_df = read_csv(dataset_csv) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ludwig/utils/data_utils.py", line 48, in read_csv logging.WARNING('Failed to parse the CSV with pandas default way,' TypeError: 'int' object is not callable
feature

opened by IzzyHibbert 29
How to get the visualization result?

After running ludwig visualize --visualization learning_curves -ts results/experiment_run_4/training_statistics.json, it returns empty. Where to get the visualization results?Is it saved as a image in somewhere?
feature waiting for answer

opened by MrRace 28
Save models with SavedModel

Hello , I know there is #55 issue for exposing the model . i have read the issue . it was suggested that you can use a ludwig developed model in tensorflow serving . I have been trying to convert it to in a format so i can use it in tensorflow serving.

For example , i have followed , readallcat.csv example and develped model in ludwig which is in the format ,

checkpoint model_weights.data-00000-of-00001 model_weights.meta model_hyperparameters.json model_weights.index train_set_metadata.json

Now for tensorflow serving the model needs to be in the below format as mentioned https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md ,

assets/ variables/ variables.data-***-of-*** variables.index saved_model.pb

I have been trying to use tensorflow "SavedModel api" , but no luck for the conversion . if you please provide some guidance will be a big help . Thank you
feature looking into it

opened by upalchowdhury 27
Refactor GBMs to request at least 1 CPU by default
lightgbm_ray attempts to auto-detect cluster resources before training, and will override CPU resource requests if the model

is not being hyperopted

has not been assigned a placement group

has requested <= 0 CPUs

Under these conditions, training may fail because no available resources match the request. To prevent this, GBMs should request at least 1 CPU by default.
opened by jeffkinnison 0
Reduce resource overhead of TestDatasetWindowAutosizing tests

These tests seem to lead to 143 errors in many cases, which are commonly the result of OOMs or other high resource cancellations by GitHub. We should find a way to run these tests that don't require generating and processing so much ephemeral data.
tests

opened by tgaddair 1
fix: Remove extra properties from `input_features`, `output_features`, `defaults` schemas for GBM models
Removes extra properties from the features and defaults schemas that are otherwise unsupported by GBM.

In particular

encoder and decoder sub schemas are removed

Only GBM-supported feature types are left behind after pruning the defaults schema

Extra fields with references to encoders/decoders are removed: input_size, 'top_k,num_classes` - can I get a double check on this?
opened by ksbrar 6
Refactor metrics and metric tables and support adding more in-training metrics.
Changes:

Rename metric_type_registry -> metric_feature_type_registry

Rename BINARY_WEIGHTED_CROSS_ENTROPY -> BINARY_WEIGHTED_CROSS_ENTROPY_LOSS

Add Precision and Recall metrics.

Remove get_output_metric_functions() everywhere and rely on the metric registry to determine which metrics correspond to which feature types.

For output features, make metric_functions private, in favor of a public metric_names attribute. Metrics are setup, updated, and managed internally within the OutputFeature -- we don't need to the expose the actual functions. metric_names is sufficient for previous users of metric_functions.

Use the metric registry to populate options for the hyperopt schema.

Clean up metric table printing mechanisms by refactoring metric tables to be constructed outside of trainer.evaluation() (instead of appended to during evaluation). We also factor out these actions to separate libraries that are re-used between trainer and trainer_lightgbm.

Add get registry function for metrics registries, and use them.
opened by justinxzhao 1
Move `on_batch_end` callback to omit eval from batch duration during benchmarking

This PR moves the on_batch_end callback to before run_evaluation is called. This is useful in getting us more accurate batch throughput measurements during benchmarking. Prior to this change, any measurements of average batch duration would incorporate the evaluation step, particularly if the average batch duration was computed across multiple epochs, leading to a (potentially significant) ~~underestimation~~ overestimation of average batch duration.

opened by geoffreyangus 3

Releases(v0.6.4)

v0.6.4(Oct 28, 2022)
What's Changed

Field fix: by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2714

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2719

Bump Ludwig to 0.6.4 by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2720

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.6.3...v0.6.4
Source code(tar.gz)
Source code(zip)
v0.6.3(Oct 20, 2022)
What's Changed

Cherry-pick remote file syncing with hyperopt by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2644

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2646

Cherry-pick bb8bef02c002eccbb6369292ac54490875bebbc4 by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2651

Cherry-pick: Ensure no ghost ray instances are running in tests (#2607) by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2654

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2660

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2677

Update version to v0.6.3 by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2682

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.6.2...v0.6.3
Source code(tar.gz)
Source code(zip)
v0.6.2(Oct 13, 2022)
What's Changed

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2594

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2602

0.6.2: cherry-pick Explanation API by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2604

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2609

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2608

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2613

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2618

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2634

Cherrypick: feat: adds max_batch_size to auto batch size functionality by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2632

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2636

Update version to 0.6.2 by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2624

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.6.1...v0.6.2
Source code(tar.gz)
Source code(zip)
v0.6.1(Oct 4, 2022)
What's Changed

Cherry pick hyperopt plots by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2567

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2571

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2572

fix: Limit frequency array to top_n_classes in F1 viz (#2565) by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2575

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2581

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2583

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2590

Cherrypick: Comprehensive Configs by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2580

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2591

Update version to 6.1 by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2582

Readme fixes by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2592

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.6...v0.6.1
Source code(tar.gz)
Source code(zip)
v0.6(Sep 27, 2022)
Overview

Ludwig 0.6 introduces several exciting features focused on modeling, deployment, and testing that make it more flexible, reliable, and easy to use in production.

Gradient boosted models: Historically, Ludwig has been built around a single, flexible neural network architecture called ECD (for Encoder-Combiner-Decoder). With the release of 0.6 we are adding support for a different model architecture: gradient-boosted tree models (GBMs).

Richer configuration schema and validation: We formalized the schema of Ludwig configurations and now validate it before initialization, which can help you avoid mistakes like typos and syntax errors.

Probability calibration for binary and multi-class classification: With deep neural networks, the probabilities given by models often don't match the true likelihood of the data. Ludwig now supports temperature scaling calibration (On Calibration of Modern Neural Networks), which brings class probabilities closer to their true likelihoods in the validation set.

Pipelined TorchScript: We improved the TorchScript model export functionality, making it easier than ever to train and deploy models for high performance inference.

Model parameter update unit tests: The code to update parameters of deep neural networks can be too complex for developers to make sure the model parameters are updated. To address this difficulty and improve the robustness of our models, we implemented a reusable utility to ensure parameters are updated during one cycle of a forward-pass / backward-pass / optimizer step.

Additional improvements include a new global configuration section, time-based dataset splitting and more flexible hyperparameter optimization configurations. Read more about each specific feature below.

If you are learning about Ludwig for the first time, or if these new features are relevant and exciting to your research or application, we'd love to hear from you. Join our Ludwig Slack Community here.

Gradient Boosted Models (@jppgks)

Historically, Ludwig has been built around a single, flexible neural network architecture called ECD (for Encoder-Combiner-Decoder). With the release of 0.6 we are, adding support for a different model architecture: gradient-boosted tree models (GBM).

This is motivated by the fact that tree models still outperform neural networks on some tabular datasets, and the fact that tree models are generally less compute-intensive, making them a better choice for some applications. In Ludwig, users can now experiment with both neural and tree-based architectures within the same framework, taking advantage of all of the additional functionalities and conveniences that Ludwig offers like: preprocessing, hyperparameter optimization, integration with different backends (local, ray, horovod), and interoperability with different data sources (pandas, dask, modin).

How to use it

Install the tree extra package with pip install ludwig[tree]. After the installation, you can use the new gbm model type in the configuration. Ludwig will default to using the ECD architecture, which can be overridden as follows to use GBM:

In some initial benchmarking we found that GBMs are particularly performant on smaller tabular datasets and can sometimes deal better with class imbalance compared to neural networks. Stay tuned for a more in-depth blogpost on the topic. Like the ECD neural networks, GBMs can be sensitive to hyperparameter values, and hyperparameter tuning is important to get a well-performing model.

Under the hood, Ludwig uses LightGBM for training gradient-boosted tree models, and the LightGBM trainer parameters can be configured in the trainer section of the configuration. For serving, the LightGBM model is converted to a PyTorch graph using Hummingbird for efficient evaluation and inference.

Limitations

Ludwig's initial support for GBM is limited to tabular data (binary, categorical and numeric features) with a single output feature target.

Calibrating probabilities for category and binary output features (@dantreiman)

Suppose your model outputs a class probability of 90%. Is there a 90% chance that the model prediction is correct? Do the probabilities given by your model match the true likelihood of the data? With deep neural networks, they often don't.

Drawing on the methods described in On Calibration of Modern Neural Networks (Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger), Ludwig now supports temperature scaling for binary and category output features. Temperature scaling brings a model's output probabilities closer to the true likelihood while preserving the same accuracy and top k predictions.

How to use Calibration

To enable calibration, add calibration: true to any binary or category output feature configuration:

With calibration enabled, Ludwig will find a scale factor (temperature) which will bring the class probabilities closer to their true likelihoods in the validation set. The calibration scale factor is determined in a short phase after training is complete. If no validation split is provided, the training set is used instead.

To visualize the effects of calibration in Ludwig, you can use Calibration Plots, which bin the data based on model probability and plot the model probability (X) versus observed (Y) for each bin (see code examples).

In a perfectly calibrated model, the observed probability equals the predicted probability, and all predictions will land on the dotted line y=x. In this example using the forest cover dataset, the uncalibrated model in blue gives over-confident predictions near the left and right edges close to probability values of 0 or 1. Temperature scaling learns a scale factor of 0.51 which improves the calibration curve in orange, moving it closer to y=x.

Limitations

Calibration is currently limited to models with binary and category output features.

Richer configuration schema and validation (@connor-mccorm @ksbrar @justinxzhao )

Ludwig configurations are flexible by design, as they internally map to Python function signatures. This allows configurations for expressive configurations with many parameters for the users to play with, but we have found that users would too easily have typos in their configs like incorrect value types or other syntactical inconsistencies that were not easy to catch.

We have now formalized the Ludwig config with a strongly typed schema, serving as a centralized source of truth for parameter documentation and config validation. Ludwig validation now explicitly restricts each parameter's values to valid ones, decreasing the chance of syntactical and logical errors and signaling immediately to the user where the issues lie, before processing data or starting training. Schemas also provide many future benefits including autocompletion.

Nested encoder and decoder parameters (@connor-mccorm )

We have also restructured the way that encoders and decoders are configured to now use a nested structure, consistent with other modules in Ludwig such as combiners and loss.

As these changes impact what constitutes a valid Ludwig config, we also introduced a mechanism for ensuring backward compatibility that invisibly and automatically upgrades older configs to the current config structure.

We hope with the new Ludwig schema and the improved encoder/decoder nesting structure, that you find using Ludwig to be a much more robust and user friendly experience!

New Defaults Ludwig Section (@arnavgarg1 )

In Ludwig 0.5, users could specify global preprocessing parameters on a per-feature-type basis through the preprocessing section in Ludwig configs. This is useful if users know they always want to apply certain transformations to their data for every feature of the same type. However, there was no equivalent mechanism for global encoder, decoder or loss related parameters.

For example, say we have a mammography dataset to predict breast cancer that contains many categorical features. In Ludwig 0.5, we might define our input features with encoder parameters in the following way:

Here, the problem is that we have to redefine the same encoder parameters (type, dropout, and embedding_size) for each of the input features if we want to override the default value across all categorical features.

In Ludwig 0.6, we are introducing a new defaults section within the Ludwig config to define feature-type defaults for preprocessing, encoders, decoders, and loss. Default preprocessing and encoder configurations will be applied to all input_features of that feature type, while decoder and loss configurations will be applied to all output_features of that feature type.

Note that you can still specify feature specific parameters as usual, and these will override any default parameter values that come from the global defaults section.

The same mammography config above could be defined in the following, much more concise way in Ludwig 0.6:

Here, the encoder defaults for type, dropout and embedding_size are applied to all three categorical features. The he_normal embedding initializer is only applied to tumor_size and inv_nodes since we didn't specify this parameter in their feature definitions, but breast_quadrant will use the glorot_normal initializer since it will override the value from the defaults section.

Additionally, in Ludwig 0.6, we have moved all global feature-type preprocessing within this new defaults section from the preprocessing section.

The defaults section enables the same fine-grained control with the benefit of making your config easier to define and read.

Global Defaults In Hyperopt (@arnavgarg1 )

The defaults section has also been added to hyperopt, so that users can define feature-type level parameters for individual trials. This makes the definition of the hyperopt search space more convenient, without the need to define individual parameters for each of the features in instances where the dataset has a large number of input or output features.

For example, if you want to hyperopt over different encoders for all text features for each of the trials, one can do so by defining a parameter this way:

This will sample one of the three encoders for text features and apply it to all the text features for that particular trial.

Nested Configs In Hyperopt (@tgaddair )

We have extended the range of hyperopt parameters to support parameter choices that consist of partial or complete blocks of nested Ludwig config sections. This allows users to search over a set of Ludwig configs, as opposed to needing to specify config params individually and search over all combinations.

To provide a parameter that represents a full top-level Ludwig config, the . key name can be used.

For example, we can define a hyperopt search space where we sample partial Ludwig configs in the following way would create hyperopt samples that look like the following:

Pipelined TorchScript (@geoffreyangus @brightsparc )

In Ludwig v0.6, we improved the TorchScript model export functionality, making it easier than ever to train and deploy models for high performance inference.

At the core of our implementation is a pipeline-based approach to exporting models. After training a Ludwig model, users can run the export_torchscript command in the CLI, or call LudwigModel.save_torchscript. If model training was performed on a GPU device, doing so produces three new TorchScript artifacts:

These artifacts represent a single LudwigModel as three modules, each separated by stage: preprocessing, prediction, and postprocessing. These artifacts can be pipelined together using the InferenceModule class method InferenceModule.from_directory, or with some tools such as NVIDIA Triton.

One of the most significant benefits is that TorchScripted models are backend and environment independent and different parts can run on different hardware to maximize throughput. They can be loaded up in either a C++ or Python backend, and in either, minimal dependencies are required to run model inference. Such characteristics ensure that the model itself is both highly portable and backward compatible.

Time-based Dataset Splitting (@tgaddair )

In Ludwig v0.6, we have added the ability to split based on a date column such that the data is ordered by date (ascending) and then split into train-validation-test along the time dimension. To make this possible, we have reworked the way splitting is handled in the Ludwig configuration to support a dedicated split section:

In this example, by setting probabilities: [0.7, 0.1, 0.2], the earliest 70% of the data will be used for training, the middle 10% used for validation, and the last 20% used for testing.

This feature is important to support backtesting strategies where the user needs to know if a model trained on historical data would have performed well on unseen future data. If we were to use a uniformly random split strategy in these cases, then the model performance may not reflect the model's ability to generalize well if the data distribution is subject to change over time. For example, imagine a model that is predicting housing prices. If we both train and test on data from around the same time, we may fool ourselves into believing our model has learned something fundamental about housing valuations when in reality it might just be basing its predictions on recent trends in the market (trends that will likely change once the model is put into production). Splitting the training from the test data along the time dimension is one way to avoid this false sense of confidence, by showing how well the model should do on unseen data from the future.

Prior to Ludwig v0.6, the preprocessing configuration supported splitting based on a split column, split probabilities (train-val-test), or stratified splitting based on a category, all of which were flattened into the top-level of the preprocessing section:

This approach was limiting in that every new split type required reconciling all of the above params and determining how they should interact with the new type. To resolve this complexity, all of the existing split types have been similarly reworked to follow the new structure supported for datetime splitting.

Examples

Splitting by row at random (default):

Splitting based on a fixed column.

Stratified splits using a chosen stratification category column.

Be on the lookout as we continue to add additional split strategies in the future to support advanced usage such as bucketed backtesting. If you are interested in these kinds of scenarios, please reach out!

Parameter Update Unit Tests (@jimthompson5802 )

A significant step was taken in this release to improve the code quality of Ludwig components, e.g., encoders, combiners, and decoders. Deep neural networks have many layers composed of a large number of parameters that must be updated to converge to a solution. Depending on the particular algorithm, the code for updating parameters during training can be quite complex. As a result, it is near impossible for a developer to reason through an analysis that confirms model parameters are updated.

To address this difficulty, we implemented a reusable utility to perform a quick sanity check to ensure parameters, such as tensor weights and biases, are updated during one cycle of a forward-pass / backward-pass / optimizer step. This work was inspired by these earlier blog postings: How to unit test machine learning code and Testing Your PyTorch Models with Torcheck.

This utility was added to unit tests for existing Ludwig components. With this addition, unit tests for Ludwig now ensure the following:

No run-time exceptions are raised

Generated output are the correct data type and shape

(New capability) Model parameters are updated as expected

The above is an example of a unit test. First, it sets the random number seed to ensure repeatability. Next, the test instantiates the Ludwig component and processes synthetic data to ensure the component does not raise an error and that the output has the expected shape. Finally, the unit test checks if the parameters are updated under the different combinations of configuration settings.

In addition to the new parameter update check utility, Ludwig's Developer Guide contains instructions for using the utility. This allows an advanced user or a contributor, who is developing custom encoders, combiners, or decoders, to ensure the quality of their custom component.

Stay in the loop

Ludwig thriving open source community gathers on Slack, join it to get involved!

If you are interested in adopting Ludwig in the enterprise, check out Predibase, the declarative ML platform that connects with your data, manages the training, iteration, and deployment of your models, and makes them available for querying, reducing time to value of machine learning projects.

Full Changelog

Fix ray nightly import by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2196

Restructured split config and added datetime splitting by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2132

enh: Implements InferenceModule as a pipelined module with separate preprocessor, predictor, and postprocessor modules by @brightsparc in https://github.com/ludwig-ai/ludwig/pull/2105

Explicitly pass data credentials when reading binary files from a RayBackend by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2198

MlflowCallback: do not end run on_trainer_train_teardown by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2201

Fail hyperopt with full import error when Ray not installed by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2203

Make convert_predictions() backend-aware by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2200

feat: MVP for explanations using Integrated Gradients from captum by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2205

[Torchscript] Adds GPU-enabled input types for Vector and Timeseries by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2197

feat: Added model type GBM (LightGBM tree learner), as an alternative to ECD by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2027

[Torchscript] Parallelized Text/Sequence Preprocessing by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2206

feat: Adding feature type shared parameter capability for hyperopt by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2133

Bump up version to 0.6.dev. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2209

Define FloatOrAuto and IntegerOrAuto schema fields, and use them. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2219

Define a dataclass for parameter metadata. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2218

Add explicit handling for zero-length image byte buffers to avoid cryptic errors by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2210

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2231

Create dataset util to form repeatable train/vali/test split by @amholler in https://github.com/ludwig-ai/ludwig/pull/2159

Bug fix: Use safe rename which works across filesystems when writing checkpoints by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2225

Add parameter metadata to the trainer schema. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2224

Add an explicit call to merge_wtih_defaults() when loading a config from a model directory. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2226

Fixes flaky test test_datetime_split[dask] by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2232

Fixes prediction saving for models with Set output by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2211

Make ExpectedImpact JSON serializable by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2233

standardised quotation marks, added missing word by @Marvjowa in https://github.com/ludwig-ai/ludwig/pull/2236

Add boolean postprocessing to dataset type inference for automl by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2193

Update get_repeatable_train_val_test_split to handle non-stratified split w/ no existing split by @amholler in https://github.com/ludwig-ai/ludwig/pull/2237

Update R2 score to handle single sample computation by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2235

Input/Output Feature Schema Refactor by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2147

Fix nan in entmax loss and flaky sparsemax/entmax loss tests by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2238

Fix preprocessing dataset split API backwards compatibility upgrade bug. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2239

Removing duplicates in constants from recent PRs by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2240

Add attention scores of the vit encoder as an additional return value by @Dennis-Rall in https://github.com/ludwig-ai/ludwig/pull/2192

Unnest Audio Feature Preprocessing Config by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2242

Fixed handling of invalud number values to treat as missing values by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2247

Support saving numpy predictions to remote FS by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2245

Use global constant for description.json by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2246

Removed import warnings when LightGBM and Ray not requested by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2249

Adds ability to read images from numpy files and numpy arrays by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2212

Hyperopt steps per epoch not being computed correctly by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2175

Fixed splitting when providing pre-split inputs by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2248

Added Backwards Compatibility for Audio Feature Preprocessing by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2254

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2256

Fix: Don't skip saving the model if the save path already exists. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2264

Load best weights outside of finally block, since load may throw an exception by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2268

Reduce number of distributed tests. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2270

[WIP] Adds inference_utils.py by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2213

Run github checks for pushes and merges to *-stable. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2266

Add ludwig logo and version to CLI help text. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2258

Add hyperopt_statistics.json constant by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2276

fix: Make BaseTrainerConfig an abstract class by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2273

[Torchscript] Adds --device argument to export_torchscript CLI command by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2275

Use pytest tmpdir fixture wherever temporary directories are used in tests. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2274

adding configs used in benchmarking by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2263

Fixes #2279 by @noahlh in https://github.com/ludwig-ai/ludwig/pull/2284

adding hardware usage and software packages tracker by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2195

benchmarking utils by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2260

dataclasses for summarizing benchmarking results by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2261

Benchmarking core by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2262

Fixed default eval_batch_size when setting batch_size=auto by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2286

Remove obsolete postprocess_inference_graph function. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2267

[Torchscript] Adds BERT tokenizer + partial HF tokenizer support by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2272

Support passing ground_truth as df for visualizations by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2281

catching urllib3 exception by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2294

Run pytest workflow on release branches. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2291

Save checkpoint if train_steps is smaller than batcher's steps_per_epoch by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2298

Fix typo in amazon review datasets: s/review_tile/review_title by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2300

Refactor non-distributed automl utils into a separate directory. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2296

Don't skip normalization in TabNet during inference on a single row. by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2299

Fix error in postproc_predictions calculation in model.evaluate() by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2304

Test for parameter updates in Ludwig components by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/2194

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2311

Use warnings to suppress repeated logs for failed image reads by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2312

Use ray dataset and drop type casting in binary_feature prediction post processing for speedup by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2293

Add size_bytes to DatasetInfo and DataSource by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2306

Fixes TensorDtype TypeError in Ray nightly by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2320

Add configuration section for global feature parameters by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2208

Ensures unit tests are deleting artifacts during teardown by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2310

Fixes unit test that had empty Dask partitions after splitting by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2313

Serve json numpy encoding by @jeffkinnison in https://github.com/ludwig-ai/ludwig/pull/2316

fix: Mlflow config being injected in hyperopt config by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2321

Update tests that use preprocessing to match new defaults config structure by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2323

Bump test timeout to 60 minutes by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2325

Set a default value for size_bytes in DatasetInfo by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2331

Pin nightly versions to fix CI by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2327

Log number of failed image reads by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2317

Add test with encoder dependencies for global defaults by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2342

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2334

Add wine quality notebook to demonstrate using config defaults by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2333

fix: GBM tests failing after new release from upstream dependency by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2347

fix: restore overwrite of eval_batch_size on GBM schema by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2345

Removes empty partitions after dropping rows and splitting datasets by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2328

fix: Properly serialize ParameterMetadata to JSON by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2348

Test for parameter updates in Ludwig Components - Part 2 by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/2252

refactor: Replace bespoke marshmallow fields that accept multiple types with a new 'combinatorial' OneOfField that accepts other fields as arguments. by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2285

Use Ray Datasets to read binary files in parallel by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2241

typos: Update README.md by @andife in https://github.com/ludwig-ai/ludwig/pull/2358

Respect the resource requests in RayPredictor by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2359

Resource tracker threading by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2352

Allow writing init_config results to remote filesystems by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2364

Fixed export_mlflow command to not assume an existing registered_model_name by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2369

fix: Fixes to serialization, and update to allow set repo location. by @brightsparc in https://github.com/ludwig-ai/ludwig/pull/2367

Add amazon employee access challenge kaggle dataset by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2349

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2362

Wrap read of cached training set metadata in try/except for robustness by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2373

Reduce dropout prob in test_conv1d_stack by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2380

fever: change broken download links by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2381

Add default split config by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2379

Fix CI: Skip failing ray GBM tests by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2391

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2389

Triton ensemble export by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2251

Fix: Random dataset splitting with 0.0 probability for optional validation or test sets. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2382

Print final training report as tabulated text. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2383

Add Ray 2.0 to CI by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2337

add GBM configs to benchmarking by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2395

Optional artifact logging for MLFlow by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/2255

Simplify ludwig.benchmarking.benchmark API and add ludwig benchmark CLI by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2394

rename kaggle_api_key to kaggle_key by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2384

use new URL for yosemite dataset by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2385

Encoder refactor V2 by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2370

re-enable GBM tests after new lightgbm-ray release by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2393

Added option to log artifact location while creating mlflow experiment by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/2397

Treat dataset columns as object dtype during first pass of handle_missing_values by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2398

fix: ParameterMetadata JSON serialization bug by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2399

Adds registry to organize backward compatibility updates around versions and config sections by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2335

Include split column in explanation df by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2405

Fix AimCallback to model_name as Run.name by @alberttorosyan in https://github.com/ludwig-ai/ludwig/pull/2413

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2410

Hotfix: features eligible for shared params hyperopt by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2417

Nest FC Params in Decoder by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2400

Hyperopt Backwards Compatibility by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2419

Investigating test_resnet_block_layer intermittent test failure by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2414

fix: Remove duplicate option from cell_type field schema by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2428

Test for parameter updates in Ludwig Combiners - Part 3 by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/2332

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2430

Hotfix: Proc column missing in output feature schema by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2435

Nest hyperopt parameters into decoder object by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2436

Fix: Make the twitter bots modeling example runnable by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2433

Add MLG-ULB creditcard fraud dataset by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2425

Bugfix: non-number inputs to GBM by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2418

GBM: log intermediate progress by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2421

Fix: Upgrade ludwig config before schema validation by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2441

Log warning for calibration if validation set is trivially small by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2440

Fixes calibration and adds example scripts by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2431

Add medical no-show appointments dataset by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2387

Added conditional check for UNK token insertion into category feature vocab by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2429

Ensure synthetic dataset unit tests to clean up extra files. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2442

Added feature specific parameter test for hyperopt by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2329

Fixed version transformation to accept user configs without ludwig_version by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2424

Fix mulitple partition predict by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2422

Cache jsonschema validator to reduce memory pressure by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2444

[tests] Added more explicit lifecycle management to Ray clusters during tests by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2447

Fix: explicit keyword args for seaborn plot fn by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2454

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2453

Extended hyperopt to support nested configuration block parameters by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2445

Consolidate missing value strategy to only include bfill and ffill by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2457

fix: Switched Learning Rate to NonNegativeFloat Field by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2446

Support GitHub Codespaces by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2463

Enh: quality-of-life improvements for export_torchscript by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2459

Disables batch_size: auto for CPU-only training by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2455

buxfix: triton model version as a string by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2461

Updating images to Ray 2.0.0 and CUDA 11.3 by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2390

Loss, Split, and Defaults Schema Additions by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2439

More precise resource usage tracking by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2363

Summarizing performance metrics and resource usage results by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2372

[release-0.6] Cherry-pick bugfixes from upstream by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2471

[release-0.6] Cherry-pick upstream commits by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2473

[release-0.6] Cherry-pick upstream by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2476

Cherry-pick backwards-compatibility fixes by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2487

[cherry-pick] Fixed usage of checkpoints for AutoML in Ray 2.0 (#2485) by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2491

fix: Automatically assign title to OneOfOptionsField (#2480) by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2492

[cherry-pick] Fixed stratified splitting with Dask (#1883) by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2494

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2505

AUTO: Enable hyperopt to be launched from a ray client by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2504

[cherry-pick] Pin transformers < 4.22 until issues resolved (#2495) by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2510

[cherry-pick] Fix flaky ray nightly image test (#2493) by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2511

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2513

Add in-memory dataset size calculation to dataset statistics and hyperopt (#2509) by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2518

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2521

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2528

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2534

Cherrypick: Cleanup: move to per-module loggers instead of the global logging object by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2539

Update version to 0.6rc1. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2529

Add resource isolation to 0.6 and fix merge conflicts by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2538

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2542

More resource isolation cherrypicks by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2544

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2546

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2552

Pin ray nightly version to avoid test failures related to TensorDType… by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2559

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2557

Update version to 0.6. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2549

New Contributors

Congratulations to our new contributors!

@Marvjowa made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2236

@Dennis-Rall made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2192

@abidwael made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2263

@noahlh made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2284

@jeffkinnison made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2316

@andife made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2358

@alberttorosyan made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2413

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.5.3...v0.6
Source code(tar.gz)
Source code(zip)
v0.6rc1(Sep 20, 2022)
What's Changed

[release-0.6] Cherry-pick bugfixes from upstream by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2471

[release-0.6] Cherry-pick upstream commits by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2473

[release-0.6] Cherry-pick upstream by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2476

Cherry-pick backwards-compatibility fixes by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2487

[cherry-pick] Fixed usage of checkpoints for AutoML in Ray 2.0 (#2485) by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2491

fix: Automatically assign title to OneOfOptionsField (#2480) by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2492

[cherry-pick] Fixed stratified splitting with Dask (#1883) by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2494

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2505

AUTO: Enable hyperopt to be launched from a ray client by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2504

[cherry-pick] Pin transformers < 4.22 until issues resolved (#2495) by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2510

[cherry-pick] Fix flaky ray nightly image test (#2493) by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2511

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2513

Add in-memory dataset size calculation to dataset statistics and hyperopt (#2509) by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2518

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2521

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2528

AUTO: by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2534

Cherrypick: Cleanup: move to per-module loggers instead of the global logging object by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2539

Update version to 0.6rc1. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2529

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.6.beta...v0.6rc1
Source code(tar.gz)
Source code(zip)
v0.6.beta(Sep 8, 2022)
What's Changed

Fix ray nightly import by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2196

Restructured split config and added datetime splitting by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2132

enh: Implements InferenceModule as a pipelined module with separate preprocessor, predictor, and postprocessor modules by @brightsparc in https://github.com/ludwig-ai/ludwig/pull/2105

Explicitly pass data credentials when reading binary files from a RayBackend by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2198

MlflowCallback: do not end run on_trainer_train_teardown by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2201

Fail hyperopt with full import error when Ray not installed by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2203

Make convert_predictions() backend-aware by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2200

feat: MVP for explanations using Integrated Gradients from captum by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2205

[Torchscript] Adds GPU-enabled input types for Vector and Timeseries by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2197

feat: Added model type GBM (LightGBM tree learner), as an alternative to ECD by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2027

[Torchscript] Parallelized Text/Sequence Preprocessing by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2206

feat: Adding feature type shared parameter capability for hyperopt by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2133

Bump up version to 0.6.dev. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2209

Define FloatOrAuto and IntegerOrAuto schema fields, and use them. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2219

Define a dataclass for parameter metadata. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2218

Add explicit handling for zero-length image byte buffers to avoid cryptic errors by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2210

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2231

Create dataset util to form repeatable train/vali/test split by @amholler in https://github.com/ludwig-ai/ludwig/pull/2159

Bug fix: Use safe rename which works across filesystems when writing checkpoints by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2225

Add parameter metadata to the trainer schema. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2224

Add an explicit call to merge_wtih_defaults() when loading a config from a model directory. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2226

Fixes flaky test test_datetime_split[dask] by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2232

Fixes prediction saving for models with Set output by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2211

Make ExpectedImpact JSON serializable by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2233

standardised quotation marks, added missing word by @Marvjowa in https://github.com/ludwig-ai/ludwig/pull/2236

Add boolean postprocessing to dataset type inference for automl by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2193

Update get_repeatable_train_val_test_split to handle non-stratified split w/ no existing split by @amholler in https://github.com/ludwig-ai/ludwig/pull/2237

Update R2 score to handle single sample computation by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2235

Input/Output Feature Schema Refactor by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2147

Fix nan in entmax loss and flaky sparsemax/entmax loss tests by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2238

Fix preprocessing dataset split API backwards compatibility upgrade bug. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2239

Removing duplicates in constants from recent PRs by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2240

Add attention scores of the vit encoder as an additional return value by @Dennis-Rall in https://github.com/ludwig-ai/ludwig/pull/2192

Unnest Audio Feature Preprocessing Config by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2242

Fixed handling of invalud number values to treat as missing values by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2247

Support saving numpy predictions to remote FS by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2245

Use global constant for description.json by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2246

Removed import warnings when LightGBM and Ray not requested by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2249

Adds ability to read images from numpy files and numpy arrays by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2212

Hyperopt steps per epoch not being computed correctly by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2175

Fixed splitting when providing pre-split inputs by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2248

Added Backwards Compatibility for Audio Feature Preprocessing by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2254

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2256

Fix: Don't skip saving the model if the save path already exists. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2264

Load best weights outside of finally block, since load may throw an exception by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2268

Reduce number of distributed tests. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2270

[WIP] Adds inference_utils.py by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2213

Run github checks for pushes and merges to *-stable. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2266

Add ludwig logo and version to CLI help text. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2258

Add hyperopt_statistics.json constant by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2276

fix: Make BaseTrainerConfig an abstract class by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2273

[Torchscript] Adds --device argument to export_torchscript CLI command by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2275

Use pytest tmpdir fixture wherever temporary directories are used in tests. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2274

adding configs used in benchmarking by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2263

Fixes #2279 by @noahlh in https://github.com/ludwig-ai/ludwig/pull/2284

adding hardware usage and software packages tracker by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2195

benchmarking utils by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2260

dataclasses for summarizing benchmarking results by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2261

Benchmarking core by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2262

Fixed default eval_batch_size when setting batch_size=auto by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2286

Remove obsolete postprocess_inference_graph function. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2267

[Torchscript] Adds BERT tokenizer + partial HF tokenizer support by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2272

Support passing ground_truth as df for visualizations by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2281

catching urllib3 exception by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2294

Run pytest workflow on release branches. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2291

Save checkpoint if train_steps is smaller than batcher's steps_per_epoch by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2298

Fix typo in amazon review datasets: s/review_tile/review_title by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2300

Refactor non-distributed automl utils into a separate directory. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2296

Don't skip normalization in TabNet during inference on a single row. by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2299

Fix error in postproc_predictions calculation in model.evaluate() by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2304

Test for parameter updates in Ludwig components by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/2194

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2311

Use warnings to suppress repeated logs for failed image reads by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2312

Use ray dataset and drop type casting in binary_feature prediction post processing for speedup by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2293

Add size_bytes to DatasetInfo and DataSource by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2306

Fixes TensorDtype TypeError in Ray nightly by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2320

Add configuration section for global feature parameters by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2208

Ensures unit tests are deleting artifacts during teardown by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2310

Fixes unit test that had empty Dask partitions after splitting by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2313

Serve json numpy encoding by @jeffkinnison in https://github.com/ludwig-ai/ludwig/pull/2316

fix: Mlflow config being injected in hyperopt config by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2321

Update tests that use preprocessing to match new defaults config structure by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2323

Bump test timeout to 60 minutes by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2325

Set a default value for size_bytes in DatasetInfo by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2331

Pin nightly versions to fix CI by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2327

Log number of failed image reads by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2317

Add test with encoder dependencies for global defaults by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2342

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2334

Add wine quality notebook to demonstrate using config defaults by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2333

fix: GBM tests failing after new release from upstream dependency by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2347

fix: restore overwrite of eval_batch_size on GBM schema by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2345

Removes empty partitions after dropping rows and splitting datasets by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2328

fix: Properly serialize ParameterMetadata to JSON by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2348

Test for parameter updates in Ludwig Components - Part 2 by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/2252

refactor: Replace bespoke marshmallow fields that accept multiple types with a new 'combinatorial' OneOfField that accepts other fields as arguments. by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2285

Use Ray Datasets to read binary files in parallel by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2241

typos: Update README.md by @andife in https://github.com/ludwig-ai/ludwig/pull/2358

Respect the resource requests in RayPredictor by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2359

Resource tracker threading by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2352

Allow writing init_config results to remote filesystems by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2364

Fixed export_mlflow command to not assume an existing registered_model_name by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2369

fix: Fixes to serialization, and update to allow set repo location. by @brightsparc in https://github.com/ludwig-ai/ludwig/pull/2367

Add amazon employee access challenge kaggle dataset by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2349

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2362

Wrap read of cached training set metadata in try/except for robustness by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2373

Reduce dropout prob in test_conv1d_stack by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2380

fever: change broken download links by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2381

Add default split config by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2379

Fix CI: Skip failing ray GBM tests by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2391

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2389

Triton ensemble export by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2251

Fix: Random dataset splitting with 0.0 probability for optional validation or test sets. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2382

Print final training report as tabulated text. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2383

Add Ray 2.0 to CI by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2337

add GBM configs to benchmarking by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2395

Optional artifact logging for MLFlow by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/2255

Simplify ludwig.benchmarking.benchmark API and add ludwig benchmark CLI by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2394

rename kaggle_api_key to kaggle_key by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2384

use new URL for yosemite dataset by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2385

Encoder refactor V2 by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2370

re-enable GBM tests after new lightgbm-ray release by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2393

Added option to log artifact location while creating mlflow experiment by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/2397

Treat dataset columns as object dtype during first pass of handle_missing_values by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2398

fix: ParameterMetadata JSON serialization bug by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2399

Adds registry to organize backward compatibility updates around versions and config sections by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2335

Include split column in explanation df by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2405

Fix AimCallback to model_name as Run.name by @alberttorosyan in https://github.com/ludwig-ai/ludwig/pull/2413

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2410

Hotfix: features eligible for shared params hyperopt by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2417

Nest FC Params in Decoder by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2400

Hyperopt Backwards Compatibility by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2419

Investigating test_resnet_block_layer intermittent test failure by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2414

fix: Remove duplicate option from cell_type field schema by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2428

Test for parameter updates in Ludwig Combiners - Part 3 by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/2332

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2430

Hotfix: Proc column missing in output feature schema by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2435

Nest hyperopt parameters into decoder object by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2436

Fix: Make the twitter bots modeling example runnable by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2433

Add MLG-ULB creditcard fraud dataset by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2425

Bugfix: non-number inputs to GBM by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2418

GBM: log intermediate progress by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2421

Fix: Upgrade ludwig config before schema validation by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2441

Log warning for calibration if validation set is trivially small by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2440

Fixes calibration and adds example scripts by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2431

Add medical no-show appointments dataset by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2387

Added conditional check for UNK token insertion into category feature vocab by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2429

Ensure synthetic dataset unit tests to clean up extra files. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2442

Added feature specific parameter test for hyperopt by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2329

Fixed version transformation to accept user configs without ludwig_version by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2424

Fix mulitple partition predict by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2422

Cache jsonschema validator to reduce memory pressure by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2444

[tests] Added more explicit lifecycle management to Ray clusters during tests by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2447

Fix: explicit keyword args for seaborn plot fn by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2454

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2453

Extended hyperopt to support nested configuration block parameters by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2445

Consolidate missing value strategy to only include bfill and ffill by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2457

fix: Switched Learning Rate to NonNegativeFloat Field by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2446

Support GitHub Codespaces by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2463

Enh: quality-of-life improvements for export_torchscript by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2459

Disables batch_size: auto for CPU-only training by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2455

buxfix: triton model version as a string by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2461

Updating images to Ray 2.0.0 and CUDA 11.3 by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2390

Loss, Split, and Defaults Schema Additions by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2439

More precise resource usage tracking by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2363

Summarizing performance metrics and resource usage results by @abidwael in https://github.com/ludwig-ai/ludwig/pull/2372

New Contributors

@Marvjowa made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2236

@Dennis-Rall made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2192

@abidwael made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2263

@noahlh made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2284

@jeffkinnison made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2316

@andife made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2358

@alberttorosyan made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2413

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.5.3...v0.6.beta
Source code(tar.gz)
Source code(zip)
v0.5.5(Aug 2, 2022)
What's Changed

Bump Ludwig From v0.5.4 -> v0.5.5 by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2340

Bug fix: Use safe rename which works across filesystems when writing checkpoints

Fixed default eval_batch_size when setting batch_size=auto

Update R2 score to handle single sample computation

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.5.4...v0.5.5
Source code(tar.gz)
Source code(zip)
v0.5.4(Jul 12, 2022)
What's Changed

Cherrypick fixes to 0.5 by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2257

Update ludwig version to v0.5.4. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2265

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.5.3...v0.5.4
Source code(tar.gz)
Source code(zip)
v0.5.3(Jun 25, 2022)
What's Changed

Changed CheckpointManager to write the latest checkpoint to a consistent filename by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2123

fix: restore existing credentials when exiting use_credentials context manager by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2112

Torchscript-compatible TabNet by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2126

Add tests to ensure optional imports are optional by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2116

Added ray 1.13.0 and nightly wheel tests to CI by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2128

fix: Add default to top level of NumericOrStringOptions schema by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2119

Comprehensive configs for trainer and combiner. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2118

Set saved_weights_in_checkpoint immediately after creating model. Also adds test. by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2131

Fix Torchscript for exclusively binary feature inputs by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2103

Fixes NaN handling in boolean dtypes by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2058

[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in https://github.com/ludwig-ai/ludwig/pull/2135

Parallelizes URL reads for images using Ray/Multithreading by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2048

Fixes dtype of SPLIT column if already provided in CSV by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2140

Fixes FILL_WITH_MEAN missing value strategy with appropriate cast by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2141

Remove tune_batch_size from tabnet config by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2145

Accept kwargs in read_xsv by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2151

Remove all torch packages from the nightly test requirements by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2157

[Torchscript] Add Set output feature by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2161

Cleaning hyperopt logging by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2162

enh: Aim experient tracking for Ludwig by @osoblanco in https://github.com/ludwig-ai/ludwig/pull/2097

Update to packaging version instead of LooseVersion by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2173

rmspe: add epsilon to avoid division by zero by @jppgks in https://github.com/ludwig-ai/ludwig/pull/2139

Fix creating tensor from copy of numpy array warning messages by @arnavgarg1 in https://github.com/ludwig-ai/ludwig/pull/2170

[Torchscript] Add Vector preprocessing and postprocessing by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2160

[Torchscript] Add H3 preprocessing by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2164

Expose dtype as a parameter of the read_xsv function instead of a purely hardcoded value by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2177

[Torchscript] Adds Sequence and Text feature postprocessing by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2163

[Torchscript] Add Date feature preprocessing by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2178

Added flag for writing per trial logs in hyperopt by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/2149

Replace ray.state.nodes() with ray.nodes(). by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2183

HYPEROPT: Migrate Sampler functionality to Executor by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/2165

Changes for enabling checkpoint syncing for hyperopt by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/2115

Adds mechanism for calibrating probabilities for category and binary features by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/1949

fix: Set divisions for proc_cols directly from original dataset by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2187

Avoid unneeded total_entropy calculation when sparsity=0 by @amholler in https://github.com/ludwig-ai/ludwig/pull/2190

Fix changing parameters on plateau. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2191

[Torchscript] Adds NaN handling to preprocessing modules by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2179

Fix postprocessing on binary feature columns with number dtype by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2189

automl: Use auto batch size by default with tabnet by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2150

Update ludwig version to v0.5.3. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2184

New Contributors

@arnavgarg1 made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2162

@osoblanco made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2097

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.5.2...v0.5.3
Source code(tar.gz)
Source code(zip)
v0.5.2(Jun 8, 2022)
What's Changed

Addresses SettingWithCopyWarning in read_csv_with_nan by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2053

Update AutoML to check for imbalanced binary or category output features by @amholler in https://github.com/ludwig-ai/ludwig/pull/2052

fix: Pin jsonschema requirement by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2059

fix: Adjust custom JSON schema for betas field on optimizers by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2056

Use the smaller, unanimated GIF version so that it loads properly in PyPi by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2063

Make text encoder trainable property default to False for pre-trained HF encoders by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2060

Pin protobuf to 3.20.1 to workaround FieldDescriptor error by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2062

Use the smaller, unanimated GIF version so that it loads properly in PyPI by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2064

Factor pytorch device setting code by @amholler in https://github.com/ludwig-ai/ludwig/pull/2068

fix: pin protobuf to 3.20.1 in tests by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/2070

Update torch nightly and pin torchvision to fix CI by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2072

Added explicit encode, combine, decode functions to ECD by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2073

Revert "Adds rule of thumb for determining embeddings size" by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2069

Unpin torchvision by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2077

Restrict torchmetrics<0.9 and whylogs<1.0 until compatibility fixed by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2079

Adding new export for Triton by @brightsparc in https://github.com/ludwig-ai/ludwig/pull/2078

Adds step tracking at epoch level by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2081

Fix ray hyperopt by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/1999

Adds regression test for #2081 by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2084

Complete PR comments for hyperopt refactoring by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/2082

Parallelizes URL reads using Ray / Multithreading by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2040

Set Hyperopt Executor Type default to RAY by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/2093

Fixes shape issue in _BinaryPostprocessing by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2094

Rename sequence_size -> max_sequence_length by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2086

Fix type hints for dropout, dropout parameter references, and add docs for FCLayer and FCStack. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2061

Fix to_numpy_dataset() for Dask series by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2095

Add DATA_TRAIN_HDF5_FP in training_set_metadata for ParquetPreprocessor by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2096

Adds torchscript-compatible Audio input feature by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/1980

Fix progress bar ray by @magdyksaleh in https://github.com/ludwig-ai/ludwig/pull/2051

Fixes binary feature postprocessing upcast by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2101

Fixes for large scale hyperopt by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/2083

Changes batch norm momentum defaults to 1-momentum by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2100

Add imbalanced tabular dataset for developing AutoML heuristics by @amholler in https://github.com/ludwig-ai/ludwig/pull/2106

Deflakes and refactors torchscript tests by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2109

Fixed combiner schema creation by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2114

Added ability to stop and resume hyperopt / automl runs by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2108

Use the Backend to check for dask dataframes, instead of a hard check. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2113

Rename 'bias' to 'use_bias' for consistency by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/2104

Update ludwig version to v0.5.2. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2098

New Contributors

@magdyksaleh made their first contribution in https://github.com/ludwig-ai/ludwig/pull/2051

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.5.1...v0.5.2
Source code(tar.gz)
Source code(zip)
v0.5.1(May 23, 2022)
What's Changed

refactor: Rename, reorganize schema module by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/1963

Fix redundant import by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2019

fix: Various marshmallow improvements. by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/1975

fixes nans in dask df engine by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2020

Adds regression tests for #2020 by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2021

Removes pinned torchtext and torch for windows. by @dantreiman in https://github.com/ludwig-ai/ludwig/pull/1998

Add AutoML inference for audio by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2023

Added support for batch size and learning rate tuning using Ray backend by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2024

Added split column for a deterministic output so flakes stop by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2028

Workaround test_tune_batch_size_lr flakiness by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2030

Fixed ordering of imports for comet test by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2031

Adds regression tests for #2007 by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2018

Improve performance of DataFrameEngine.df_like by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2029

Fixed infinite loop in tune_batch_size by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2034

Fixed learning rate tuning on gpu by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/2035

Fix SIGINT handler to modify the number of remaining training steps. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/2032

upgrade: Update jsonschema validator to latest spec. by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/2036

Bumps py3.7 Ray version to 1.12.0 by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2041

Added blocking warning for experiment CLI, and visual warning for tra… by @connor-mccorm in https://github.com/ludwig-ai/ludwig/pull/2043

Adds ability to export scripted ECD model without pre-/post- processing modules by @geoffreyangus in https://github.com/ludwig-ai/ludwig/pull/2042

Convert nan to 0 in avg_num_tokens() by @hungcs in https://github.com/ludwig-ai/ludwig/pull/2046

Fixing the trainable parameter in pretrained encoders by @w4nderlust in https://github.com/ludwig-ai/ludwig/pull/2047

Fixes trainability of sparse embeddings by @w4nderlust in https://github.com/ludwig-ai/ludwig/pull/2049

Adds rule of thumb for determining embeddings size by @w4nderlust in https://github.com/ludwig-ai/ludwig/pull/2050

Refactor HyperOpt to use RayTune by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/1994

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.5...v0.5.1
Source code(tar.gz)
Source code(zip)
v0.5(May 10, 2022)
Ludwig v0.5 is a complete renovation of Ludwig from the ground up with a focus on parity, scalability, deployment, reliability, and documentation. Ludwig v0.5 migrates our entire backend from TensorFlow to PyTorch and introduces several new features and technical improvements, including:

Step-based training and evaluation to enable frequent sub-epoch monitoring of model health and evaluation metrics. This is particularly useful for large datasets that may be trained using large models.

Data balancing: upsampling and downsampling during preprocessing to better proportioned datasets.

End-to-end torchscript to support low-level optimized model deployment, including preprocessing and post-processing, to go directly from example to predictions.

Ludwig on Ray with RayDatasets enabling significant training speed boosts for reading large datasets while training Ludwig models on a Ray cluster.

The addition of MLPMixer and ViTEncoder as image encoders for state-of-the-art deep learning on image data.

AutoML for tabular and text classification, integrated with distributed hyperparameter search using RayTune.

Scalability optimizations with Dask, Modin, and Ray, enabling Ludwig to preprocess, train, and evaluate over datasets hundreds of gigabytes in size in tens of minutes.

Config validation using marshmallow schemas revealing configuration typos or bad values early and increasing reliability.

More tests. We've quadrupled the number of unit tests and end-to-end integration tests and we've expanded our CI testing to run in distributed and GPU settings. This strengthens Ludwig's stability and helps build confidence in new changes going forward.

Our team is thoroughly invested in improving the declarative ML experience, and, as part of the v0.5 release, we've revamped the getting started guide, user guide, and developer documentation. We've also published a handful of end-to-end tutorials with thoroughly documented notebooks on text, tabular, image, and multimodal classification that provide a deep walkthrough of Ludwig's functionality.

Migrating to PyTorch

Ludwig's migration to PyTorch comes from a substantial 6 month undertaking involving 230+ commits, changes to 70k+ lines of code, and contributions from 40+ people.

PyTorch's pythonic design and emphasis on developer experience are well-aligned with Ludwig's principles of simplicity, modularity, and extensibility. Switching to use PyTorch as Ludwig’s backend of choice was strongly motivated by the increase in productivity in development, debugging, and iteration that the more pythonic PyTorch API affords us as well as the great ecosystem the PyTorch community has built around it. With Ludwig on PyTorch, we're thrilled to see what developers, researchers, and data scientists in the PyTorch and broader deep learning community can bring to Ludwig.

Feature and Performance Parity

Over the last several months, we've moved all Ludwig encoders, combiners, decoders, and metrics for every data modality that Ludwig supports, as well as all of the backend infrastructure on Horovod and Ray, to PyTorch.

At the same time, we wanted to make sure that the experience of Ludwig users continues to be performant and delightful. We've run extensive comparisons between Ludwig v0.5 (PyTorch-based) and Ludwig v0.4 on text, image, and tabular datasets, evaluating training speed, inference throughput, and model performance, to verify that there's been no degradation.

Our results reveal roughly the same high GPU utilization (~90%) on several datasets with significant improvements in distributed training speed and memory usage without impacting model accuracy nor time to convergence. We'll be publishing a blog with more details on benchmarking soon.

New Features

In addition to the PyTorch migration, Ludwig v0.5 is packed with new functionality, features, and additional changes that make v0.5 the most feature-rich and robust release of Ludwig yet.

Step-based training and evaluation

Ludwig's train loop is epoch-based by default, with one round of evaluation per epoch (one pass through the dataset).

for epoch in num_epochs: for batch in training_data.batches: train(batch) save_model(model_dir) evaluation(training_data) evaluation(validation_data) evaluation(test_data) print_results()

This is an appropriate fit for tabular datasets, which are small, fit in memory, and train quickly. However, this can be awkward for unstructured datasets, which tend to be much larger, and train more slowly due to larger models. Now, with step-based training and evaluation, users can configure a more frequent sub-epoch evaluation cadence to more regularly monitor metrics and model health.

Use steps_per_checkpoint to run evaluation every N training steps, or checkpoints_per_epoch to run evaluation N times per epoch.

trainer: steps_per_checkpoint: 1000

trainer: checkpoints_per_epoch: 2

Note that it is invalid to specify both checkpoints_per_epoch and steps_per_checkpoint simultaneously.

To further speed up evaluation, users can skip evaluation on the training set by setting evaluate_training_set to False.

trainer: evaluate_training_set: false

Data balancing

Users working with imbalanced datasets can specify an oversampling or undersampling parameter which will balance the data during preprocessing.

In this example, Ludwig will oversample the minority class to achieve a 50% representation in the overall dataset.

preprocessing: oversample_minority: 0.5

In this example, Ludwig will undersample the majority class to achieve a 70% representation in the overall dataset.

preprocessing: undersample_majority: 0.7

Data balancing is only supported for binary output classes. Specifying both parameters at the same time is also not supported. When developing models, it can be useful to iterate quickly with a smaller portion of the dataset. Ludwig supports this with a new preprocessing parameter, sample_ratio, which subsamples the dataset.

preprocessing: sample_ratio: 0.7

End-to-end torchscript

Users can export trained ludwig models to torchscript with ludwig export_torchscript.

ludwig export_torchscript –model=/path/to/model

Models that use number, category, and text binary features now support torchscript-compatible preprocessing, enabling end-to-end torchscript compilation.

inputs = { 'cat_feature': ['foo', 'bar'] 'num_feature': torch.tensor([42, 7]) 'bin_feature1': torch.tensor([True, False]) 'bin_feature2': ['No', 'Yes'] } scripted_model = model.to_torchscript() output = scripted_model(inputs)

End to end torchscript compilation is also supported for text features that use torchscript-enabled torchtext tokenizers. We are actively working on adding support for other data types.

AutoML for Text Classification

In v0.4, we introduced experimental AutoML functionalities into Ludwig.

Ludwig AutoML automatically creates deep learning models given a dataset, its label column, and a time budget. Ludwig AutoML infers the input and output feature types, chooses the model architecture, and specifies the parameters and ranges across which to perform hyperparameter search.

auto_train_results = ludwig.automl.auto_train( dataset=my_dataset_df, target=target_column_name, time_limit_s=7200, tune_for_memory=False )

Our initial AutoML work focused on tabular datasets, since good performance on such datasets is a current area of interest in the DL community. In v0.5, we expand on this work to develop and validate Ludwig AutoML for text classification.

Config validation against Marshmallow Schemas

The combiner and trainer sections of Ludwig configurations are now validated against official Marshmallow schemas. This centralizes documentation, flags configuration typos or bad values, and helps catch regressions.

Better Test Coverage

We've quadrupled the number of unit and integration tests and we've established new testing guidelines for well-tested contributions going forward. This strengthens Ludwig's stability, iterability, and helps build confidence in new changes.

Backward Compatibility

Despite all of the code changes, we've worked hard to ensure that Ludwig’s simple interface remains consistent and compatible with earlier releases as much as possible. A few minor parameter naming changes in the Ludwig configuration to be aware of:

training -> trainer

numeric -> number

fc_size -> output_size

tied_weights -> tied

deleted {weight/bias/activation}_regularizer -> A global regularization_lambda and regularization_type is used to control regularization across the entire model.

delete dropout: True/False -> dropout is float [0,1]

Finally, we've dropped support for Python 3.6. Please use Python 3.7 going forward.

New Contributors

@vreyespue made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1213

@Yard1 made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1277

@EnricoMi made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1442

@q0w made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1512

@kriziacicchetti made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1525

@RebSolcia made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1526

@noyoshi made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1540

@louixs made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1552

@dantreiman made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1576

@pre-commit-ci made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1595

@connor-mccorm made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1699

@hfurkanbozkurt made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1734

@brightsparc made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1830

@tirkarthi made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1838

@jeffreykennethli made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1856

@rk0n made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1864

@geoffreyangus made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1882

@jppgks made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1959

Source code(tar.gz)
Source code(zip)
v0.5rc2(Mar 7, 2022)

Fixes loss reporting consistency issues, and shape-based metric calculation errors with SET output features.
Source code(tar.gz)
Source code(zip)
v0.5rc1(Feb 10, 2022)

Migration to PyTorch.
Source code(tar.gz)
Source code(zip)
v0.4.1(Feb 1, 2022)
Summary

This release features experimental AutoML with auto config generation and auto-training integrated with hyperopt on RayTune, and integrations with Ray training and Ray datasets. We're still working on a comprehensive overhaul of the documentation, and all the new functionality will all available in the upcoming v0.5 too.

Aside from critical bugs and new datasets, v0.4.1 will be the last release of Ludwig using TensorFlow. Starting with v0.5+ (release coming soon), Ludwig will use PyTorch as the backend for tensor computation. We will release a blogpost detailing the rationale and impact of this decision, but we wanted to do one last TensorFlow release to make sure that all those committed to a TensorFlow ecosystem that have used Ludwig so far could enjoy the benefits of many bug fixes and improvements we did on the codebase that were not specific to PyTorch.

The next version v0.5 will also have several additional improvements that we’ll be excited to share in the coming weeks.

Additions

Non-absolute image path support by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1224

Add image dim inference to schema by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1225

Additional Tabular Datasets by @amholler (#1226, #1230, #1237)

Initial implementation of the end-to-end autotrain module by @ANarayan in https://github.com/ludwig-ai/ludwig/pull/1219

[automl] AutoML Extended public API by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1235

Add image dimension inference to automl by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1243

[automl] Memory Aware Config Tuning by @ANarayan in https://github.com/ludwig-ai/ludwig/pull/1257

Added DataFrame wrapper type and fixed usage of optional imports by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1371

Added Dask kwargs to Ray backend by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1380

Configure Dask to determine parallelism automatically by default by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1383

Add Ray backend to Ray hyperopt by @Yard1 in https://github.com/ludwig-ai/ludwig/pull/1269

Add additional hyperopt callbacks by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1388

Added preprocessing callbacks by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1398

Added Slack and Twitter badges by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1399

Add support for Ray Train and Ray Datasets in training by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1391

Add combiner schema validation by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/1347

Publish unit test results by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1414

Publish test results for fork repos as well by @EnricoMi in https://github.com/ludwig-ai/ludwig/pull/1442

Build docker images for tf-legacy by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1504

Added init_config and render_config command-line utils (#1506) by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1514

Add experiment heuristics to automl module (variant of Avanika PR 1362) by @amholler in https://github.com/ludwig-ai/ludwig/pull/1507

Add random_seed to auto_train API to improve repeatability by @amholler in https://github.com/ludwig-ai/ludwig/pull/1619

Support use_reference_config option to AutoML to add initial trial from relevant best past model by @amholler in https://github.com/ludwig-ai/ludwig/pull/1636

Add remote checkpoint support to ray tune post search evaluation by @amholler in https://github.com/ludwig-ai/ludwig/pull/1646

[datasets] Add remote filesystem support to datasets module by @ANarayan in https://github.com/ludwig-ai/ludwig/pull/1244

Add sample training by @amholler in https://github.com/ludwig-ai/ludwig/pull/1227

Add support for Santander Customer Satisfaction dataset, along with s… by @amholler in https://github.com/ludwig-ai/ludwig/pull/1238

Improvements

Allow logging params to mlflow from any epoch by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1211

Changed remote fs behavior to upload at the end of each epoch by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1210

Add metric and loss modules for RMSE, RMSPE, and AUC by @ANarayan in https://github.com/ludwig-ai/ludwig/pull/1214

[hyperopt] fixed metric_score to use test split when available by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1239

Fixed metric selection to ignore config split if unavailable by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1248

Ray Tune Intermediate Checkpoint Cleaning by @ANarayan in https://github.com/ludwig-ai/ludwig/pull/1255

Do not initialize Ray if already initalized by @Yard1 in https://github.com/ludwig-ai/ludwig/pull/1277

Changed default combiner to concat from tabnet by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/1278

Ray data migration by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/1260

Fix automl to treat binary as categorical when missing values present by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1292

Add serialization for DatasetInfo and round avg_words to int by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1294

Cast max_length to int in build_sequence_matrix::pad by @Yard1 in https://github.com/ludwig-ai/ludwig/pull/1295

[automl] update model config parameter ranges by @ANarayan in https://github.com/ludwig-ai/ludwig/pull/1298

Change INFER_IMAGE_DIMENSIONS default to True by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1303

Add HTTPS retries for image urls by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1304

Return None for unreadable images and try to infer num channels by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1307

Add gray image/avg image fallbacks for unreachable images by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1312

Account for image extensions during image type inference by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1335

Fixed schema validation to handle null preprocessing values for strings by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1344

Added default size and output_size for tabnet by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1355

Removed DaskBackend and moved tests to RayBackend by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1412

Perform preprocessing first before hyperopt when possible by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1415

Employ a fallback str2bool mapping from the feature column's distinct values when the feature's values aren't boolean-like. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/1471

Remove trailing dot in income label field in adult_census… by @amholler in https://github.com/ludwig-ai/ludwig/pull/1475

Update Ludwig AutoML Feature Type Selection by @amholler in https://github.com/ludwig-ai/ludwig/pull/1485

Update infer_type tests to reflect interface and functionality updates by @amholler in https://github.com/ludwig-ai/ludwig/pull/1493

Skip converting to TensorDType if the column is binary by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1547

Remove TensorDType conversion for all scalar types by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1560

Update AutoML tabular model type choice to remove heuristic for concat by @amholler in https://github.com/ludwig-ai/ludwig/pull/1548

Better handle empty fields with distinct_values=[] by @hungcs in https://github.com/ludwig-ai/ludwig/pull/1574

Port #1476 ('dict' option for weights_initializer and bias_initializer) to tf_legacy by @ksbrar in https://github.com/ludwig-ai/ludwig/pull/1599

Modify combiners to accept input_features as a dict instead of a list by @jeffreyftang in https://github.com/ludwig-ai/ludwig/pull/1618

Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end by @amholler in https://github.com/ludwig-ai/ludwig/pull/1612

Keep search_alg type in dict to record in hyperopt_statistics.json by @amholler in https://github.com/ludwig-ai/ludwig/pull/1626

For ames_housing, remove test.csv from processing; it has no label column which prevents test split eval by @amholler in https://github.com/ludwig-ai/ludwig/pull/1634

Improve Ludwig resilience to Ray Tune issues by @amholler in https://github.com/ludwig-ai/ludwig/pull/1660

Handle download gzip files by @amholler in https://github.com/ludwig-ai/ludwig/pull/1676

Upgrade tf from 2.5.2 to 2.7.0. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/1713

Add basic precommit to tf-legacy to pass precommit checks on tf-legacy PRs. by @justinxzhao in https://github.com/ludwig-ai/ludwig/pull/1718

For kdd datasets, do not include unlabeled test data by default by @amholler in https://github.com/ludwig-ai/ludwig/pull/1704

Use config which has been previously validated by @vreyespue in https://github.com/ludwig-ai/ludwig/pull/1213

Update Readme to activate directly the virtualenv by @vreyespue in https://github.com/ludwig-ai/ludwig/pull/1212

doc: Correct README.md link to Developer Guide by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/1217

Update pandas version by @w4nderlust in https://github.com/ludwig-ai/ludwig/pull/1223

Modify Kaggle datasets to not process test sets by @ANarayan in https://github.com/ludwig-ai/ludwig/pull/1233

Restructure dataframe preprocessing setup and change to avoid creatin… by @amholler in https://github.com/ludwig-ai/ludwig/pull/1240

Bug fixes

Fixed Keras imports by @w4nderlust in https://github.com/ludwig-ai/ludwig/pull/1215

Fix assert in tabnet to be tf assert_rank by @w4nderlust in https://github.com/ludwig-ai/ludwig/pull/1222

Fixed read_csv for Dask by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1247

Fix TensorFlow CUDA version mismatch in Ray GPU image by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1256

Fix excluded field detection by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1285

Fixed automl to work when combiner is not specified by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1293

FIX: Issue 1181 resolves the ZeroDivisionError when calculating sample variance by @jimthompson5802 in https://github.com/ludwig-ai/ludwig/pull/1326

Fixed steps_per_epoch to be computed on batch resizing by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1402

Fix evaluation and visualization of confusion_matrix by @carlogrisetti in https://github.com/ludwig-ai/ludwig/pull/1408

Fixed auto eval batch size when train batch size is set by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1410

Fixed gpu isolation by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1455

Address issues in AutoML managing time-budget while exploring trial space by @amholler in https://github.com/ludwig-ai/ludwig/pull/1535

Fixed RayDatasets by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1565

Fix makedirs call to path_exists to pass url by @amholler in https://github.com/ludwig-ai/ludwig/pull/1592

Fixed KeyError while creating default config (#1643) by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1654

Fix FileNotFoundError while caching when cache_dir is … by @ShreyaR in https://github.com/ludwig-ai/ludwig/pull/1665

Fixed TabNet conversion to TF graph with unknown batch size by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1252

Other changes and things to note

Moved experiments to separate repo by @tgaddair in https://github.com/ludwig-ai/ludwig/pull/1245

Neuropod does not yet support python 3.9. Ludwig still supports neuropod for python<=3.8.

New Contributors

@vreyespue made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1213

@Yard1 made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1277

@EnricoMi made their first contribution in https://github.com/ludwig-ai/ludwig/pull/1442

Full Changelog: https://github.com/ludwig-ai/ludwig/compare/v0.4...v0.4.1
Source code(tar.gz)
Source code(zip)
v0.4(Jun 15, 2021)
Changelog

Additions

Integrate ray tune into hyperopt (#1001)

Added Ames Housing Kaggle dataset (#1098)

Added functionality to obtain subtrees in the SST dataset (#1108)

Added comparator combiner (#1113)

Additional Text Classification Datasets (#1121)

Added Ray remote backend and Dask distributed preprocessing (#1090)

Added TabNet combiner and needed modules (#1062)

Added Higgs Boson dataset (#1157)

Added GitHub workflow to push to Docker Hub (#1160)

Added more tagging schemes for Docker images (#1161)

Added Docker build matrix (#1162)

Added category feature > 1 dim to TabNet (#1150)

Added timeseries datasets (#1149)

Add TabNet Datasets (#1153)

Forest Cover Type, Adult Census Income and Rossmann Store Sales datasets (#1165)

Added KDD Cup 2009 datasets (#1167)

Added Ray GPU image (#1170)

Added support for cloud object storage (S3, GCS, ADLS, etc.) (#1164)

Perform inference with Dask when using the Ray backend (#1128)

Added schema validation to config files (#1186)

Added MLflow experiment tracking support (#1191)

Added export to MLflow pyfunc model format (#1192)

Added MLP-Mixer image encoder (#1178)

Added TransformerCombiner (#1177)

Added TFRecord support as a preprocessing cache format (#1194)

Added higgs boson tabnet examples (#1209)

Improvements

Abstracted Horovod params into the Backend API (#1080)

Added allowed_origins to serving to support to allow cross-origin requests (#1091)

Added callbacks to hook into the training loop programmatically (#1094)

Added scheduler support to Ray Tune hyperopt and fixed GPU usage (#1088)

Ray Tune: enforced that epochs equals max_t and early stopping is disabled (#1109)

Added register_trainable logic to RayTuneExecutor (#1117)

Replaced Travis CI with GitHub Actions (#1120)

Split distributed tests into separate test suite (#1126)

Removed unused regularizer parameter from training defaults

Restrict docker built GA to only ludwig-ai repos (#1166)

Harmonize return object for categorical, sequence generator and sequence tagger (#1171)

Sourcing images from either file path or in-memory ndarrays (#1174)

Refactored hyperopt results into object structure for easier programmatic usage (#1184)

Refactored all contrib classes to use the Callback interface (#1187)

Improved performance of Dask preprocessing by adding parallelism (#1193)

Improved TabNetCombiner and Concat combiner (#1177)

Added additional backend configuration options (#1195)

Made should_shuffle configurable in Trainer (#1198)

Bugfixes

Fix SST parentheses issue

Fix serve.py adding a try around the form parsing (#1111)

Fix #1104: add lengths to text encoder output with updated unit test (#1105)

Fix sst2 substree logic to match glue sst2 dataset (#1112)

Fix #1078: Avoid recreating cache when using image preproc (#1114)

Fix checking is dask exists in figure_data_format_dataset

Fixed bug in EthosBinary dataset class and model directory copying logic in RayTuneReportCallback (#1129)

Fix #1070: error when saving model with image feature (#1119)

Fixed IterableBatcher incompatibility with ParquetDataset and remote model serialization (#1138)

Fix: passing backend and TF config parameters to model load path in experiment

Fix: improved TabNet numerical stability + refactoring

Fix #1147: passing bn_epsilon to AttentiveTransformer initialization in TabNet

Fix #1093: loss value mismatch (#1103)

Fixed CacheManager to correctly handle test_set and validation_set (#1189)

Fixing TabNet sparsity loss issue (#1199)

Breaking changes

Most models trained with v0.3.3 would keep working in v0.4. The main changes in v0.4 are additional options, so what worked previously should not be broken now. One exception to this is that now there is a much strictier check of the validity of the model configuration. This is great as it allows to catch errors earlier, although configurations that despite errors worked in the past may not work anymore. The checks should help identify the issues in the configurations though, so errors should be easily ficable.

Contributors

@tgaddair @jimthompson5802 @ANarayan @kaushikb11 @mejackreed @ronaldyang @zhisbug @nimz @kanishk16
Source code(tar.gz)
Source code(zip)
v0.3.3(Feb 1, 2021)
Changelog

Additions

Added Irony dataset and AGNews dataset (#1073)

Improvements

Updated hyperopt sampling functions to handle list values (#1082)

Bugfixes

Fix compatibility issues with Transformers 4.2.1 (#1077)

Fixed SST dataset link (#1085)

Fix hyperopt batch sampling (#1086)

Bumped skimage dependency version (#1087)

Source code(tar.gz)
Source code(zip)
v0.3.2(Dec 29, 2020)
Changelog

Additions

Added feature identification logic (#957)

Added Backend interface for abstracting DataFrame preprocessing steps (#1014)

Add support for transforming numeric predictions that were normalized (#1015)

Added Kaggle API integration and Titanic dataset (#1021)

Add Korean translation for the README (#1022)

Added cast_columns function to preprocessing and cast_column function to all feature mixin classes (#1027)

Added custom encoder / decoder registration decorator (#1017)

Add titles to Hyperopt Report visualization (#1026) -Added cast_columns function to preprocessing and cast_column function to all feature mixin classes (#1027)

Added label-wise probability to binary feature predictions (#1033)

Add support for num_layers in sequence generator decoder (#1050)

Added Flickr8k dataset (#1053)

Add support for transforming numeric predictions that were normalized (#1015)

Improvements

Improved triggering of cache re-creation (now it depends also on changes in feature types)

Improved legend and add tight_layout param to compare predictions plot (#1037)

Improved postprocessing for binary features so prediction vocab matches inputs (#1038)

Bump TensorFlow and tfa-nightly for 2.4.0 release (#1058)

Updated Dockerfiles to TensorFlow 2.4.0 (#1059)

Bugfixes

Fix missing yaml files for datasets in pip package

Fix hdf5 preprocessing error

Fix calculation of the metric score for hyperopt (#1031)

Fix wrong argument in visualize.py from -f to -ofn (#1032)

Fix fill NaN by adding selected conversion of columns to string when computing metadata (#1042)

Fix: inconsistent seq length for probabilities (#1043)

Fix issues with changes in xlrd package (#1056)

Source code(tar.gz)
Source code(zip)
v0.3.1(Nov 16, 2020)
Additions

Added dataset module (#949) containing MNIST, SST-2, SST-5, REUTERS, OHSUMED, FEVER and GoEmotions datasets

Add Ludwig Model Serve Example (#947)

Add checksum mechanism for HDF5 and Meta JSON cache file (#1006)-

Improvements

Updated run_experiment to use new skip parameters and returns (#955)

Several improvements to testing (more coverage, with faster tests)

Changed default value of HF encoder trainable parameter to True (for performance reasons) (#996)

Improved and slightly modified visualization functions API-

Bugfixes

Changed not to is None in dataset checks in hyperopt.run.hyperopt() (#956)

Fix LudwigModel.predict() when skip_save_predictions = False (#962)

Fix #963: Convert materialized tensors to numpy arrays up front to avoid repeated conversion ()

Fix errors with DataFrame truth checks in hyperopt (#956)

Added truncation to HF tokenizer (#978)

Reimplemented Jaccard Metric for the Set Feature (#979)

Fix learning rate computation with decay and warmup (#982)

Fix CLI logger typos (#998, #999)

Fix loading of split from hdf5 (#1003)

Fix visualization unit tests (#981)

Fix concatenate_csv to work with arbitrary read functions and renamed concatenate_datasets

Fix compatibility issue with matplotlib 3.3.3

Limit numpy and h5py max versions due to tensorflow 2.3.1 max supported versions (#990)

Fixed usage of model_load_path with Horovod (#1011)

Source code(tar.gz)
Source code(zip)
v0.3(Oct 6, 2020)
Improvements

Full porting to TensorFlow 2.

New hyperparameter optimization functionality through the hyperopt command.

Integration with HuggingFace Transformers for pre-trained text encoders.

Refactored preprocessing with new supported data formats: auto, csv, df, dict, excel, feather, fwf, hdf5 (cache file produced during previous training), html (file containing a single HTML <table>), json, jsonl, parquet, pickle (pickled Pandas DataFrame), sas, spss, stata, tsv.

improved validation logic.

New Transformer encoders for sequential data types (sequence, text, audio, timeseries).

new batch_predict functionality in the REST API.

New export command to export to SavedModel and Neuropod.

New collect_summary command to print out a model summary with layers names.

Modified the predict command, and splitt it into predict and evaluate. The first only produces predictions, the second evaluates those predictions against ground truth.

Two new hyperopt-related visualizations: hyperopt_report and hyperopt_hiplot.

Improved tracking of metrics in the TensorBoard.

Greatly improved test suite.

Various documentation improvements.

Bugfixes

This release includes a fundamental rewrite of the internals, so many bugs have been fixed while rewiting. This list includes only the ones that have a specific Issue associated with them, but many others where addressed.

Fix #649: Replaced SPLIT with 'split' in example code.

Fix documentation, wrong parameter name (#684)

Fix #702: Fixed setting defaults in binary output feature.

Fix #729: Reduce output was not passed to the sequence encoder inside the sequence combiner.

Fix #742: Renamed self._learning_rate in Progresstracker.

Fix #799: Added tf_version to description.json.

Fix #840: Better messaging for plateau logic.

Fix #850: Switch from ValueError to Warning to make stratify work on non-output features.

Fix ##844: Load LudwigModel in test_savedmodel before creating saved model.

Fix #833: loads the model after training and before predicting if the model was saved on disk.

Fix #933: Added NumpyDecoder before returning JSON response from server.

Fix #935: Multiple categorical features with different vocabs now work.

Breaking changes

Because of the change in the underlying tensor computation library (TensorFlow 1 to TensorFlow 2) and the internal reworking it required, models trained with v0.2 don't work on v0.3. We suggest to retrain such models, in most cases the same model definition can be used, although one impactuful breaking change is that model_definition are now called config, because they don't contain only information about the model, but also training, preprocessing, and a newly added hyperopt section.

There have been some changes in the parameters inside the config too. In particular, one main change is dropout that now it is a float value that can be specified for each encode / combiner / decoder / layer, while before it was a boolean parameter. As a consequence, the dropout_rate parameter in the training section has been removed.

Another change in training parameters are the available optimizers. TensorFlow 2 doesn't ship with some of the ones that were exposed in Ludwig (adagradda, proximalgd, proximaladagrad) and the momentum optimizer has been removed as now it is a parameter of the sgd optimizer. Newly added optimizers are nadam and adamax. Note that the accuracy metric for the combined feature has been removed because it was misleading in some scenarios when multiple features of different types where trained.

In most cases, encoders, combiners and decoders now have an increased number of exposed parameters to play with for increased flexibility. One notable change is that the previous BERT encoder has been replaced by an HuggingFace based one with different parameters, and it is now available only for text features. Please refer to the User Guide for details for each encoder.

Tokenizers also changed substantially with new parameters supported, refer to User Guide for more details.

Other major changes are related to the CLI interface. The predict command has been replaced in functionality with a simplified predict and a new evaluate. The first only produces predictions, the second evaluates those predictions against ground truth. Some parameters of all CLI commands changed. All different type of data_* parameters have been replaced by generic dataset, training_set, validation_set and test_set parameters, while the data format is automatically determined, but can also be set manually by using the data_format argument. There is nogpu_fractionany more, but now users can specifygpu_limit` for managing the VRAM usage. For all additional minor changes to the CLI please refer to the User Guide.

The programmatic API changed too, as a consequence. Now all the parameters match closely the ones of the CLI interface, including the new dataset and gpu parameters. Also in this case the predict function has been split into predict and evaluate. Finally, the returned values of most functions changed to include some intermediate processing values, like for instance the preprocessed and split data when calling train, the output experiment directory and so on. Notably, now there is an experiment function in the API too, together with a new hyperopt one. For more datails, refer to the API reference.

Contriburotrs

@jimthompson5802 @tgaddair @kaushikb11 @ANarayan @calio @dme65 @ydudin3 @carlogrisetti @ifokeev @flozi00 @soovam123 @KushalP1 @JiByungKyu @stremlau @adiov @martinremy @dsblank @jakobt @vkuzmin-uber @mbzhu1 @moritzebeling @lnxpy
Source code(tar.gz)
Source code(zip)
v0.2.2(Mar 6, 2020)

Improvements

Added integration with Weights and Biases. Added K-Fold cross validation. Added 4 examples with their respective code and Jupyter Notebooks: Hyper-parameter optimization, K-Fold Cross Validation, MNIST, Titanic. Greatly improved the measures tracked on the TensorBoard. Added auto-detect function for field separator when reading CSVs. Added CI tooling. Class weights can be specified as a dictionary #615. Removed deprecation warning from h5py. Removed most deprecation warning from TensorFlow. Bypass multiprocessing.Pool.map for faster execution. Updated TensorFlow dependency to 1.15.2. Various documentation improvements.

Bugfixes

Fix cudnn error on RTX GPUs. Fix inverted confusion_matrix axis. Fix #201: Removed whitespace as a separator option. Fix #540: Fixed default text parameters for sampled loss. Fix #541: Docker image improvements (removed libgmp and spacy model download). Fix #554: Fix audio input test case in docker container. Fix #570: Temporary dolution for in_memory flag usage in API. Fix #574: Setting intra and inter op parallelism to 0 so that TF determine them automatically. Fix #329 and #575: Fixed use of SavedModel and added an integration test. Fix #609: When predicting, if a split is in the CSV, data is split correctly. Fix #616: Change preprocessing in siamese network example. Fix #620: Failure in unit tests for 1 vs all calibration plots. Fix #632: Setting minimum version requirements for six. Fix #636: CLI output table column ordering preserved when resuming. Fix #641: Added multi-task learning section specifying the weight for each output feature in the User Guide. Fix #642: Fixing horovod use when loading a model as initialization.

Contriburotrs

@jimthompson5802 @calz1 @pingsutw @vanpelt @carlogrisetti @anttisaukko @dsblank @borisdayma @flozi00 @jshah02
Source code(tar.gz)
Source code(zip)
v0.2.1(Oct 13, 2019)

Improvements

Add Filter Bank features for audio. Added two more parameters skip_save_test_predictions and skip_save_test_statistics to train and experiment CLI commands and API. Updated to spaCy 2.2 with support for Norvegian and Lithuanian tokenizers. Reorganized dependencies, now the defaults are barebone and there are several axtra ones. Added fc_layers to H3 embed encoder. Added get_preprocessing_params in preprocessing. Refactored image features preprocessing to use multiprocessing. Refactored preprocessing with strategy pattern.

Bugfixes

Fix #452: Removed dependency on gmpy. Fix #465: Adds capability to set the vocabulary from a Glove file. Fix #480: Adds a health check to ludwig serve. Fix #481: Added some examples of visualization commands. Fix #491: Improved skip parameters, now no directories are created if not needed. Fix #492: Adds skip saving unprocessed output api.py. Fix #493: Added parameters for the vocabulary file and the UNK and PAD symbols in sequence feature call to create_vocabulary in the calculation of metadata. Fix #500: Fixed learning_curves() when the training statistics file does not contain validation. Fix #509: Fixes in_memory issues in image features. Fix #525: Adding checkis_on_master() before creating save_path dir./ectory Fix #510: Fixed version of pydantic. Fix #532: Improved speed of add_sequence_feature_column().

Potentially breaking changes

Fix #520: Renamed field parameter in visualization to output_feature_name for clarity and improved documentation. Please make sure to rename you function calls if you were using this parameter by name (the order keeps the same).

Contributors

@sriki18 @carlogrisetti @areeves87 @naresh-bhandari @revolunet @patrickvonplaten @Athanaziz @dsblank @tgaddair @Mechachleopteryx @AlexeyGy @yu-iskw
Source code(tar.gz)
Source code(zip)
v0.2(Jul 24, 2019)

Improvements

New BERT encoder and with its BPE tokenizer Added Audio features that can be used also for speech data (with appropriate preprocessing feature extraction) Added H3 feature, together with 3 encoders to deal with spatial information Added Date feature and two encoders to deal with temporal information Improved Comet.ml integration Refactored visualization.py to make individual functions usable from API Added capability of saving visualization graph in the visualization command and visualizations_utils.py Added a serve command that allows for spawning a prediction server using FastAPI Added a test command (that requires output columns in the data) to avoid confusion with predict (which does not require output columns) Added pixel normalization and pixel standardization scaling options for image features Added greyscaling of images if specified channels = 1 and img channels is 3 or 4 Added normalization strategies for numerical features (#367) Added experiment name parameter in the API (#357) Refactored text tokenizers Several improvements in logging Added a method for saving models with SavedModels in model.py and exposes it in the API with a save_for_serving() function (#329)(#425) Upgraded to the latest version of TensorFlow 1.14 (#429) Added learning rate warmup for non distributed settings

Bugfixes

Fix #321: Removed the 6n+2 check for ResNet size Fix #328: adds missing UPDATE_OPS to the optimization operation Fix #336: GloVe embeddings loading now reads utf-8 encoded files Fix #336: Addresses the malformed lines issue in embeddings loading Fix #346: added a parameter indicating if the session should be closed after training in full_train Fix #351: values in categorical columns are now stripped before being compared to the vocabulary Fix #364: associate the right function to non english text format functions Fix #372: set evaluate performance parameter to false in predict.py Fix #394: Improved error explaination when image dimensions don't match and improved documentation accordingly Fix #411: Images in HDF5 are now correctly saved as uint8 instead of int8 Fix #431: missing libgmp3-dev dependency in docker (#428) Fix fixed image resizing Fix model load path (#424) Fix batch norm in convolutional layers (now uses tf internal layer and not the one in contrib) Several additional minor fixes

Contributors

@carlogrisetti @jaipradeesh @glongh @dsblank @danicattaneob @gogasca @lordeddard @IgorWilbert @patrickvonplaten @ojus1 @jimthompson5802 @johnwahba @revolunet @gogasca
Source code(tar.gz)
Source code(zip)
v0.1.2(Apr 27, 2019)
Improvements

Improved import speed by ~50%

Improved Comet.ml integration

Replaced only_predict with evaluate_performance (and flipped the logic) in all predict commands and functions

Refactored preprocessing functions for improved testability, understanbility and extensibility

Added data_dict to the train method in LudwigModel

Improved tests speed

Bugfixes

Fix issue #283: word_format in text features is now properly used

Fix issue #286: avoid using signal when not on main thread

Fix issue where the order of operations when preprocessing images between resizing and changing channels was inverted

Fix safety issues: now using yaml.safe_load instead of yaml.load and replaced pickling of the progress tracker with a JSON equivalent

Fix minor bug with missing tied_weights key in some features

Fixed a few minor issues discovered with deepsource.io

Other Changes

If before LudwigModel would be imported from ludwig now it should be imported from ludwig.api. This change was needed for speeding up imports

Contributors

@dsblank @Ignisor @bertyhell @jaipradeesh
Source code(tar.gz)
Source code(zip)
v0.1.1(Apr 9, 2019)
New features and improvements

Updated to tensorflow 1.13.1 and spacy 2.1 (this also makes Ludwig compatible with Python 3.7)

Added an initial integration with Comet.ml

Added support for text preprocessing of additional languages: Italian, Spanish, German, French, Portuguese, Dutch, Greek and Multi-language (Fature Request #251).

Added skip_save_progress, skip_save_model and skip_save_log parameters

Improved the default parameters of the image feature (this may make previously trained models including image features not compatible. If that is the case retrain your model)

Added PassthroughEncoder

Added eval_batch_size parameter

Added sanity checks for model definitions, with improved error messages

Add Dockerfile for running Ludwig on a CPU

Added clip parameter to numerical output features

Added a full MNIST training example, a fraud detection example and a more complex regression example on fuel consumption

Bug fixes

Fix issue #56: removing just keys that exist in dataset when when replacing text feature names concatenating their level

Fix issue #46 #144: Solved Mac OS X mpl.use('TkAgg') use

Fix issue #74: Call subprocess within try except

Fix issue #81: Opens a file before calling yaml.load()

Fix issue #90: Forcing csv writer to write utf-8 encoded files

Fix issue #120: Missing sgd (and synonyms) key in optimizers default

Fix issue #64: Fix for files with capitalized extensions

Fix issue #121: Typo bucketin_field to bucketing_field

Fix training when validation or test cvs are provided separately

Fix issue #112: dataframe_df may not have a csv attribute

Fix missing checks if dataset is None in preprocessing.py and api.py

Fix error measure aggregation and default value

Fix image interpolation

Fix preprocessing_defaults error in bag_feature.py

Fix text output features populate_defaults() and update_model_definition_with_metadata()

Fix in timeseires placeholder datatype

Moved image preprocessing params to preprocessing section (this may make previously trained models including image features not compatible. If that is the case retrain your model)

Fix warmup learning rate function for distributed training

Fix issue #214: replace_text_feature_level usage in api.py

Fix issue #214: replaced SPACE_PUNCTUATION_REGEX

Fix issue #229 #100: solved missing hdf5 / csv file reference

Fix issue #222: incorrect logging in read_csv

Fix issue #194: Renaming class_distance to class_similarities and several bugfixes regarding class_similarities, class_weights and their interaction at model building time

Fix issue #100 #225: solves image prediction issues

Fix issue #98: solves dealing with images with different numbers of channels, including transparencies

Fix unwanted creation of hdf5 files when running ludwig.predict on images

And few more minor fixes

Contributors

Thanks to all our amazing contributors (some of your PRs were not merged, but we used some of their code in our commits, so thank you anyway!): @dsblank @MariusDanner @BenMacKenzie @Barathwaja @gabefair @kevinqz @yantsey @jontonsoup4 @Praneet460 @DakshMiglani @syeef @Tejaf @rolisz @JakeConnors376W @AndyZZH @us @0xflotus @laserbeam3 @krychu @dettmering @bbrodsky @c-m-hunt @C0deFxxker @hemchander23 @Shivam-Beeyani @yashrajbharti @rbramwell @emushtaq @EBazarov @graytowne @jovilius @ivanhe @philippgille @floscha
Source code(tar.gz)
Source code(zip)
v0.1.0(Feb 11, 2019)

This is the first public release of Ludwig
Source code(tar.gz)
Source code(zip)

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Related tags

Overview

Installation

Basic Principles

Training

Distributed Training

Prediction and Evaluation

Programmatic API

Extensibility

Full documentation

License

Comments

Code Pull Requests

Code Pull Requests

Code Pull Requests

Code Pull Requests

Releases(v0.6.4)

v0.6.4(Oct 28, 2022)

What's Changed

v0.6.3(Oct 20, 2022)

What's Changed

v0.6.2(Oct 13, 2022)

What's Changed

v0.6.1(Oct 4, 2022)

What's Changed

v0.6(Sep 27, 2022)

Overview

Gradient Boosted Models (@jppgks)

How to use it

Limitations

Calibrating probabilities for category and binary output features (@dantreiman)

How to use Calibration

Limitations

Richer configuration schema and validation (@connor-mccorm @ksbrar @justinxzhao )

Nested encoder and decoder parameters (@connor-mccorm )

New Defaults Ludwig Section (@arnavgarg1 )

Global Defaults In Hyperopt (@arnavgarg1 )

Nested Configs In Hyperopt (@tgaddair )

Pipelined TorchScript (@geoffreyangus @brightsparc )

Time-based Dataset Splitting (@tgaddair )

Examples

Parameter Update Unit Tests (@jimthompson5802 )

Stay in the loop

Full Changelog

New Contributors

v0.6rc1(Sep 20, 2022)

What's Changed

v0.6.beta(Sep 8, 2022)

What's Changed

New Contributors

v0.5.5(Aug 2, 2022)

What's Changed

v0.5.4(Jul 12, 2022)

What's Changed

v0.5.3(Jun 25, 2022)

What's Changed

New Contributors

v0.5.2(Jun 8, 2022)

What's Changed

New Contributors

v0.5.1(May 23, 2022)

What's Changed

v0.5(May 10, 2022)

Migrating to PyTorch

Feature and Performance Parity

New Features

AutoML for Text Classification

Config validation against Marshmallow Schemas

Better Test Coverage

Backward Compatibility

New Contributors

v0.5rc2(Mar 7, 2022)

v0.5rc1(Feb 10, 2022)

v0.4.1(Feb 1, 2022)

Summary

Additions

Improvements

Bug fixes

Other changes and things to note