An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

Microsoft

Last update: Dec 31, 2022

Related tags

Deep Learning python data-science machine-learning deep-learning neural-network tensorflow machine-learning-algorithms pytorch distributed feature-extraction hyperparameter-optimization feature-engineering nas bayesian-optimization automl automated-machine-learning model-compression neural-architecture-search deep-neural-network

Overview

NNI Doc | 简体中文

NNI (Neural Network Intelligence) is a lightweight but powerful toolkit to help users automate Feature Engineering, Neural Architecture Search, Hyperparameter Tuning and Model Compression.

The tool manages automated machine learning (AutoML) experiments, dispatches and runs experiments' trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in different training environments like Local Machine, Remote Servers, OpenPAI, Kubeflow, FrameworkController on K8S (AKS etc.), DLWorkspace (aka. DLTS), AML (Azure Machine Learning), AdaptDL (aka. ADL) , other cloud options and even Hybrid mode.

Who should consider using NNI

Those who want to try different AutoML algorithms in their training code/model.
Those who want to run AutoML trial jobs in different environments to speed up search.
Researchers and data scientists who want to easily implement and experiment new AutoML algorithms, may it be: hyperparameter tuning algorithm, neural architect search algorithm or model compression algorithm.
ML Platform owners who want to support AutoML in their platform.

What's NEW!

New release: v2.5 is available - released on Nov-04-2021
New demo available: Youtube entry | Bilibili 入口 - last updated on May-26-2021
New webinar: Introducing Retiarii: A deep learning exploratory-training framework on NNI - scheduled on June-24-2021
New community channel: Discussions
New emoticons release: nnSpider

NNI capabilities in a glance

NNI provides CommandLine Tool as well as an user friendly WebUI to manage training experiments. With the extensible API, you can customize your own AutoML algorithms and training services. To make it easy for new users, NNI also provides a set of build-in state-of-the-art AutoML algorithms and out of box support for popular training platforms.

Within the following table, we summarized the current NNI capabilities, we are gradually adding new capabilities and we'd love to have your contribution.

Frameworks & Libraries

Algorithms

Training Services

Built-in

Supported Frameworks

PyTorch
Keras
TensorFlow
MXNet
Caffe2

More...

Supported Libraries

Scikit-learn
XGBoost
LightGBM

More...

Examples

More...

Hyperparameter Tuning

Exhaustive search

Heuristic search

Bayesian optimization

Neural Architecture Search (Retiarii)

Model Compression

Pruning

Quantization

Feature Engineering (Beta)

Early Stop Algorithms

References

Support TrainingService

Implement TrainingService

Installation

Install

NNI supports and is tested on Ubuntu >= 16.04, macOS >= 10.14.1, and Windows 10 >= 1809. Simply run the following pip install in an environment that has python 64-bit >= 3.6.

Linux or macOS

python3 -m pip install --upgrade nni

Windows

python -m pip install --upgrade nni

If you want to try latest code, please install NNI from source code.

For detail system requirements of NNI, please refer to here for Linux & macOS, and here for Windows.

Note:

If there is any privilege issue, add --user to install NNI in the user directory.
Currently NNI on Windows supports local, remote and pai mode. Anaconda or Miniconda is highly recommended to install NNI on Windows.
If there is any error like Segmentation fault, please refer to FAQ. For FAQ on Windows, please refer to NNI on Windows.

Verify installation

Download the examples via clone the source code.

git clone -b v2.5 https://github.com/Microsoft/nni.git

Run the MNIST example.

Linux or macOS

nnictl create --config nni/examples/trials/mnist-pytorch/config.yml

Windows

nnictl create --config nni\examples\trials\mnist-pytorch\config_windows.yml

Wait for the message INFO: Successfully started experiment! in the command line. This message indicates that your experiment has been successfully started. You can explore the experiment using the Web UI url.

INFO: Starting restful server...
INFO: Successfully started Restful server!
INFO: Setting local config...
INFO: Successfully set local config!
INFO: Starting experiment...
INFO: Successfully started experiment!
-----------------------------------------------------------------------
The experiment id is egchD4qy
The Web UI urls are: http://223.255.255.1:8080   http://127.0.0.1:8080
-----------------------------------------------------------------------

You can use these commands to get more information about the experiment
-----------------------------------------------------------------------
         commands                       description
1. nnictl experiment show        show the information of experiments
2. nnictl trial ls               list all of trial jobs
3. nnictl top                    monitor the status of running experiments
4. nnictl log stderr             show stderr log content
5. nnictl log stdout             show stdout log content
6. nnictl stop                   stop an experiment
7. nnictl trial kill             kill a trial job by id
8. nnictl --help                 get help information about nnictl
-----------------------------------------------------------------------

Open the Web UI url in your browser, you can view detailed information of the experiment and all the submitted trial jobs as shown below. Here are more Web UI pages.

Releases and Contributing

NNI has a monthly release cycle (major releases). Please let us know if you encounter a bug by filling an issue.

We appreciate all contributions. If you are planning to contribute any bug-fixes, please do so without further discussions.

If you plan to contribute new features, new tuners, new training services, etc. please first open an issue or reuse an exisiting issue, and discuss the feature with us. We will discuss with you on the issue timely or set up conference calls if needed.

To learn more about making a contribution to NNI, please refer to our How-to contribution page.

We appreciate all contributions and thank all the contributors!

Feedback

File an issue on GitHub.
Open or participate in a discussion.
Discuss on the NNI Gitter in NNI.

Join IM discussion groups:

Gitter		WeChat
	OR

Test status

Essentials

Type	Status
Fast test
Full linux
Full windows

Training services

Type	Status
Remote - linux to linux
Remote - linux to windows
Remote - windows to linux
OpenPAI
Frameworkcontroller
Kubeflow
Hybrid
AzureML

Related Projects

Targeting at openness and advancing state-of-art technology, Microsoft Research (MSR) had also released few other open source projects.

OpenPAI : an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale.
FrameworkController : an open source general-purpose Kubernetes Pod Controller that orchestrate all kinds of applications on Kubernetes by a single controller.
MMdnn : A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models. The "MM" in MMdnn stands for model management and "dnn" is an acronym for deep neural network.
SPTAG : Space Partition Tree And Graph (SPTAG) is an open source library for large scale vector approximate nearest neighbor search scenario.
nn-Meter : An accurate inference latency predictor for DNN models on diverse edge devices.

We encourage researchers and students leverage these projects to accelerate the AI development and research.

License

The entire codebase is under MIT license

Comments

The software breaks down at example code after running for 10s normally.

Describe the issue: I am trying your pytorch example in https://nni.readthedocs.io/zh/stable/tutorials/hpo_quickstart_pytorch/model.html everything goes well, installation succeed, and the web page looks nice. But after about 10s, the program ends up.

[2023-01-02 01:56:47] Creating experiment, Experiment ID: 6j50nacv
[2023-01-02 01:56:47] Starting web server...
[2023-01-02 01:56:48] Setting up...
[2023-01-02 01:56:48] Web portal URLs: http://127.0.0.1:8080 http://172.18.36.113:8080
node:internal/fs/watchers:252
    throw error;
    ^

Error: ENOSPC: System limit for number of file watchers reached, watch '/home/linux_username/nni-experiments/6j50nacv/trials/gnav8/.nni/metrics'
    at FSWatcher.<computed> (node:internal/fs/watchers:244:19)
    at Object.watch (node:fs:2251:34)
    at TailStream.waitForMoreData (/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni_node/node_modules/tail-stream/index.js:123:31)
    at TailStream.<anonymous> (/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni_node/node_modules/tail-stream/index.js:275:22)
    at FSReqCallback.wrapper [as oncomplete] (node:fs:660:5) {
  errno: -28,
  syscall: 'watch',
  code: 'ENOSPC',
  path: '/home/linux_username/nni-experiments/6j50nacv/trials/gnav8/.nni/metrics',
  filename: '/home/linux_username/nni-experiments/6j50nacv/trials/gnav8/.nni/metrics'
}
Thrown at:
    at __node_internal_captureLargerStackTrace (node:internal/errors:464:5)
    at __node_internal_uvException (node:internal/errors:521:10)
    at FSWatcher.<computed> (node:internal/fs/watchers:244:19)
    at watch (node:fs:2251:34)
    at TailStream.waitForMoreData (/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni_node/node_modules/tail-stream/index.js:123:31)
    at /home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni_node/node_modules/tail-stream/index.js:275:22
    at wrapper (node:fs:660:5)
Traceback (most recent call last):
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/__main__.py", line 85, in <module>
    main()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/__main__.py", line 61, in main
    dispatcher.run()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run
    command, data = self._channel._receive()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/channel.py", line 94, in _receive
    command = self._retry_receive()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/channel.py", line 104, in _retry_receive
    self._channel.connect()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 62, in connect
    self._ws = _wait(_connect_async(self._url))
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 111, in _wait
    return future.result()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 125, in _connect_async
    return await websockets.connect(url, max_size=None)  # type: ignore
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/websockets/legacy/client.py", line 659, in __await_impl_timeout__
    return await asyncio.wait_for(self.__await_impl__(), self.open_timeout)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/tasks.py", line 479, in wait_for
    return fut.result()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/websockets/legacy/client.py", line 663, in __await_impl__
    _transport, _protocol = await self._create_connection()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/base_events.py", line 1065, in create_connection
    raise exceptions[0]
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/base_events.py", line 1050, in create_connection
    sock = await self._connect_sock(
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/base_events.py", line 961, in _connect_sock
    await self.sock_connect(sock, address)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/selector_events.py", line 500, in sock_connect
    return await fut
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/selector_events.py", line 535, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 8080)
Traceback (most recent call last):
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1040, in _send_output
    self.send(msg)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 980, in send
    self.connect()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f9d7c6042e0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d7c6042e0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/linux_username/yecanming/repo/new_things_exploring/tunning_parameter/nni/nni_main.py", line 23, in <module>
    experiment.run(8080)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/experiment.py", line 183, in run
    self._wait_completion()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/experiment.py", line 163, in _wait_completion
    status = self.get_status()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/experiment.py", line 283, in get_status
    resp = rest.get(self.port, '/check-status', self.url_prefix)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/rest.py", line 43, in get
    return request('get', port, api, prefix=prefix)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/adapters.py", line 565, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d7c6042e0>: Failed to establish a new connection: [Errno 111] Connection refused'))
[2023-01-02 01:57:08] Stopping experiment, please wait...
[2023-01-02 01:57:08] ERROR: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/experiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d929a5c40>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1040, in _send_output
    self.send(msg)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 980, in send
    self.connect()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f9d929a5c40>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/experiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d929a5c40>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/experiment.py", line 143, in _stop_impl
    rest.delete(self.port, '/experiment', self.url_prefix)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/rest.py", line 52, in delete
    request('delete', port, api, prefix=prefix)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/adapters.py", line 565, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/experiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d929a5c40>: Failed to establish a new connection: [Errno 111] Connection refused'))
[2023-01-02 01:57:08] WARNING: Cannot gracefully stop experiment, killing NNI process...
[2023-01-02 01:57:08] Experiment stopped

Environment:

NNI version: 2.10
Training service (local|remote|pai|aml|etc): local
Client OS: Linux qaz 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Server OS (for remote mode only):
Python version: Python 3.9.13
PyTorch/TensorFlow version: '1.13.0+cu116'
Is conda/virtualenv/venv used?: conda is used
Is running in Docker?: no

Configuration:

Experiment config (remember to remove secrets!):

experiment.config.trial_command = 'python run_nn.py'
experiment.config.trial_code_directory = '.'

experiment.config.search_space = search_space

experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'

experiment.config.max_trial_number = 10
experiment.config.trial_concurrency = 2

experiment.config.max_experiment_duration = '1h'

Search space:

Log message:

nnimanager.log: cannot find your tutorial at "https://github.com/microsoft/nni/blob/master/docs/en_US/Tutorial/HowToDebug.md#experiment-root-director"
dispatcher.log:
nnictl stdout and stderr:

How to reproduce it?:

opened by 2catycm 0

Prune Problem
Describe the issue: The code is normal before adding prune code(L2filterPruner), but there is an error after adding. My project code is from: https://github.com/minar09/cp-vton-plus. It's error: error part: my code:

Environment:

NNI version:2.10

Training service (local|remote|pai|aml|etc): local

Client OS:

Server OS (for remote mode only):

Python version:3.10

PyTorch/TensorFlow version:1.12.1

Is conda/virtualenv/venv used?: yes

Is running in Docker?:

Configuration:

Experiment config (remember to remove secrets!):

Search space:

Log message:

nnimanager.log:

dispatcher.log:

nnictl stdout and stderr:

How to reproduce it?:
opened by ShuYangXie 0

yolov5prune error

Describe the issue: I'm trying to prune the pre-trained model yolov5n-0.5 from Yolov5-face. Here is the code I used:

import torch, torchvision
from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner, L2NormPruner,FPGMPruner,ActivationAPoZRankPruner
from nni.compression.pytorch.speedup import ModelSpeedup
from rich import print
from utils.general import check_img_size
from models.common import Conv
from models.experimental import attempt_load
from models.yolo import Detect
from utils.activations import SiLU
import torch.nn as nn
from nni.compression.pytorch.utils.counter import count_flops_params

class SiLU(nn.Module):  # export-friendly version of nn.SiLU()
    @staticmethod
    def forward(x):
        return x * torch.sigmoid(x)

device = device = torch.device("cuda:1")
model = attempt_load('/data03/hezhenhui/project/helmet/yolov5-6.0/runs/train/helmet6/weights/best.pt', map_location=device, inplace=True, fuse=True) # load FP32 model
model.eval()

for k, m in model.named_modules():
    if isinstance(m, Conv): # assign export-friendly activations
        if isinstance(m.act, nn.SiLU):
            m.act = SiLU()
        elif isinstance(m, Detect):
            m.inplace = False
    m.onnx_dynamic = False
    if hasattr(m, 'forward_export'):
        m.forward = m.forward_export # assign custom forward (optional)


imgsz = (640, 640)
imgsz *= 2 if len(imgsz) == 1 else 1 # expand

gs = int(max(model.stride)) # grid size (max stride)
imgsz = [check_img_size(x, gs) for x in imgsz] # verify img_size are gs-multiples
im = torch.zeros(1, 3, *imgsz).to(device) # image size(1,3,320,192) BCHW iDetection
dummy_input = im

cfg_list = [{
'sparsity': 0.3, 'op_types': ['Conv2d'],'op_names': [
    'model.0.conv',
    'model.1.conv',
    'model.2.cv1.conv',
    'model.2.cv2.conv',
    'model.2.cv3.conv',
    'model.2.m.0.cv1.conv',
    'model.2.m.0.cv2.conv',
    'model.2.m.1.cv1.conv',
    'model.2.m.1.cv2.conv',
    'model.2.m.2.cv1.conv',
    'model.2.m.2.cv2.conv',
    'model.2.m.3.cv1.conv',
    'model.2.m.3.cv2.conv',
    'model.3.conv',
    'model.4.cv1.conv',
    'model.4.cv2.conv',
    'model.4.cv3.conv',
    'model.4.m.0.cv1.conv',
    'model.4.m.0.cv2.conv',
    'model.4.m.1.cv1.conv',
    'model.4.m.1.cv2.conv',
    'model.4.m.2.cv1.conv',
    'model.4.m.2.cv2.conv',
    'model.4.m.3.cv1.conv',
    'model.4.m.3.cv2.conv',
    'model.4.m.4.cv1.conv',
    'model.4.m.4.cv2.conv',
    'model.4.m.5.cv1.conv',
    'model.4.m.5.cv2.conv',
    'model.4.m.6.cv1.conv',
    'model.4.m.6.cv2.conv',
    'model.4.m.7.cv1.conv',
    'model.4.m.7.cv2.conv',
    'model.5.conv',
    'model.6.cv1.conv',
    'model.6.cv2.conv',
    'model.6.cv3.conv',
    'model.6.m.0.cv1.conv',
    'model.6.m.0.cv2.conv',
    'model.6.m.1.cv1.conv',
    'model.6.m.1.cv2.conv',
    'model.6.m.2.cv1.conv',
    'model.6.m.2.cv2.conv',
    'model.6.m.3.cv1.conv',
    'model.6.m.3.cv2.conv',
    'model.6.m.4.cv1.conv',
    'model.6.m.4.cv2.conv',
    'model.6.m.5.cv1.conv',
    'model.6.m.5.cv2.conv',
    'model.6.m.6.cv1.conv',
    'model.6.m.6.cv2.conv',
    'model.6.m.7.cv1.conv',
    'model.6.m.7.cv2.conv',
    'model.6.m.8.cv1.conv',
    'model.6.m.8.cv2.conv',
    'model.6.m.9.cv1.conv',
    'model.6.m.9.cv2.conv',
    'model.6.m.10.cv1.conv',
    'model.6.m.10.cv2.conv',
    'model.6.m.11.cv1.conv',
    'model.6.m.11.cv2.conv',
    'model.7.conv',
    'model.8.cv1.conv',
    'model.8.cv2.conv',
    'model.8.cv3.conv',
    'model.8.m.0.cv1.conv',
    'model.8.m.0.cv2.conv',
    'model.8.m.1.cv1.conv',
    'model.8.m.1.cv2.conv',
    'model.8.m.2.cv1.conv',
    'model.8.m.2.cv2.conv',
    'model.8.m.3.cv1.conv',
    'model.8.m.3.cv2.conv',
    'model.9.cv1.conv',
    'model.9.cv2.conv',
    'model.10.conv',
    'model.13.cv1.conv',
    'model.13.cv2.conv',
    'model.13.cv3.conv',
    'model.13.m.0.cv1.conv',
    'model.13.m.0.cv2.conv',
    'model.13.m.1.cv1.conv',
    'model.13.m.1.cv2.conv',
    'model.13.m.2.cv1.conv',
    'model.13.m.2.cv2.conv',
    'model.13.m.3.cv1.conv',
    'model.13.m.3.cv2.conv',
    'model.14.conv',
    'model.17.cv1.conv',
    'model.17.cv2.conv',
    'model.17.cv3.conv',
    'model.17.m.0.cv1.conv',
    'model.17.m.0.cv2.conv',
    'model.17.m.1.cv1.conv',
    'model.17.m.1.cv2.conv',
    'model.17.m.2.cv1.conv',
    'model.17.m.2.cv2.conv',
    'model.17.m.3.cv1.conv',
    'model.17.m.3.cv2.conv',
    'model.18.conv',
    'model.20.cv1.conv',
    'model.20.cv2.conv',
    'model.20.cv3.conv',
    'model.20.m.0.cv1.conv',
    'model.20.m.0.cv2.conv',
    'model.20.m.1.cv1.conv',
    'model.20.m.1.cv2.conv',
    'model.20.m.2.cv1.conv',
    'model.20.m.2.cv2.conv',
    'model.20.m.3.cv1.conv',
    'model.20.m.3.cv2.conv',
    'model.21.conv',
    'model.23.cv1.conv',
    'model.23.cv2.conv',
    'model.23.cv3.conv',
    'model.23.m.0.cv1.conv',
    'model.23.m.0.cv2.conv',
    'model.23.m.1.cv1.conv',
    'model.23.m.1.cv2.conv',
    'model.23.m.2.cv1.conv',
    'model.23.m.2.cv2.conv',
    'model.23.m.3.cv1.conv',
    'model.23.m.3.cv2.conv'
    ]
},
{
'op_names':['model.24.m.0','model.24.m.1','model.24.m.2'],
'exclude': True
    }
]


pruner = L1NormPruner(model, cfg_list)
_, masks = pruner.compress()
# print(masks)
pruner.export_model(model_path='helmet_yolov5s.pt', mask_path='helmet_mask.pt')
pruner.show_pruned_weights()
pruner._unwrap_model()

print("im.shape:",dummy_input.shape)

But it always throws this error:

ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code.
        Node:
                %864 : Tensor = prim::Constant[value={2}](), scope: __module.model.24 # /data03/hezhenhui/project/helmet/yolov5-6.0/models/yolo.py:66:0
        Source Location:
                /data03/hezhenhui/project/helmet/yolov5-6.0/models/yolo.py(66): forward
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/torch/nn/modules/module.py(709): _slow_forward
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/torch/nn/modules/module.py(725): _call_impl
                /data03/hezhenhui/project/helmet/yolov5-6.0/models/yolo.py(149): _forward_once
                /data03/hezhenhui/project/helmet/yolov5-6.0/models/yolo.py(126): forward
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/torch/nn/modules/module.py(709): _slow_forward
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/torch/nn/modules/module.py(725): _call_impl
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/torch/jit/_trace.py(934): trace_module
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/torch/jit/_trace.py(733): trace
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/nni/common/graph_utils.py(91): _trace
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/nni/common/graph_utils.py(67): __init__
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/nni/common/graph_utils.py(265): __init__
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/nni/common/graph_utils.py(25): build_module_graph
                /data03/hezhenhui/.conda/envs/tdn/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compressor.py(73): __init__
                prune_nni.py(242): <module>
        Comparison exception:   expand(torch.cuda.FloatTensor{[1, 3, 40, 40, 2]}, size=[]): the number of sizes provided (0) must be greater or equal to the number of dimensions in the tensor (5)

I can't find a solution to the problem, can you give some advice

Environment:

NNI version:2.10
Cent OS version:Linux version 3.10.0-957.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Thu Oct 4 20:48:51 UTC 2018
Python version:3.8.13
PyTorch version:1.7.1

opened by Turing77 0

NetWork Error
Describe the issue: When I use NNI, at most 10 minutes, I will be reminded of NetWork Error, and then the port connection will be disconnected. I want to know what the problem is. By the way, I am using Windows10 system

Environment:

NNI version: 2.10

Training service (local|remote|pai|aml|etc):local

Client OS:windows 10

Server OS (for remote mode only):

Python version:3.7

PyTorch/TensorFlow version:torch==1.8.1

Is conda/virtualenv/venv used?:conda

Is running in Docker?:no

Configuration:

Experiment config (remember to remove secrets!):

Search space:websockets.exceptions.InvalidMessage: did not receive a valid HTTP response

Log message:

nnimanager.log:

dispatcher.log:[2022-12-26 10:35:12] INFO (nni.tuner.tpe/MainThread) Using random seed 175818551 [2022-12-26 10:35:12] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started [2022-12-26 10:41:26] WARNING (nni.runtime.tuner_command_channel.channel/MainThread) Exception on receiving: ConnectionClosedError(None, None, None) [2022-12-26 10:41:26] WARNING (nni.runtime.tuner_command_channel.channel/MainThread) Connection lost. Trying to reconnect... [2022-12-26 10:41:26] INFO (nni.runtime.tuner_command_channel.channel/MainThread) Attempt #0, wait 0 seconds... [2022-12-26 10:41:26] INFO (nni.runtime.msg_dispatcher_base/MainThread) Report error to NNI manager: Traceback (most recent call last): File "E:\Anaconda\install\envs\pytorch\lib\site-packages\websockets\legacy\client.py", line 138, in read_http_response status_code, reason, headers = await read_response(self.reader) File "E:\Anaconda\install\envs\pytorch\lib\site-packages\websockets\legacy\http.py", line 120, in read_response status_line = await read_line(stream) File "E:\Anaconda\install\envs\pytorch\lib\site-packages\websockets\legacy\http.py", line 194, in read_line line = await stream.readline() File "E:\Anaconda\install\envs\pytorch\lib\asyncio\streams.py", line 496, in readline line = await self.readuntil(sep) File "E:\Anaconda\install\envs\pytorch\lib\asyncio\streams.py", line 588, in readuntil await self._wait_for_data('readuntil') File "E:\Anaconda\install\envs\pytorch\lib\asyncio\streams.py", line 473, in _wait_for_data await self._waiter File "E:\Anaconda\install\envs\pytorch\lib\asyncio\selector_events.py", line 814, in _read_ready__data_received data = self._sock.recv(self.max_size) ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "E:\Anaconda\install\envs\pytorch\lib\site-packages\nni_main_.py", line 61, in main dispatcher.run() File "E:\Anaconda\install\envs\pytorch\lib\site-packages\nni\runtime\msg_dispatcher_base.py", line 69, in run command, data = self._channel._receive() File "E:\Anaconda\install\envs\pytorch\lib\site-packages\nni\runtime\tuner_command_channel\channel.py", line 94, in _receive command = self._retry_receive() File "E:\Anaconda\install\envs\pytorch\lib\site-packages\nni\runtime\tuner_command_channel\channel.py", line 104, in _retry_receive self._channel.connect() File "E:\Anaconda\install\envs\pytorch\lib\site-packages\nni\runtime\tuner_command_channel\websocket.py", line 62, in connect self._ws = _wait(_connect_async(self._url)) File "E:\Anaconda\install\envs\pytorch\lib\site-packages\nni\runtime\tuner_command_channel\websocket.py", line 111, in _wait return future.result() File "E:\Anaconda\install\envs\pytorch\lib\concurrent\futures_base.py", line 435, in result return self.__get_result() File "E:\Anaconda\install\envs\pytorch\lib\concurrent\futures_base.py", line 384, in __get_result raise self._exception File "E:\Anaconda\install\envs\pytorch\lib\site-packages\nni\runtime\tuner_command_channel\websocket.py", line 125, in _connect_async return await websockets.connect(url, max_size=None) # type: ignore File "E:\Anaconda\install\envs\pytorch\lib\site-packages\websockets\legacy\client.py", line 659, in await_impl_timeout return await asyncio.wait_for(self.await_impl(), self.open_timeout) File "E:\Anaconda\install\envs\pytorch\lib\asyncio\tasks.py", line 442, in wait_for return fut.result() File "E:\Anaconda\install\envs\pytorch\lib\site-packages\websockets\legacy\client.py", line 671, in await_impl extra_headers=protocol.extra_headers, File "E:\Anaconda\install\envs\pytorch\lib\site-packages\websockets\legacy\client.py", line 326, in handshake status_code, response_headers = await self.read_http_response() File "E:\Anaconda\install\envs\pytorch\lib\site-packages\websockets\legacy\client.py", line 144, in read_http_response raise InvalidMessage("did not receive a valid HTTP response") from exc websockets.exceptions.InvalidMessage: did not receive a valid HTTP response

nnictl stdout and stderr:

How to reproduce it?:
opened by accelerator1737 0
How to solve "UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed"
Describe the issue:

execute: CUDA_VISIBLE_DEVICES=0 python taylorfo_lightning_evaluator.py

some warning:

[2022-12-27 10:01:54] Update the indirect sparsity for the model.classifier.3 /home/user/miniconda3/envs/prune/lib/python3.8/site-packages/nni/compression/pytorch/speedup/infer_mask.py:275: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:480.) if isinstance(self.output, torch.Tensor) and self.output.grad is not None:

[2022-12-27 10:01:54] Update the indirect sparsity for the model.classifier.2 /home/user/miniconda3/envs/prune/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compressor.py:305: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:480.) if last_output.grad is not None and tin.grad is not None: /home/user/miniconda3/envs/prune/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compressor.py:307: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:480.) elif last_output.grad is None:

Environment:ubuntu20.04

NNI version:2.10

Training service (local|remote|pai|aml|etc):local

Client OS:ubuntu20.04

Server OS (for remote mode only):

Python version:3.8.15

PyTorch/TensorFlow version:1.13.1+ cu116

Is conda/virtualenv/venv used?:conda

Is running in Docker?:no

pytorch-lightning: 1.8.6

Configuration:

Experiment config (remember to remove secrets!):

Search space:

Log message:

nnimanager.log:

dispatcher.log:

nnictl stdout and stderr:

How to reproduce it?: execute: python taylorfo_lightning_evaluator.py
opened by skylaugher 0
cannot fix the mask of the interdependent layers
Describe the issue: When i pruned the segmentation model.After saving the mask.pth,when i speed up ,the mask cannot fix the new architecture of the model

Environment:

NNI version:2.0

Training service (local|remote|pai|aml|etc):local

Client OS:ubuntu

Server OS (for remote mode only):

Python version:3.7

PyTorch/TensorFlow version:pytorch

Is conda/virtualenv/venv used?:yes

Is running in Docker?:no

Configuration:

Experiment config (remember to remove secrets!):

Search space:

Log message:

nnimanager.log:

dispatcher.log:

nnictl stdout and stderr: mask_conflict.py,line195,in fix_mask_conflict assert shape[0] % group == 0 AssertionError

when i print shape[0] and group: 32 32 32 32 16 1 96 96 96 96 24 320 the group==320 is bigger than the shape[0]==24 how can i fix the problem?

How to reproduce it?:
opened by sungh66 5

Releases(v2.10)

v2.10(Nov 14, 2022)
Neural Architecture Search

Added trial deduplication for evolutionary search.

Fixed the racing issue in RL strategy on submitting models.

Fixed an issue introduced by the trial recovery feature.

Fixed import error of PyTorch Lightning in NAS.

Compression

Supported parsing schema by replacing torch._C.parse_schema in pytorch 1.8.0 in ModelSpeedup.

Fixed the bug that speedup rand_like_with_shape is easy to overflow when dtype=torch.int8.

Fixed the propagation error with view tensors in speedup.

Hyper-parameter optimization

Supported rerunning the interrupted trials induced by the termination of an NNI experiment when resuming this experiment.

Fixed a dependency issue of Anneal tuner by changing Anneal tuner dependency to optional.

Fixed a bug that tuner might lose connection in long experiments.

Training service

Fixed a bug that trial code directory cannot have non-English characters.

Web portal

Fixed an error of columns in HPO experiment hyper-parameters page by using localStorage.

Fixed a link error in About menu on WebUI.

Known issues

Modelspeedup does not support non-tensor intermediate variables.

Source code(tar.gz)
Source code(zip)
v2.9(Sep 7, 2022)
Neural Architecture Search

New tutorial of model space hub and one-shot strategy. (tutorial)

Add pretrained checkpoints to AutoFormer. (doc)

Support loading checkpoint of a trained supernet in a subnet. (doc)

Support view and resume of NAS experiment. (doc)

Enhancements

Support fit_kwargs in lightning evaluator. (doc)

Support drop_path and auxiliary_loss in NASNet. (doc)

Support gradient clipping in DARTS. (doc)

Add export_probs to monitor the architecture weights.

Rewrite configure_optimizers, functions to step optimizers / schedulers, along with other hooks for simplicity, and to be compatible with latest lightning (v1.7).

Align implementation of DifferentiableCell with DARTS official repo.

Re-implementation of ProxylessNAS.

Move nni.retiarii code-base to nni.nas.

Bug fixes

Fix a performance issue caused by tensor formatting in weighted_sum.

Fix a misuse of lambda expression in NAS-Bench-201 search space.

Fix the gumbel temperature schedule in Gumbel DARTS.

Fix the architecture weight sharing when sharing labels in differentiable strategies.

Fix the memo reusing in exporting differentiable cell.

Compression

New tutorial of pruning transformer model. (tutorial)

Add TorchEvaluator, LightningEvaluator, TransformersEvaluator to ease the expression of training logic in pruner. (doc, API)

Enhancements

Promote all pruner API using Evaluator, the old API is deprecated and will be removed in v3.0. (doc)

Greatly enlarge the set of supported operators in pruning speedup via automatic operator conversion.

Support lr_scheduler in pruning by using Evaluator.

Support pruning NLP task in ActivationAPoZRankPruner and ActivationMeanRankPruner.

Add training_steps, regular_scale, movement_mode, sparse_granularity for MovementPruner. (doc)

Add GroupNorm replacement in pruning speedup. Thanks external contributor @cin-xing .

Optimize balance mode performance in LevelPruner.

Bug fixes

Fix the invalid dependency_aware mode in scheduled pruners.

Fix the bug where bias mask cannot be generated.

Fix the bug where max_sparsity_per_layer has no effect.

Fix Linear and LayerNorm speedup replacement in NLP task.

Fix tracing LightningModule failed in pytorch_lightning >= 1.7.0.

Hyper-parameter optimization

Fix the bug that weights are not defined correctly in adaptive_parzen_normal of TPE.

Training service

Fix trialConcurrency bug in K8S training service: use${envId}_run.sh to replace run.sh.

Fix upload dir bug in K8S training service: use a separate working directory for each experiment. Thanks external contributor @amznero .

Web portal

Support dict keys in Default metric chart in the detail page.

Show experiment error message with small popup windows in the bottom right of the page.

Upgrade React router to v6 to fix index router issue.

Fix the issue of details page crashing due to choices containing None.

Fix the issue of missing dict intermediate dropdown in comparing trials dialog.

Known issues

Activation based pruner can not support [batch, seq, hidden].

Failed trials are NOT auto-submitted when experiment is resumed (#4931 is reverted due to its pitfalls).

Source code(tar.gz)
Source code(zip)
v2.8(Jun 22, 2022)
Neural Architecture Search

Align user experience of one-shot NAS with multi-trial NAS, i.e., users can use one-shot NAS by specifying the corresponding strategy (doc)

Support multi-GPU training of one-shot NAS

Preview Support load/retrain the pre-searched model of some search spaces, i.e., 18 models in 4 different search spaces (doc)

Support AutoFormer search space in search space hub, thanks our collaborators @nbl97 and @penghouwen

One-shot NAS supports the NAS API repeat and cell

Refactor of RetiariiExperiment to share the common implementation with HPO experiment

CGO supports pytorch-lightning 1.6

Model Compression

Preview Refactor and improvement of automatic model compress with a new CompressionExperiment

Support customizating module replacement function for unsupported modules in model speedup (doc)

Support the module replacement function for some user mentioned modules

Support output_padding for convtranspose2d in model speedup, thanks external contributor @haoshuai-orka

Hyper-Parameter Optimization

Make config.tuner.name case insensitive

Allow writing configurations of advisor in tuner format, i.e., aligning the configuration of advisor and tuner

Experiment

Support launching multiple HPO experiments in one process

Internal refactors and improvements

Refactor of the logging mechanism in NNI

Refactor of NNI manager globals for flexible and high extensibility

Migrate dispatcher IPC to WebSocket

Decouple lock stuffs from experiments manager logic

Use launcher's sys.executable to detect Python interpreter

WebUI

Improve user experience of trial ordering in the overview page

Fix the update issue in the trial detail page

Documentation

A new translation framework for document

Add a new quantization demo (doc)

Notable Bugfixes

Fix TPE import issue for old metrics

Fix the issue in TPE nested search space

Support RecursiveScriptModule in speedup

Fix the issue of failed "implicit type cast" in merge_parameter()

Source code(tar.gz)
Source code(zip)
v2.7(Apr 18, 2022)
Documentation

A full-size upgrade of the documentation, with the following significant improvements in the reading experience, practical tutorials, and examples:

Reorganized the document structure with a new document template. (Upgraded doc entry)

Add more friendly tutorials with jupyter notebook.(New Quick Starts)

New model pruning demo available. (Youtube entry, Bilibili entry)

Hyper-Parameter Optimization

[Improvement] TPE and random tuners will not generate duplicate hyperparameters anymore.

[Improvement] Most Python APIs now have type annotations.

Neural Architecture Search

Jointly search for architecture and hyper-parameters: ValueChoice in evaluator. (doc)

Support composition (transformation) of one or several value choices. (doc)

Enhanced Cell API (merge_op, preprocessor, postprocessor). (doc)

The argument depth in the Repeat API allows ValueChoice. (doc)

Support loading state_dict between sub-net and super-net. (doc, example in spos)

Support BN fine-tuning and evaluation in SPOS example. (doc)

Experimental Model hyper-parameter choice. (doc)

Preview Lightning implementation for Retiarii including DARTS, ENAS, ProxylessNAS and RandomNAS. (example usage)

Preview A search space hub that contains 10 search spaces. (code)

Model Compression

Pruning V2 is promoted as default pruning framework, old pruning is legacy and keeps for a few releases.(doc)

A new pruning mode balance is supported in LevelPruner.(doc)

Support coarse-grained pruning in ADMMPruner.(doc)

[Improvement] Support more operation types in pruning speedup.

[Improvement] Optimize performance of some pruners.

Experiment

[Improvement] Experiment.run() no longer stops web portal on return.

Notable Bugfixes

Fixed: experiment list could not open experiment with prefix.

Fixed: serializer for complex kinds of arguments.

Fixed: some typos in code. (thanks @a1trl9 @mrshu)

Fixed: dependency issue across layer in pruning speedup.

Fixed: uncheck trial doesn't work bug in the detail table.

Fixed: filter name | id bug in the experiment management page.

Source code(tar.gz)
Source code(zip)
v2.6.1(Feb 18, 2022)
Bug Fixes

Fix a bug that new TPE does not support dict metrics.

Fix a bug that missing comma. (Thanks to @mrshu)

Source code(tar.gz)
Source code(zip)
v2.6(Jan 19, 2022)
NOTE: NNI v2.6 is the last version that supports Python 3.6. From next release NNI will require Python 3.7+.

Hyper-Parameter Optimization

Experiment

The legacy experiment config format is now deprecated. (doc of new config)

If you are still using legacy format, nnictl will show equivalent new config on start. Please save it to replace the old one.

nnictl now uses nni.experiment.Experiment APIs as backend. The output message of create, resume, and view commands have changed.

Added Kubeflow and Frameworkcontroller support to hybrid mode. (doc)

The hidden tuner manifest file has been updated. This should be transparent to users, but if you encounter issues like failed to find tuner, please try to remove ~/.config/nni.

Algorithms

Random tuner now supports classArgs seed. (doc)

TPE tuner is refactored: (doc)

Support classArgs seed.

Support classArgs tpe_args for expert users to customize algorithm behavior.

Parallel optimization has been turned on by default. To turn it off set tpe_args.constant_liar_type to null (or None in Python).

parallel_optimize and constant_liar_type has been removed. If you are using them please update your config to use tpe_args.constant_liar_type instead.

Grid search tuner now supports all search space types, including uniform, normal, and nested choice. (doc)

Neural Architecture Search

Enhancement to serialization utilities (doc) and changes to recommended practice of customizing evaluators. (doc)

Support latency constraint on edge device for ProxylessNAS based on nn-Meter. (doc)

Trial parameters are showed more friendly in Retiarii experiments.

Refactor NAS examples of ProxylessNAS and SPOS.

Model Compression

New Pruner Supported in Pruning V2

Auto-Compress Pruner (doc)

AMC Pruner (doc)

Movement Pruning Pruner (doc)

Support nni.trace wrapped Optimizer in Pruning V2. In the case of not affecting the user experience as much as possible, trace the input parameters of the optimizer. (doc)

Optimize Taylor Pruner, APoZ Activation Pruner, Mean Activation Pruner in V2 memory usage.

Add more examples for Pruning V2.

Add document for pruning config list. (doc)

Parameter masks_file of ModelSpeedup now accepts pathlib.Path object. (Thanks to @dosemeion) (doc)

Bug Fix

Fix Slim Pruner in V2 not sparsify the BN weight.

Fix Simulator Annealing Task Generator generates config ignoring 0 sparsity.

Documentation

Supported GitHub feature "Cite this repository".

Updated index page of readthedocs.

Updated Chinese documentation.

From now on NNI only maintains translation for most import docs and ensures they are up to date.

Reorganized HPO tuners' doc.

Bugfixes

Fixed a bug where numpy array is used as a truth value. (Thanks to @khituras)

Fixed a bug in updating search space.

Fixed a bug that HPO search space file does not support scientific notation and tab indent.

For now NNI does not support mixing scientific notation and YAML features. We are waiting for PyYAML to update.

Fixed a bug that causes DARTS 2nd order to crash.

Fixed a bug that causes deep copy of mutation primitives (e.g., LayerChoice) to crash.

Removed blank at bottom in Web UI overview page.

Source code(tar.gz)
Source code(zip)
v2.5(Nov 4, 2021)
Model Compression

New major version of pruning framework (doc)

Iterative pruning is more automated, users can use less code to implement iterative pruning.

Support exporting intermediate models in the iterative pruning process.

The implementation of the pruning algorithm is closer to the paper.

Users can easily customize their own iterative pruning by using PruningScheduler.

Optimize the basic pruners underlying generate mask logic, easier to extend new functions.

Optimized the memory usage of the pruners.

MobileNetV2 end-to-end example (notebook)

Improved QAT quantizer (doc)

Support dtype and scheme customization

Support dp multi-gpu training

Support load_calibration_config

Model speed-up now supports directly loading the mask (doc)

Support speed-up depth-wise convolution

Support bn-folding for LSQ quantizer

Support QAT and LSQ resume from PTQ

Added doc for observer quantizer (doc)

Neural Architecture Search

NAS benchmark (doc)

Support benchmark table lookup in experiments

New data preparation approach

Improved quick start doc

Experimental CGO execution engine (doc)

Hyper-Parameter Optimization

New training platform: Alibaba DSW+DLC (doc)

Support passing ConfigSpace definition directly to BOHB (doc) (thanks to @khituras)

Reformatted experiment config doc

Added example config files for Windows (thanks to @politecat314)

FrameworkController now supports reuse mode

Fixed Bugs

Experiment cannot start due to platform timestamp format (issue #4077 #4083)

Cannot use 1e-5 in search space (issue #4080)

Dependency version conflict caused by ConfigSpace (issue #3909) (thanks to @jexxers)

Hardware-aware SPOS example does not work (issue #4198)

Web UI show wrong remaining time when duration exceeds limit (issue #4015)

cudnn.deterministic is always set in AMC pruner (#4117) thanks to @mstczuo

And...

New emoticons!

Install from pypi
Source code(tar.gz)
Source code(zip)
v2.4(Aug 12, 2021)
Major Updates

Neural Architecture Search

NAS visualization: visualize model graph through Netron (#3878)

Support NAS bench 101/201 on Retiarii framework (#3871 #3920)

Support hypermodule AutoActivation (#3868)

Support PyTorch v1.8/v1.9 (#3937)

Support Hardware-aware NAS with nn-Meter (#3938)

Enable fixed_arch on Retiarii (#3972)

Model Compression

Refactor of ModelSpeedup: auto shape/mask inference (#3462)

Added more examples for ModelSpeedup (#3880)

Support global sort for Taylor pruning (#3896)

Support TransformerHeadPruner (#3884)

Support batch normalization folding in QAT quantizer (#3911, thanks the external contributor @chenbohua3)

Support post-training observer quantizer (#3915, thanks the external contributor @chenbohua3)

Support ModelSpeedup for Slim Pruner (#4008)

Support TensorRT 8.0.0 in ModelSpeedup (#3866)

Hyper-parameter Tuning

Improve HPO benchmarks (#3925)

Improve type validation of user defined search space (#3975)

Training service & nnictl

Support JupyterLab (#3668 #3954)

Support viewing experiment from experiment folder (#3870)

Support kubeflow in training service reuse framework (#3919)

Support viewing trial log on WebUI for an experiment launched in view mode (#3872)

Minor Updates & Bug Fixes

Fix the failure of the exit of Retiarii experiment (#3899)

Fix exclude not supported in some config_list cases (#3815)

Fix bug in remote training service on reuse mode (#3941)

Improve IP address detection in modern way (#3860)

Fix bug of the search box on WebUI (#3935)

Fix bug in url_prefix of WebUI (#4051)

Support dict format of intermediate on WebUI (#3895)

Fix bug in openpai training service induced by experiment config v2 (#4027 #4057)

Improved doc (#3861 #3885 #3966 #4004 #3955)

Improved the API export_model in model compression (#3968)

Supported UnSqueeze in ModelSpeedup (#3960)

Thanks other external contributors: @Markus92 (#3936), @thomasschmied (#3963), @twmht (#3842)

Source code(tar.gz)
Source code(zip)
v2.3(Jun 15, 2021)
Major Updates

Neural Architecture Search

Retiarii Framework (NNI NAS 2.0) Beta Release with new features:

Support new high-level APIs: Repeat and Cell (#3481)

Support pure-python execution engine (#3605)

Support policy-based RL strategy (#3650)

Support nested ModuleList (#3652)

Improve documentation (#3785)

Note: there are more exciting features of Retiarii planned in the future releases, please refer to Retiarii Roadmap for more information.

Add new NAS algorithm: Blockwise DNAS FBNet (#3532, thanks the external contributor @alibaba-yiwuyao)

Model Compression

Support Auto Compression Framework (#3631)

Support slim pruner in Tensorflow (#3614)

Support LSQ quantizer (#3503, thanks the external contributor @chenbohua3)

Improve APIs for iterative pruners (#3507 #3688)

Training service & Rest

Support 3rd-party training service (#3662 #3726)

Support setting prefix URL (#3625 #3674 #3672 #3643)

Improve NNI manager logging (#3624)

Remove outdated TensorBoard code on nnictl (#3613)

Hyper-Parameter Optimization

Add new tuner: DNGO (#3479 #3707)

Add benchmark for tuners (#3644 #3720 #3689)

WebUI

Improve search parameters on trial detail page (#3651 #3723 #3715)

Make selected trials consistent after auto-refresh in detail table (#3597)

Add trial stdout button on local mode (#3653 #3690)

Examples & Documentation

Convert all trial examples' from config v1 to config v2 (#3721 #3733 #3711 #3600)

Add new jupyter notebook examples (#3599 #3700)

Dev Excellent

Upgrade dependencies in Dockerfile (#3713 #3722)

Substitute PyYAML for ruamel.yaml (#3702)

Add pipelines for AML and hybrid training service and experiment config V2 (#3477 #3648)

Add pipeline badge in README (#3589)

Update issue bug report template (#3501)

Bug Fixes & Minor Updates

Fix syntax error on Windows (#3634)

Fix a logging related bug (#3705)

Fix a bug in GPU indices (#3721)

Fix a bug in FrameworkController (#3730)

Fix a bug in export_data_url format (#3665)

Report version check failure as a warning (#3654)

Fix bugs and lints in nnictl (#3712)

Fix bug of optimize_mode on WebUI (#3731)

Fix bug of useActiveGpu in AML v2 config (#3655)

Fix bug of experiment_working_directory in Retiarii config (#3607)

Fix a bug in mask conflict (#3629, thanks the external contributor @Davidxswang)

Fix a bug in model speedup shape inference (#3588, thanks the external contributor @Davidxswang)

Fix a bug in multithread on Windows (#3604, thanks the external contributor @Ivanfangsc)

Delete redundant code in training service (#3526, thanks the external contributor @maxsuren)

Fix typo in DoReFa compression doc (#3693, thanks the external contributor @Erfandarzi)

Update docstring in model compression (#3647, thanks the external contributor @ichejun)

Fix a bug when using Kubernetes container (#3719, thanks the external contributor @rmfan)

Source code(tar.gz)
Source code(zip)
v2.2(Apr 26, 2021)
Major updates

Neural Architecture Search

Improve NAS 2.0 (Retiarii) Framework (Alpha Release)

Support local debug mode (#3476)

Support nesting ValueChoice in LayerChoice (#3508)

Support dict/list type in ValueChoice (#3508)

Improve the format of export architectures (#3464)

Refactor of NAS examples (#3513)

Refer to here <https://github.com/microsoft/nni/issues/3301>__ for Retiarii Roadmap

Model Compression

Support speedup for mixed precision quantization model (Experimental) (#3488 #3512)

Support model export for quantization algorithm (#3458 #3473)

Support model export in model compression for TensorFlow (#3487)

Improve documentation (#3482)

nnictl & nni.experiment

Add native support for experiment config V2 (#3466 #3540 #3552)

Add resume and view mode in Python API nni.experiment (#3490 #3524 #3545)

Training Service

Support umount for shared storage in remote training service (#3456)

Support Windows as the remote training service in reuse mode (#3500)

Remove duplicated env folder in remote training service (#3472)

Add log information for GPU metric collector (#3506)

Enable optional Pod Spec for FrameworkController platform (#3379, thanks the external contributor @mbu93)

WebUI

Support launching TensorBoard on WebUI (#3454 #3361 #3531)

Upgrade echarts-for-react to v5 (#3457)

Add wrap for dispatcher/nnimanager log monaco editor (#3461)

Bug Fixes

Fix bug of FLOPs counter (#3497)

Fix bug of hyper-parameter Add/Remove axes and table Add/Remove columns button conflict (#3491)

Fix bug that monaco editor search text is not displayed completely (#3492)

Fix bug of Cream NAS (#3498, thanks the external contributor @AliCloud-PAI)

Fix typos in docs (#3448, thanks the external contributor @OliverShang)

Fix typo in NAS 1.0 (#3538, thanks the external contributor @ankitaggarwal23)

Source code(tar.gz)
Source code(zip)
v2.1(Mar 10, 2021)
Major updates

Neural architecture search

Improve NAS 2.0 (Retiarii) Framework (Improved Experimental)

Improve the robustness of graph generation and code generation for PyTorch models (#3365)

Support the inline mutation API ValueChoice (#3349 #3382)

Improve the design and implementation of Model Evaluator (#3359 #3404)

Support Random/Grid/Evolution exploration strategies (i.e., search algorithms) (#3377)

Refer to here for Retiarii Roadmap

Training service

Support shared storage for reuse mode (#3354)

Support Windows as the local training service in hybrid mode (#3353)

Remove PAIYarn training service (#3327)

Add "recently-idle" scheduling algorithm (#3375)

Deprecate preCommand and enable pythonPath for remote training service (#3284 #3410)

Refactor reuse mode temp folder (#3374)

nnictl & nni.experiment

Migrate nnicli to new Python API nni.experiment (#3334)

Refactor the way of specifying tuner in experiment Python API (nni.experiment), more aligned with nnictl (#3419)

WebUI

Support showing the assigned training service of each trial in hybrid mode on WebUI (#3261 #3391)

Support multiple selection for filter status in experiments management page (#3351)

Improve overview page (#3316 #3317 #3352)

Support copy trial id in the table (#3378)

Documentation

Improve model compression examples and documentation (#3326 #3371)

Add Python API examples and documentation (#3396)

Add SECURITY doc (#3358)

Add 'What's NEW!' section in README (#3395)

Update English contributing doc (#3398, thanks external contributor @Yongxuanzhang)

Bug fixes

Fix AML outputs path and python process not killed (#3321)

Fix bug that an experiment launched from Python cannot be resumed by nnictl (#3309)

Fix import path of network morphism example (#3333)

Fix bug in the tuple unpack (#3340)

Fix bug of security for arbitrary code execution (#3311, thanks external contributor @huntr-helper)

Fix NoneType error on jupyter notebook (#3337, thanks external contributor @tczhangzhi)

Fix bugs in Retiarii (#3339 #3341 #3357, thanks external contributor @tczhangzhi)

Fix bug in AdaptDL mode example (#3381, thanks external contributor @ZeyaWang)

Fix the spelling mistake of assessor (#3416, thanks external contributor @ByronCHAO)

Fix bug in ruamel import (#3430, thanks external contributor @rushtehrani)

Source code(tar.gz)
Source code(zip)
v2.0(Jan 14, 2021)
Major updates

Neural architecture search

Support an improved NAS framework: Retiarii (experimental)

Feature roadmap

Related issues and pull requests

Documentation

Support a new NAS algorithm: Cream (#2705)

Add a new NAS benchmark for NLP model search (#3140)

Training service

Support hybrid training service (#3097 #3251 #3252)

Support AdlTrainingService, a new training service based on Kubernetes (#3022, thanks external contributors Petuum @pw2393)

Model compression

Support pruning schedule for fpgm pruning algorithm (#3110)

ModelSpeedup improvement: support torch v1.7 (updated graph_utils.py) (#3076)

Improve model compression utility: model flops counter (#3048 #3265)

WebUI & nnictl

Support experiments management on WebUI, add a web page for it (#3081 #3127)

Improve the layout of overview page (#3046 #3123)

Add navigation bar on the right for logs and configs; add expanded icons for table (#3069 #3103)

Others

Support launching an experiment from Python code (#3111 #3210 #3263)

Refactor builtin/customized tuner installation (#3134)

Support new experiment configuration V2 (#3138 #3248 #3251)

Reorganize source code directory hierarchy (#2962 #2987 #3037)

Change SIGKILL to SIGTERM in local mode when cancelling trial jobs (#3173)

Refector hyperband (#3040)

Documentation

Port markdown docs to reStructuredText docs and introduce githublink (#3107)

List related research and publications in doc (#3150)

Add tutorial of saving and loading quantized model (#3192)

Remove paiYarn doc and add description of reuse config in remote mode (#3253)

Update EfficientNet doc to clarify repo versions (#3158, thanks external contributor @ahundt)

Bug fixes

Fix exp-duration pause timing under NO_MORE_TRIAL status (#3043)

Fix bug in NAS SPOS trainer, apply_fixed_architecture (#3051, thanks external contributor @HeekangPark)

Fix _compute_hessian bug in NAS DARTS (PyTorch version) (#3058, thanks external contributor @hroken)

Fix bug of conv1d in the cdarts utils (#3073, thanks external contributor @athaker)

Fix the handling of unknown trials when resuming an experiment (#3096)

Fix bug of kill command under Windows (#3106)

Fix lazy logging (#3108, thanks external contributor @HarshCasper)

Fix checkpoint load and save issue in QAT quantizer (#3124, thanks external contributor @eedalong)

Fix quant grad function calculation error (#3160, thanks external contributor @eedalong)

Fix device assignment bug in quantization algorithm (#3212, thanks external contributor @eedalong)

Fix bug in ModelSpeedup and enhance UT for it (#3279)

and others

Source code(tar.gz)
Source code(zip)
v1.9(Oct 22, 2020)
Release 1.9 - 10/22/2020

Major updates

Neural architecture search

Support regularized evolution algorithm for NAS scenario (#2802)

Add NASBench201 in search space zoo (#2766)

Model compression

AMC pruner improvement: support resnet, support reproduction of the experiments (default parameters in our example code) in AMC paper (#2876 #2906)

Support constraint-aware on some of our pruners to improve model compression efficiency (#2657)

Support "tf.keras.Sequential" in model compression for TensorFlow (#2887)

Support customized op in the model flops counter (#2795)

Support quantizing bias in QAT quantizer (#2914)

Training service

Support configuring python environment using "preCommand" in remote mode (#2875)

Support AML training service in Windows (#2882)

Support reuse mode for remote training service (#2923)

WebUI & nnictl

The "Overview" page on WebUI is redesigned with new layout (#2914)

Upgraded node, yarn and FabricUI, and enabled Eslint (#2894 #2873 #2744)

Add/Remove columns in hyper-parameter chart and trials table in "Trials detail" page (#2900)

JSON format utility beautify on WebUI (#2863)

Support nnictl command auto-completion (#2857)

UT & IT

Add integration test for experiment import and export (#2878)

Add integration test for user installed builtin tuner (#2859)

Add unit test for nnictl (#2912)

Documentation

Refactor of the document for model compression (#2919)

Bug fixes

Bug fix of naïve evolution tuner, correctly deal with trial fails (#2695)

Resolve the warning "WARNING (nni.protocol) IPC pipeline not exists, maybe you are importing tuner/assessor from trial code?" (#2864)

Fix search space issue in experiment save/load (#2886)

Fix bug in experiment import data (#2878)

Fix annotation in remote mode (python 3.8 ast update issue) (#2881)

Support boolean type for "choice" hyper-parameter when customizing trial configuration on WebUI (#3003)

Source code(tar.gz)
Source code(zip)
v1.8(Aug 28, 2020)
Release 1.8 - 8/27/2020

Major updates

Training service

Access trial log directly on WebUI (local mode only) (#2718)

Add OpenPAI trial job detail link (#2703)

Support GPU scheduler in reusable environment (#2627) (#2769)

Add timeout for web_channel in trial_runner (#2710)

Show environment error message in AzureML mode (#2724)

Add more log information when copying data in OpenPAI mode (#2702)

WebUI, nnictl and nnicli

Improve hyper-parameter parallel coordinates plot (#2691) (#2759)

Add pagination for trial job list (#2738) (#2773)

Enable panel close when clicking overlay region (#2734)

Remove support for Multiphase on WebUI (#2760)

Support save and restore experiments (#2750)

Add intermediate results in export result (#2706)

Add command to list trial results with highest/lowest metrics (#2747)

Improve the user experience of nnicli with examples (#2713)

Neural architecture search

Search space zoo: ENAS and DARTS (#2589)

API to query intermediate results in NAS benchmark (#2728)

Model compression

Support the List/Tuple Construct/Unpack operation for TorchModuleGraph (#2609)

Model speedup improvement: Add support of DenseNet and InceptionV3 (#2719)

Support the multiple successive tuple unpack operations (#2768)

Doc of comparing the performance of supported pruners (#2742)

New pruners: Sensitivity pruner (#2684) and AMC pruner (#2573) (#2786)

TensorFlow v2 support in model compression (#2755)

Backward incompatible changes

Update the default experiment folder from $HOME/nni/experiments to $HOME/nni-experiments. If you want to view the experiments created by previous NNI releases, you can move the experiments folders from $HOME/nni/experiments to $HOME/nni-experiments manually. (#2686) (#2753)

Dropped support for Python 3.5 and scikit-learn 0.20 (#2778) (#2777) (2783) (#2787) (#2788) (#2790)

Others

Upgrade TensorFlow version in Docker image (#2732) (#2735) (#2720)

Examples

Remove gpuNum in assessor examples (#2641)

Documentation

Improve customized tuner documentation (#2628)

Fix several typos and grammar mistakes in documentation (#2637 #2638, thanks @tomzx)

Improve AzureML training service documentation (#2631)

Improve CI of Chinese translation (#2654)

Improve OpenPAI training service documenation (#2685)

Improve documentation of community sharing (#2640)

Add tutorial of Colab support (#2700)

Improve documentation structure for model compression (#2676)

Bug fixes

Fix mkdir error in training service (#2673)

Fix bug when using chmod in remote training service (#2689)

Fix dependency issue by making _graph_utils imported inline (#2675)

Fix mask issue in SimulatedAnnealingPruner (#2736)

Fix intermediate graph zooming issue (#2738)

Fix issue when dict is unordered when querying NAS benchmark (#2728)

Fix import issue for gradient selector dataloader iterator (#2690)

Fix support of adding tens of machines in remote training service (#2725)

Fix several styling issues in WebUI (#2762 #2737)

Fix support of unusual types in metrics including NaN and Infinity (#2782)

Fix nnictl experiment delete (#2791)

Source code(tar.gz)
Source code(zip)
v1.7.1(Jul 31, 2020)
Release 1.7.1 - 8/1/2020

Bug Fixes

Fix pai training service error handling #2692

Fix pai training service codeDir copying issue #2673

Upgrade training service to support latest pai restful API #2722

Source code(tar.gz)
Source code(zip)
v1.7(Jul 8, 2020)
Release 1.7 - 7/8/2020

Major Features

Training Service

Support AML(Azure Machine Learning) platform as NNI training service.

OpenPAI job can be reusable. When a trial is completed, the OpenPAI job won't stop, and wait next trial. refer to reuse flag in OpenPAI config.

Support ignoring files and folders in code directory with .nniignore when uploading code directory to training service.

Neural Architecture Search (NAS)

Provide NAS Open Benchmarks (NasBench101, NasBench201, NDS) with friendly APIs.

Support Classic NAS (i.e., non-weight-sharing mode) on TensorFlow 2.X.

Model Compression

Improve Model Speedup: track more dependencies among layers and automatically resolve mask conflict, support the speedup of pruned resnet.

Added new pruners, including three auto model pruning algorithms: NetAdapt Pruner, SimulatedAnnealing Pruner, AutoCompress Pruner, and ADMM Pruner.

Added model sensitivity analysis tool to help users find the sensitivity of each layer to the pruning.

Easy flops calculation for model compression and NAS.

Update lottery ticket pruner to export winning ticket.

Examples

Automatically optimize tensor operators on NNI with a new customized tuner OpEvo.

Built-in tuners/assessors/advisors

Allow customized tuners/assessor/advisors to be installed as built-in algorithms.

WebUI

Support visualizing nested search space more friendly.

Show trial's dict keys in hyper-parameter graph.

Enhancements to trial duration display.

Others

Provide utility function to merge parameters received from NNI

Support setting paiStorageConfigName in pai mode

Documentation

Improve documentation for model compression

Improve documentation and examples for NAS benchmarks.

Improve documentation for AzureML training service

Homepage migration to readthedoc.

Bug Fixes

Fix bug for model graph with shared nn.Module

Fix nodejs OOM when make build

Fix NASUI bugs

Fix duration and intermediate results pictures update issue.

Fix minor WebUI table style issues.

Source code(tar.gz)
Source code(zip)
v1.6(May 26, 2020)
Release 1.6 - 5/26/2020

Major Features

New Features and improvement

support __version__ for SDK version

support windows dev install

Improve IPC limitation to 100W

improve code storage upload logic among trials in non-local platform

HPO Updates

Improve PBT on failure handling and support experiment resume for PBT

NAS Updates

NAS support for TensorFlow 2.0 (preview) TF2.0 NAS examples

Use OrderedDict for LayerChoice

Prettify the format of export

Replace layer choice with selected module after applied fixed architecture

Model Compression Updates

Model compression PyTorch 1.4 support

Training Service Updates

update pai yaml merge logic

support windows as remote machine in remote mode Remote Mode

Web UI new supports or improvements

Show trial error message

finalize homepage layout

Refactor overview's best trials module

Remove multiphase from webui

add tooltip for trial concurrency in the overview page

Show top trials for hyper-parameter graph

Bug Fix

fix dev install

SPOS example crash when the checkpoints do not have state_dict

Fix table sort issue when experiment had failed trial

Support multi python env (conda, pyenv etc)

Source code(tar.gz)
Source code(zip)
v1.5(Apr 13, 2020)
New Features and Documentation

Hyper-Parameter Optimizing

New tuner: Population Based Training (PBT)

Trials can now report infinity and NaN as result

Neural Architecture Search

New NAS algorithm: TextNAS

ENAS and DARTS now support visualization through web UI.

Model Compression

New Pruner: GradientRankFilterPruner

Compressors will validate configuration by default

Refactor: Adding optimizer as an input argument of pruner, for easy support of DataParallel and more efficient iterative pruning. This is a broken change for the usage of iterative pruning algorithms.

Model compression examples are refactored and improved

Added documentation for implementing compressing algorithm

Training Service

Kubeflow now supports pytorchjob crd v1 (thanks external contributor @jiapinai)

Experimental DLTS support

Overall Documentation Improvement

Documentation is significantly improved on grammar, spelling, and wording (thanks external contributor @AHartNtkn)

Fixed Bugs

ENAS cannot have more than one LSTM layers (thanks external contributor @marsggbo)

NNI manager's timers will never unsubscribe (thanks external contributor @guilhermehn)

NNI manager may exhaust head memory (thanks external contributor @Sundrops)

Batch tuner does not support customized trials (#2075)

Experiment cannot be killed if it failed on start (#2080)

Non-number type metrics break web UI (#2278)

A bug in lottery ticket pruner

Other minor glitches

Source code(tar.gz)
Source code(zip)
v1.4(Feb 19, 2020)
Release 1.4 - 2/19/2020

Major Features

Neural Architecture Search

Support C-DARTS algorithm and add the example using it

Support a preliminary version of ProxylessNAS and the corresponding example

Add unit tests for the NAS framework

Model Compression

Support DataParallel for compressing models, and provide an example of using DataParallel

Support model speedup for compressed models, in Alpha version

Training Service

Support complete PAI configurations by allowing users to specify PAI config file path

Add example config yaml files for the new PAI mode (i.e., paiK8S)

Support deleting experiments using sshkey in remote mode (thanks external contributor @tyusr)

WebUI

WebUI refactor: adopt fabric framework

Others

Support running NNI experiment at foreground, i.e., --foreground argument in nnictl create/resume/view

Support canceling the trials in UNKNOWN state

Support large search space whose size could be up to 50mb (thanks external contributor @Sundrops)

Documentation

Improve the index structure of NNI readthedocs

Improve documentation for NAS

Improve documentation for the new PAI mode

Add QuickStart guidance for NAS and model compression

Improve documentation for the supported EfficientNet

Bug Fixes

Correctly support NaN in metric data, JSON compliant

Fix the out-of-range bug of randint type in search space

Fix the bug of wrong tensor device when exporting onnx model in model compression

Fix incorrect handling of nnimanagerIP in the new PAI mode (i.e., paiK8S)

Source code(tar.gz)
Source code(zip)
v1.3(Dec 31, 2019)
Release 1.3 - 12/30/2019

Major Features

Neural Architecture Search Algorithms Support

Single Path One Shot algorithm and the example using it

Model Compression Algorithms Support

Example: Knowledge Distillation algorithm and the example using it

Pruners

L2Filter Pruner

ActivationAPoZRankFilterPruner

ActivationMeanRankFilterPruner

BNN Quantizer

Training Service

NFS Support for PAI

Instead of using HDFS as default storage, since OpenPAI v0.11, OpenPAI can have NFS or AzureBlob or other storage as default storage. In this release, NNI extended the support for this recent change made by OpenPAI, and could integrate with OpenPAI v0.11 or later version with various default storage.

Kubeflow update adoption Add support for zero gpuNum in kubernetes (#1830 | thanks to external contributor @skyser2003) Adopted the Kubeflow 0.7's new supports for tf-operator. (thanks to external contributor @skyser2003)

Engineering (code and build automation)

Enforced ESLint on static code analysis.

Small changes & Bug Fixes

correctly recognize builtin tuner and customized tuner

logging in dispatcher base

fix the bug where tuner/assessor's failure sometimes kills the experiment.

Fix local system as remote machine issue

de-duplicate trial configuration in smac tuner ticket

Source code(tar.gz)
Source code(zip)
v1.2(Dec 2, 2019)
Release 1.2 - 12/2/2019

Major Features

Feature Engineering

New feature engineering interface

Feature selection algorithms: Gradient feature selector & GBDT selector

Examples for feature engineering

Neural Architecture Search (NAS) on NNI

New NAS interface

NAS algorithms: ENAS, DARTS, P-DARTS (in PyTorch)

NAS in classic mode (each trial runs independently)

Model compression

New model pruning algorithms: lottery ticket pruning approach, L1Filter pruner, Slim pruner, FPGM pruner

New model quantization algorithms: QAT quantizer, DoReFa quantizer

Support the API for exporting compressed model.

Training Service

Support OpenPAI token authentication

Examples:

An example to automatically tune rocksdb configuration with NNI.

A new MNIST trial example supports tensorflow 2.0.

Engineering Improvements

For remote training service, trial jobs require no GPU are now scheduled with round-robin policy instead of random.

Pylint rules added to check pull requests, new pull requests need to comply with these pylint rules.

Web Portal & User Experience

Support user to add customized trial.

User can zoom out/in in detail graphs, except Hyper-parameter.

Documentation

Improved NNI API documentation with more API docstring.

Bug fix

Fix the table sort issue when failed trials haven't metrics. -Issue #1764

Maintain selected status(Maximal/Minimal) when the page switched. -PR #1710

Make hyper-parameters graph's default metric yAxis more accurate. -PR #1736

Fix GPU script permission issue. -Issue #1665

Source code(tar.gz)
Source code(zip)
v1.1(Oct 23, 2019)
Release 1.1 - 10/23/2019

Major Features

New tuner: PPO Tuner

View stopped experiments

Tuners can now use dedicated GPU resource (see gpuIndices in tutorial for details)

Web UI improvements

Trials detail page can now list hyperparameters of each trial, as well as their start and end time (via "add column")

Viewing huge experiment is now less laggy

More examples

EfficientNet PyTorch example

Cifar10 NAS example

Model compression toolkit - Alpha release: We are glad to announce the alpha release for model compression toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute

Fixed Bugs

Multiphase job hangs when search space exhuasted (issue #1204)

nnictl fails when log not available (issue #1548)

Source code(tar.gz)
Source code(zip)
v1.0(Sep 2, 2019)
Release 1.0 - 09/02/2019

Major Features

Tuners and Assessors

Support Auto-Feature generator & selection -Issue#877 -PR #1387

Provide auto feature interface

Tuner based on beam search

Add Pakdd example

Add a parallel algorithm to improve the performance of TPE with large concurrency. -PR #1052

Support multiphase for hyperband -PR #1257

Training Service

Support private docker registry -PR #755

Engineering Improvements

Python wrapper for rest api, support retrieve the values of the metrics in a programmatic way PR #1318

New python API : get_experiment_id(), get_trial_id() -PR #1353 -Issue #1331 & -Issue#1368

Optimized NAS Searchspace -PR #1393

Unify NAS search space with _type -- "mutable_type"e

Update random search tuner

Set gpuNum as optional -Issue #1365

Remove outputDir and dataDir configuration in PAI mode -Issue #1342

When creating a trial in Kubeflow mode, codeDir will no longer be copied to logDir -Issue #1224

Web Portal & User Experience

Show the best metric curve during search progress in WebUI -Issue #1218

Show the current number of parameters list in multiphase experiment -Issue1210 -PR #1348

Add "Intermediate count" option in AddColumn. -Issue #1210

Support search parameters value in WebUI -Issue #1208

Enable automatic scaling of axes for metric value in default metric graph -Issue #1360

Add a detailed documentation link to the nnictl command in the command prompt -Issue #1260

UX improvement for showing Error log -Issue #1173

Documentation

Update the docs structure -Issue #1231

Multi phase document improvement -Issue #1233 -PR #1242

Add configuration example

WebUI description improvement -PR #1419

Bug fix

(Bug fix)Fix the broken links in 0.9 release -Issue #1236

(Bug fix)Script for auto-complete

(Bug fix)Fix pipeline issue that it only check exit code of last command in a script. -PR #1417

(Bug fix)quniform fors tuners -Issue #1377

(Bug fix)'quniform' has different meaning beween GridSearch and other tuner. -Issue #1335

(Bug fix)"nnictl experiment list" give the status of a "RUNNING" experiment as "INITIALIZED" -PR #1388

(Bug fix)SMAC cannot be installed if nni is installed in dev mode -Issue #1376

(Bug fix)The filter button of the intermediate result cannot be clicked -Issue #1263

(Bug fix)API "/api/v1/nni/trial-jobs/xxx" doesn't show a trial's all parameters in multiphase experiment -Issue #1258

(Bug fix)Succeeded trial doesn't have final result but webui show ×××(FINAL) -Issue #1207

(Bug fix)IT for nnictl stop -Issue #1298

(Bug fix)fix security warning

(Bug fix)Hyper-parameter page broken -Issue #1332

(Bug fix)Run flake8 tests to find Python syntax errors and undefined names -PR #1217

Source code(tar.gz)
Source code(zip)
v0.9(Jul 1, 2019)
Release 0.9 - 7/1/2019

Major Features

General NAS programming interface

Add enas-mode and oneshot-mode for NAS interface: PR #1201

Gaussian Process Tuner with Matern kernel

Multiphase experiment supports

Added new training service support for multiphase experiment: PAI mode supports multiphase experiment since v0.9.

Added multiphase capability for the following builtin tuners:

TPE, Random Search, Anneal, Naïve Evolution, SMAC, Network Morphism, Metis Tuner.

For details, please refer to Write a tuner that leverages multi-phase

Web Portal

Enable trial comparation in Web Portal. For details, refer to View trials status

Allow users to adjust rendering interval of Web Portal. For details, refer to View Summary Page

show intermediate results more friendly. For details, refer to View trials status

Commandline Interface

nnictl experiment delete: delete one or all experiments, it includes log, result, environment information and cache. It uses to delete useless experiment result, or save disk space.

nnictl platform clean: It uses to clean up disk on a target platform. The provided YAML file includes the information of target platform, and it follows the same schema as the NNI configuration file.

Bug fix and other changes

Tuner Installation Improvements: add sklearn to nni dependencies.

(Bug Fix) Failed to connect to PAI http code - Issue #1076

(Bug Fix) Validate file name for PAI platform - Issue #1164

(Bug Fix) Update GMM evaluation in Metis Tuner

(Bug Fix) Negative time number rendering in Web Portal - Issue #1182, Issue #1185

(Bug Fix) Hyper-parameter not shown correctly in WebUI when there is only one hyper parameter - Issue #1192

Source code(tar.gz)
Source code(zip)
v0.8(Jun 5, 2019)
Release 0.8 - 6/4/2019

Major Features

Support NNI on Windows for PAI/Remote mode

NNI running on windows for remote mode

NNI running on windows for PAI mode

Advanced features for using GPU

Run multiple trial jobs on the same GPU for local and remote mode

Run trial jobs on the GPU running non-NNI jobs

Kubeflow v1beta2 operator

Support Kubeflow TFJob/PyTorchJob v1beta2

General NAS programming interface

Provide NAS programming interface for users to easily express their neural architecture search space through NNI annotation

Provide a new command nnictl trial codegen for debugging the NAS code

Tutorial of NAS programming interface, example of NAS on mnist, customized random tuner for NAS

Support resume tuner/advisor's state for experiment resume

For experiment resume, tuner/advisor will be resumed by replaying finished trial data

Web Portal

Improve the design of copying trial's parameters

Support 'randint' type in hyper-parameter graph

Use should ComponentUpdate to avoid unnecessary render

Bug fix and other changes

Bug fix that nnictl update has inconsistent command styles

Support import data for SMAC tuner

Bug fix that experiment state transition from ERROR back to RUNNING

Fix bug of table entries

Nested search space refinement

Refine 'randint' type and support lower bound

Comparison of different hyper-parameter tuning algorithm

Comparison of NAS algorithm

NNI practice on Recommenders

Source code(tar.gz)
Source code(zip)
v0.7(Apr 29, 2019)
Release 0.7 - 4/29/2019

Major Features

Support NNI on Windows

NNI running on windows for local mode

New advisor: BOHB

Support a new advisor BOHB, which is a robust and efficient hyperparameter tuning algorithm, combines the advantages of Bayesian optimization and Hyperband

Support import and export experiment data through nnictl

Generate analysis results report after the experiment execution

Support import data to tuner and advisor for tuning

Designated gpu devices for NNI trial jobs

Specify GPU devices for NNI trial jobs by gpuIndices configuration, if gpuIndices is set in experiment configuration file, only the specified GPU devices are used for NNI trial jobs.

Web Portal enhancement

Decimal format of metrics other than default on the Web UI

Hints in WebUI about Multi-phase

Enable copy/paste for hyperparameters as python dict

Enable early stopped trials data for tuners.

NNICTL provide better error message

nnictl provide more meaningful error message for yaml file format error

Bug fix

Unable to kill all python threads after nnictl stop in async dispatcher mode

nnictl --version does not work with make dev-instal

All trail jobs status stays on 'waiting' for long time on PAI platform

Source code(tar.gz)
Source code(zip)
v0.6(Apr 2, 2019)
Release 0.6 - 4/2/2019

Major Features

Version checking

check whether the version is consistent between nniManager and trialKeeper

Report final metrics for early stop job

If includeIntermediateResults is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result. The default value of includeIntermediateResults is false.

Separate Tuner/Assessor

Adds two pipes to separate message receiving channels for tuner and assessor.

Make log collection feature configurable

Add intermediate result graph for all trials

Bug fix

Add shmMB config key for PAI

Fix the bug that doesn't show any result if metrics is dict

Fix the number calculation issue for float types in hyperband

Fix a bug in the search space conversion in SMAC tuner

Fix the WebUI issue when parsing experiment.json with illegal format

Fix cold start issue in Metis Tuner

Source code(tar.gz)
Source code(zip)
v0.5.2.1(Mar 4, 2019)
Release 0.5.2.1 - 3/4/2019

Add release note.

Fix Metis tuner cold start issue.

Source code(tar.gz)
Source code(zip)
v0.5.2(Mar 4, 2019)
Release 0.5.2 - 3/4/2019

Improvements

Curve fitting assessor performance improvement.

Documentation

Chinese version document: https://nni.readthedocs.io/zh/latest/

Debuggability/serviceability document: https://nni.readthedocs.io/en/latest/Tutorial/HowToDebug.html

Tuner assessor reference: https://nni.readthedocs.io/en/latest/sdk_reference.html#tuner

Bug Fixes and Other Changes

Fix a race condition bug that does not store trial job cancel status correctly.

Fix search space parsing error when using SMAC tuner.

Fix cifar10 example broken pipe issue.

Add unit test cases for nnimanager and local training service.

Add integration test azure pipelines for remote machine, PAI and kubeflow training services.

Support Pylon in PAI webhdfs client.

Source code(tar.gz)
Source code(zip)
v0.5.1(Jan 31, 2019)
Release 0.5.1 - 1/31/2018

Improvements

Making log directory configurable

Support different levels of logs, making it easier for debugging

Documentation

Reorganized documentation & New Homepage Released: https://nni.readthedocs.io/en/latest/

Chinese users are able to learn NNI with the translated Chinese doc: https://github.com/microsoft/nni/blob/master/README_zh_CN.md Dear Contributors: We'd love to provide more language translations, contribute to NNI with more languages =)

Bug Fixes and Other Changes

Fix the bug of installation in python virtualenv, and refactor the installation logic

Fix the bug of HDFS access failure on PAI mode after PAI is upgraded.

Fix the bug that sometimes in-place flushed stdout makes experiment crash

Source code(tar.gz)
Source code(zip)