Describe the issue:
I am trying your pytorch example in https://nni.readthedocs.io/zh/stable/tutorials/hpo_quickstart_pytorch/model.html
everything goes well, installation succeed, and the web page looks nice. But after about 10s, the program ends up.
[2023-01-02 01:56:47] Creating experiment, Experiment ID: 6j50nacv
[2023-01-02 01:56:47] Starting web server...
[2023-01-02 01:56:48] Setting up...
[2023-01-02 01:56:48] Web portal URLs: http://127.0.0.1:8080 http://172.18.36.113:8080
node:internal/fs/watchers:252
throw error;
^
Error: ENOSPC: System limit for number of file watchers reached, watch '/home/linux_username/nni-experiments/6j50nacv/trials/gnav8/.nni/metrics'
at FSWatcher.<computed> (node:internal/fs/watchers:244:19)
at Object.watch (node:fs:2251:34)
at TailStream.waitForMoreData (/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni_node/node_modules/tail-stream/index.js:123:31)
at TailStream.<anonymous> (/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni_node/node_modules/tail-stream/index.js:275:22)
at FSReqCallback.wrapper [as oncomplete] (node:fs:660:5) {
errno: -28,
syscall: 'watch',
code: 'ENOSPC',
path: '/home/linux_username/nni-experiments/6j50nacv/trials/gnav8/.nni/metrics',
filename: '/home/linux_username/nni-experiments/6j50nacv/trials/gnav8/.nni/metrics'
}
Thrown at:
at __node_internal_captureLargerStackTrace (node:internal/errors:464:5)
at __node_internal_uvException (node:internal/errors:521:10)
at FSWatcher.<computed> (node:internal/fs/watchers:244:19)
at watch (node:fs:2251:34)
at TailStream.waitForMoreData (/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni_node/node_modules/tail-stream/index.js:123:31)
at /home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni_node/node_modules/tail-stream/index.js:275:22
at wrapper (node:fs:660:5)
Traceback (most recent call last):
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/__main__.py", line 85, in <module>
main()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/__main__.py", line 61, in main
dispatcher.run()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/msg_dispatcher_base.py", line 69, in run
command, data = self._channel._receive()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/channel.py", line 94, in _receive
command = self._retry_receive()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/channel.py", line 104, in _retry_receive
self._channel.connect()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 62, in connect
self._ws = _wait(_connect_async(self._url))
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 111, in _wait
return future.result()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/runtime/tuner_command_channel/websocket.py", line 125, in _connect_async
return await websockets.connect(url, max_size=None) # type: ignore
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/websockets/legacy/client.py", line 659, in __await_impl_timeout__
return await asyncio.wait_for(self.__await_impl__(), self.open_timeout)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/tasks.py", line 479, in wait_for
return fut.result()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/websockets/legacy/client.py", line 663, in __await_impl__
_transport, _protocol = await self._create_connection()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/base_events.py", line 1065, in create_connection
raise exceptions[0]
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/base_events.py", line 1050, in create_connection
sock = await self._connect_sock(
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/base_events.py", line 961, in _connect_sock
await self.sock_connect(sock, address)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/selector_events.py", line 500, in sock_connect
return await fut
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/asyncio/selector_events.py", line 535, in _sock_connect_cb
raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 8080)
Traceback (most recent call last):
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1331, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1040, in _send_output
self.send(msg)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 980, in send
self.connect()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f9d7c6042e0>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d7c6042e0>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/linux_username/yecanming/repo/new_things_exploring/tunning_parameter/nni/nni_main.py", line 23, in <module>
experiment.run(8080)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/experiment.py", line 183, in run
self._wait_completion()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/experiment.py", line 163, in _wait_completion
status = self.get_status()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/experiment.py", line 283, in get_status
resp = rest.get(self.port, '/check-status', self.url_prefix)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/rest.py", line 43, in get
return request('get', port, api, prefix=prefix)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/rest.py", line 31, in request
resp = requests.request(method, url, timeout=timeout)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d7c6042e0>: Failed to establish a new connection: [Errno 111] Connection refused'))
[2023-01-02 01:57:08] Stopping experiment, please wait...
[2023-01-02 01:57:08] ERROR: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/experiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d929a5c40>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1331, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 1040, in _send_output
self.send(msg)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/http/client.py", line 980, in send
self.connect()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f9d929a5c40>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/experiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d929a5c40>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/experiment.py", line 143, in _stop_impl
rest.delete(self.port, '/experiment', self.url_prefix)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/rest.py", line 52, in delete
request('delete', port, api, prefix=prefix)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/nni/experiment/rest.py", line 31, in request
resp = requests.request(method, url, timeout=timeout)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/linux_username/anaconda3/envs/torch/lib/python3.9/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /api/v1/nni/experiment (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9d929a5c40>: Failed to establish a new connection: [Errno 111] Connection refused'))
[2023-01-02 01:57:08] WARNING: Cannot gracefully stop experiment, killing NNI process...
[2023-01-02 01:57:08] Experiment stopped
Environment:
- NNI version: 2.10
- Training service (local|remote|pai|aml|etc): local
- Client OS: Linux qaz 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- Server OS (for remote mode only):
- Python version: Python 3.9.13
- PyTorch/TensorFlow version: '1.13.0+cu116'
- Is conda/virtualenv/venv used?: conda is used
- Is running in Docker?: no
Configuration:
- Experiment config (remember to remove secrets!):
experiment.config.trial_command = 'python run_nn.py'
experiment.config.trial_code_directory = '.'
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
experiment.config.max_trial_number = 10
experiment.config.trial_concurrency = 2
experiment.config.max_experiment_duration = '1h'
Log message:
- nnimanager.log: cannot find your tutorial at "https://github.com/microsoft/nni/blob/master/docs/en_US/Tutorial/HowToDebug.md#experiment-root-director"
- dispatcher.log:
- nnictl stdout and stderr:
How to reproduce it?: