Update 10/14
Follow up items:
- [x] add custom resolver to get lib versions.
- [x] clean up orphaned instances.
Update 09/30
A few items I want to follow up separately (will create PR for these)
- cannot run the plugin in python 3.6 (cloudpickle issue, same error message as #428, failed run here https://app.circleci.com/pipelines/github/jieru-hu/hydra/611/workflows/0d99e3b0-1442-446f-857b-f476a7707b6d/jobs/6648 ), I tried a few things (updating cloudpickle version etc but was not able to resolve it.)
- nightly builds test AMIs (right now the AMI is set as an env variable which is annoying everytime when we need to update the AMI id, I want to automate this.)
- doc update. The doc needs to be refreshed a bit. I think it might be easier to create a separate PR for this.
Update 09/28
Summary of the changes:
1. Address omry's comments.
2. Changes to integration test:
The goal is "No outbound traffic for the test instances." The barrier is the pip install
and conda create
we need to run while setting up the instance which requires us to open 443 to all outbound traffic.
To get around this: for conda
and dependency packages needed for starting the cluster, I created a base AMI that has everything pre-installed. 2) for Hydra related packages, build the wheels at test time and install the wheels on the instance.
The upside is we achieve "no outbound traffic for the instance", the downside is that means we need update AMI when dependencies changes. To help with that I created a script
(create_ami.py
) to automate building the AMI.
It would be good to build nightly AMIs and wheels, that's something I want to work on soon.
output from running `create_ami.py`
$ AWS_PROFILE=jieru python create_ami.py
2020-09-28 16:23:56.268051 - Running: aws ec2 authorize-security-group-egress --group-id sg-0a1 --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges=[{CidrIp=0.0.0.0/0}]
2020-09-28 16:23:57.464861 -
2020-09-28 16:23:57.487688 - Running: ray up /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmpjw6lihef.yaml -y
2020-09-28 16:25:57.400029 - 2020-09-28 16:23:58,540 INFO cli_logger.py:388 -- Using cached config at /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/ray-config-d951a214f8602b878335411b5df6e84af463922b
2020-09-28 16:25:57.462567 - Running: ray rsync_up /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmpjw6lihef.yaml './setup_ami.py' '/home/ubuntu/'
2020-09-28 16:25:59.210039 - 2020-09-28 16:25:58,320 INFO cli_logger.py:388 -- Using cached config at /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/ray-config-d951a214f8602b878335411b5df6e84af463922b
2020-09-28 16:25:58,736 INFO cli_logger.py:388 -- NodeUpdater: i-0d82abc901a725abc: Syncing ./setup_ami.py to /home/ubuntu/...
2020-09-28 16:25:58,922 INFO log_timer.py:25 -- NodeUpdater: i-0d82abc901a725abc: Got IP [LogTimer=186ms]
building file list ... done
setup_ami.py
sent 692 bytes received 42 bytes 489.33 bytes/sec
total size is 1345 speedup is 1.83
Installing dependencies now, this may take a while...
2020-09-28 16:25:59.210121 - Running: ray exec /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmpjw6lihef.yaml 'python ./setup_ami.py'
...
Installing collected packages: typing-extensions, omegaconf
Successfully installed omegaconf-2.0.2 typing-extensions-3.7.4.3
2020-09-28 23:45:09.147903 - OUT: /home/ubuntu/anaconda3/envs/hydra_3.8.5/bin/pip install antlr4-python3-runtime==4.8
2020-09-28 23:45:09.927774 - OUT: Processing ./.cache/pip/wheels/c8/d0/ab/d43c02eaddc5b9004db86950802442ad9a26f279c619e28da0/antlr4_python3_runtime-4.8-py3-none-any.whl
Installing collected packages: antlr4-python3-runtime
Successfully installed antlr4-python3-runtime-4.8
2020-09-28 23:45:09.927847 - OUT: /home/ubuntu/anaconda3/envs/hydra_3.8.5/bin/pip install --ignore-installed PyYAML
2020-09-28 23:45:10.798501 - OUT: Processing ./.cache/pip/wheels/13/90/db/290ab3a34f2ef0b5a0f89235dc2d40fea83e77de84ed2dc05c/PyYAML-5.3.1-cp38-cp38-linux_x86_64.whl
Installing collected packages: PyYAML
Successfully installed PyYAML-5.3.1
Shared connection to 34.221.119.106 closed.
2020-09-28 16:45:11.007294 - Running: aws ec2 revoke-security-group-egress --group-id sg-0a1 --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges=[{CidrIp=0.0.0.0/0}]
2020-09-28 16:45:12.047395 -
2020-09-28 16:45:12.047395 -
ec2.Image(id='ami-0c46') current state pending
ec2.Image(id='ami-0c46') current state pending
...
ami-0c46 ready for use now.
#### 3. skip `-Werror` flag for ray launcher, the tests will fail with the flag, stack trace (this is caused by ray, not the plugin itself) - solution is to add a pytest.ini in ray's tests dir to suppress the warnings.
Stack trace
test_ray_local_launcher.py .[2020-09-28 21:52:46,553][HYDRA] Ray Launcher is launching 1 jobs, sweep output dir: /private/var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/pytest-of-jieru/pytest-25/test_sweep_1_job_ray_local_ove0
[2020-09-28 21:52:46,553][HYDRA] Initializing ray with config: {'num_cpus': 1, 'num_gpus': 0}
2020-09-28 21:52:46,564 INFO resource_spec.py:223 -- Starting Ray with 8.79 GiB memory available for workers and up to 4.41 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis-shard_0.err' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis-shard_0.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis-shard_0.out' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis-shard_0.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis.err' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis.out' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/gcs_server.out' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 748, in start_head_processes
self.start_gcs_server()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/gcs_server.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/gcs_server.err' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 748, in start_head_processes
self.start_gcs_server()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/gcs_server.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/monitor.out' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 750, in start_head_processes
self.start_monitor()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/monitor.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/monitor.err' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 750, in start_head_processes
self.start_monitor()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/monitor.err' mode='a' encoding='utf-8'>
F[2020-09-28 21:52:47,565][HYDRA] Ray Launcher is launching 2 jobs, sweep output dir: /private/var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/pytest-of-jieru/pytest-25/test_sweep_2_jobs_ray_local_ov0
[2020-09-28 21:52:47,565][HYDRA] Initializing ray with config: {'num_cpus': 1, 'num_gpus': 0}
2020-09-28 21:52:47,573 INFO resource_spec.py:223 -- Starting Ray with 8.74 GiB memory available for workers and up to 4.38 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis-shard_0.err' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis-shard_0.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis-shard_0.out' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis-shard_0.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis.err' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis.out' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/gcs_server.out' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 748, in start_head_processes
self.start_gcs_server()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/gcs_server.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/gcs_server.err' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 748, in start_head_processes
self.start_gcs_server()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/gcs_server.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/monitor.out' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 750, in start_head_processes
self.start_monitor()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/monitor.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/monitor.err' mode='ab' closefd=True>
Traceback (most recent call last):
File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 750, in start_head_processes
self.start_monitor()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/monitor.err' mode='a' encoding='utf-8'>
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/main.py", line 191, in wrap_session
INTERNALERROR> session.exitstatus = doit(config, session) or 0
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/main.py", line 247, in _main
INTERNALERROR> config.hook.pytest_runtestloop(session=session)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR> return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR> return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 84, in <lambda>
INTERNALERROR> self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 208, in _multicall
INTERNALERROR> return outcome.get_result()
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR> raise ex[1].with_traceback(ex[2])
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR> res = hook_impl.function(*args)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/main.py", line 272, in pytest_runtestloop
INTERNALERROR> item.config.hook.pytest_runtest_protocol(item=item, nextitem=nextitem)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR> return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR> return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 84, in <lambda>
INTERNALERROR> self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 208, in _multicall
INTERNALERROR> return outcome.get_result()
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR> raise ex[1].with_traceback(ex[2])
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR> res = hook_impl.function(*args)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/runner.py", line 85, in pytest_runtest_protocol
INTERNALERROR> runtestprotocol(item, nextitem=nextitem)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/runner.py", line 100, in runtestprotocol
INTERNALERROR> reports.append(call_and_report(item, "call", log))
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/runner.py", line 188, in call_and_report
INTERNALERROR> report = hook.pytest_runtest_makereport(item=item, call=call)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR> return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR> return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 84, in <lambda>
INTERNALERROR> self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 203, in _multicall
INTERNALERROR> gen.send(outcome)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/skipping.py", line 129, in pytest_runtest_makereport
INTERNALERROR> rep = outcome.get_result()
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR> raise ex[1].with_traceback(ex[2])
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR> res = hook_impl.function(*args)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/runner.py", line 260, in pytest_runtest_makereport
INTERNALERROR> return TestReport.from_item_and_call(item, call)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/reports.py", line 294, in from_item_and_call
INTERNALERROR> longrepr = item.repr_failure(excinfo)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/python.py", line 1511, in repr_failure
INTERNALERROR> return self._repr_failure_py(excinfo, style=style)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/nodes.py", line 355, in _repr_failure_py
INTERNALERROR> return excinfo.getrepr(
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 635, in getrepr
INTERNALERROR> return fmt.repr_excinfo(self)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 880, in repr_excinfo
INTERNALERROR> reprtraceback = self.repr_traceback(excinfo_)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 824, in repr_traceback
INTERNALERROR> reprentry = self.repr_traceback_entry(entry, einfo)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 774, in repr_traceback_entry
INTERNALERROR> source = self._getentrysource(entry)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 685, in _getentrysource
INTERNALERROR> source = entry.getsource(self.astcache)
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 246, in getsource
INTERNALERROR> astnode, _, end = getstatementrange_ast(
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/source.py", line 384, in getstatementrange_ast
INTERNALERROR> astnode = ast.parse(content, "source", "exec")
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/ast.py", line 47, in parse
INTERNALERROR> return compile(source, filename, mode, flags,
INTERNALERROR> File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/worker.py", line 869, in sigterm_handler
INTERNALERROR> sys.exit(signum)
INTERNALERROR> SystemExit: 15
mainloop: caught unexpected SystemExit!
4. install conda in circleCI linux docker
Previously we pin the circleCI linux docker image to be python:3.8. however, the image runs on python 3.8.6 which is not yet available to be installed in conda. As a result, the ray launcher tests fails (cloudpickle requires the exact same version of python used on pickle and unpickle side)
To be consistent with how tests are run in MACOS and WIN, I added the miniconda installation for linux machines as well. The installation takes a few secs, so I didn't add cache for it.
Update 09/21
Now that #815 has finally landed. This PR is unblocked finally! This is my priority this week.
TODO items.
- [x] rebase onto latest master
- [x] upload latest wheels to S3 for installation during integration tests.
In order to upload and install latest wheels in the integration, I want to:
- list all the plugins that's going to be tested (by getting the
PLUGINS
env variable), build wheels, and scp them all to the ec2 instance.
- Install all the wheels on the ec2 instance.
This way, we can remove all outbound traffic of the testing ec2 instances.
- [x] Address omry's comments
- [ ] Update circleCI test user creds. and finish all the TODO items outlined in the proposal quip.
Update 09/08
edit: moved the TODO items to the latest update.
Update 09/01
This PR has been blocked by #815. Now that we've figured out a good solution for #815, I will go ahead and get 815 in first and then circle back here.
Also I'm going to create data class for both ray init, ray remote and boto configs (we will only add typing for common boto fields, and the boto config will extend Dict[str, Any])
For the integration tests to work on the latest code, we are going to build wheels and upload to S3 with each integration test run. This will be likely be a a separate PR.
Motivation
This is built on https://github.com/facebookresearch/hydra/pull/515, sorry I had to open a new pull requests. I still need to figure out what's a better workflow with forking & syncing.
This PR address some comments from PR515:
- Add local mode for ray launcher
- Add docker options
- refactor the Launcher class, group all file syncing together.
Plan to add in next PR(s):
- Add Integration test
- Add an option for users to update the cluster if they need.
- Remote cluster return results to laptop.
- make _dump_func_params better/less hard coded.
Update 04/10
Add Integration tests
Ray up automatically update cluster so no need for us to provide an option for update.
Now JobReturns are copied back to laptop
refactor _dump_func_params a bit more.
Next:
- Fix LOCAL mode to run RAY directly.
- Add integration tests for both LOCAL and AWS mode.
Yes/No
Test Plan
Integration tests
Run launcher in both local and AWS mode
Run the
Related Issues and PRs
(Is this PR part of a group of changes? Link the other relevant PRs and Issues here. Use https://help.github.com/en/articles/closing-issues-using-keywords for help on GitHub syntax)
CLA Signed