The fastai deep learning library

fast.ai

Last update: Jan 7, 2023

Related tags

Deep Learning python machine-learning deep-learning gpu pytorch colab notebooks fastai

Overview

Welcome to fastai

fastai simplifies training fast and accurate neural nets using modern best practices

Important: This documentation covers fastai v2, which is a from-scratch rewrite of fastai. The v1 documentation has moved to fastai1.fast.ai. To stop fastai from updating to v2, run in your terminal echo 'fastai 1.*' >> $CONDA_PREFIX/conda-meta/pinned (if you use conda).

Installing

You can use fastai without any installation by using Google Colab. In fact, every page of this documentation is also available as an interactive notebook - click "Open in colab" at the top of any page to open it (be sure to change the Colab runtime to "GPU" to have it run fast!) See the fast.ai documentation on Using Colab for more information.

You can install fastai on your own machines with conda (highly recommended). If you're using Anaconda then run:

conda install -c fastai -c pytorch -c anaconda fastai gh anaconda

...or if you're using miniconda) then run:

conda install -c fastai -c pytorch fastai

To install with pip, use: pip install fastai. If you install with pip, you should install PyTorch first by following the PyTorch installation instructions.

If you plan to develop fastai yourself, or want to be on the cutting edge, you can use an editable install (if you do this, you should also use an editable install of fastcore to go with it.):

git clone https://github.com/fastai/fastai
pip install -e "fastai[dev]"

Learning fastai

The best way to get started with fastai (and deep learning) is to read the book, and complete the free course.

To see what's possible with fastai, take a look at the Quick Start, which shows how to use around 5 lines of code to build an image classifier, an image segmentation model, a text sentiment model, a recommendation system, and a tabular model. For each of the applications, the code is much the same.

Read through the Tutorials to learn how to train your own models on your own datasets. Use the navigation sidebar to look through the fastai documentation. Every class, function, and method is documented here.

To learn about the design and motivation of the library, read the peer reviewed paper.

About fastai

fastai is a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches. It aims to do both things without substantial compromises in ease of use, flexibility, or performance. This is possible thanks to a carefully layered architecture, which expresses common underlying patterns of many deep learning and data processing techniques in terms of decoupled abstractions. These abstractions can be expressed concisely and clearly by leveraging the dynamism of the underlying Python language and the flexibility of the PyTorch library. fastai includes:

A new type dispatch system for Python along with a semantic type hierarchy for tensors
A GPU-optimized computer vision library which can be extended in pure Python
An optimizer which refactors out the common functionality of modern optimizers into two basic pieces, allowing optimization algorithms to be implemented in 4–5 lines of code
A novel 2-way callback system that can access any part of the data, model, or optimizer and change it at any point during training
A new data block API
And much more...

fastai is organized around two main design goals: to be approachable and rapidly productive, while also being deeply hackable and configurable. It is built on top of a hierarchy of lower-level APIs which provide composable building blocks. This way, a user wanting to rewrite part of the high-level API or add particular behavior to suit their needs does not have to learn how to use the lowest level.

Migrating from other libraries

It's very easy to migrate from plain PyTorch, Ignite, or any other PyTorch-based library, or even to use fastai in conjunction with other libraries. Generally, you'll be able to use all your existing data processing code, but will be able to reduce the amount of code you require for training, and more easily take advantage of modern best practices. Here are migration guides from some popular libraries to help you on your way:

Tests

To run the tests in parallel, launch:

nbdev_test_nbs or make test

For all the tests to pass, you'll need to install the following optional dependencies:

pip install "sentencepiece<0.1.90" wandb tensorboard albumentations pydicom opencv-python scikit-image pyarrow kornia \
    catalyst captum neptune-cli

Tests are written using nbdev, for example see the documentation for test_eq.

Contributing

After you clone this repository, please run nbdev_install_git_hooks in your terminal. This sets up git hooks, which clean up the notebooks to remove the extraneous stuff stored in the notebooks (e.g. which cells you ran) which causes unnecessary merge conflicts.

Before submitting a PR, check that the local library and notebooks match. The script nbdev_diff_nbs can let you know if there is a difference between the local library and the notebooks.

If you made a change to the notebooks in one of the exported cells, you can export it to the library with nbdev_build_lib or make fastai.
If you made a change to the library, you can export it back to the notebooks with nbdev_update_lib.

Docker Containers

For those interested in official docker containers for this project, they can be found here.

Comments

System halted when calling `model.fit`

I tried lesson1 and lesson4-imdb jupyter notebook, however, whenever I tried to train a model(calling fit method), the system halted and then rebooted.

I tried to debug by myself, checked all system logs, and searched any suspicious log via everything, but none of them seems to log the error details.

I notice that anaconda install cudnn-7.1.4-cuda9.0_0 and perhaps it's conflicting with current cuda?

The error cell in lesson1

arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 2)

The error cell in lesson4-imdb

learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

Running the error cells cause system halt and reboot.

os: windows 10
jupyter lab: 0.32.1
notebook: 5.5.0
cuda: 9.0.176

cudnn1(installed before): cudnn-9.0-windows10-x64-v7.2.1.38
cudnn2(installed by anaconda): cudnn-7.1.4-cuda9.0_0
- E:\Anaconda3\pkgs\cudnn-7.1.4-cuda9.0_0

packages:

alabaster==0.7.11
appdirs==1.4.3
asn1crypto==0.24.0
astroid==2.0.4
atomicwrites==1.2.1
attrs==18.2.0
Automat==0.7.0
Babel==2.6.0
backcall==0.1.0
bcolz==1.2.1
beautifulsoup4==4.6.3
bleach==2.1.4
bokeh==0.13.0
certifi==2018.8.24
cffi==1.11.5
chardet==3.0.4
click==6.7
click-plugins==1.0.3
cliff==2.8.2
cligj==0.4.0
cloudpickle==0.5.5
cmd2==0.9.4
colorama==0.3.9
configparser==3.5.0
constantly==15.1.0
cryptography==2.3.1
cryptography-vectors==2.3.1
cssselect==1.0.3
cycler==0.10.0
cymem==1.31.2
cytoolz==0.9.0.1
dask==0.19.0
decorator==4.3.0
descartes==1.1.0
dill==0.2.8.2
distributed==1.23.0
docutils==0.14
en-core-web-sm==2.0.0
entrypoints==0.2.3
feather-format==0.4.0
feedparser==5.2.1
Fiona==1.7.10
GDAL==2.2.2
geopandas==0.4.0
graphviz==0.9
h5py==2.8.0rc1
heapdict==1.0.0
html5lib==1.0.1
hyperlink==17.3.1
idna==2.7
imagesize==1.1.0
incremental==17.5.0
ipykernel==4.9.0
ipython==6.5.0
ipython-genutils==0.2.0
ipywidgets==7.4.1
isort==4.3.4
isoweek==1.3.3
jedi==0.12.1
Jinja2==2.10
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==5.2.0
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.5.0
jupyter-core==4.4.0
jupyter-highlight-selected-word==0.2.0
jupyter-latex-envs==1.4.4
jupyter-nbextensions-configurator==0.4.0
kaggle-cli==0.12.13
keyring==13.2.1
kiwisolver==1.0.1
lazy-object-proxy==1.3.1
locket==0.2.0
lxml==4.0.0
MarkupSafe==1.0
matplotlib==2.2.3
mccabe==0.6.1
MechanicalSoup==0.8.0
mistune==0.8.3
mizani==0.4.6
mkl-fft==1.0.6
mkl-random==1.0.1
more-itertools==4.3.0
msgpack==0.5.6
msgpack-numpy==0.4.3.1
munch==2.3.2
murmurhash==0.28.0
nbconvert==5.3.1
nbformat==4.4.0
notebook==5.6.0
numexpr==2.6.6
numpy==1.15.1
numpydoc==0.8.0
olefile==0.45.1
opencv-python==3.4.2.17
packaging==17.1
palettable==3.1.1
pandas==0.23.4
pandas-summary==0.0.5
pandocfilters==1.4.2
parso==0.3.1
partd==0.3.8
path.py==11.0.1
patsy==0.5.0
pbr==4.2.0
pexpect==4.6.0
pickleshare==0.7.4
Pillow==5.2.0
plac==0.9.6
plotnine==0.4.0
pluggy==0.7.1
preshed==1.0.0
prettytable==0.7.2
progressbar2==3.34.3
prometheus-client==0.3.0
prompt-toolkit==1.0.15
psutil==5.4.7
py==1.6.0
pyarrow==0.10.0
pyasn1==0.4.4
pyasn1-modules==0.2.1
pycodestyle==2.4.0
pycparser==2.18
pyflakes==2.0.0
Pygments==2.2.0
PyHamcrest==1.9.0
pylint==2.1.1
pyOpenSSL==18.0.0
pyparsing==2.2.0
pyperclip==1.6.4
pyproj==1.9.5.1
pyreadline==2.1
PySocks==1.6.8
pytest==3.7.4
python-dateutil==2.7.3
python-utils==2.3.0
pytz==2018.5
pywin32==223
pywinpty==0.5.4
PyYAML==3.13
pyzmq==17.1.2
QtAwesome==0.4.4
qtconsole==4.4.1
QtPy==1.5.0
regex==2017.11.9
requests==2.19.1
rope==0.11.0
scikit-learn==0.19.2
scipy==1.1.0
seaborn==0.9.0
Send2Trash==1.5.0
service-identity==17.0.0
Shapely==1.6.4.post2
simplegeneric==0.8.1
six==1.11.0
sklearn-pandas==1.7.0
snowballstemmer==1.2.1
sortedcontainers==2.0.4
spacy==2.0.12
Sphinx==1.7.8
sphinxcontrib-websupport==1.1.0
spyder==3.3.1
spyder-kernels==0.2.6
statsmodels==0.9.0
stevedore==1.29.0
tables==3.4.4
tblib==1.3.2
termcolor==1.1.0
terminado==0.8.1
testfixtures==6.3.0
testpath==0.3.1
thinc==6.10.3
toolz==0.9.0
torch==0.4.1
torchtext==0.2.3
torchvision==0.2.1
tornado==4.5.3
tqdm==4.24.0
traitlets==4.3.2
Twisted==18.7.0
typed-ast==1.1.0
ujson==1.35
urllib3==1.23
wcwidth==0.1.7
webencodings==0.5.1
widgetsnbextension==3.4.1
win-inet-pton==1.0.1
wincertstore==0.2
wrapt==1.10.11
zict==0.1.3
zope.interface==4.5.0

opened by geekan 68

RuntimeError: received 0 items of ancdata

I'm running into an issue when trying to predict with the dn models. From what I've researched it seems maybe related to this issue https://github.com/pytorch/pytorch/issues/973 from the pytorch forums and the workaround there was setting the number of workers to 0. If anybody else has encountered this or knows how to set the number of workers to 0, I tried setting num_workers on ImageClassifierData to 0, but that didn't solve the issue for me. I don't know if there is anything that can be done on the fastai side since it appears to be a pytorch problem, but I figured it's at least worth documenting and if anybody has any ideas they can look into it.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-13-c94a818ff72b> in <module>()
     10     learn[i].fit(0.01, 3, cycle_len=1, cycle_mult=4)
     11 
---> 12     test_predictions = learn[i].predict(is_test=True)
     13 
     14     #tmp_log_preds,tmp_y = learn[i].TTA(is_test=True, n_aug=50)

~/fastaip1v2/fastai/courses/dl1/fastai/learner.py in predict(self, is_test)
    136         self.load('tmp')
    137 
--> 138     def predict(self, is_test=False): return self.predict_with_targs(is_test)[0]
    139 
    140     def predict_with_targs(self, is_test=False):

~/fastaip1v2/fastai/courses/dl1/fastai/learner.py in predict_with_targs(self, is_test)
    140     def predict_with_targs(self, is_test=False):
    141         dl = self.data.test_dl if is_test else self.data.val_dl
--> 142         return predict_with_targs(self.model, dl)
    143 
    144     def predict_dl(self, dl): return predict_with_targs(self.model, dl)[0]

~/fastaip1v2/fastai/courses/dl1/fastai/model.py in predict_with_targs(m, dl)
    115     if hasattr(m, 'reset'): m.reset()
    116     preda,targa = zip(*[(get_prediction(m(*VV(x))),y)
--> 117                         for *x,y in iter(dl)])
    118     return to_np(torch.cat(preda)), to_np(torch.cat(targa))
    119 

~/fastaip1v2/fastai/courses/dl1/fastai/model.py in <listcomp>(.0)
    114     m.eval()
    115     if hasattr(m, 'reset'): m.reset()
--> 116     preda,targa = zip(*[(get_prediction(m(*VV(x))),y)
    117                         for *x,y in iter(dl)])
    118     return to_np(torch.cat(preda)), to_np(torch.cat(targa))

~/fastaip1v2/fastai/courses/dl1/fastai/dataset.py in __next__(self)
    226         if self.i>=len(self.dl): raise StopIteration
    227         self.i+=1
--> 228         return next(self.it)
    229 
    230     @property

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    193         while True:
    194             assert (not self.shutdown and self.batches_outstanding > 0)
--> 195             idx, batch = self.data_queue.get()
    196             self.batches_outstanding -= 1
    197             if idx != self.rcvd_idx:

~/anaconda3/envs/fastai/lib/python3.6/multiprocessing/queues.py in get(self)
    335             res = self._reader.recv_bytes()
    336         # unserialize the data after having released the lock
--> 337         return _ForkingPickler.loads(res)
    338 
    339     def put(self, obj):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/multiprocessing/reductions.py in rebuild_storage_fd(cls, df, size)
     68         fd = multiprocessing.reduction.rebuild_handle(df)
     69     else:
---> 70         fd = df.detach()
     71     try:
     72         storage = storage_from_cache(cls, fd_id(fd))

~/anaconda3/envs/fastai/lib/python3.6/multiprocessing/resource_sharer.py in detach(self)
     56             '''Get the fd.  This should only be called once.'''
     57             with _resource_sharer.get_connection(self._id) as conn:
---> 58                 return reduction.recv_handle(conn)
     59 
     60 

~/anaconda3/envs/fastai/lib/python3.6/multiprocessing/reduction.py in recv_handle(conn)
    180         '''Receive a handle over a local connection.'''
    181         with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
--> 182             return recvfds(s, 1)[0]
    183 
    184     def DupFd(fd):

~/anaconda3/envs/fastai/lib/python3.6/multiprocessing/reduction.py in recvfds(sock, size)
    159             if len(ancdata) != 1:
    160                 raise RuntimeError('received %d items of ancdata' %
--> 161                                    len(ancdata))
    162             cmsg_level, cmsg_type, cmsg_data = ancdata[0]
    163             if (cmsg_level == socket.SOL_SOCKET and

RuntimeError: received 0 items of ancdata

opened by kevinbird15 46

cannot instantiate 'WindowsPath' on your system
Describe the bug

Installed fastai on the python:3.7 image but it fails to load the model:

empty_data = ImageDataBunch.load_empty(modelPath) File "/usr/local/lib/python3.7/site-packages/fastai/data_block.py", line 649, in _databunch_load_empty sd = LabelLists.load_empty(path, fn=fname) File "/usr/local/lib/python3.7/site-packages/fastai/data_block.py", line 513, in load_empty state = pickle.load(open(path/fn, 'rb')) File "/usr/local/lib/python3.7/pathlib.py", line 997, in new % (cls.name,)) NotImplementedError: cannot instantiate 'WindowsPath' on your system

Provide your installation details

=== Software === python : 3.7.2 fastai : 1.0.40 fastprogress : 0.1.18 torch : 1.0.0 torch cuda : 9.0.176 / is **Not available** === Hardware === No GPUs available === Environment === platform : Linux-4.9.125-linuxkit-x86_64-with-debian-9.6 distro : #1 SMP Fri Sep 7 08:20:28 UTC 2018 conda env : Unknown python : /usr/local/bin/python sys.path : /usr/local/lib/python37.zip /usr/local/lib/python3.7 /usr/local/lib/python3.7/lib-dynload /usr/local/lib/python3.7/site-packages no supported gpus found on this system

To Reproduce Try build this docker file and then run it:

FROM python:3.7 WORKDIR /app RUN pip3 install flask flask-cors gunicorn RUN pip3 install torch torchvision RUN pip3 install fastai COPY ./src /app/src COPY ./dist /app/dist CMD gunicorn --bind 0.0.0.0:$PORT src.app:app

Expected behavior The model should load and predict correctly.

Screenshots

Additional context
opened by PsidomPC 35
Serialization / Deserialization of Fastai objects to byte streams

Added options to save/export/load using BytesIO streams to the following functions: Learn.save, Learn.export, load_learner, DataBunch.save, load_data.

Following a discussion with @sgugger here.

opened by bachsh 31

ImageDataLoaders num_workers >0 → RuntimeError: Cannot pickle CUDA storage; try pickling a CUDA tensor instead

Please confirm you have the latest versions of fastai, fastcore, fastscript, and nbdev prior to reporting a bug (delete one): YES

Describe the bug When using a DataLoaders with num_workers>0, training raises RuntimeError: Cannot pickle CUDA storage; try pickling a CUDA tensor instead

To Reproduce Steps to reproduce the behavior:

from fastai.vision.data import ImageDataLoaders
from fastai.vision.learner import cnn_learner
from fastai.vision.augment import aug_transforms
import pandas as pd
from fastai import vision

df = pd.read_csv("/data/cats/labels.csv")

data = ImageDataLoaders.from_df(df=df, path="/", label_col=1, bs=100, batch_tfms=[
    *aug_transforms(size=224)], valid_pct=0.2, num_workers=1)
learn = cnn_learner(data, getattr(vision.models, "resnet18"))
learn.fit_one_cycle(10)

Expected behavior There shouldn't be an exception, as there is none when using num_workers=0.

Error with full stack trace

Place between these lines with triple backticks:

Traceback (most recent call last):
  File "/home/df/git/mitl/mitlmodels/model.py", line 426, in train
    pass  # This comment shows up if we ran into a callback error
  File "/home/df/git/mitl/mitlmodels/ml_utils.py", line 63, in __exit__
    raise exc_type(exc_val).with_traceback(exc_tb) from None
  File "/home/df/git/mitl/mitlmodels/model.py", line 401, in train
    learn.fit_one_cycle(max_epochs, slice(lr_init, lr_init * 30), wd=wd,
  File "/home/df/.local/lib/python3.8/site-packages/fastcore/logargs.py", line 56, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/home/df/.local/lib/python3.8/site-packages/fastai/callback/schedule.py", line 113, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/home/df/.local/lib/python3.8/site-packages/fastcore/logargs.py", line 56, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/home/df/.local/lib/python3.8/site-packages/fastai/learner.py", line 207, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/home/df/.local/lib/python3.8/site-packages/fastai/learner.py", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "/home/df/.local/lib/python3.8/site-packages/fastai/learner.py", line 197, in _do_fit
    self._with_events(self._do_epoch, 'epoch', CancelEpochException)
  File "/home/df/.local/lib/python3.8/site-packages/fastai/learner.py", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "/home/df/.local/lib/python3.8/site-packages/fastai/learner.py", line 191, in _do_epoch
    self._do_epoch_train()
  File "/home/df/.local/lib/python3.8/site-packages/fastai/learner.py", line 183, in _do_epoch_train
    self._with_events(self.all_batches, 'train', CancelTrainException)
  File "/home/df/.local/lib/python3.8/site-packages/fastai/learner.py", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "/home/df/.local/lib/python3.8/site-packages/fastai/learner.py", line 161, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/home/df/.local/lib/python3.8/site-packages/fastai/data/load.py", line 102, in __iter__
    for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
  File "/home/df/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 737, in __init__
    w.start()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__

opened by dreamflasher 26

Added heatmap boolean variable to plot_top_losses. By default this va…

Added heatmap boolean variable to plot_top_losses. By default this variable is True.

When true, plot_top_losses will overlay heat-maps on the top of images. Otherwise, plot_top_losses will display only images associated with top losses.

I am not sure how to write a test case. But here is two scenarios in which test worked well with and without my code. (I assumed that passing of test is equivalent of displaying images associated with top losses). I am looking forward to learn more about it.

#####with my code path = untar_data(URLs.PETS); path_anno = path/'annotations' path_img = path/'images' np.random.seed(2) pat = re.compile(r'/([^/]+)_\d+.jpg$') data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=bs).normalize(imagenet_stats) learn = create_cnn(data, models.resnet34, metrics=error_rate) interp = ClassificationInterpretation.from_learner(learn) losses,idxs = interp.top_losses() len(data.valid_ds)==len(losses)==len(idxs) interp.plot_top_losses(9, figsize=(15,11),heatmap=True)

###without my code path = untar_data(URLs.PETS); path_anno = path/'annotations' path_img = path/'images' np.random.seed(2) pat = re.compile(r'/([^/]+)_\d+.jpg$') data = ImageDataBunch.from_name_re(path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=bs).normalize(imagenet_stats) learn = create_cnn(data, models.resnet34, metrics=error_rate) interp = ClassificationInterpretation.from_learner(learn) losses,idxs = interp.top_losses() len(data.valid_ds)==len(losses)==len(idxs) interp.plot_top_losses(9, figsize=(15,11))

opened by at110 25
ImageCleaner.next_batch() and/or .render() broken in JupyterLab
Describe the bug Hello, After successfully running notebooks in an instance running on GCP by following the instructions, I cannot get the ImageCleaner widget to work properly. First issue: the widget does not even appear in JupyterLab unless I install the ipywidgets JupyterLab extension; only the object is returned. Second, even after installing this extension, the "Next Batch" button does not work. The CSV is properly created and updated, but the next batch of images are not rendered. This leads me to believe that ImageCleaner.render() is broken.

Provide your installation details

=== Software === python : 3.7.1 fastai : 1.0.42 fastprogress : 0.1.18 torch : 1.0.0 nvidia driver : 410.72 torch cuda : 10.0.130 / is available torch cudnn : 7401 / is enabled === Hardware === nvidia gpus : 1 torch devices : 1 - gpu0 : 7611MB | Tesla P4 === Environment === platform : Linux-4.9.0-8-amd64-x86_64-with-debian-9.7 distro : #1 SMP Debian 4.9.130-2 (2018-10-27) conda env : base python : /opt/anaconda3/bin/python sys.path : /home/jupyter/tutorials/fastai/course-v3/nbs/dl1 /opt/anaconda3/lib/python37.zip /opt/anaconda3/lib/python3.7 /opt/anaconda3/lib/python3.7/lib-dynload /opt/anaconda3/lib/python3.7/site-packages /opt/anaconda3/lib/python3.7/site-packages/IPython/extensions /home/jupyter/.ipython

To Reproduce

Create a new instance via GCP instructions

gcloud compute ssh --zone=$ZONE jupyter@$INSTANCE_NAME -- -L 8080:localhost:8080

Point browser to localhost:8080

Open lesson2-download.ipynt

Run all cells as instructed in Lesson 2 (careful to create dirs and download images properly)

Attempt to instantiate ImageCleaner(ds, idxs, path); note that the current version of lesson2-download.ipynt is missing the necessary path argument

See that the output of the cell is merely the object

Install ipywidgets JupyterLab extension via jupyter labextension install @jupyter-widgets/jupyterlab-manager

Refresh notebook browser window

Attempt to instantiate ImageCleaner(ds, idxs, path) again

See that the widget appears

Interact with the widget and click "Next Batch"

See that the next batch of images does not render, but that cleaned.csv is created

Expected behavior The next batch of images should appear.

Thanks.
opened by amqdn 24
Tokenization is time and space inefficient
Going through the code in transform.py I cannot help but notice several opportunities for optimization for parallel execution. In its current form it would take >4 days to tokenize a 12Gb corpus on a 16-core/32 thread CPU (if it wouldn't run out of memory first as 36G RAM weren't enough). Writing a custom implementation reduced both the time to a little more than 4 hours and memory use by 2-3x. The code is mission-specific and write-once dirt so I'm feeling reluctant to share, but I'd gladly share the gist of it below.

In its current implementation the tokenization process is parallelized using the very inefficient concurrent.futures.ProcessPoolExecutor's map function which creates Future objects where there is no good reason to. These are good for fine-grained control like progress reporting, cancelling etc, but are fairly heavy. In this case we are actually only interested in the returned tokens. multiprocessing.Pool's map should perform considerably better. See this SO post for more details.

A number of new processes are created to tokenize each bach of text. This means a new batch of processes need to be forked, initialize a fresh instance of Spacy Tokenizer and receive a fairly large chunk of text via IPC every few seconds. This seems very inefficient. For small enough batches more time will be spent in fork-IPC-joining than on the actual work being done. Alternatively, there should be a number of long-lived tokenizer worker processes initialized at the beginning with a workload to process and each should process a stream of text with the more efficient Tokenizer.pipe function from Spacy.

Also, for large batches it is by leaps and bounds more efficient to have each process read its own batch from disk than to have some producer process provide it by IPC. Python's performance for reading large objects through IPC is atrocious (I don't know if this is Python-specific). See this SO post for more context.

The current implementation requires enormous amounts of RAM for relatively small corpora (something like ~1Gb requires >24G RAM). Serializing the tokens and word counts from each tokenizer worker to disk, merging them after tokenization and (if needed) truncate the vocabulary and replace deleted instances with UNK in the tokenized text files is vastly more memory-efficient and can scale to much larger texts.

I don't really know if the goal of the code as it exists today is to make it easier to "bring your own tokenizer" or to just make it small and understandable, and it is definitely nice to have functions like .from_csv or .from_files that "magically" do everything in one go, for demonstration purposes, but for more serious datasets, maybe breaking the process to more manageable pieces would be a better approach?

[EDIT: Some demonstration]

Here is what processor utilization looks like with the current implementation:

Here is what it should look like (running the example script from here)
opened by kliron 20

AttributeError: 'Learner' object has no attribute 'min_grad_lr'

Describe the bug

Intermittently, I am getting the error AttributeError: 'Learner' object has no attribute 'min_grad_lr' When attempting to do:

learn.lr_find()
fig = learn.recorder.plot(suggestion=True, return_fig=True);
lr = learn.recorder.min_grad_lr

Provide your installation details

=== Software === 
python        : 3.6.5
fastai        : 1.0.46
fastprogress  : 0.1.20
torch         : 1.0.0
nvidia driver : 410.79
torch cuda    : 10.0.130 / is available
torch cudnn   : 7401 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 11441MB | Tesla K80

=== Environment === 
platform      : Linux-4.14.97-74.72.amzn1.x86_64-x86_64-with-glibc2.9
distro        : #1 SMP Tue Feb 5 20:59:30 UTC 2019
conda env     : pytorch_p36
python        : /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python
sys.path      : 
/home/ec2-user/src/cntk/bindings/python
/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python36.zip
/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6
/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/lib-dynload
/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages
/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/IPython/extensions
/home/ec2-user/.ipython

To Reproduce This error is happening while training inside a docker container (local-notebook SageMaker training). It seems that I can't reproduce this error directly inside a notebook environment. I've also tried lr_find(learn), but I assume that's the same thing.

opened by austinmw 20

Save load

The export and load_learner methods of Learner were only working when a gpu with cuda was available, so there were no possibility to export a model and then load it on a cpu only device. You can now do that by specifiying device='cpu' when calling load_learner

opened by pouannes 20
Support PyTorch 1.8, TorchVision 0.9.0 and TorchAduio 0.8.0

I know there is probably some testing that needs to happen, and that you devs are probably already aware of it, but PyTorch 1.8, TorchVision 0.9.0 and TorchAduio 0.8.0 were released two days ago so support for these in the next FastAI release would be nice.

https://github.com/pytorch/pytorch/releases/tag/v1.8.0

opened by DavidSpek 19
AttributeError: module 'sklearn.metrics._dist_metrics' has no attribute 'DistanceMetric32'

This was removed post scikit-learn version 1.1.0 I believe.

Installing scikit-learn 1.1.0 fixed this issue for me when trying the first line import of the vision tutorial.

from fastai.vision.all import *

opened by talentoscope 0
Gradio unable to render output properly in Jupyter Notebook (fastai uses np.int but Gradio does not)

Describe the Bug

In creating a simple image classifier, there appears to be a bug when trying to render the output onto a Jupyter Notebook. Specifically, here's the output issue that I experience:

AttributeError: module 'numpy' has no attribute 'int'

The code deploys properly on Huggingface Spaces, but I get an error when the output is rendered on Jupyter Notebook. How can this be resolved so that I can actually create and test the output locally in Jupyter Notebook before deploying it more broadly on Huggingface Spaces? I've posted the same question under issues in the Gradio repo, but was suggested to post in the fastai repo

Reproduction

Here's the full source code on GitHub: https://github.com/emptytank/invoice_classifier/blob/main/invoice_classifier.ipynb

Here's the Hugging face spaces: https://huggingface.co/spaces/emptytank/invoice_classifier

Here's the link to the issue posted in the Gradio GitHub repo: https://github.com/gradio-app/gradio/issues/2908

Logs

Traceback (most recent call last): File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\gradio\routes.py", line 321, in run_predict output = await app.blocks.process_api( File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\gradio\blocks.py", line 1015, in process_api result = await self.call_function(fn_index, inputs, iterator, request) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\gradio\blocks.py", line 856, in call_function prediction = await anyio.to_thread.run_sync( File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, *args) File "C:\Users\tangmi2\AppData\Local\Temp\ipykernel_22992\830469006.py", line 6, in predict pred, pred_idx, probs = learn.predict(img) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\learner.py", line 313, in predict inp,preds,_,dec_preds = self.get_preds(dl=dl, with_input=True, with_decoded=True) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\learner.py", line 300, in get_preds self._do_epoch_validate(dl=dl) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\learner.py", line 236, in _do_epoch_validate with torch.no_grad(): self._with_events(self.all_batches, 'validate', CancelValidException) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\learner.py", line 193, in with_events try: self(f'before{event_type}'); f() File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\learner.py", line 199, in all_batches for o in enumerate(self.dl): self.one_batch(*o) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\data\load.py", line 127, in iter for b in _loadersself.fake_l.num_workers==0: File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\torch\utils\data\dataloader.py", line 628, in next data = self._next_data() File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\torch\utils\data\dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\torch\utils\data_utils\fetch.py", line 43, in fetch data = next(self.dataset_iter) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\data\load.py", line 138, in create_batches yield from map(self.do_batch, self.chunkify(res)) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastcore\basics.py", line 230, in chunked res = list(itertools.islice(it, chunk_sz)) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\data\load.py", line 153, in do_item try: return self.after_item(self.create_item(s)) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastcore\transform.py", line 208, in call def call(self, o): return compose_tfms(o, tfms=self.fs, split_idx=self.split_idx) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastcore\transform.py", line 158, in compose_tfms x = f(x, **kwargs) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastcore\transform.py", line 81, in call def call(self, x, **kwargs): return self._call('encodes', x, **kwargs) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastcore\transform.py", line 91, in _call return self.do_call(getattr(self, fn), x, **kwargs) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastcore\transform.py", line 98, in do_call res = tuple(self.do_call(f, x, **kwargs) for x in x) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastcore\transform.py", line 98, in res = tuple(self.do_call(f, x, **kwargs) for x in x) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastcore\transform.py", line 97, in _do_call return retain_type(f(x, **kwargs), x, ret) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastcore\dispatch.py", line 120, in call return f(*args, **kwargs) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\vision\core.py", line 236, in encodes def encodes(self, o:PILBase): return o._tensor_cls(image2tensor(o)) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\vision\core.py", line 106, in image2tensor res = tensor(img) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\torch_core.py", line 154, in tensor else _array2tensor(array(x), **kwargs)) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\fastai\torch_core.py", line 136, in array2tensor if sys.platform == "win32" and x.dtype==np.int: x = x.astype(np.int64) File "c:\Users\tangmi2\GitHub\invoice_classifier\venv-invoice\lib\site-packages\numpy_init.py", line 284, in getattr raise AttributeError("module {!r} has no attribute "

System Info

Gradio Version: gradio==3.15.0 Operating System: Windows 10 Enterprise 64-bit Browser: Microsoft Edge

opened by emptytank 0
Add option to (optionally) save confusion matrix plot

This PR adds an optional parameter save_plot to the plot_confusion_matrix function which allows offline analysis of models and their confusion matrix across several tuning or iterations.

opened by aspiringastro 1
Fastai docs not available as Colab notebooks any more?

The https://docs.fast.ai/ website says that every page of the docs is available as a Colab notebook. But I couldn't find the Colab link on any of the pages. Are the Colab notebooks not available any more?

opened by amoghvaishampayan 0

Multi-GPU training CNN hangs when using TensorboardCallback

Please confirm you have the latest versions of fastai, fastcore, and nbdev prior to reporting a bug (delete one): YES

Describe the bug Hi! I have recently been experimenting with TensorBoard and fastai, especially as a means of tracking metrics in real-time with ClearML.

I've noticed that the steps to train a CNN across GPUs using Accelerate works fine, but the moment you introduce a TensorBoardCallback in training, it hangs indefinitely without any errors. Training the CNN on a multi-GPU instance but without distributed training (ie only using one of the GPUs and without including Accelerate) works perfectly fine too.

I mentioned this issue on the Accelerate repo: https://github.com/huggingface/accelerate/issues/900 And @muellerzr hypothesised that it's because TensorBoard can only run as a main process only which fastai doesn't guard. (Thanks, Zachary!)

To Reproduce Steps to reproduce the behavior:

Spin up a notebook session with multiple GPUs. Here is the information on all of my settings here:

Accelerate version: 0.15.0
OS: CentOS 7 (running JupyterLab through Docker with a CUDA-configured container
Python version: 3.9.12
numpy version: 1.23.5
ClearML version: 1.8.2
torch version:
* torch==1.12.1+cu113
* torchaudio==0.12.1+cu113
* torchvision==0.13.1+cu113
fastai version: 2.7.10
protobuf version: 3.19.6 (because of tensorboard issues)
accelerate configuration:
  * command_file: null
  * commands: null
  * compute_environment: LOCAL_MACHINE
  * deepspeed_config: {}
  * distributed_type: MULTI_GPU
  * downcast_bf16: 'no'
  * dynamo_backend: 'NO'
  * fsdp_config: {}
  * gpu_ids: all
  * machine_rank: 0
  * main_process_ip: null
  * main_process_port: null
  * main_training_function: main
  * megatron_lm_config: {}
  * mixed_precision: 'no'
  * num_machines: 1
  * num_processes: 4
  * rdzv_backend: static
  * same_network: true
  * tpu_name: null
  * tpu_zone: null
  * use_cpu: false
CUDA version: 11.3
EC2 instance type: p3.8xlarge

Take the following base script:

from fastai.vision.all import *

from accelerate import notebook_launcher
from fastai.distributed import *
from clearml import Task, Logger
from fastai.callback.tensorboard import TensorBoardCallback

path = untar_data(URLs.PETS)/'images'

# Not included - the credentials and host information for ClearML set as environment variables
task = Task.init(project_name='Test Project', task_name='clearml-fastai-integration-demo-4')
logger = Logger.current_logger()

path = untar_data(URLs.PETS)/'images'
task = Task.init(project_name='Listing Image Tagger', task_name='clearml-fastai-integration-demo-4')
logger = Logger.current_logger()

def train():
    print('Creating DataLoader')
    dls = ImageDataLoaders.from_name_func(
        path, get_image_files(path), valid_pct=0.2,
        label_func=lambda x: x[0].isupper(), item_tfms=Resize(224))
    print('Creating learner')
    learn = vision_learner(dls, resnet34, metrics=error_rate).to_fp16()
    print('Outside learn.distrib_ctx')
    with learn.distrib_ctx(in_notebook=True, sync_bn=False):
        print('Inside learn.distrib_ctx')
        # learn.fine_tune(2, cbs=[TensorBoardCallback()])
        learn.fine_tune(2)

notebook_launcher(train, num_processes=4)

Paste it into a cell in a Jupyter Lab session
Uncomment one of the learn.fine_tune lines and comment the other (eg try first without any callbacks)
Run the cell
Swap the uncommented and commented learn.fine_tune lines (eg now try with TensorBoardCallback)

Expected behavior I expect the model to train across GPUs with a TensorBoardCallback enabled.

Error with full stack trace Here's the output when there is no callback:

Launching training on 4 GPUs.
Creating DataLoader
Creating DataLoader
Creating DataLoader
Creating DataLoader
Creating learner
/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning:

The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.

/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning:

Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.

Creating learner
/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning:

The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.

/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning:

Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.

2022-12-02 04:02:37,332 - clearml.model - INFO - Selected model id: b212faeef29d4a54861d19d9bd2a3bde
Creating learner
/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning:

The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.

/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning:

Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.

Creating learner
/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning:

The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.

/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning:

Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.

2022-12-02 04:02:37,632 - clearml.model - INFO - Selected model id: b212faeef29d4a54861d19d9bd2a3bde
2022-12-02 04:02:38,745 - clearml.model - INFO - Selected model id: b212faeef29d4a54861d19d9bd2a3bde
Outside learn.distrib_ctx
2022-12-02 04:02:39,055 - clearml.model - INFO - Selected model id: b212faeef29d4a54861d19d9bd2a3bde
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
Outside learn.distrib_ctx
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
Outside learn.distrib_ctx
Outside learn.distrib_ctx
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
Training Learner...
Inside learn.distrib_ctxInside learn.distrib_ctxInside learn.distrib_ctxInside learn.distrib_ctx



[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).

 0.00% [0/1 00:00<?]
epoch	train_loss	valid_loss	error_rate	time

 39.13% [9/23 00:06<00:09 0.0596]

Here's the output when using TensorBoardCallback as a callback:

Launching training on 4 GPUs.
Creating DataLoader
Creating DataLoader
Creating DataLoader
Creating DataLoader
Creating learner
/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning:

The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.

/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning:

Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.

Creating learner
/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning:

The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.

/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning:

Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.

Creating learner
/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning:

The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.

/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning:

Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.

Creating learner
/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning:

The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.

/root/miniconda/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning:

Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.

2022-12-02 03:58:35,509 - clearml.model - INFO - Selected model id: b212faeef29d4a54861d19d9bd2a3bde
2022-12-02 03:58:35,608 - clearml.model - INFO - Selected model id: b212faeef29d4a54861d19d9bd2a3bde
2022-12-02 03:58:35,646 - clearml.model - INFO - Selected model id: b212faeef29d4a54861d19d9bd2a3bde
2022-12-02 03:58:35,712 - clearml.model - INFO - Selected model id: b212faeef29d4a54861d19d9bd2a3bde
Outside learn.distrib_ctx
Outside learn.distrib_ctx
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
Outside learn.distrib_ctx
Outside learn.distrib_ctx
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
Training Learner...
Inside learn.distrib_ctx
Inside learn.distrib_ctxInside learn.distrib_ctx

Inside learn.distrib_ctx
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).

The output does not progress from here.

Additional context To summarise: | Using Accelerate to multi-process training? | Callbacks enabled? | Result | |:---|:--:|---:| | No | No | Runs successfully | | No | Yes | Runs successfully | | Yes | No | Runs successfully | | Yes | Yes | Hangs indefinitely |

opened by ntdesilv 0

Releases(2.7.10)

2.7.10(Nov 2, 2022)
New Features

Add torch save and load kwargs (#3831), thanks to @JonathanGrant

This lets us do nice things like set pickle_module to cloudpickle

PyTorch 1.13 Compatibility (#3828), thanks to @warner-benjamin

Recursive copying of attribute dictionaries for TensorImage subclass (#3822), thanks to @restlessronin

OptimWrapper sets same param groups as Optimizer (#3821), thanks to @warner-benjamin

This PR harmonizes the default parameter group setting between OptimWrapper and Optimizer by modifying OptimWrapper to match Optimizer's logic.

Support normalization of 1-channel images in unet (#3820), thanks to @marib00

Add img_cls param to ImageDataLoaders (#3808), thanks to @tcapelle

This is particularly useful for passing PILImageBW for MNIST.

Add support for kwargs to tensor() when arg is an ndarray (#3797), thanks to @SaadAhmedGit

Add latest TorchVision models on fastai (#3791), thanks to @datumbox

Option to preserve filenames in download_images (#2983), thanks to @mess-lelouch

Bugs Squashed

get_text_classifier fails with custom AWS_LSTM (#3817)

revert auto-enable of mac mps due to pytorch limitations (#3769)

Workaround for performance bug in PyTorch with subclassed tensors (#3683), thanks to @warner-benjamin

Source code(tar.gz)
Source code(zip)
2.7.8(Aug 2, 2022)
New Features

add split value argument to ColSplitter (#3737), thanks to @DanteOz

deterministic repr for PIL images (#3762)

option to skip default callbacks in Learner (#3739)

update for nbdev2 (#3747)

Bugs Squashed

IntToFloatTensor failing on Mac mps due to missing op (#3761)

fix for pretrained in vision.learner (#3746), thanks to @peterdudfield

fix same file error message when resizing image (#3743), thanks to @cvergnes

Source code(tar.gz)
Source code(zip)
2.7.6(Jul 7, 2022)
New Features

Initial Mac GPU (mps) support (#3719)

Source code(tar.gz)
Source code(zip)
2.7.5(Jul 4, 2022)
New Features

auto-normalize timm models (#3716)

PyTorch 1.12 support

Source code(tar.gz)
Source code(zip)
2.7.4(Jun 28, 2022)
New Features

Add DataBlock.weighted_dataloaders (#3706)

Source code(tar.gz)
Source code(zip)
2.7.2(Jun 19, 2022)
Bugs Squashed

PIL.Resampling only added in v9.1 (#3699)

Source code(tar.gz)
Source code(zip)
2.7.1(Jun 19, 2022)
Update fastcore minimum version

Source code(tar.gz)
Source code(zip)
2.7.0(Jun 19, 2022)
Breaking changes

Distributed training now uses Hugging Face Accelerate, rather than fastai's launcher. Distributed training is now supported in a notebook -- see this tutorial for details

New Features

resize_images creates folder structure at dest when recurse=True (#3692)

Integrate nested callable and getcallable (#3691), thanks to @muellerzr

workaround pytorch subclass performance bug (#3682)

Torch 1.12.0 compatibility (#3659), thanks to @josiahls

Integrate Accelerate into fastai (#3646), thanks to @muellerzr

New Callback event, before and after backward (#3644), thanks to @muellerzr

Let optimizer use built torch opt (#3642), thanks to @muellerzr

Support PyTorch Dataloaders with DistributedDL (#3637), thanks to @tmabraham

Add channels_last cb (#3634), thanks to @tcapelle

support all timm kwargs (#3631)

send self.loss_func to device if it is an instance on nn.Module (#3395), thanks to @arampacha

Bugs Squashed

Solve hanging load_model and let LRFind be ran in a distributed setup (#3689), thanks to @muellerzr

pytorch subclass functions fail if no positional args (#3687)

Workaround for performance bug in PyTorch with subclassed tensors (#3683), thanks to @warner-benjamin

Fix Tokenizer.get_lengths (#3667), thanks to @karotchykau

load_learner with cpu=False doesn't respect the current cuda device if model exported on another; fixes #3656 (#3657), thanks to @ohmeow

[Bugfix] Fix smoothloss on distributed (#3643), thanks to @muellerzr

WandbCallback Error: "Tensors must be CUDA and dense" on distributed training (#3291)

vision tutorial failed at learner.fine_tune(1) (#3283)

Source code(tar.gz)
Source code(zip)
2.6.3(May 1, 2022)
Bugs Squashed

Fix Learner pickling problem introduced in v2.6.2

Source code(tar.gz)
Source code(zip)
2.6.2(Apr 30, 2022)
Bugs Squashed

Race condition: 'Tensor' object has no attribute 'append' (#3385)

Source code(tar.gz)
Source code(zip)
2.6.1(Apr 30, 2022)
Bugs Squashed

Race condition: 'Tensor' object has no attribute 'append' (#3385)

Source code(tar.gz)
Source code(zip)
2.6.0(Apr 24, 2022)
New Features

add support for Ross Wightman's Pytorch Image Models (timm) library (#3624)

rename cnn_learner to vision_learner since we now support models other than CNNs too (#3625)

Bugs Squashed

Fix AccumMetric name.setter (#3621), thanks to @warner-benjamin

Fix Classification Interpretation (#3563), thanks to @warner-benjamin

Source code(tar.gz)
Source code(zip)
2.5.6(Apr 2, 2022)
New Features

support pytorch 1.11 (#3618)

Add in exceptions and verbose errors (#3611), thanks to @muellerzr

Bugs Squashed

Fix name conflicts in ColReader (#3602), thanks to @hiromis

Source code(tar.gz)
Source code(zip)
2.5.5(Mar 25, 2022)
New Features

Update fastcore dep

Source code(tar.gz)
Source code(zip)
2.5.4(Mar 25, 2022)
New Features

Support py3.10 annotations (#3601)

Bugs Squashed

Fix pin_memory=True breaking (batch) Transforms (#3606), thanks to @johan12345

Add Python 3.9 to setup.py for PyPI (#3604), thanks to @nzw0301

removes add_vert from get_grid calls (#3593), thanks to @kevinbird15

Making loss_not_reduced work with DiceLoss (#3583), thanks to @hiromis

Fix bug in URLs.path() in 04_data.external (#3582), thanks to @malligaraj

Custom name for metrics (#3573), thanks to @bdsaglam

Update import for show_install (#3568), thanks to @fr1ll

Fix Classification Interpretation (#3563), thanks to @warner-benjamin

Updates Interpretation class to be memory efficient (#3558), thanks to @warner-benjamin

Learner.show_results uses passed dataloader via dl_idx or dl arguments (#3554), thanks to @warner-benjamin

Fix learn.export pickle error with MixedPrecision Callback (#3544), thanks to @warner-benjamin

Fix concurrent LRFinder instances overwriting each other by using tempfile (#3528), thanks to @warner-benjamin

Fix _get_shapes to work with dictionaries (#3520), thanks to @ohmeow

Fix torch version checks, remove clip_grad_norm check (#3518), thanks to @warner-benjamin

Fix nested tensors predictions compatibility with fp16 (#3516), thanks to @tcapelle

Learning rate passed via OptimWrapper not updated in Learner (#3337)

Different results after running lr_find() at different times (#3295)

lr_find() may fail if run in parallel from the same directory (#3240)

Source code(tar.gz)
Source code(zip)
2.5.3(Oct 23, 2021)
New Features

add at_end feature to SaveModelCallback (#3296), thanks to @tmabraham

Bugs Squashed

fix fp16 test (#3284), thanks to @tmabraham

Source code(tar.gz)
Source code(zip)
2.5.1(Aug 11, 2021)
Import download_url from fastdownload

Source code(tar.gz)
Source code(zip)
2.5.0(Aug 6, 2021)
Breaking changes

config.yml has been renamed to config.ini, and is now in ConfigParser format instead of YAML

THe _path suffixes in config.ini have been removed

Bugs Squashed

Training with learn.to_fp16() fails with PyTorch 1.9 / Cuda 11.4 (#3438)

pandas 1.3.0 breaks add_elapsed_times (#3431)

Source code(tar.gz)
Source code(zip)
2.4.1(Jul 14, 2021)
New Features

add DiceLoss (#3386), thanks to @tcapelle

TabularPandas data transform reproducibility (#2826)

Bugs Squashed

Latest Pillow v8.3.0 breaks conversion Image to Tensor (#3416)

Source code(tar.gz)
Source code(zip)
2.4(Jun 16, 2021)
Breaking changes

QRNN module removed, due to incompatibility with PyTorch 1.9, and lack of utilization of QRNN in the deep learning community. QRNN was our only module that wasn't pure Python, so with this change fastai is now a pure Python package.

New Features

Support for PyTorch 1.9

Improved LR Suggestions (#3377), thanks to @muellerzr

SaveModelCallback every nth epoch (#3375), thanks to @KeremTurgutlu

Send self.loss_func to device if it is an instance of nn.Module (#3395), thanks to @arampacha

Batch support for more than one image (#3339)

Changable tfmdlists for TransformBlock, Datasets, DataBlock (#3327)

Bugs Squashed

convert TensorBBox to TensorBase during compare (#3388), thanks to @kevinbird15

Check if normalize exists on _add_norm (#3371), thanks to @renato145

Source code(tar.gz)
Source code(zip)
2.3.2(Jun 16, 2021)
New Features

send self.loss_func to device if it is an instance of nn.Module (#3395), thanks to @arampacha

Improved LR Suggestions (#3377), thanks to @muellerzr

SaveModelCallback every nth epoch (#3375), thanks to @KeremTurgutlu

Batch support for more than one image (#3339)

Changable tfmdlists for TransformBlock, Datasets, DataBlock (#3327)

Bugs Squashed

convert TensorBBox to TensorBase during compare (#3388), thanks to @kevinbird15

Check if normalize exists on _add_norm (#3371), thanks to @renato145

Source code(tar.gz)
Source code(zip)
2.3.1(May 4, 2021)
New Features

Add support for pytorch 1.8 (#3349)

Add support for spacy3 (#3348)

Add support for Windows. Big thanks to Microsoft for many contributions to get this working

Timedistributed layer and Image Sequence Tutorial (#3124), thanks to @tcapelle

Add interactive run logging to AzureMLCallback (#3341), thanks to @yijinlee

Batch support for more than one image (#3339)

Have interp use ds_idx, add tests (#3332), thanks to @muellerzr

Automatically have fastai determine the right device, even with torch DataLoaders (#3330), thanks to @muellerzr

Add at_end feature to SaveModelCallback (#3296), thanks to @tmabraham

Improve inplace params in Tabular's new and allow for new and test_dl to be in place (#3292), thanks to @muellerzr

Update VSCode & Codespaces dev container (#3280), thanks to @bamurtaugh

Add max_scale param to RandomResizedCrop(GPU) (#3252), thanks to @kai-tub

Increase testing granularity for speedup (#3242), thanks to @ddobrinskiy

Bugs Squashed

Make TTA turn shuffle and drop_last off when using ds_idx (#3347), thanks to @muellerzr

Add order to TrackerCallback derived classes (#3346), thanks to @muellerzr

Prevent schedule from crashing close to the end of training (#3335), thanks to @Lewington-pitsos

Fix ability to use raw pytorch DataLoaders (#3328), thanks to @hamelsmu

Fix PixelShuffle_icnr weight (#3322), thanks to @pratX

Creation of new DataLoader in Learner.get_preds has wrong keyword (#3316), thanks to @tcapelle

Correct layers order in tabular learner (#3314), thanks to @gradientsky

Fix vmin parameter default (#3305), thanks to @tcapelle

Ensure call to one_batch places data on the right device (#3298), thanks to @tcapelle

Fix Cutmix Augmentation (#3259), thanks to @MrRobot2211

Fix custom tokenizers for DataLoaders (#3256), thanks to @iskode

fix error setting 'tok_tfm' parameter in TextDataloaders.from_folder

Fix lighting augmentation (#3255), thanks to @kai-tub

Fix CUDA variable serialization (#3253), thanks to @mszhanyi

change batch tfms to have the correct dimensionality (#3251), thanks to @trdvangraft

Ensure add_datepart adds elapsed as numeric column (#3230), thanks to @aberres

Source code(tar.gz)
Source code(zip)
2.3.0(Mar 31, 2021)
Breaking Changes

fix optimwrapper to work with param_groups (#3241), thanks to @tmabraham

OptimWrapper now has a different constructor signature, which makes it easier to wrap PyTorch optimizers

New Features

Support discriminative learning with OptimWrapper (#2829)

Bugs Squashed

Updated to support adding transforms to multiple dataloaders (#3268), thanks to @marii-moe

This fixes an issue in 2.2.7 which resulted in incorrect validation metrics when using Normalization

Source code(tar.gz)
Source code(zip)
2.2.7(Feb 22, 2021)
Bugs Squashed

Regression fix: Ensure add_datepart adds elapsed as numeric column (#3230), thanks to @aberres

Source code(tar.gz)
Source code(zip)
2.2.6(Feb 21, 2021)
Bugs Squashed

2.2.5 was not released correctly - it was actually 2.2.3

Source code(tar.gz)
Source code(zip)
2.2.5(Feb 8, 2021)
New Features

Enhancement: Let TextDataLoaders take in a custom tok_text_col (#3208), thanks to @muellerzr

Changed dataloaders arguments to have consistent overrides (#3178), thanks to @marii-moe

Better support for iterable datasets (#3173), thanks to @jcaw

Bugs Squashed

BrokenProcessPool in download_images() on Windows (#3196)

error on predict() or using interp with resnet and MixUp (#3180)

Fix 'cat' attribute with pandas dataframe: AttributeError: Can only use .cat accessor with a 'category' dtype (#3165), thanks to @dreamflasher

cont_cat_split does not support pandas types (#3156)

DataBlock.dataloaders does not support the advertised "shuffle" argument (#3133)

Source code(tar.gz)
Source code(zip)
2.2.3(Jan 12, 2021)
New Features

Calculate correct nf in create_head based on concat_pool (#3115), thanks to @muellerzr

Bugs Squashed

wandb integration failing with latest wandb library (#3066)

Learner.load and LRFinder not functioning properly for the optimizer states (#2892)

Source code(tar.gz)
Source code(zip)
2.2.2(Jan 7, 2021)
Bugs Squashed

tensorboard and wandb can not access smooth_loss (#3131)

Source code(tar.gz)
Source code(zip)
2.2.0(Jan 6, 2021)
Breaking Changes

Promote NativeMixedPrecision to default MixedPrecision (and similar for Learner.to_fp16); old MixedPrecision is now called NonNativeMixedPrecision (#3127)

Use the new GradientClip callback instead of the clip parameter to use gradient clipping

Adding a Callback which has the same name as an attribute no longer raises an exception (#3109)

RNN training now requires RNNCallback, but does not require RNNRegularizer; out and raw_out have moved to RNNRegularizer (#3108)

Call rnn_cbs to get all callbacks needed for RNN training, optionally with regularization

replace callback run_after with order; do not run after cbs on exception (#3101)

New Features

Add GradientClip callback (#3107)

Make Flatten cast to TensorBase to simplify type compatibility (#3106)

make flattened metrics compatible with all tensor subclasses (#3105)

New class method TensorBase.register_func to register types for __torch_function__ (#3097)

new dynamic flag for controlling dynamic loss scaling in NativeMixedPrecision (#3096)

remove need to call to_native_fp32 before predict; set skipped in NativeMixedPrecision after NaN from dynamic loss scaling (#3095)

make native fp16 extensible with callbacks (#3094)

Calculate correct nf in create_head based on concat_pool (#3115) thanks to @muellerzr

Source code(tar.gz)
Source code(zip)
2.1.10(Dec 22, 2020)
New Features

Small DICOM segmentation dataset (#3034), thanks to @moritzschwyzer

Bugs Squashed

NoneType object has no attribute append in fastbook chapter 6 BIWI example (#3091)

Source code(tar.gz)
Source code(zip)