Code for visualizing the loss landscape of neural nets

Overview

Visualizing the Loss Landscape of Neural Nets

This repository contains the PyTorch code for the paper

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer and Tom Goldstein. Visualizing the Loss Landscape of Neural Nets. NIPS, 2018.

An interactive 3D visualizer for loss surfaces has been provided by telesens.

Given a network architecture and its pre-trained parameters, this tool calculates and visualizes the loss surface along random direction(s) near the optimal parameters. The calculation can be done in parallel with multiple GPUs per node, and multiple nodes. The random direction(s) and loss surface values are stored in HDF5 (.h5) files after they are produced.

Setup

Environment: One or more multi-GPU node(s) with the following software/libraries installed:

Pre-trained models: The code accepts pre-trained PyTorch models for the CIFAR-10 dataset. To load the pre-trained model correctly, the model file should contain state_dict, which is saved from the state_dict() method. The default path for pre-trained networks is cifar10/trained_nets. Some of the pre-trained models and plotted figures can be downloaded here:

Data preprocessing: The data pre-processing method used for visualization should be consistent with the one used for model training. No data augmentation (random cropping or horizontal flipping) is used in calculating the loss values.

Visualizing 1D loss curve

Creating 1D linear interpolations

The 1D linear interpolation method [1] evaluates the loss values along the direction between two minimizers of the same network loss function. This method has been used to compare the flatness of minimizers trained with different batch sizes [2]. A 1D linear interpolation plot is produced using the plot_surface.py method.

mpirun -n 4 python plot_surface.py --mpi --cuda --model vgg9 --x=-0.5:1.5:401 --dir_type states \
--model_file cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=128_wd=0.0_save_epoch=1/model_300.t7 \
--model_file2 cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=8192_wd=0.0_save_epoch=1/model_300.t7 --plot
  • --x=-0.5:1.5:401 sets the range and resolution for the plot. The x-coordinates in the plot will run from -0.5 to 1.5 (the minimizers are located at 0 and 1), and the loss value will be evaluated at 401 locations along this line.
  • --dir_type states indicates the direction contains dimensions for all parameters as well as the statistics of the BN layers (running_mean and running_var). Note that ignoring running_mean and running_var cannot produce correct loss values when plotting two solutions togeather in the same figure.
  • The two model files contain network parameters describing the two distinct minimizers of the loss function. The plot will interpolate between these two minima.

VGG-9 SGD, WD=0

Producing plots along random normalized directions

A random direction with the same dimension as the model parameters is created and "filter normalized." Then we can sample loss values along this direction.

mpirun -n 4 python plot_surface.py --mpi --cuda --model vgg9 --x=-1:1:51 \
--model_file cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=128_wd=0.0_save_epoch=1/model_300.t7 \
--dir_type weights --xnorm filter --xignore biasbn --plot
  • --dir_type weights indicates the direction has the same dimensions as the learned parameters, including bias and parameters in the BN layers.
  • --xnorm filter normalizes the random direction at the filter level. Here, a "filter" refers to the parameters that produce a single feature map. For fully connected layers, a "filter" contains the weights that contribute to a single neuron.
  • --xignore biasbn ignores the direction corresponding to bias and BN parameters (fill the corresponding entries in the random vector with zeros).

VGG-9 SGD, WD=0

We can also customize the appearance of the 1D plots by calling plot_1D.py once the surface file is available.

Visualizing 2D loss contours

To plot the loss contours, we choose two random directions and normalize them in the same way as the 1D plotting.

mpirun -n 4 python plot_surface.py --mpi --cuda --model resnet56 --x=-1:1:51 --y=-1:1:51 \
--model_file cifar10/trained_nets/resnet56_sgd_lr=0.1_bs=128_wd=0.0005/model_300.t7 \
--dir_type weights --xnorm filter --xignore biasbn --ynorm filter --yignore biasbn  --plot

ResNet-56

Once a surface is generated and stored in a .h5 file, we can produce and customize a contour plot using the script plot_2D.py.

python plot_2D.py --surf_file path_to_surf_file --surf_name train_loss
  • --surf_name specifies the type of surface. The default choice is train_loss,
  • --vmin and --vmax sets the range of values to be plotted.
  • --vlevel sets the step of the contours.

Visualizing 3D loss surface

plot_2D.py can make a basic 3D loss surface plot with matplotlib. If you want a more detailed rendering that uses lighting to display details, you can render the loss surface with ParaView.

ResNet-56-noshort ResNet-56

To do this, you must

  1. Convert the surface .h5 file to a .vtp file.
python h52vtp.py --surf_file path_to_surf_file --surf_name train_loss --zmax  10 --log

This will generate a VTK file containing the loss surface with max value 10 in the log scale.

  1. Open the .vtp file with ParaView. In ParaView, open the .vtp file with the VTK reader. Click the eye icon in the Pipeline Browser to make the figure show up. You can drag the surface around, and change the colors in the Properties window.

  2. If the surface appears extremely skinny and needle-like, you may need to adjust the "transforming" parameters in the left control panel. Enter numbers larger than 1 in the "scale" fields to widen the plot.

  3. Select Save screenshot in the File menu to save the image.

Reference

[1] Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. ICLR, 2015.

[2] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ICLR, 2017.

Citation

If you find this code useful in your research, please cite:

@inproceedings{visualloss,
  title={Visualizing the Loss Landscape of Neural Nets},
  author={Li, Hao and Xu, Zheng and Taylor, Gavin and Studer, Christoph and Goldstein, Tom},
  booktitle={Neural Information Processing Systems},
  year={2018}
}
Comments
  • Is there any difference about using 'OpenMPI'?

    Is there any difference about using 'OpenMPI'?

    Hello. I just tried to run this code with Ubuntu 16.04 LTS, Geforce TITAN X GPU with Pytorch 0.4.1 While running the code with mpirun -n 4 python plot_surface.py --mpi --cuda --model vgg9 --x=-0.5:1.5:401 --dir_type states --model_file cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=128_wd=0.0_save_epoch=1/model_300.t7 --model_file2 cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=8192_wd=0.0_save_epoch=1/model_300.t7 --plot --show , nothing has happened.

    But just delete mpirun -n 4, then code starts running. I think it is in the training process. And after the training process, I can see the plotted results.

    Can I run this code without 'OpenMPI'?? I only know that openmpi is just for parallel computation. So can I use the code with python plot_surface.py --mpi --cuda --model vgg9 --x=-0.5:1.5:401 --dir_type states --model_file cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=128_wd=0.0_save_epoch=1/model_300.t7 --model_file2 cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=8192_wd=0.0_save_epoch=1/model_300.t7 --plot --show?

    opened by seongkyun 4
  • error when converting h5

    error when converting h5

    hi

    I download Resnet h5 files from https://drive.google.com/a/cs.umd.edu/file/d/12oxkvfaKcPyyHiOevVNTBzaQ1zAFlNPX/view?usp=sharing

    then I try the conversion python h52vtp.py --surf_file path_to_surf_file --surf_name train_loss --zmax 10 --log

    but I get this error

    Traceback (most recent call last): File "loss-landscape/h52vtp.py", line 259, in h5_to_vtp(args.surf_file, args.surf_name, log=args.log, zmax=args.zmax, interp=args.interp) File "loss-landscape/h52vtp.py", line 38, in h5_to_vtp [xcoordinates, ycoordinates] = np.meshgrid(f['xcoordinates'][:], f['ycoordinates'][:][:]) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 177, in getitem oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: "Unable to open object (object 'xcoordinates' doesn't exist)"

    any tips?

    and most importantly, can you advice me how to convert .h5 files to .obj mesh files? thank you so much

    opened by javismiles 3
  • Undefined name: import os for lines 9 and 10

    Undefined name: import os for lines 9 and 10

    flake8 testing of https://github.com/tomgoldstein/loss-landscape on Python 3.7.1

    $ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

    ./cifar10/dataloader.py:7:16: F821 undefined name 'os'
            assert os.path.exists(trainloader_path), 'trainloader does not exist'
                   ^
    ./cifar10/dataloader.py:7:31: F821 undefined name 'trainloader_path'
            assert os.path.exists(trainloader_path), 'trainloader does not exist'
                                  ^
    ./cifar10/dataloader.py:8:16: F821 undefined name 'os'
            assert os.path.exists(testloader_path), 'testloader does not exist'
                   ^
    ./cifar10/dataloader.py:8:31: F821 undefined name 'testloader_path'
            assert os.path.exists(testloader_path), 'testloader does not exist'
                                  ^
    ./cifar10/dataloader.py:9:34: F821 undefined name 'trainloader_path'
            trainloader = torch.load(trainloader_path)
                                     ^
    ./cifar10/dataloader.py:10:33: F821 undefined name 'testloader_path'
            testloader = torch.load(testloader_path)
                                    ^
    6     F821 undefined name 'os'
    6
    
    opened by cclauss 3
  • H5py file lock fix for newer h5py versions.

    H5py file lock fix for newer h5py versions.

    So newer versions of h5py throw the following error when f = h5py.File(surf_file, 'r+' if rank == 0 else 'r') is called:

    OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
    

    This fix prevents this error from happening and the project can run on newer versions of h5py. This will also finally fix issue https://github.com/tomgoldstein/loss-landscape/issues/4 .

    opened by KaleabTessera 2
  • train_loss is not found

    train_loss is not found

    I run the plot_surface code like so:

        /usr/bin/python -u /local/mnt/workspace/ikarmano/Gitlab/sagd/loss-landscape/plot_surface.py --cuda \
        --x=-1:1:51 --y=-1:1:51 --model_file models/32_32_32_32_32_32_32_32_32_32_32_32_32_32_32cnn.t \
        --dir_type weights --xnorm filter --xignore biasbn --ynorm filter --yignore biasbn --plot
    

    And it seem to calculate the loss fine:

    Evaluating rank 2  90/2601  (3.5%)  coord=[ 0.56 -0.96] 	train_loss= 21.470 	train_acc=14.54 	time=5.28 	sync=0.00
    Evaluating rank 2  91/2601  (3.5%)  coord=[ 0.6  -0.96] 	train_loss= 22.225 	train_acc=14.10 	time=5.65 	sync=0.00
    Evaluating rank 2  92/2601  (3.5%)  coord=[ 0.64 -0.96] 	train_loss= 23.044 	train_acc=13.67 	time=5.92 	sync=0.00
    Evaluating rank 2  93/2601  (3.6%)  coord=[ 0.68 -0.96] 	train_loss= 23.935 	train_acc=13.33 	time=5.71 	sync=0.00
    Evaluating rank 2  94/2601  (3.6%)  coord=[ 0.72 -0.96] 	train_loss= 24.905 	train_acc=13.02 	time=5.65 	sync=0.00
    Evaluating rank 2  95/2601  (3.7%)  coord=[ 0.76 -0.96] 	train_loss= 25.958 	train_acc=12.66 	time=5.50 	sync=0.00
    Evaluating rank 2  96/2601  (3.7%)  coord=[ 0.8  -0.96] 	train_loss= 27.100 	train_acc=12.37 	time=5.99 	sync=0.00
    Evaluating rank 2  97/2601  (3.7%)  coord=[ 0.84 -0.96] 	train_loss= 28.334 	train_acc=12.13 	time=5.85 	sync=0.00
    Evaluating rank 2  98/2601  (3.8%)  coord=[ 0.88 -0.96] 	train_loss= 29.666 	train_acc=11.91 	time=5.71 	sync=0.00
    Evaluating rank 2  99/2601  (3.8%)  coord=[ 0.92 -0.96] 	train_loss= 31.101 	train_acc=11.69 	time=5.58 	sync=0.00
    

    However, the plot functions do not work because 'train_loss' is not found:

    train_loss is not found in ../models/32_32_32_32_32_32_32_32_32_32_32_32_32_32_32cnn.t_weights_xignore=biasbn_xnorm=filter_yignore=biasbn_ynorm=filter.h5_[-1.0,1.0,51]x[-1.0,1.0,51].h5
    

    And if I print the keys(), it's just:

    <KeysViewHDF5 ['dir_file', 'xcoordinates', 'ycoordinates']>

    Not sure what I'm doing wrong?

    opened by ilkarman 2
  • Trajectory plot

    Trajectory plot

    To plot the figure 10 in your paper, I am assuming I should generate the PCA directions from plot_trajectory.py first then use the PCA directions to plot the loss contours of the final model.

    I had to write my own code because of some technical difficulties. However, I notice that for my data, the trajectory should start from loss ~= 0.9 but the loss contour of the final model is far from 0.9 at the trajectory starting point. This makes me think that actually there is no guarantee that loss contours which the plotted trajectory comes across reflect the real loss, in other words, the loss contours of the final model do not show the "loss landscape" along the trajectory. However, when I reduced the number of models used to perform the PCA, the loss at the trajectory starting point is near 0.9 loss contour of the final model.

    This is reasonable since the final model is perturbed along pc1 and pc2, while the trajectory is projected to pc1 and pc2 and a model can actually be far away from the projection, thus the loss corresponding to the trajectory can be far from the loss contours of the final model.

    I understand that pc1 and pc2 can explain most of the variance among the parameters of all the models, but there is no guarantee that it can explain the most difference between any given model and the final model. That is probably why I got more "accurate" results when I use less models to estimate the principal components?

    opened by ghost 2
  • change type conversion code to fix bug on fp16

    change type conversion code to fix bug on fp16

    There is an implicit float16 -> float32 conversion in the original code if net weights are float16

    p.data = w + torch.Tensor(d).type(type(w))
    

    thus results in inaccurate loss computation. The following code can avoid this problem since it take the exact w.dtype when doing the conversion

    p.data = w + torch.Tensor(d).type_as(w.dtype)
    
    opened by TobiasLee 1
  • Nan values for loss when running both 1D and 2D plotting

    Nan values for loss when running both 1D and 2D plotting

    Hi, I'm not sure if I'm missing something simple, but using either VGG9 or Resnet56 for both the 1D and 2D visualizations gives nan losses when they should be smaller. Using Pytorch 0.4.1, models downloaded from provided links

    image

    opened by pkadambi 1
  • Plotting to pdf files

    Plotting to pdf files

    We have some nodes which only support command interface, and there is no GUI display. If we run the code, then we will get

    _tkinter.TclError: no display name and no $DISPLAY environment variable
    

    Can we save figures just into files such as pdf files?

    opened by wenwei202 1
  • trajectory plot on contour

    trajectory plot on contour

    Hi guys,

    I have already get the contour and trajectory plot, but in two different pdfs. I really want to plot the trajectory on the contour (just like the figure 9 in paper). Do anyone have successfully done that?

    Thanks!

    opened by activelearning2022 0
  • How to generate the direction and surface file (.h5 files)?

    How to generate the direction and surface file (.h5 files)?

    I can implement plot the surface of the function using the trained networks attached. If I want to apply the code on a new model, could you provide more details about how to generate the .t7 and .h5 files?

    opened by DAIZHENWEI 0
  • Support for later versions of h5py (3.7)

    Support for later versions of h5py (3.7)

    Are there any plans to add support for h5py 3.7 and hdf5 1.12.0 in the near future? At this point, it is very difficult to find and compile hdf 1.8.16 on MacOS Monterey, as it is obsolete. Also, h5py 2.7 (the required version) does not work with later versions of python (3.9+) either.

    opened by stefanroata 0
  • Why L2 norm leads to narrower landscape near minimal when showing the trajectory?

    Why L2 norm leads to narrower landscape near minimal when showing the trajectory?

    In the figure 9 of your paper, I noticed that by using L2 norm, the landscape becomes more narrow around the minimal point. Which is different from previous figures.

    I do know that you are using a different way of choosing vectors by PCA. And it can be understood by a way from-result-to-cause -- that is, L2 norm makes it harder to train, so the convex part is smaller. However, I curious if you have any deeper insight of this pattern? Thanks!

    opened by Zhangyanbo 0
  • 2D loss contours

    2D loss contours

    When I re-run the code for visualizing 2D loss contours, the generated 2D loss contour and 3d surface are different from the figures provided in google drive. I use the same setting and pre-trained models:

    mpirun -n 4 python plot_surface.py --mpi --cuda --model resnet56 --x=-1:1:51 --y=-1:1:51 \
    --model_file cifar10/trained_nets/resnet56_sgd_lr=0.1_bs=128_wd=0.0005/model_300.t7 \
    --dir_type weights --xnorm filter --xignore biasbn --ynorm filter --yignore biasbn  --plot 
    
    opened by lan-lw 0
  • --dir_type states vs weights

    --dir_type states vs weights

    --dir_type states indicates the direction contains dimensions for all parameters as well as the statistics of the BN layers (running_mean and running_var). Note that ignoring running_mean and running_var cannot produce correct loss values when plotting two solutions together in the same figure.

    Why running mean and var important for plotting two solutions in the same figure? Could anyone help me with this one?

    I'm trying to plot two or three solutions in one surface plot.

    Thanks

    opened by anyuzoey 0
Owner
Tom Goldstein
Tom Goldstein
[Open Source]. The improved version of AnimeGAN. Landscape photos/videos to anime

[Open Source]. The improved version of AnimeGAN. Landscape photos/videos to anime

CC 4.4k Dec 27, 2022
NLMpy - A Python package to create neutral landscape models

NLMpy is a Python package for the creation of neutral landscape models that are widely used by landscape ecologists to model ecological patterns

Manaaki Whenua – Landcare Research 1 Oct 8, 2022
Create animations for the optimization trajectory of neural nets

Animating the Optimization Trajectory of Neural Nets loss-landscape-anim lets you create animated optimization path in a 2D slice of the loss landscap

Logan Yang 81 Dec 25, 2022
NudeNet: Neural Nets for Nudity Classification, Detection and selective censoring

NudeNet: Neural Nets for Nudity Classification, Detection and selective censoring Uncensored version of the following image can be found at https://i.

notAI.tech 1.1k Dec 29, 2022
Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation) Download Synthia dataset The model uses

null 32 Sep 21, 2022
An implementation for the loss function proposed in Decoupled Contrastive Loss paper.

Decoupled-Contrastive-Learning This repository is an implementation for the loss function proposed in Decoupled Contrastive Loss paper. Requirements P

Ramin Nakhli 71 Dec 4, 2022
Companion code for the paper "An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence" (NeurIPS 2021)

ReLU-GP Residual (RGPR) This repository contains code for reproducing the following NeurIPS 2021 paper: @inproceedings{kristiadi2021infinite, title=

Agustinus Kristiadi 4 Dec 26, 2021
Woosung Choi 63 Nov 14, 2022
SMD-Nets: Stereo Mixture Density Networks

SMD-Nets: Stereo Mixture Density Networks This repository contains a Pytorch implementation of "SMD-Nets: Stereo Mixture Density Networks" (CVPR 2021)

Fabio Tosi 115 Dec 26, 2022
[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets

[NeurIPS 2021] Well-tuned Simple Nets Excel on Tabular Datasets Introduction This repo contains the source code accompanying the paper: Well-tuned Sim

null 52 Jan 4, 2023
Real-CUGAN - Real Cascade U-Nets for Anime Image Super Resolution

Real Cascade U-Nets for Anime Image Super Resolution 中文 | English ?? Real-CUGAN

tarsin 111 Dec 28, 2022
Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Visualizing Adapted Knowledge in Domain Transfer @inproceedings{hou2021visualizing, title={Visualizing Adapted Knowledge in Domain Transfer}, auth

Yunzhong Hou 80 Dec 25, 2022
[CVPR 2022] Official code for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration"

MDCA Calibration This is the official PyTorch implementation for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved

MDCA Calibration 21 Dec 22, 2022
This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CNPs), Neural Processes (NPs), Attentive Neural Processes (ANPs).

The Neural Process Family This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CN

DeepMind 892 Dec 28, 2022
Simple tools for logging and visualizing, loading and training

TNT TNT is a library providing powerful dataloading, logging and visualization utilities for Python. It is closely integrated with PyTorch and is desi

null 1.5k Jan 2, 2023
Your interactive network visualizing dashboard

Your interactive network visualizing dashboard Documentation: Here What is Jaal Jaal is a python based interactive network visualizing tool built usin

Mohit 177 Jan 4, 2023
A module for solving and visualizing Schrödinger equation.

qmsolve This is an attempt at making a solid, easy to use solver, capable of solving and visualize the Schrödinger equation for multiple particles, an

null 506 Dec 28, 2022
Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021 Oral

Improving Contrastive Learning by Visualizing Feature Transformation This project hosts the codes, models and visualization tools for the paper: Impro

Bingchen Zhao 83 Dec 15, 2022
Losslandscapetaxonomy - Taxonomizing local versus global structure in neural network loss landscapes

Taxonomizing local versus global structure in neural network loss landscapes Int

Yaoqing Yang 8 Dec 30, 2022