GRF: Learning a General Radiance Field for 3D Representation and Rendering

Alex Trevithick

Last update: Dec 29, 2022

Related tags

Deep Learning GRF

Overview

GRF: Learning a General Radiance Field for 3D Representation and Rendering

[Paper] [Video]

GRF: Learning a General Radiance Field for 3D Representation and Rendering
Alex Trevithick^1,2 and Bo Yang^2,3
¹Williams College, ²University of Oxford, ³The Hong Kong Polytechnic University in ICCV 2021

This is the codebase which is currently a work in progress.

Overview of GRF

GRF is a powerful implicit neural function that can represent and render arbitrarily complex 3D scenes in a single network only from 2D observations. GRF takes a set of posed 2D images as input, constructs an internal representation for each 3D point of the scene, and renders the corresponding appearance and geometry of any 3D point viewing from an arbitrary angle. The key to our approach is to explicitly integrate the principle of multi-view geometry to obtain features representative of an entire ray from a given viewpoint. Thus, in a single forward pass to render a scene from a novel view, GRF takes some views of that scene as input, computes per-pixel pose-aware features for each ray from the given viewpoints through the image plane at that pixel, and then uses those features to predict the volumetric density and rgb values of points in 3D space. Volumetric rendering is then applied.

Setting Up the Environment

Use conda to setup an environment as follows:

conda env create -f environment.yml
conda activate grf

Data

SRN cars and chairs datasets can be downloaded from the paper's drive link
NeRF-Synthetic and LLFF datasets can be downloaded from the NeRF drive link
MultiShapenet dataset can be downloaded from the DISN drive link

Training and Rendering from the Model

To train and render from the model, use the run.py script

python run.py --data_root [path to directory with dataset] ] \
    --expname [experiment name]
    --basedir [where to store ckpts and logs]
    --datadir [input data directory]
    --netdepth [layers in network]
    --netwidth [channels per layer]
    --netdepth_fine [layers in fine network]
    --netwidth_fine [channels per layer in fine network]
    --N_rand [batch size (number of random rays per gradient step)]
    --lrate [learning rate]
    --lrate_decay [exponential learning rate decay (in 1000s)]
    --chunk [number of rays processed in parallel, decrease if running out of memory]
    --netchunk [number of pts sent through network in parallel, decrease if running out of memory]
    --no_batching [only take random rays from 1 image at a time]
    --no_reload [do not reload weights from saved ckpt]
    --ft_path [specific weights npy file to reload for coarse network]
    --random_seed [fix random seed for repeatability]
    --precrop_iters [number of steps to train on central crops]
    --precrop_frac [fraction of img taken for central crops]
    --N_samples [number of coarse samples per ray]
    --N_importance [number of additional fine samples per ray]
    --perturb [set to 0. for no jitter, 1. for jitter]
    --use_viewdirs [use full 5D input instead of 3D]
    --i_embed [set 0 for default positional encoding, -1 for none]
    --multires [log2 of max freq for positional encoding (3D location)]
    --multires_views [log2 of max freq for positional encoding (2D direction)]
    --raw_noise_std [std dev of noise added to regularize sigma_a output, 1e0 recommended]
    --render_only [do not optimize, reload weights and render out render_poses path]
    --dataset_type [options: llff / blender / shapenet / multishapenet]
    --testskip [will load 1/N images from test/val sets, useful for large datasets like deepvoxels]
    --white_bkgd [set to render synthetic data on a white bkgd (always use for dvoxels)]
    --half_res [load blender synthetic data at 400x400 instead of 800x800]
    --no_ndc [do not use normalized device coordinates (set for non-forward facing scenes)]
    --lindisp [sampling linearly in disparity rather than depth]
    --spherify [set for spherical 360 scenes]
    --llffhold [will take every 1/N images as LLFF test set, paper uses 8]
    --i_print [frequency of console printout and metric loggin]
    --i_img [frequency of tensorboard image logging]
    --i_weights [frequency of weight ckpt saving]
    --i_testset [frequency of testset saving]
    --i_video [frequency of render_poses video saving]
    --attention_direction_multires [frequency of embedding for value]
    --attention_view_multires [frequency of embedding for direction]
    --training_recon [whether to render images from the test set or not during final evaluation]
    --use_quaternion [append input pose as quaternion to input to unet]
    --no_globl [don't use global vector in middle of unet]
    --no_render_pose [append render pose to input to unet]
    --use_attsets [use attsets, otherwise use slot attention]

In particular, note that to render and test from a trained model, set render_only to True in the config.

Configs

The current configs are for the blender, LLFF, and shapenet datasets, which can be found in configs.

After setting the parameters of the model, to run it,

python run.py --configs/config_DATATYPE

Practical Concerns

The models were tested on 32gb GPUs, and higher resolution images require very large amounts of memory. The shapenet experiments should run on 16gb GPUs.

Acknowledgements

The code is built upon the original NeRF implementation. Thanks to LucidRains for the torch implementation of slot attention on which the current version is based.

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{grf2020,
  title={GRF: Learning a General Radiance Field for 3D Scene Representation and Rendering},
  author={Trevithick, Alex and Yang, Bo},
  booktitle={arXiv:2010.04595},
  year={2020}
}

Comments

A question about the section 3.5

In the figure 6 in section 3.5, the input of the MLP is 3D point feature and viewpoint (x,y,z) (correspond to the 3-D posotion in the classical NeRF). I wonder whether the 2-D direction is required is needed for the input of the MLP?

opened by sstzal 5
Some questions about the paper
Hi, thanks for the great work!

I have some questions:

How much is the computational overhead introduced by the CNN feature extraction? At inference maybe it's not that much because we only need to do 1 forward pass for each image and store the features in a buffer, but at training, we need to perform it on the entire images at each iteration, and we only train on a very little portion (800-1000 rays), so I wonder isn't it somewhat inefficient and slow, or maybe you have some outstanding implementation to accelerate this part.

As for the generalization, is it correct to understand that it only generalizes to objects within the same class (experiments on shapenetv2) with very similar visual and pose settings? For example, if we train on 7 NeRF-synthetic scenes, does it generalize to the 8th?
opened by kwea123 5
Confusion on section 3.3

I'm rather confused on this section because you cite P as this function based on multi view geometry and then describe two approximations. Do these approximations represent P? Also I am confused about how to implement these approximations outside of checking inside and outside of the image, specifically how do you "duplicate its features to the 3D point"?

opened by collinarnett 3
Question about the CNN model

Hi, Alex I notice that different CNN models are used for different datasets. I wonder if there were some special considerations when designing the CNNs. And if I want to design a CNN for my dataset, what should I pay attention to? Thanks!

opened by sstzal 2
A question on Section 4.1 (SHAPENETV2)

Hey, thanks for showing the great work. I have a question on Figure 4,

"3) To further demonstrate the advantage of GRF over SRNs, we directly evaluate the trained SRNs model on unseen objects (of the same category) without retraining. For comparison, we also directly evaluate the trained GRF model on the same novel objects. Figure 4 shows the qualitative results. It can be seen that if not retrained, SRNs completely fails to".

According to that the SRNs model does not train on unseen objects, the latent code z for unseen object is not optimized, so is it randomly initialized? Then, how can it generates novel views of unseen cars similar to GT views? My understanding is that the randomly initialized latent code z may generate unpredictable cars, but similar to training set, which seems to be conflict with the above quoted words. It confuses me for hours.

opened by wdmwhh 2

Positional Encoding

I'm using the positional encoder found in the NERF paper to encode my images after stacking the view points on the colors as mentioned in the paper however I'm unable to get the shapes to line up for input into the CNN. In NERF's implementation they flatten before sending their inputs to encoder.

To give a more concrete example of what I'm talking about here is my code

# reshape inputs to [20, 378, 504, 6] concatenating view to colorspace
inputs = torch.tensor(np.concatenate([images, np.broadcast_to(np.expand_dims(C, (1,2)), images.shape)], axis=-1))
# create embedder with length 5 as specified in the paper
embed, input_ch = get_embedder(5, 0)
# flatten. not sure if this step is required shape of [3810240, 6]
inputs_flat = torch.reshape(inputs, [-1, inputs.shape[-1]])
# apply embedding for a output shape of  [3810240, 66]
embedding = embed(inputs)

Not really sure where to go from here to submit to the CNN.

opened by collinarnett 2

Out of memory (OOM)

My GPU has 24GB. I decrease parameters as you said. --chunk [number of rays processed in parallel, decrease if running out of memory] --netchunk [number of pts sent through network in parallel, decrease if running out of memory] But it is useless even if these two parameters are set to one.

opened by qhdqhd 1
How do you organize your dataset directory?

I noticed that when using the shapenet data set for training, your dataset loading module uses path like “train” and "train_val", but this is inconsistent with the raw dataset which Vincent provides. May I ask how do you organize your project’s dataset folders? Thank you in advance.

opened by Chester-CS 1
How to get the unseen category/scene result as section 4.2, 4.3 and How to train several classes together for generlization?

Hello! thanks for sharing your great art work! I wonder how to get the rendering result of the unseen category/scene. according to 4.3, it says "We train a single model on randomly selected 4 scenes, i.e., Chair, Mic, Ship, and Hotdog, ...." and I wonder how to train several classes in each image batch(like all Nerf Synthetic datasets together to get the generalized GRF model). I think there are configs which contain only one single class for each config text file. I guess, if a batch contains 4 classes and each class has 8 views, then the batch has 32 images in total. Or, a batch could contain 32 images of only one single class and it sequencially see all classes one by one.
Can you describe your train configuration for generalization in more detail?

Plus, If 2 views or 6 views are fed, only 2/6images are input? or corresponding poses are also required? or can you provide us all the required data for section 4.1~4.3 if possible(it must be the best for me to understand it clearly)?

Please give me some hint. ;)

Cheers!

opened by dedoogong 0

GRF: Learning a General Radiance Field for 3D Representation and Rendering

Related tags

Overview

GRF: Learning a General Radiance Field for 3D Representation and Rendering

[Paper] [Video]

Overview of GRF

Setting Up the Environment

Data

Training and Rendering from the Model

Configs

Practical Concerns

Acknowledgements

Citation

Comments

Owner

Alex Trevithick

This repository contains the source code for the paper "DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks",

This is the official implementation code repository of Underwater Light Field Retention : Neural Rendering for Underwater Imaging (Accepted by CVPR Workshop2022 NTIRE)

Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.

This is the code for "HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields".

Build upon neural radiance fields to create a scene-specific implicit 3D semantic representation, Semantic-NeRF

Generative Query Network (GQN) in PyTorch as described in "Neural Scene Representation and Rendering"

We have implemented shaDow-GNN as a general and powerful pipeline for graph representation learning. For more details, please find our paper titled Deep Graph Neural Networks with Shallow Subgraph Samplers, available on arXiv (https//arxiv.org/abs/2012.01380).

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.

Pytorch implementation for A-NeRF: Articulated Neural Radiance Fields for Learning Human Shape, Appearance, and Pose

Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes

This is a JAX implementation of Neural Radiance Fields for learning purposes.

Implementation of "Generalizable Neural Performer: Learning Robust Radiance Fields for Human Novel View Synthesis"

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Eff video representation - Efficient video representation through neural fields

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

[ICCV'21] UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction

[ICCV'21] Neural Radiance Flow for 4D View Synthesis and Video Processing