Dense Gaussian Processes For Few-Shot Segmentation |
Few-shot segmentation is a challenging dense prediction task, which entails segmenting a novel query image given only a small annotated support set. the key problem is thus to design a method that aggregates detailed information from the support set, while being robust to large variations in appearance and context. to this end, we propose a few-shot segmentation method based on dense gaussian process (gp) regression. given the support set, our dense gp learns the mapping from local deep image features to mask values, capable of capturing complex appearance distributions. furthermore, it provides a principled means of capturing uncertainty, which serves as another powerful cue for the final segmentation, obtained by a cnn decoder. instead of a one-dimensional mask output, we further exploit the end-to-end learning capabilities of our approach to learn a high-dimensional output space for the gp. our approach sets a new state-of-the-art for both 1-shot and 5-shot fss on the pascal-5$^i$ and coco-20$^i$ benchmarks, achieving an absolute gain of $+14.9$ miou in the coco-20$^i$ 5-shot setting. furthermore, the segmentation quality of our approach scales gracefully when increasing the support set size, while achieving robust cross-dataset transfer. |
Joakim Johnander, Johan Edstedt, Michael Felsberg, Fahad Shahbaz Khan, Martin Danelljan |
2110.03674 |
link |
One Thing To Fool Them All: Generating Interpretable, Universal, And Physically-Realizable Adversarial Features |
It is well understood that modern deep networks are vulnerable to adversarial attacks. however, conventional methods fail to produce adversarial perturbations that are intelligible to humans, and they pose limited threats in the physical world. to study feature-class associations in networks and better understand the real-world threats they face, we develop feature-level adversarial perturbations using deep image generators and a novel optimization objective. we term these feature-fool attacks. we show that they are versatile and use them to generate targeted feature-level attacks at the imagenet scale that are simultaneously interpretable, universal to any source image, and physically-realizable. these attacks can also reveal spurious, semantically-describable feature/class associations, and we use them to guide the design of "copy/paste" adversaries in which one natural image is pasted into another to cause a targeted misclassification. |
Stephen Casper, Max Nadeau, Gabriel Kreiman |
2110.03605 |
link |
Towards Accurate Cross-Domain In-Bed Human Pose Estimation |
Human behavioral monitoring during sleep is essential for various medical applications. majority of the contactless human pose estimation algorithms are based on rgb modality, causing ineffectiveness in in-bed pose estimation due to occlusions by blankets and varying illumination conditions. long-wavelength infrared (lwir) modality based pose estimation algorithms overcome the aforementioned challenges; however, ground truth pose generations by a human annotator under such conditions are not feasible. a feasible solution to address this issue is to transfer the knowledge learned from images with pose labels and no occlusions, and adapt it towards real world conditions (occlusions due to blankets). in this paper, we propose a novel learning strategy comprises of two-fold data augmentation to reduce the cross-domain discrepancy and knowledge distillation to learn the distribution of unlabeled images in real world conditions. our experiments and analysis show the effectiveness of our approach over multiple standard human pose estimation baselines. |
Mohamed Afham, Udith Haputhanthri, Jathurshan Pradeepkumar, Mithunjha Anandakumar, Ashwin De Silva, Chamira Edussooriya |
2110.03578 |
link |
A Few-Shot Learning Graph Multi-Trajectory Evolution Network For Forecasting Multimodal Baby Connectivity Development From A Baseline Timepoint |
Charting the baby connectome evolution trajectory during the first year after birth plays a vital role in understanding dynamic connectivity development of baby brains. such analysis requires acquisition of longitudinal connectomic datasets. however, both neonatal and postnatal scans are rarely acquired due to various difficulties. a small body of works has focused on predicting baby brain evolution trajectory from a neonatal brain connectome derived from a single modality. although promising, large training datasets are essential to boost model learning and to generalize to a multi-trajectory prediction from different modalities (i.e., functional and morphological connectomes). here, we unprecedentedly explore the question: can we design a few-shot learning-based framework for predicting brain graph trajectories across different modalities? to this aim, we propose a graph multi-trajectory evolution network (gmte-net), which adopts a teacher-student paradigm where the teacher network learns on pure neonatal brain graphs and the student network learns on simulated brain graphs given a set of different timepoints. to the best of our knowledge, this is the first teacher-student architecture tailored for brain graph multi-trajectory growth prediction that is based on few-shot learning and generalized to graph neural networks (gnns). to boost the performance of the student network, we introduce a local topology-aware distillation loss that forces the predicted graph topology of the student network to be consistent with the teacher network. experimental results demonstrate substantial performance gains over benchmark methods. hence, our gmte-net can be leveraged to predict atypical brain connectivity trajectory evolution across various modalities. our code is available at https: //github.com/basiralab/gmte-net. |
Alaa Bessadok, Ahmed Nebli, Mohamed Ali Mahjoub, Gang Li, Weili Lin, Dinggang Shen, Islem Rekik |
2110.03535 |
link |
Unsupervised Image Decomposition With Phase-Correlation Networks |
The ability to decompose scenes into their object components is a desired property for autonomous agents, allowing them to reason and act in their surroundings. recently, different methods have been proposed to learn object-centric representations from data in an unsupervised manner. these methods often rely on latent representations learned by deep neural networks, hence requiring high computational costs and large amounts of curated data. such models are also difficult to interpret. to address these challenges, we propose the phase-correlation decomposition network (pcdnet), a novel model that decomposes a scene into its object components, which are represented as transformed versions of a set of learned object prototypes. the core building block in pcdnet is the phase-correlation cell (pc cell), which exploits the frequency-domain representation of the images in order to estimate the transformation between an object prototype and its transformed version in the image. in our experiments, we show how pcdnet outperforms state-of-the-art methods for unsupervised object discovery and segmentation on simple benchmark datasets and on more challenging data, while using a small number of learnable parameters and being fully interpretable. |
Angel Villar-Corrales, Sven Behnke |
2110.03473 |
link |
Recurrent Multigraph Integrator Network For Predicting The Evolution Of Population-Driven Brain Connectivity Templates |
Learning how to estimate a connectional brain template(cbt) from a population of brain multigraphs, where each graph (e.g., functional) quantifies a particular relationship between pairs of brain regions of interest (rois), allows to pin down the unique connectivity patterns shared across individuals. specifically, a cbt is viewed as an integral representation of a set of highly heterogeneous graphs and ideally meeting the centeredness (i.e., minimum distance to all graphs in the population) and discriminativeness (i.e., distinguishes the healthy from the disordered population) criteria. so far, existing works have been limited to only integrating and fusing a population of brain multigraphs acquired at a single timepoint. in this paper, we unprecedentedly tackle the question: given a baseline multigraph population, can we learn how to integrate and forecast its cbt representations at follow-up timepoints? addressing such question is of paramount in predicting common alternations across healthy and disordered populations. to fill this gap, we propose recurrent multigraph integrator network (remi-net), the first graph recurrent neural network which infers the baseline cbt of an input population t1 and predicts its longitudinal evolution over time (ti > t1). our remi-net is composed of recurrent neural blocks with graph convolutional layers using a cross-node message passing to first learn hidden-states embeddings of each cbt node (i.e., brain region of interest) and then predict its evolution at the consecutive timepoint. moreover, we design a novel time-dependent loss to regularize the cbt evolution trajectory over time and further introduce a cyclic recursion and learnable normalization layer to generate well-centered cbts from time-dependent hidden-state embeddings. finally, we derive the cbt adjacency matrix from the learned hidden state graph representation. |
Oytun Demirbilek, Islem Rekik |
2110.03453 |
link |
Inter-Domain Alignment For Predicting High-Resolution Brain Networks Using Teacher-Student Learning |
Accurate and automated super-resolution image synthesis is highly desired since it has the great potential to circumvent the need for acquiring high-cost medical scans and a time-consuming preprocessing pipeline of neuroimaging data. however, existing deep learning frameworks are solely designed to predict high-resolution (hr) image from a low-resolution (lr) one, which limits their generalization ability to brain graphs (i.e., connectomes). a small body of works has focused on superresolving brain graphs where the goal is to predict a hr graph from a single lr graph. although promising, existing works mainly focus on superresolving graphs belonging to the same domain (e.g., functional), overlooking the domain fracture existing between multimodal brain data distributions (e.g., morphological and structural). to this aim, we propose a novel inter-domain adaptation framework namely, learn to superresolve brain graphs with knowledge distillation network (l2s-kdnet), which adopts a teacher-student paradigm to superresolve brain graphs. our teacher network is a graph encoder-decoder that firstly learns the lr brain graph embeddings, and secondly learns how to align the resulting latent representations to the hr ground truth data distribution using an adversarial regularization. ultimately, it decodes the hr graphs from the aligned embeddings. next, our student network learns the knowledge of the aligned brain graphs as well as the topological structure of the predicted hr graphs transferred from the teacher. we further leverage the decoder of the teacher to optimize the student network. l2s-kdnet presents the first ts architecture tailored for brain graph super-resolution synthesis that is based on inter-domain alignment. our experimental results demonstrate substantial performance gains over benchmark methods. |
Basar Demir, Alaa Bessadok, Islem Rekik |
2110.03452 |
link |
Rhh-Lgp: Receding Horizon And Heuristics-Based Logic-Geometric Programming For Task And Motion Planning |
Sequential decision-making and motion planning for robotic manipulation induce combinatorial complexity. for long-horizon tasks, especially when the environment comprises many objects that can be interacted with, planning efficiency becomes even more important. to plan such long-horizon tasks, we present the rhh-lgp algorithm for combined task and motion planning (tamp). first, we propose a tamp approach (based on logic-geometric programming) that effectively uses geometry-based heuristics for solving long-horizon manipulation tasks. we further improve the efficiency of this planner by a receding horizon formulation, resulting in rhh-lgp. we demonstrate the effectiveness and generality of our approach on several long-horizon tasks that require reasoning about interactions with a large number of objects. using our framework, we can solve tasks that require multiple robots, including a mobile robot and snake-like walking robots, to form novel heterogeneous kinematic structures autonomously. |
Cornelius V. Braun, Joaquim Ortiz-Haro, Marc Toussaint, Ozgur S. Oguz |
2110.03420 |
link |
Optimized U-Net For Brain Tumor Segmentation |
We propose an optimized u-net architecture for a brain \mbox{tumor} segmentation task in the brats21 challenge. to find the \mbox{optimal} model architecture and learning schedule we ran an extensive ablation study to test: deep supervision loss, focal loss, decoder attention, drop block, and residual connections. additionally, we have searched for the optimal depth of the u-net and number of convolutional channels. our solution was the winner of the challenge validation phase, with the normalized statistical ranking score of 0.267 and mean dice score of 0.8855 |
Michał Futrega, Alexandre Milesi, Michal Marcinkiewicz, Pablo Ribalta |
2110.03352 |
link |
End-To-End Supermask Pruning: Learning To Prune Image Captioning Models |
With the advancement of deep models, research work on image captioning has led to a remarkable gain in raw performance over the last decade, along with increasing model complexity and computational cost. however, surprisingly works on compression of deep networks for image captioning task has received little to no attention. for the first time in image captioning research, we provide an extensive comparison of various unstructured weight pruning methods on three different popular image captioning architectures, namely soft-attention, up-down and object relation transformer. following this, we propose a novel end-to-end weight pruning method that performs gradual sparsification based on weight sensitivity to the training loss. the pruning schemes are then extended with encoder pruning, where we show that conducting both decoder pruning and training simultaneously prior to the encoder pruning provides good overall performance. empirically, we show that an 80% to 95% sparse network (up to 75% reduction in model size) can either match or outperform its dense counterpart. the code and pre-trained models for up-down and object relation transformer that are capable of achieving cider scores >120 on the ms-coco dataset but with only 8.7 mb and 14.5 mb in model size (size reduction of 96% and 94% respectively against dense versions) are publicly available at https://github.com/jiahuei/sparse-image-captioning. |
Jia Huei Tan, Chee Seng Chan, Joon Huang Chuah |
2110.03298 |
link |
Propagating State Uncertainty Through Trajectory Forecasting |
Uncertainty pervades through the modern robotic autonomy stack, with nearly every component (e.g., sensors, detection, classification, tracking, behavior prediction) producing continuous or discrete probabilistic distributions. trajectory forecasting, in particular, is surrounded by uncertainty as its inputs are produced by (noisy) upstream perception and its outputs are predictions that are often probabilistic for use in downstream planning. however, most trajectory forecasting methods do not account for upstream uncertainty, instead taking only the most-likely values. as a result, perceptual uncertainties are not propagated through forecasting and predictions are frequently overconfident. to address this, we present a novel method for incorporating perceptual state uncertainty in trajectory forecasting, a key component of which is a new statistical distance-based loss function which encourages predicting uncertainties that better match upstream perception. we evaluate our approach both in illustrative simulations and on large-scale, real-world data, demonstrating its efficacy in propagating perceptual state uncertainty through prediction and producing more calibrated predictions. |
Boris Ivanovic, N/A Yifeng, N/A Lin, Shubham Shrivastava, Punarjay Chakravarty, Marco Pavone |
2110.03267 |
link |
Gradient Step Denoiser For Convergent Plug-And-Play |
Plug-and-play methods constitute a class of iterative algorithms for imaging problems where regularization is performed by an off-the-shelf denoiser. although plug-and-play methods can lead to tremendous visual performance for various image problems, the few existing convergence guarantees are based on unrealistic (or suboptimal) hypotheses on the denoiser, or limited to strongly convex data terms. in this work, we propose a new type of plug-and-play methods, based on half-quadratic splitting, for which the denoiser is realized as a gradient descent step on a functional parameterized by a deep neural network. exploiting convergence results for proximal gradient descent algorithms in the non-convex setting, we show that the proposed plug-and-play algorithm is a convergent iterative scheme that targets stationary points of an explicit global functional. besides, experiments show that it is possible to learn such a deep denoiser while not compromising the performance in comparison to other state-of-the-art deep denoisers used in plug-and-play schemes. we apply our proximal gradient algorithm to various ill-posed inverse problems, e.g. deblurring, super-resolution and inpainting. for all these applications, numerical results empirically confirm the convergence results. experiments also show that this new algorithm reaches state-of-the-art performance, both quantitatively and qualitatively. |
Samuel Hurault, Arthur Leclaire, Nicolas Papadakis |
2110.03220 |
link |
Treegcn-Ed: Encoding Point Cloud Using A Tree-Structured Graph Network |
Point cloud is an efficient way of representing and storing 3d geometric data. deep learning algorithms on point clouds are time and memory efficient. several methods such as pointnet and foldingnet have been proposed for processing point clouds. this work proposes an autoencoder based framework to generate robust embeddings for point clouds by utilizing hierarchical information using graph convolution. we perform multiple experiments to assess the quality of embeddings generated by the proposed encoder architecture and visualize the t-sne map to highlight its ability to distinguish between different object classes. we further demonstrate the applicability of the proposed framework in applications like: 3d point cloud completion and single image based 3d reconstruction. |
Prajwal Singh, Kaustubh Sadekar, Shanmuganathan Raman |
2110.03170 |
link |
Efficient Sharpness-Aware Minimization For Improved Training Of Neural Networks |
Overparametrized deep neural networks (dnns) often achieve astounding performances, but may potentially result in severe generalization error. recently, the relation between the sharpness of the loss landscape and the generalization error has been established by foret et al. (2020), in which the sharpness aware minimizer (sam) was proposed to mitigate the degradation of the generalization. unfortunately, sam s computational cost is roughly double that of base optimizers, such as stochastic gradient descent (sgd). this paper thus proposes efficient sharpness aware minimizer (esam), which boosts sam s efficiency at no cost to its generalization performance. esam includes two novel and efficient training strategies-stochasticweight perturbation and sharpness-sensitive data selection. in the former, the sharpness measure is approximated by perturbing a stochastically chosen set of weights in each iteration; in the latter, the sam loss is optimized using only a judiciously selected subset of data that is sensitive to the sharpness. we provide theoretical explanations as to why these strategies perform well. we also show, via extensive experiments on the cifar and imagenet datasets, that esam enhances the efficiency over sam from requiring 100% extra computations to 40% vis-a-vis base optimizers, while test accuracies are preserved or even improved. |
Jiawei Du, Hanshu Yan, Jiashi Feng, Joey Tianyi Zhou, Liangli Zhen, Rick Siow Mong Goh, Vincent Y. F. Tan |
2110.03141 |
link |
Speed+: Next Generation Dataset For Spacecraft Pose Estimation Across Domain Gap |
Autonomous vision-based spaceborne navigation is an enabling technology for future on-orbit servicing and space logistics missions. while computer vision in general has benefited from machine learning (ml), training and validating spaceborne ml models are extremely challenging due to the impracticality of acquiring a large-scale labeled dataset of images of the intended target in the space environment. existing datasets, such as spacecraft pose estimation dataset (speed), have so far mostly relied on synthetic images for both training and validation, which are easy to mass-produce but fail to resemble the visual features and illumination variability inherent to the target spaceborne images. in order to bridge the gap between the current practices and the intended applications in future space missions, this paper introduces speed+: the next generation spacecraft pose estimation dataset with specific emphasis on domain gap. in addition to 60,000 synthetic images for training, speed+ includes 9,531 simulated images of a spacecraft mockup model captured from the testbed for rendezvous and optical navigation (tron) facility. tron is a first-of-a-kind robotic testbed capable of capturing an arbitrary number of target images with accurate and maximally diverse pose labels and high-fidelity spaceborne illumination conditions. speed+ will be used in the upcoming international satellite pose estimation challenge co-hosted with the advanced concepts team of the european space agency to evaluate and compare the robustness of spaceborne ml models trained on synthetic images. |
Tae Ha Park, Marcus Märtens, Gurvan Lecuyer, Dario Izzo, Simone D'Amico |
2110.03101 |
link |
Large-Scale Topological Radar Localization Using Learned Descriptors |
In this work, we propose a method for large-scale topological localization based on radar scan images using learned descriptors. we present a simple yet efficient deep network architecture to compute a rotationally invariant discriminative global descriptor from a radar scan image. the performance and generalization ability of the proposed method is experimentally evaluated on two large scale driving datasets: mulran and oxford radar robotcar. additionally, we present a comparative evaluation of radar-based and lidar-based localization using learned global descriptors. our code and trained models are publicly available on the project website. |
Jacek Komorowski, Monika Wysoczanska, Tomasz Trzcinski |
2110.03081 |
link |
Deepbbs: Deep Best Buddies For Point Cloud Registration |
Recently, several deep learning approaches have been proposed for point cloud registration. these methods train a network to generate a representation that helps finding matching points in two 3d point clouds. finding good matches allows them to calculate the transformation between the point clouds accurately. two challenges of these techniques are dealing with occlusions and generalizing to objects of classes unseen during training. this work proposes deepbbs, a novel method for learning a representation that takes into account the best buddy distance between points during training. best buddies (i.e., mutual nearest neighbors) are pairs of points nearest to each other. the best buddies criterion is a strong indication for correct matches that, in turn, leads to accurate registration. our experiments show improved performance compared to previous methods. in particular, our learned representation leads to an accurate registration for partial shapes and in unseen categories. |
Itan Hezroni, Amnon Drory, Raja Giryes, Shai Avidan |
2110.03016 |
link |
Adaptive Unfolding Total Variation Network For Low-Light Image Enhancement |
Real-world low-light images suffer from two main degradations, namely, inevitable noise and poor visibility. since the noise exhibits different levels, its estimation has been implemented in recent works when enhancing low-light images from raw bayer space. when it comes to srgb color space, the noise estimation becomes more complicated due to the effect of the image processing pipeline. nevertheless, most existing enhancing algorithms in srgb space only focus on the low visibility problem or suppress the noise under a hypothetical noise level, leading them impractical due to the lack of robustness. to address this issue,we propose an adaptive unfolding total variation network (utvnet), which approximates the noise level from the real srgb low-light image by learning the balancing parameter in the model-based denoising method with total variation regularization. meanwhile, we learn the noise level map by unrolling the corresponding minimization process for providing the inferences of smoothness and fidelity constraints. guided by the noise level map, our utvnet can recover finer details and is more capable to suppress noise in real captured low-light scenes. extensive experiments on real-world low-light images clearly demonstrate the superior performance of utvnet over state-of-the-art methods. |
Chuanjun Zheng, Daming Shi, Wentian Shi |
2110.00984 |
link |
Student Helping Teacher: Teacher Evolution Via Self-Knowledge Distillation |
Knowledge distillation usually transfers the knowledge from a pre-trained cumbersome teacher network to a compact student network, which follows the classical teacher-teaching-student paradigm. based on this paradigm, previous methods mostly focus on how to efficiently train a better student network for deployment. different from the existing practices, in this paper, we propose a novel student-helping-teacher formula, teacher evolution via self-knowledge distillation (teskd), where the target teacher (for deployment) is learned with the help of multiple hierarchical students by sharing the structural backbone. the diverse feedback from multiple students allows the teacher to improve itself through the shared feature representations. the effectiveness of our proposed framework is demonstrated by extensive experiments with various network settings on two standard benchmarks including cifar-100 and imagenet. notably, when trained together with our proposed method, resnet-18 achieves 79.15% and 71.14% accuracy on cifar-100 and imagenet, outperforming the baseline results by 4.74% and 1.43%, respectively. the code is available at: https://github.com/zhengli427/teskd. |
Zheng Li, Xiang Li, Lingfeng Yang, Jian Yang, Zhigeng Pan |
2110.00329 |
link |
On Assessing The Usefulness Of Proxy Domains For Developing And Evaluating Embodied Agents |
In many situations it is either impossible or impractical to develop and evaluate agents entirely on the target domain on which they will be deployed. this is particularly true in robotics, where doing experiments on hardware is much more arduous than in simulation. this has become arguably more so in the case of learning-based agents. to this end, considerable recent effort has been devoted to developing increasingly realistic and higher fidelity simulators. however, we lack any principled way to evaluate how good a "proxy domain" is, specifically in terms of how useful it is in helping us achieve our end objective of building an agent that performs well in the target domain. in this work, we investigate methods to address this need. we begin by clearly separating two uses of proxy domains that are often conflated: 1) their ability to be a faithful predictor of agent performance and 2) their ability to be a useful tool for learning. in this paper, we attempt to clarify the role of proxy domains and establish new proxy usefulness (pu) metrics to compare the usefulness of different proxy domains. we propose the relative predictive pu to assess the predictive ability of a proxy domain and the learning pu to quantify the usefulness of a proxy as a tool to generate learning data. furthermore, we argue that the value of a proxy is conditioned on the task that it is being used to help solve. we demonstrate how these new metrics can be used to optimize parameters of the proxy domain for which obtaining ground truth via system identification is not trivial. |
Anthony Courchesne, Andrea Censi, Liam Paull |
2109.14516 |
link |
A Scaling Law For Synthetic-To-Real Transfer: How Much Is Your Pre-Training Effective? |
Synthetic-to-real transfer learning is a framework in which a synthetically generated dataset is used to pre-train a model to improve its performance on real vision tasks. the most significant advantage of using synthetic images is that the ground-truth labels are automatically available, enabling unlimited expansion of the data size without human cost. however, synthetic data may have a huge domain gap, in which case increasing the data size does not improve the performance. how can we know that? in this study, we derive a simple scaling law that predicts the performance from the amount of pre-training data. by estimating the parameters of the law, we can judge whether we should increase the data or change the setting of image synthesis. further, we analyze the theory of transfer learning by considering learning dynamics and confirm that the derived generalization bound is consistent with our empirical findings. we empirically validated our scaling law on various experimental settings of benchmark tasks, model sizes, and complexities of synthetic images. |
Hiroaki Mikami, Kenji Fukumizu, Shogo Murai, Shuji Suzuki, Yuta Kikuchi, Taiji Suzuki, Shin-Ichi Maeda, Kohei Hayashi |
2108.11018 |
link |
Aprel: A Library For Active Preference-Based Reward Learning Algorithms |
Reward learning is a fundamental problem in robotics to have robots that operate in alignment with what their human user wants. many preference-based learning algorithms and active querying techniques have been proposed as a solution to this problem. in this paper, we present aprel, a library for active preference-based reward learning algorithms, which enable researchers and practitioners to experiment with the existing techniques and easily develop their own algorithms for various modules of the problem. |
Erdem Bıyık, Aditi Talati, Dorsa Sadigh |
2108.07259 |
link |
Gait-Learning With Morphologically Evolving Robots Generated By L-System |
When controllers (brains) and morphologies (bodies) of robots simultaneously evolve, this can lead to a problem, namely the brain & body mismatch problem. in this research, we propose a solution of lifetime learning. we set up a system where modular robots can create offspring that inherit the bodies of parents by recombination and mutation. with regards to the brains of the offspring, we use two methods to create them. the first one entails solely evolution which means the brain of a robot child is inherited from its parents. the second approach is evolution plus learning which means the brain of a child is inherited as well, but additionally is developed by a learning algorithm - revdeknn. we compare these two methods by running experiments in a simulator called revolve and use efficiency, efficacy, and the morphology intelligence of the robots for the comparison. the experiments show that the evolution plus learning method does not only lead to a higher fitness level, but also to more morphologically evolving robots. this constitutes a quantitative demonstration that changes in the brain can induce changes in the body, leading to the concept of morphological intelligence, which is quantified by the learning delta, meaning the ability of a morphology to facilitate learning. |
Jie Luo, Daan Zeeuwe, Agoston E. Eiben |
2107.08249 |
link |
Align Before Fuse: Vision And Language Representation Learning With Momentum Distillation |
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. in this paper, we introduce a contrastive loss to align the image and text representations before fusing (albef) them through cross-modal attention, which enables more grounded vision and language representation learning. unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. in order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. we provide a theoretical analysis of albef from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. albef achieves state-of-the-art performance on multiple downstream vision-language tasks. on image-text retrieval, albef outperforms methods that are pre-trained on orders of magnitude larger datasets. on vqa and nlvr$^2$, albef achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. code and pre-trained models are available at https://github.com/salesforce/albef/. |
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi |
2107.07651 |
link |
Llc: Accurate, Multi-Purpose Learnt Low-Dimensional Binary Codes |
Learning binary representations of instances and classes is a classical problem with several high potential applications. in modern settings, the compression of high-dimensional neural representations to low-dimensional binary codes is a challenging task and often require large bit-codes to be accurate. in this work, we propose a novel method for learning low-dimensional binary codes (llc) for instances as well as classes. our method does not require any side-information, like annotated attributes or label meta-data, and learns extremely low-dimensional binary codes (~20 bits for imagenet-1k). the learnt codes are super-efficient while still ensuring nearly optimal classification accuracy for resnet50 on imagenet-1k. we demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes. we further quantitatively measure the quality of our codes by applying it to the efficient image retrieval as well as out-of-distribution (ood) detection problems. for imagenet-100 retrieval problem, our learnt binary codes outperform 16 bit hashnet using only 10 bits and also are as accurate as 10 dimensional real representations. finally, our learnt binary codes can perform ood detection, out-of-the-box, as accurately as a baseline that needs ~3000 samples to tune its threshold, while we require none. code is open-sourced at https://github.com/raivnlab/llc. |
Aditya Kusupati, Matthew Wallingford, Vivek Ramanujan, Raghav Somani, Jae Sung Park, Krishna Pillutla, Prateek Jain, Sham Kakade, Ali Farhadi |
2106.01487 |
link |
Agentformer: Agent-Aware Transformers For Socio-Temporal Multi-Agent Forecasting |
Predicting accurate future trajectories of multiple agents is essential for autonomous systems, but is challenging due to the complex agent interaction and the uncertainty in each agent's future behavior. forecasting multi-agent trajectories requires modeling two key dimensions: (1) time dimension, where we model the influence of past agent states over future states; (2) social dimension, where we model how the state of each agent affects others. most prior methods model these two dimensions separately, e.g., first using a temporal model to summarize features over time for each agent independently and then modeling the interaction of the summarized features with a social model. this approach is suboptimal since independent feature encoding over either the time or social dimension can result in a loss of information. instead, we would prefer a method that allows an agent's state at one time to directly affect another agent's state at a future time. to this end, we propose a new transformer, agentformer, that jointly models the time and social dimensions. the model leverages a sequence representation of multi-agent trajectories by flattening trajectory features across time and agents. since standard attention operations disregard the agent identity of each element in the sequence, agentformer uses a novel agent-aware attention mechanism that preserves agent identities by attending to elements of the same agent differently than elements of other agents. based on agentformer, we propose a stochastic multi-agent trajectory prediction model that can attend to features of any agent at any previous timestep when inferring an agent's future position. the latent intent of all agents is also jointly modeled, allowing the stochasticity in one agent's behavior to affect other agents. our method substantially improves the state of the art on well-established pedestrian and autonomous driving datasets. |
Ye Yuan, Xinshuo Weng, Yanglan Ou, Kris Kitani |
2103.14023 |
link |
What'S In My Lidar Odometry Toolbox? |
With the democratization of 3d lidar sensors, precise lidar odometries and slam are in high demand. new methods regularly appear, proposing solutions ranging from small variations in classical algorithms to radically new paradigms based on deep learning. yet it is often difficult to compare these methods, notably due to the few datasets on which the methods can be evaluated and compared. furthermore, their weaknesses are rarely examined, often letting the user discover the hard way whether a method would be appropriate for a use case. in this paper, we review and organize the main 3d lidar odometries into distinct categories. we implemented several approaches (geometric based, deep learning based, and hybrid methods) to conduct an in-depth analysis of their strengths and weaknesses on multiple datasets, guiding the reader through the different lidar odometries available. implementation of the methods has been made publicly available at https://github.com/kitware/pylidar-slam. |
Pierre Dellenbach, Jean-Emmanuel Deschaud, Bastien Jacquet, François Goulette |
2103.09708 |
link |
Efficient Two-Stream Network For Violence Detection Using Separable Convolutional Lstm |
Automatically detecting violence from surveillance footage is a subset of activity recognition that deserves special attention because of its wide applicability in unmanned security monitoring systems, internet video filtration, etc. in this work, we propose an efficient two-stream deep learning architecture leveraging separable convolutional lstm (sepconvlstm) and pre-trained mobilenet where one stream takes in background suppressed frames as inputs and other stream processes difference of adjacent frames. we employed simple and fast input pre-processing techniques that highlight the moving objects in the frames by suppressing non-moving backgrounds and capture the motion in-between frames. as violent actions are mostly characterized by body movements these inputs help produce discriminative features. sepconvlstm is constructed by replacing convolution operation at each gate of convlstm with a depthwise separable convolution that enables producing robust long-range spatio-temporal features while using substantially fewer parameters. we experimented with three fusion methods to combine the output feature maps of the two streams. evaluation of the proposed methods was done on three standard public datasets. our model outperforms the accuracy on the larger and more challenging rwf-2000 dataset by more than a 2% margin while matching state-of-the-art results on the smaller datasets. our experiments lead us to conclude, the proposed models are superior in terms of both computational efficiency and detection accuracy. |
Zahidul Islam, Mohammad Rukonuzzaman, Raiyan Ahmed, Md. Hasanul Kabir, Moshiur Farazi |
2102.10590 |
link |
Full-Glow: Fully Conditional Glow For More Realistic Image Generation |
Autonomous agents, such as driverless cars, require large amounts of labeled visual data for their training. a viable approach for acquiring such data is training a generative model with collected real data, and then augmenting the collected real dataset with synthetic images from the model, generated with control of the scene layout and ground truth labeling. in this paper we propose full-glow, a fully conditional glow-based architecture for generating plausible and realistic images of novel street scenes given a semantic segmentation map indicating the scene layout. benchmark comparisons show our model to outperform recent works in terms of the semantic segmentation performance of a pretrained pspnet. this indicates that images from our model are, to a higher degree than from other models, similar to real images of the same kinds of scenes and objects, making them suitable as training data for a visual semantic segmentation or object recognition system. |
Moein Sorkhei, Gustav Eje Henter, Hedvig Kjellström |
2012.05846 |
link |
Multi-Agent Reinforcement Learning For Visibility-Based Persistent Monitoring |
The visibility-based persistent monitoring (vpm) problem seeks to find a set of trajectories (or controllers) for robots to persistently monitor a changing environment. each robot has a sensor, such as a camera, with a limited field-of-view that is obstructed by obstacles in the environment. the robots may need to coordinate with each other to ensure no point in the environment is left unmonitored for long periods of time. we model the problem such that there is a penalty that accrues every time step if a point is left unmonitored. however, the dynamics of the penalty are unknown to us. we present a multi-agent reinforcement learning (marl) algorithm for the vpm problem. specifically, we present a multi-agent graph attention proximal policy optimization (ma-g-ppo) algorithm that takes as input the local observations of all agents combined with a low resolution global map to learn a policy for each agent. the graph attention allows agents to share their information with others leading to an effective joint policy. our main focus is to understand how effective marl is for the vpm problem. we investigate five research questions with this broader goal. we find that ma-g-ppo is able to learn a better policy than the non-rl baseline in most cases, the effectiveness depends on agents sharing information with each other, and the policy learnt shows emergent behavior for the agents. |
Jingxi Chen, Amrish Baskaran, Zhongshun Zhang, Pratap Tokekar |
2011.01129 |
link |
Light Field Salient Object Detection: A Review And Benchmark |
Salient object detection (sod) is a long-standing research topic in computer vision and has drawn an increasing amount of research interest in the past decade. this paper provides the first comprehensive review and benchmark for light field sod, which has long been lacking in the saliency community. firstly, we introduce preliminary knowledge on light fields, including theory and data forms, and then review existing studies on light field sod, covering ten traditional models, seven deep learning-based models, one comparative study, and one brief review. existing datasets for light field sod are also summarized with detailed information and statistical analyses. secondly, we benchmark nine representative light field sod models together with several cutting-edge rgb-d sod models on four widely used light field datasets, from which insightful discussions and analyses, including a comparison between light field sod and rgb-d sod models, are achieved. besides, due to the inconsistency of datasets in their current forms, we further generate complete data and supplement focal stacks, depth maps and multi-view images for the inconsistent datasets, making them consistent and unified. our supplemental data makes a universal benchmark possible. lastly, because light field sod is quite a special problem attributed to its diverse data representations and high dependency on acquisition hardware, making it differ greatly from other saliency detection tasks, we provide nine hints into the challenges and future directions, and outline several open issues. we hope our review and benchmarking could help advance research in this field. all the materials including collected models, datasets, benchmarking results, and supplemented light field datasets will be publicly available on our project site https://github.com/kerenfu/lfsod-survey. |
Keren Fu, Yao Jiang, Ge-Peng Ji, Tao Zhou, Qijun Zhao, Deng-Ping Fan |
2010.04968 |
link |
Artificial Fingerprinting For Generative Models: Rooting Deepfake Attribution In Training Data |
Photorealistic image generation has reached a new level of quality due to the breakthroughs of generative adversarial networks (gans). yet, the dark side of such deepfakes, the malicious use of generated media, raises concerns about visual misinformation. while existing research work on deepfake detection demonstrates high accuracy, it is subject to advances in generation techniques and adversarial iterations on detection countermeasure techniques. thus, we seek a proactive and sustainable solution on deepfake detection, that is agnostic to the evolution of generative models, by introducing artificial fingerprints into the models. our approach is simple and effective. we first embed artificial fingerprints into training data, then validate a surprising discovery on the transferability of such fingerprints from training data to generative models, which in turn appears in the generated deepfakes. experiments show that our fingerprinting solution (1) holds for a variety of cutting-edge generative models, (2) leads to a negligible side effect on generation quality, (3) stays robust against image-level and model-level perturbations, (4) stays hard to be detected by adversaries, and (5) converts deepfake detection and attribution into trivial tasks and outperforms the recent state-of-the-art baselines. our solution closes the responsibility loop between publishing pre-trained generative model inventions and their possible misuses, which makes it independent of the current arms race. |
Ning Yu, Vladislav Skripniuk, Sahar Abdelnabi, Mario Fritz |
2007.08457 |
link |
Universal Graph Transformer Self-Attention Networks |
The transformer has been extensively used in research domains such as computer vision, image processing, and natural language processing. the transformer, however, has not been actively used in graph neural networks. to this end, we introduce a transformer-based advanced gnn model, named ugformer, to learn graph representations. in particular, given an input graph, we present two ugformer variants. the first variant is to leverage the transformer on a set of sampled neighbors for each node, while the second is to leverage the transformer directly on the input graph. experimental results demonstrate that these two ugformer variants achieve state-of-the-art accuracies on well-known benchmark datasets for graph classification and inductive text classification, respectively. the code is available on github: \url{https://github.com/daiquocnguyen/graph-transformer}. |
Dai Quoc Nguyen, Tu Dinh Nguyen, Dinh Phung |
1909.11855 |
link |