Object Detection Using Deep Learning, CNNs and Vision Transformers: A Review

Detecting objects remains one of computer vision and image understanding applications’ most fundamental and challenging aspects. Significant advances in object detection have been achieved through improved object representation and the use of deep neural network models. This paper examines more closely how object detection has evolved in the era of deep learning over the past years. We present a literature review on various state-of-the-art object detection algorithms and the underlying concepts behind these methods. We classify these methods into three main groups: anchor-based, anchor-free, and transformer-based detectors. Those approaches are distinct in the way they identify objects in the image. We discuss the insights behind these algorithms and experimental analyses to compare quality metrics, speed/accuracy tradeoffs, and training methodologies. The survey compares the major convolutional neural networks for object detection. It also covers the strengths and limitations of each object detector model and draws significant conclusions. We provide simple graphical illustrations summarising the development of object detection methods under deep learning. Finally, we identify where future research will be conducted.

View this article on IEEE Xplore


MMNeRF: Multi-Modal and Multi-View Optimized Cross-Scene Neural Radiance Fields

We present MMNeRF, a simple yet powerful learning framework for highly photo-realistic novel view synthesis by learning Multi-modal and Multi-view features to guide neural radiance fields to a generic model. Novel view synthesis has achieved great improvement with the significant success of NeRF-series methods. However, how to make the method generic across scenes has always been a challenging task. A good idea is to introduce 2D image features as prior knowledge for adaptive modeling, yet RGB features lack geometry and 3D spatial information, which causes shape-radiance ambiguity issues and lead to blurry and low-resolution results in the synthesis images. We propose a multi-modal multi-view method to make up for the existing methods. Specifically, we introduce depth features besides RGB features into the model and effectively fuse these multi-modal features by modality-based attention. Furthermore, Our framework innovatively adopts the transformer encoder to fuse multi-view features and uses the transformer decoder to adaptively incorporate the target view with global memory. Extensive experiments are carried out on both categories-specific and category-agnostic benchmarks, and the results demonstrate that our MMNeRF achieves state-of-the-art neural rendering performance.

View this article on IEEE Xplore

 

Chat2VIS: Generating Data Visualizations via Natural Language Using ChatGPT, Codex and GPT-3 Large Language Models

The field of data visualisation has long aimed to devise solutions for generating visualisations directly from natural language text. Research in Natural Language Interfaces (NLIs) has contributed towards the development of such techniques. However, the implementation of workable NLIs has always been challenging due to the inherent ambiguity of natural language, as well as in consequence of unclear and poorly written user queries which pose problems for existing language models in discerning user intent. Instead of pursuing the usual path of developing new iterations of language models, this study uniquely proposes leveraging the advancements in pre-trained large language models (LLMs) such as ChatGPT and GPT-3 to convert free-form natural language directly into code for appropriate visualisations. This paper presents a novel system, Chat2VIS, which takes advantage of the capabilities of LLMs and demonstrates how, with effective prompt engineering, the complex problem of language understanding can be solved more efficiently, resulting in simpler and more accurate end-to-end solutions than prior approaches. Chat2VIS shows that LLMs together with the proposed prompts offer a reliable approach to rendering visualisations from natural language queries, even when queries are highly misspecified and underspecified. This solution also presents a significant reduction in costs for the development of NLI systems, while attaining greater visualisation inference abilities compared to traditional NLP approaches that use hand-crafted grammar rules and tailored models. This study also presents how LLM prompts can be constructed in a way that preserves data security and privacy while being generalisable to different datasets. This work compares the performance of GPT-3, Codex and ChatGPT across several case studies and contrasts the performances with prior studies.

View this article on IEEE Xplore