A Hierarchical Graph-Based Approach for Recognition and Description Generation of Bimanual Actions in Videos

Nuanced understanding and the generation of detailed descriptive content for (bimanual) manipulation actions in videos is important for disciplines such as robotics, human-computer interaction, and video content analysis. This study describes a novel method, integrating graph-based modeling with layered hierarchical attention mechanisms, resulting in higher precision and better comprehensiveness of video descriptions. To achieve this, we encode, first, the spatio-temporal interdependencies between objects and actions with scene graphs and we combine this, in a second step, with a novel 3-level architecture creating a hierarchical attention mechanism using Graph Attention Networks (GATs). The 3-level GAT architecture allows recognizing local, but also global contextual elements. This way several descriptions with different semantic complexity can be generated in parallel for the same video clip, enhancing the discriminative accuracy of action recognition and action description. The performance of our approach is empirically tested using several 2D and 3D datasets. By comparing our method to the state of the art we consistently obtain better performance concerning accuracy, precision, and contextual relevance when evaluating action recognition as well as description generation. In a large set of ablation experiments we also assess the role of the different components of our model. With our multi-level approach the system obtains different semantic description depths, often observed in descriptions made by different people, too. Furthermore, better insight into bimanual hand-object interactions as achieved by our model may portend advancements in the field of robotics, enabling the emulation of intricate human actions with heightened precision.

View this article on IEEE Xplore


Clinical Micro-CT Empowered by Interior Tomography, Robotic Scanning, and Deep Learning

While micro-CT systems are instrumental in preclinical research, clinical micro-CT imaging has long been desired with cochlear implantation as a primary application. The structural details of the cochlear implant and the temporal bone require a significantly higher image resolution than that (about 0.2 mm ) provided by current medical CT scanners. In this paper, we propose a clinical micro-CT (CMCT) system design integrating conventional spiral cone-beam CT, contemporary interior tomography, deep learning techniques, and the technologies of a micro-focus X-ray source, a photon-counting detector (PCD), and robotic arms for ultrahigh-resolution localized tomography of a freely-selected volume of interest (VOI) at a minimized radiation dose level. The whole system consists of a standard CT scanner for a clinical CT exam and VOI specification, and a robotic micro-CT scanner for a local scan of high spatial and spectral resolution at minimized radiation dose. The prior information from the global scan is also fully utilized for background compensation of the local scan data for accurate and stable VOI reconstruction. Our results and analysis show that the proposed hybrid reconstruction algorithm delivers accurate high-resolution local reconstruction, and is insensitive to the misalignment of the isocenter position, initial view angle and scale mismatch in the data/image registration. These findings demonstrate the feasibility of our system design. We envision that deep learning techniques can be leveraged for optimized imaging performance. With high-resolution imaging, high dose efficiency and low system cost synergistically, our proposed CMCT system has great promise in temporal bone imaging as well as various other clinical applications.

*The video published with this article received a promotional prize for the 2020 IEEE Access Best Multimedia Award (Part 2).

View this article on IEEE Xplore

 

Robots and Wizards: An Investigation Into Natural Human–Robot Interaction

The goal of the study was to research different communication modalities needed for intuitive Human-Robot Interaction. This study utilizes a Wizard of Oz prototyping method to enable a restriction-free, intuitive interaction with an industrial robot. The data from 36 test subjects suggests a high preference for speech input, automatic path planning and pointing gestures. The catalogue developed during this experiment contains intrinsic gestures suggesting that the two most popular gestures per action can be sufficient to cover the majority of users. The system scored an average of 74% in different user interface experience questionnaires, while containing forced flaws. These findings allow a future development of an intuitive Human-Robot interaction system with high user acceptance.

*The video published with this article received a promotional prize for the 2020 IEEE Access Best Multimedia Award (Part 2).

View this article on IEEE Xplore