SIG-CV latest activities
Three talks are given by Phong Nguyen-Ha, Zhaodong Sun and Martin Trapp.
Talk 1 -- Phong Nguyen-Ha: "Free-Viewpoint RGB-D Human Performance Capture and Rendering."
Abstract Novel view synthesis for humans in motion is a challenging computer vision problem that enables applications such as free-viewpoint video. Existing methods typically use complex setups with multiple input views, 3D supervision or pre-trained models that do not generalize well to new identities. Aiming to address these limitations, we present a novel view synthesis framework to generate realistic renders from unseen views of any human captured from a single-view sensor with sparse RGB-D, similar to a low-cost depth camera, and without actor-specific models. We propose an architecture to learn dense features in novel views obtained by sphere-based neural rendering, and create complete renders using a global context inpainting model. Additionally, an enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details. We show our method generates high-quality novel views of synthetic and real human actors given a single sparse RGB-D input. It generalizes to unseen identities, new poses and faithfully reconstructs facial expressions. Our approach outperforms prior human view synthesis methods and is robust to different levels of input sparsity.
Phong Nguyen-Ha received the B.Ss. degree in mechanical engineering from the Ha Noi University of Science and Technology (HUST), Vietname and the M.Sc. degree in computer science engineering at Dongguk University, Seoul, South Korea. Since then, he is working as a doctoral candidate at CMVS, fully funded by the Vision-based 3D perception for mixed reality applications grant from the Infotech institute. His research interests include 3D computer vision, computer graphics and deep learning.
Talk 2 -- Zhaodong Sun: "Contrast-Phys: Unsupervised Video-based Remote Physiological Measurement via Spatiotemporal Contrast."
Abstract: Video-based remote physiological measurement utilizes face videos to measure the blood volume change signal, which is also called remote photoplethysmography (rPPG). Supervised methods for rPPG measurements achieve state-of-the-art performance. However, supervised rPPG methods require face videos and ground truth physiological signals for model training. In this paper, we propose an unsupervised rPPG measurement method that does not require ground truth signals for training. We use a 3DCNN model to generate multiple rPPG signals from each video in different spatiotemporal locations and train the model with a contrastive loss based on the prior knowledge of rPPG. The results show that our method outperforms the previous unsupervised baseline and achieves accuracies very close to the current best supervised rPPG methods. Furthermore, we also demonstrate that our approach can run at a much faster speed and is more robust to noises than the previous unsupervised baseline.
Zhaodong Sun is currently a second-year doctoral student under the supervision of Asst. Prof. Xiaobai Li and Prof. Olli Silvén at the Center for Machine Vision and Signal Analysis, University of Oulu. His research interest includes remote photoplethysmography, computer vision, affective computing, and biomedical signal processing. His doctoral research topic is the remote physiological signal measurement from facial videos, which covers two aspects. One is remote physiological measurement algorithm development, and the other is remote physiological signal applications including healthcare and security.
Talk 3 -- Martin Trapp: " Uncertainty-guided source-free domain adaptation."
Abstract: Source-free domain adaptation (SFDA) aims to adapt a classifier to an unlabelled target data set by only using a pre-trained source model. However, the absence of the source data and the domain shift makes the predictions on the target data unreliable. We propose quantifying the uncertainty in the source model predictions and utilizing it to guide the target adaptation. For this, we construct a probabilistic source model by incorporating priors on the network parameters inducing a distribution over the model predictions. Uncertainties are estimated by employing a Laplace approximation and incorporated to identify target data points that do not lie in the source manifold and to down-weight them when maximizing the mutual information on the target data. Unlike recent works, our probabilistic treatment is computationally lightweight, decouples source training and target adaptation, and requires no specialized source training or changes of the model architecture. We show the advantages of uncertainty-guided SFDA over traditional SFDA in the closed-set and open-set settings and provide empirical evidence that our approach is more robust to strong domain shifts even without tuning.
Dr. Martin Trapp is an Academy of Finland postdoctoral fellow at Aalto University, primarily working on methods development in probabilistic machine learning with a focus on flexible Bayesian models (e.g., Gaussian process, Bayesian neural networks, Polya trees) and tractable modelling families (e.g., probabilistic circuits). Before this, he worked with Arno Solin as a postdoc at Aalto University and finished his PhD in machine learning at Graz University of Technology, Austria, under the supervision of Franz Pernkopf and Robert Peharz in 2020. In addition to his research, Martin is a core developer of the probabilistic programming language Turing.jl and co-organiser of the ELLIS seminar on Advances in Probabilistic Machine Learning at Aalto University.
Three talks are given by Heikki Huttunen, Yuxin Hou and Yang Liu.
Talk 1 -- Heikki Huttunen: "Computer vision for ports and terminals."
Abstract: International ports compete to best serve their customers in terms of efficiency and safety, and automation is a key tool in improving the cargo handling processes. Artificial intelligence both improves safety by moving potentially dangerous cargo handling tasks from humans to machines; and facilitates new ways of data collection for new level of situational awareness. In this talk, we will discuss the automation needs of modern mid-size to large cargo terminals, and will present a few case examples of how computer vision is used to solve these needs. The cases utilize a wide spectrum of deep learning techniques, ranging from basic tasks: detection, semantic segmentation and classification; to more specialized tasks, such as optical character recognition and face recognition. In the end, the business significance of artificial intelligence as well as the disruptions it can bring to the industry are discussed.
Dr. Heikki Huttunen is currently the chief technical officer at Visy Oy, and leads the AI development at the company. The products are based on an in-house AI engine and used worldwide for various detection, classification and optical character recognition (OCR) tasks in international ports and terminals. He received his doctoral degree from Tampere University of Technology in 1999, and served in various positions in the academia; most recently as an associate professor of signal processing and machine learning at Tampere University until the year 2020. During his academic career he authored over 100 scientific articles. His research interests include deep learning for detection, text spotting, scene text recognition and semantic segmentation..
Talk 2 -- Yuxin Hou: "Implicit Map Augmentation for Relocalization."
Abstract Learning neural radiance fields (NeRF) has recently revolutionized novel view synthesis and related topics. The fact that the implicit scene models learned via NeRF greatly extend the representational capability compared with sparse maps, however, is largely overlooked. In this talk, we will show the implicit map augmentation that uses implicit scene representation to augment the sparse maps and help with visual relocalization. Given a sparse map reconstructed by structure-from-motion, we firstly train a NeRF model conditioned on the sparse point cloud. Then an augmented sparse map representation can be sampled from the NeRF model, which can be plugged into a traditional visual localization pipeline directly. The experiments demonstrate that using augmented maps achieves better relocalization results on challenging views. .
Yuxin Hou Yuxin Hou is currently a PhD student at Aalto University, supervised by Dr. Juho Kannala and Dr. Arno Solin. Her research interests include deep learning for 3D vision tasks like depth estimation, novel view synthesis, and implicit 3D representation learning. She joined Meta Reality Labs in 2021 as a Research Intern, working on algorithms for visual relocalization..
Talk 3 -- Yang Liu: "Graph-based Facial Affect Analysis: Trends and Applications."
Abstract: As one of the most important affective signals, facial affect analysis (FAA) is essential for developing human-computer interaction systems. Early methods focus on extracting appearance and geometry features while ignoring the latent semantic information among individual facial changes, leading to limited performance and generalization. Recent work attempts to establish a graph-based representation to model these semantic relationships and develop frameworks to leverage them for various FAA tasks. We provide a comprehensive review of graph-based FAA, including the evolution of algorithms, SOTA frameworks, and open directions. Finally, we also propose a new graph-based method for uncertain facial expression recognition. Based on the latent dependency between emotions and Action Units (AUs), an auxiliary branch using graph convolutional layers is added to extract the semantic information from graph topologies. A re-labeling strategy corrects the ambiguous annotations by comparing their feature similarities with semantic templates. Experiments on large-scale datasets reveal a substantial performance improvement.
Yang Liu currently is a postdoctoral research fellow under the supervision of Academy Prof. Guoying Zhao at the Center for Machine Vision and Signal Analysis, University of Oulu. He received his Ph.D. degree in computer science and technology from the South China University of Technology, in 2021. His research interests include facial expression recognition, multi-modal affective computing, and machine learning.
Four talks are given by Matti Pietikäine, Serkan Kiranyaz, Abol Basher and Jaakko Lehtinen.
Talk 1 -- Matti Pietikäinen: "Machine Vision Group: 40 Years of Computer Vision and Pattern Recognition Research."
Abstract: The Machine Vision Group (MVG) at the University of Oulu celebrates it’s 40th Anniversary in 2021. This talk will first present objectives and scientific impact of our research, as well as it’s impact on ecosystem development. Then, some highlights of the past research are presented, including approaches for texture-based image and video description with local binary patterns (LBP), geometric camera calibration, and face and facial expression analysis. Various application examples of this research are also shown. Next, the current focus areas of CMVS research are introduced, including learning image, video & 3-D representations, affective computing, geometric 3-D vision, bio-signal analysis, and embedded vision systems. Examples of this research are presented, dealing with energy and sample efficient AI, facial expression and micro-expression analysis, context recognition with 3-D view synthesis, multimodal video-based bio-signal extraction, and ”flat” imaging for embedded perceptual user interfaces. Finally, a new project on emotion AI is briefly introduced.
Matti Pietikäinen received his Doctor of Science in Technology degree from the University of Oulu, Finland. He is an emeritus professor at the Center for Machine Vision and Signal Analysis, University of Oulu. From 1980 to 1981 and from 1984 to 1985, he visited the Computer Vision Laboratory at the University of Maryland. He has made fundamental contributions, e.g. to local binary pattern (LBP) methodology, texture-based image and video analysis, and facial image analysis. He has authored over 350 refereed papers in international journals, books and conferences. His papers have 71,663+ citations in Google Scholar (h-index 93) (June 16, 2021). He was Associate Editor of IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Pattern Recognition, IEEE Transactions on Forensics and Security, IEEE Transactions on Biometrics, Behavior and Identity Science, and Image and Vision Computing journals. Currently he serves as a Guest Editor for a special issue of IEEE TPAMI on Learning with Less Labels in Computer Vision. He was President of the Pattern Recognition Society of Finland from 1989 to 1992, and was named its Honorary Member in 2014. From 1989 to 2007 he served as Member of the Governing Board of International Association for Pattern Recognition (IAPR), and became one of the founding fellows of the IAPR in 1994. He is IEEE Fellow for contributions to texture and facial image analysis for machine vision. In 2014, his research on LBP-based face description was awarded the Koenderink Prize for Fundamental Contributions in Computer Vision that withstood the test of time. He was the recipient of the prestigious IAPR King-Sun Fu Prize 2018 for fundamental contributions to texture analysis and facial image analysis. In 2018, he was named a Highly Cited Researcher by Clarivate Analytics, by producing multiple highly cited papers in 2006-2016 that rank in the top 1% by citation for his field in Web of Science.
Talk 2 -- Serkan Kiranyaz: "New-Generation Neural Networks and Applications."
Abstract Multi-Layer Perceptrons (MLPs), and their derivatives, Convolutional Neural Networks (CNNs) have a common drawback: they employ a homogenous network structure with an identical “linear” neuron model. This naturally makes them only a crude model of the biological neurons or mammalian neural systems, which are heterogeneous and composed of highly diverse neuron types with distinct biochemical and electrophysiological properties. With such crude models, conventional homogenous networks can learn sufficiently well problems with a monotonous, relatively simple, and linearly separable solution space but they fail to accomplish this whenever the solution space is highly nonlinear and complex. To address this drawback, a heterogeneous and dense network model, Generalized Operational Perceptrons (GOPs) has recently been proposed. GOPs aim to model biological neurons with distinct synaptic connections. GOPs have demonstrated a superior diversity, encountered in biological neural networks, which resulted in an elegant performance level on numerous challenging problems where conventional MLPs entirely failed. Following GOPs footsteps, a heterogeneous and non-linear network model, called Operational Neural Network (ONN), has recently been proposed as a superset of CNNs. ONNs, like their predecessor GOPs, boost the diversity to learn highly complex and multi-modal functions or spaces with minimal network complexity and training data. However, ONNs also exhibit certain drawbacks such as strict dependability to the operators in the operator set library, the mandatory search for the best operator set for each layer/neuron, and the need for setting (fixing) the operator sets of the output layer neuron(s) in advance. Self-organized ONNs (Self-ONNs) with generative neurons can address all these drawbacks without any prior search or training and with elegant computational complexity. However, generative neurons still perform “localized” kernel operations and hence the kernel size of a neuron at a particular layer solely determines the capacity of the receptive fields and the amount of information gathered from the previous layer. In order to improve the receptive field size and to even to find the best possible location for each kernel, non-localized kernel operations for Self-ONNs are embedded in a novel and superior neuron model than the generative neurons hence called the “super (generative) neurons”. This talk will cover a natural evolution of the artificial neuron and network models starting from the ancient (linear) neuron model in the 1940s to the super neurons and new-generation Self-ONNs. The focus will particularly be drawn on numerous Image Processing applications such as image restoration, denoising, and regression where Self-ONNs especially with super neurons have achieved state-of-the-art performance levels with a significant gap. .
Serkan Kiranyaz (Dr. Tech.) Serkan Kiranyaz was born in Turkey, 1972. He received his BS and MS degrees in Electrical and Electronics Department at Bilkent University, Ankara, Turkey, in 1994 and 1996, respectively. He received his PhD degree in 2005 and his Docency at 2007 from Tampere University of Technology, Institute of Signal Processing respectively. He was working as a Professor in Signal Processing Department in the same university during 2009 to 2015. He currently works as a Professor in Qatar University, Doha, Qatar. Prof. Kiranyaz has noteworthy expertise and background in various signal processing domains. He published two books, 7 book chapters, 7 patents, more than 90 journal articles in several IEEE Transactions and other high impact journals, and more than 110 papers in international conferences. He served as PI and LPI in several national and international projects. His principal research field is machine learning and signal processing. He is rigorously aiming for reinventing the ways in novel signal processing paradigms, enriching it with new approaches especially in machine intelligence, and revolutionizing the means of “learn-to-process” signals. He made significant contributions on bio-signal analysis, particularly EEG and ECG analysis and processing, classification and segmentation, computer vision with applications to recognition, classification, multimedia retrieval, evolving systems and evolutionary machine learning, swarm intelligence and evolutionary optimization.
Talk 3 -- Abol Basher: "LightSAL: Lightweight Sign Agnostic Learning for Implicit Surface Representation."
Abstract: Recently, several works have addressed modeling of 3D shapes using deep neural networks to learn implicit surface representations. Up to now, the majority of works have concentrated on reconstruction quality, paying little or no attention to model size or training time. This work proposes LightSAL, a novel deep convolutional architecture for learning 3D shapes; the proposed work concentrates on efficiency both in network training time and resulting model size. We build on the recent concept of Sign Agnostic Learning for training the proposed network, relying on signed distance fields, with unsigned distance as ground truth. In the experimental section of the paper, we demonstrate that the proposed architecture outperforms previous work in model size and number of required training iterations, while achieving equivalent accuracy. Experiments are based on the D-Faust dataset that contains 41k 3D scans of human shapes.
Abol Basher currently is a doctoral student (Computer Science) under the supervision of Prof. Jani Boutellier in University of Vaasa since Jan. 2021. He has been working for Academy of Finland funded project "Robust and Efficient Perception for Autonomous Thing (REPEAT)" and his research topic is 3D Reconstruction and Scene completion. He is also a project researcher for Digital Economy Research Platform (October 2020-present) with University of Vaasa.
Talk 4 -- Jaakko Lehtinen: "Alias-Free Generative Adversarial Networks." (postponed)*(
Abstract: We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, e.g., detail appearing to be glued to image coordinates instead of the surfaces of depicted objects. We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process. The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.
Jaakko Lehtinen is a tenured associate professor at Aalto University, and a principal research scientist at NVIDIA Research. He works on computer graphics, computer vision, and machine learning, with particular interests in generative modelling, realistic image synthesis, and appearance acquisition and reproduction. Overall, he is fascinated by the combination of machine learning techniques with physical simulators in the search for robust, interpretable AI. Prior to taking his current positions, he spent 2007-10 as a postdoc with Frédo Durand at MIT. Before his research career, he worked for the game developer Remedy Entertainment in 1996-2005 as a graphics programmer, and contributed significantly to the graphics technology behind the worldwide blockbuster hit games Max Payne (2001), Max Payne 2 (2003), and Alan Wake (2009).