Publications by authors named "Zisserman A"

Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals-usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training.

View Article and Find Full Text PDF

Spinal magnetic resonance (MR) scans are a vital tool for diagnosing the cause of back pain for many diseases and conditions. However, interpreting clinically useful information from these scans can be challenging, time-consuming and hard to reproduce across different radiologists. In this paper, we alleviate these problems by introducing a multi-stage automated pipeline for analysing spinal MR scans.

View Article and Find Full Text PDF

Our objective is to locate and provide a unique identifier for each mouse in a cluttered home-cage environment through time, as a precursor to automated behaviour recognition for biological research. This is a very challenging problem due to (i) the lack of distinguishing visual features for each mouse, and (ii) the close confines of the scene with constant occlusion, making standard visual tracking approaches unusable. However, a coarse estimate of each mouse's location is available from a unique RFID implant, so there is the potential to optimally combine information from (weak) tracking with coarse information on identity.

View Article and Find Full Text PDF

Background: Scoliosis is spinal curvature that may progress to require surgical stabilisation. Risk factors for progression are little understood due to lack of population-based research, since radiographs cannot be performed on entire populations due to high levels of radiation. To help address this, we have previously developed and validated a method for quantification of spinal curvature from total body dual energy X-ray absorptiometry (DXA) scans.

View Article and Find Full Text PDF

Objectives: The relationship of degeneration to symptoms has been questioned. MRI detects apparently similar disc degeneration and degenerative changes in subjects both with and without back pain. We aimed to overcome these problems by re-annotating MRIs from asymptomatic and symptomatics groups onto the same grading system.

View Article and Find Full Text PDF

Large video datasets of wild animal behavior are crucial to produce longitudinal research and accelerate conservation efforts; however, large-scale behavior analyses continue to be severely constrained by time and resources. We present a deep convolutional neural network approach and fully automated pipeline to detect and track two audiovisually distinctive actions in wild chimpanzees: buttress drumming and nut cracking. Using camera trap and direct video recordings, we train action recognition models using audio and visual signatures of both behaviors, attaining high average precision (buttress drumming: 0.

View Article and Find Full Text PDF

We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. We present a new approach called AutoNovel to address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labelled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use ranking statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. Moreover, we propose a method to estimate the number of classes for the case where the number of new categories is not known a priori.

View Article and Find Full Text PDF

Capturing the 'mutual gaze' of people is essential for understanding and interpreting the social interactions between them. To this end, this paper addresses the problem of detecting people Looking At Each Other (LAEO) in video sequences. For this purpose, we propose LAEO-Net++, a new deep CNN for determining LAEO in videos.

View Article and Find Full Text PDF

In the original version of the article, the co-author would like to add to the acknowledgements section to highlight their funding stream (EPSRC). The revised acknowledgements is given below.

View Article and Find Full Text PDF

Time-lapse cameras facilitate remote and high-resolution monitoring of wild animal and plant communities, but the image data produced require further processing to be useful. Here we publish pipelines to process raw time-lapse imagery, resulting in count data (number of penguins per image) and 'nearest neighbour distance' measurements. The latter provide useful summaries of colony spatial structure (which can indicate phenological stage) and can be used to detect movement - metrics which could be valuable for a number of different monitoring scenarios, including image capture during aerial surveys.

View Article and Find Full Text PDF

Scoliosis is a 3D-torsional rotation of the spine, but risk factors for initiation and progression are little understood. Research is hampered by lack of population-based research since radiographs cannot be performed on entire populations due to the relatively high levels of ionising radiation. Hence we have developed and validated a manual method for identifying scoliosis from total body dual energy X-ray absorptiometry (DXA) scans for research purposes.

View Article and Find Full Text PDF

The implementation of video-based non-contact technologies to monitor the vital signs of preterm infants in the hospital presents several challenges, such as the detection of the presence or the absence of a patient in the video frame, robustness to changes in lighting conditions, automated identification of suitable time periods and regions of interest from which vital signs can be estimated. We carried out a clinical study to evaluate the accuracy and the proportion of time that heart rate and respiratory rate can be estimated from preterm infants using only a video camera in a clinical environment, without interfering with regular patient care. A total of 426.

View Article and Find Full Text PDF

Unlabelled: Non-contact vital sign monitoring enables the estimation of vital signs, such as heart rate, respiratory rate and oxygen saturation (SpO), by measuring subtle color changes on the skin surface using a video camera. For patients in a hospital ward, the main challenges in the development of continuous and robust non-contact monitoring techniques are the identification of time periods and the segmentation of skin regions of interest (ROIs) from which vital signs can be estimated. We propose a deep learning framework to tackle these challenges.

View Article and Find Full Text PDF

Video recording is now ubiquitous in the study of animal behavior, but its analysis on a large scale is prohibited by the time and resources needed to manually process large volumes of data. We present a deep convolutional neural network (CNN) approach that provides a fully automated pipeline for face detection, tracking, and recognition of wild chimpanzees from long-term video records. In a 14-year dataset yielding 10 million face images from 23 individuals over 50 hours of footage, we obtained an overall accuracy of 92.

View Article and Find Full Text PDF

The objective of this work is automatic labelling of characters in TV video and movies, given weak supervisory information provided by an aligned transcript. We make five contributions: (i) a new strategy for obtaining stronger supervisory information from aligned transcripts; (ii) an explicit model for classifying background characters, based on their face-tracks; (iii) employing new ConvNet based face features, and (iv) a novel approach for labelling all face tracks jointly using linear programming. Each of these contributions delivers a boost in performance, and we demonstrate this on standard benchmarks using tracks provided by authors of prior work.

View Article and Find Full Text PDF

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss.

View Article and Find Full Text PDF
From Images to 3D Shape Attributes.

IEEE Trans Pattern Anal Mach Intell

January 2019

Our goal in this paper is to investigate properties of 3D shape that can be determined from a single image. We define 3D shape attributes-generic properties of the shape that capture curvature, contact and occupied space. Our first objective is to infer these 3D shape attributes from a single image.

View Article and Find Full Text PDF

Automated time-lapse cameras can facilitate reliable and consistent monitoring of wild animal populations. In this report, data from 73,802 images taken by 15 different Penguin Watch cameras are presented, capturing the dynamics of penguin (Spheniscidae; Pygoscelis spp.) breeding colonies across the Antarctic Peninsula, South Shetland Islands and South Georgia (03/2012 to 01/2014).

View Article and Find Full Text PDF

The crystallization of solidifying Al-Cu alloys over a wide range of conditions was studied in situ by synchrotron x-ray radiography, and the data were analyzed using a computer vision algorithm trained using machine learning. The effect of cooling rate and solute concentration on nucleation undercooling, crystal formation rate, and crystal growth rate was measured automatically for thousands of separate crystals, which was impossible to achieve manually. Nucleation undercooling distributions confirmed the efficiency of extrinsic grain refiners and gave support to the widely assumed free growth model of heterogeneous nucleation.

View Article and Find Full Text PDF

Methods for aligning 3D fetal neurosonography images must be robust to (i) intensity variations, (ii) anatomical and age-specific differences within the fetal population, and (iii) the variations in fetal position. To this end, we propose a multi-task fully convolutional neural network (FCN) architecture to address the problem of 3D fetal brain localization, structural segmentation, and alignment to a referential coordinate system. Instead of treating these tasks as independent problems, we optimize the network by simultaneously learning features shared within the input data pertaining to the correlated tasks, and later branching out into task-specific output streams.

View Article and Find Full Text PDF

The objective of this work is to automatically produce radiological gradings of spinal lumbar MRIs and also localize the predicted pathologies. We show that this can be achieved via a Convolutional Neural Network (CNN) framework that takes intervertebral disc volumes as inputs and is trained only on disc-specific class labels. Our contributions are: (i) a CNN architecture that predicts multiple gradings at once, and we propose variants of the architecture including using 3D convolutions; (ii) showing that this architecture can be trained using a multi-task loss function without requiring segmentation level annotation; and (iii) a localization method that clearly shows pathological regions in the disc volumes.

View Article and Find Full Text PDF

Study Design: Investigation of the automation of radiological features from magnetic resonance images (MRIs) of the lumbar spine.

Objective: To automate the process of grading lumbar intervertebral discs and vertebral bodies from MRIs. MR imaging is the most common imaging technique used in investigating low back pain (LBP).

View Article and Find Full Text PDF

We consider the design of an image representation that embeds and aggregates a set of local descriptors into a single vector. Popular representations of this kind include the bag-of-visual-words, the Fisher vector and the VLAD. When two such image representations are compared with the dot-product, the image-to-image similarity can be interpreted as a match kernel.

View Article and Find Full Text PDF

The objective of this work is to learn descriptors suitable for the sparse feature detectors used in viewpoint invariant matching. We make a number of novel contributions towards this goal. First, it is shown that learning the pooling regions for the descriptor can be formulated as a convex optimisation problem selecting the regions using sparsity.

View Article and Find Full Text PDF