Publications on ViCoS Lab

1st Workshop on Maritime Computer Vision (MaCVi) 2023: Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The 1st Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi

2nd Workshop on Maritime Computer Vision (MaCVi) 2024: Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The 2nd Workshop on Maritime Computer Vision (MaCVi) 2024 addresses maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicles (USV). Three challenges categories are considered: (i) UAV-based Maritime Object Tracking with Re-identification, (ii) USV-based Maritime Obstacle Segmentation and De- tection, (iii) USV-based Maritime Boat Tracking. The USV-based Maritime Obstacle Segmentation and Detec- tion features three sub-challenges, including a new em- bedded challenge addressing efficicent inference on real- world embedded devices. This report offers a comprehen- sive overview of the findings from the challenges. We pro- vide both statistical and qualitative analyses, evaluating trends from over 195 submissions. All datasets, evaluation code, and the leaderboard are available to the public at https://macvi.org/workshop/macvi24.

A basic cognitive system for interactive continuous learning of visual concepts

Mon, 01 Jan 0001 00:00:00 +0000

Interactive continuous learning is an important characteristic of a cognitive agent that is supposed to operate and evolve in an everchanging environment. In this paper we present representations and mechanisms that are necessary for continuous learning of visual concepts in dialogue with a tutor. We present an approach for modelling beliefs stemming from multiple modalities and we show how these beliefs are created by processing visual and linguistic information and how they are used for learning. We also present a system that exploits these representations and mechanisms, and demonstrate these principles in the case of learning about object colours and basic shapes in dialogue with the tutor.

A basic cognitive system for interactive learning of simple visual concepts

Mon, 01 Jan 0001 00:00:00 +0000

In this work we present a system and underlying representations and mechanisms for continuous learning of visual concepts in dialogue with a human tutor.

A Bayes-Spectral-Entropy-Based Measure of Camera Focus Using a Discrete Cosine Transform

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present a novel measure of camera focus based on the Bayes spectral entropy of an image spectrum. In order to estimate the degree of focus, the image is divided into non-overlapping subimages of 8 by 8 pixels. Next, sharpness values are calculated separately for each sub-image and their mean is taken as a measure of the overall focus. The sub-image spectra are obtained by an 8×8 discrete cosine transform (DCT). Comparisons were made against four well-known measures that were chosen as reference, on images captured with a standard visible-light camera and a thermal camera. The proposed measure outperformed the reference measures by exhibiting a wider working range and a smaller failure rate. To assess its robustness to noise, additional tests were conducted with noisy images.

A Computer Vision Integration Model for a Multi-modal Cognitive System

Mon, 01 Jan 0001 00:00:00 +0000

We present a general method for integrating visual components into a multi-modal cognitive system. The integration is very generic and can combine an arbitrary set of modalities. We illustrate our integration approach with a specific instantiation of the architecture schema that focuses on integration of vision and language: a cognitive system able to collaborate with a human, learn and display some understanding of its surroundings. As examples of cross-modal interaction we describe mechanisms for clarification and visual learning.

A Detect-and-Verify Paradigm for Low-Shot Counting - DAVE

Mon, 01 Jan 0001 00:00:00 +0000

Low-shot counters estimate the number of objects corresponding to a selected category, based on only few or no exemplars annotated in the image. The current state-of-the-art estimates the total counts as the sum over the object location density map, but does not provide individual object locations and sizes, which are crucial for many applications. This is addressed by detection-based counters, which, however fall behind in the total count accuracy. Furthermore, both approaches tend to overestimate the counts in the presence of other object classes due to many false positives. We propose DAVE, a low-shot counter based on a detect-and-verify paradigm, that avoids the aforementioned issues by first generating a high-recall detection set and then verifying the detections to identify and remove the outliers. This jointly increases the recall and precision, leading to accurate counts. DAVE outperforms the top density-based counters by ~20% in the total count MAE, it outperforms the most recent detection-based counter by ~20% in detection quality and sets a new state-of-the-art in zero-shot as well as text-prompt-based counting.

A Discriminative Single-Shot Segmentation Network for Visual Object Tracking

Mon, 01 Jan 0001 00:00:00 +0000

Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker – D3S2, which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve robust online target segmentation. The overall tracking reliability is further increased by decoupling the object and feature scale estimation. Without per-dataset finetuning, and trained only for segmentation as the primary output, D3S2 outperforms all published trackers on the recent short-term tracking benchmark VOT2020 and performs very close to the state-of-the-art trackers on the GOT-10k, TrackingNet, OTB100 and LaSoT. D3S2 outperforms the leading segmentation tracker SiamMask on video object segmentation benchmarks and performs on par with top video object segmentation algorithms.

A Distractor-Aware Memory for Visual Object Tracking with SAM2

Mon, 01 Jan 0001 00:00:00 +0000

Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them. The code and the new dataset will be available on https://github.com/jovanavidenovic/DAM4SAM.

A Framework for Robust and Incremental Self-Localization

Mon, 01 Jan 0001 00:00:00 +0000

In this contribution we present a framework for an embodied robotic system that is capable of appearance-based self-localization. Specifically, we concentrate on the issues of robustness, flexibility, and scalability of the system. The framework presented is based on a panoramic eigenspace model of the environment. Its main feature is that it allows for simultaneous localization and map building using an incremental learning algorithm. Further, both the learning and the training processes are designed in a way to achieve robustness and adaptability to changes in the environment.

A framework for visual-context-aware object detection in still images

Mon, 01 Jan 0001 00:00:00 +0000

Visual context provides cues about an object’s presence, position and size within the observed scene, which should be used to increase the performance of object detection techniques. However, in computer vision, object detectors typically ignore this information. We therefore present a framework for visual-context-aware object detection. Methods for extracting visual contextual information from still images are proposed, which are then used to calculate a prior for object detection. The concept is based on a sparse coding of contextual features, which are based on geometry and texture. In addition, bottom-up saliency and object co-occurrences are exploited, to define auxiliary visual context. To integrate the individual contextual cues with a local appearance-based object detector, a fully probabilistic framework is established. In contrast to other methods, our integration is based on modeling the underlying conditional probabilities between the different cues, which is done via kernel density estimation. This integration is a crucial part of the framework which is demonstrated within the detailed evaluation. Our method is evaluated using a novel demanding image data set and compared to a state-of-the-art method for context-aware object detection. An in-depth analysis is given discussing the contributions of the individual contextual cues and the limitations of visual context for object detection.

A graphical model for rapid obstacle image-map estimation from unmanned surface vehicles

Mon, 01 Jan 0001 00:00:00 +0000

A hierarchical dynamic model for tracking in sports

Mon, 01 Jan 0001 00:00:00 +0000

Dynamic models play a crucial role in tracking algorithms. In particle filters, for example, proper modelling of the target dynamics can help achieving the desired tracking accuracy using only a small number of particles and thus reducing the computa- tional complexity of the tracker. We propose a novel hierarchical model for tracking players in sports by combining a conservative and a liberal dynamic model to better describe the player’s dynamics. We show how parameters of the model can be estimated from prior knowledge about the players dynamics. The proposed dynamic model was compared to a widely used model and resulted in better performance in terms of estimating position and prediction.

A hierarchy of cognitive maps from panoramic images

Mon, 01 Jan 0001 00:00:00 +0000

This paper presents a computational model which implements formation of cognitive maps based on panoramic images captured during the exploration phase. The resulting map consists of “place cells” and topological relations between them. The formation of the cognitive map is based on the model introduced by Hafner. The use of panoramic images as inputs would result in high computational complexity of the simulation, therefore we propose to use the PCA (Principal Component Analysis) method to reduce the dimension of the input space. A physical force model is applied to extend the relatively sparse topological map with metric information. Both the computational model and the physical force model try to mimic functions performed in the mammalian brain.

A Local-motion-based probabilistic model for visual tracking

Mon, 01 Jan 0001 00:00:00 +0000

Color-based tracking is prone to failure in situations where visually similar targets are moving in a close proximity or occlude each other. To deal with the ambiguities in the visual information, we propose an additional color-independent visual model based on the target’s local motion. This model is calculated from the optical flow induced by the target in consecutive images. By modifying a color-based particle filter to account for the target’s local motion, the combined color/local-motion-based tracker is constructed. We compare the combined tracker to a purely color-based tracker on a challenging dataset from hand tracking, surveillance and sports. The experiments show that the proposed local-motion model largely resolves situations when the target is occluded by, or moves in front of, a visually similar object.

A Long-Term Discriminative Single Shot Segmentation Tracker

Mon, 01 Jan 0001 00:00:00 +0000

State-of-the-art long-term visual object tracking methods are limited to predict target position as an axis-aligned bounding box. Segmentation-based trackers exist, however they do not address long-term disappearances of the target. We propose a long-term discriminative single shot segmentation tracker – D3SLT, which addresses the above shortcomings. The previously developed short-term D3S tracker is upgraded with a global re-detection module, based on an image-wide discriminative correlation filter response and Gaussian motion model. An online learned confidence estimation module is employed for robust estimation target disappearance. Additional backtracking module enables recovery from tracking failures and further improves tracking performance. D3SLT performs close to the state-of-the-art long-term trackers on the bou-nding box based VOT-LT2021 Challenge, achieving F-score of 0.667, while additionally outputting segmentation masks.

A Low-Shot Object Counting Network With Iterative Prototype Adaptation

Mon, 01 Jan 0001 00:00:00 +0000

We consider low-shot counting of arbitrary semantic categories in the image using only few annotated exemplars (few-shot) or no exemplars (no-shot). The standard few-shot pipeline follows extraction of appearance queries from exemplars and matching them with image features to infer the object counts. Existing methods extract queries by feature pooling, but neglect the shape information (e.g., size and aspect), which leads to a reduced object localization accuracy and count estimates. We propose a Low-shot Object Counting network with iterative prototype Adaptation (LOCA). Our main contribution is the new object prototype extraction module, which iteratively fuses the exemplar shape and appearance queries with image features. The module is easily adapted to zero-shot scenario, enabling LOCA to cover the entire spectrum of low-shot counting problems. LOCA outperforms all recent state-of-the-art methods on FSC147 benchmark by 20-30% in RMSE on one-shot and few-shot and achieves state-of-the-art on zero-shot scenarios, while demonstrating better generalization capabilities.

A modular toolkit for visual tracking performance evaluation

Mon, 01 Jan 0001 00:00:00 +0000

We present a modular software package for conducting single-target visual object tracking experiments and analyzing results. Our software supports many of the common usage patterns in visual tracking evaluation out of the box, but is also modular and allows various extensions. Users are able to integrate existing implementations of visual tracking algorithms with little additional effort using a standardized and flexible communication protocol. The software has been the technical backbone of the VOT Challenge initiative for many years and has grown and evolved with the competitions that it supported. We present its current state and the capabilities of the package and conclude with some plans for future development.

A New Dataset and a Distractor-Aware Architecture for Transparent Object Tracking

Mon, 01 Jan 0001 00:00:00 +0000

Performance of modern trackers degrades substantially on transparent objects compared to opaque objects. This is largely due to two distinct reasons. Transparent objects are unique in that their appearance is directly affected by the background. Furthermore, transparent object scenes often contain many visually similar objects (distractors), which often lead to tracking failure. However, development of modern tracking architectures requires large training sets, which do not exist in transparent object tracking. We present two contributions addressing the aforementioned issues. We propose the first transparent object tracking training dataset Trans2k that consists of over 2k sequences with 104,343 images overall, annotated by bounding boxes and segmentation masks. Standard trackers trained on this dataset consistently improve by up to 16%. Our second contribution is a new distractor-aware transparent object tracker (DiTra) that treats localization accuracy and target identification as separate tasks and implements them by a novel architecture. DiTra sets a new state-of-the-art in transparent object tracking and generalizes well to opaque objects.

A Novel Performance Evaluation Methodology for Single-Target Trackers

Mon, 01 Jan 0001 00:00:00 +0000

This paper addresses the problem of single-target tracker performance evaluation. We consider the performance measures, the dataset and the evaluation system to be the most important components of tracker evaluation and propose requirements for each of them. The requirements are the basis of a new evaluation methodology that aims at a simple and easily interpretable tracker comparison. The ranking-based methodology addresses tracker equivalence in terms of statistical significance and practical differences. A fully-annotated dataset with per-frame annotations with several visual attributes is introduced. The diversity of its visual properties is maximized in a novel way by clustering a large number of videos according to their visual attributes. This makes it the most sophistically constructed and annotated dataset to date. A multi-platform evaluation system allowing easy integration of third-party trackers is presented as well. The proposed evaluation methodology was tested on the VOT2014 challenge on the new dataset and 38 trackers, making it the largest benchmark to date. Most of the tested trackers are indeed state-of-the-art since they outperform the standard baselines, resulting in a highly-challenging benchmark. An exhaustive analysis of the dataset from the perspective of tracking difficulty is carried out. To facilitate tracker comparison a new performance visualization technique is proposed.

A Novel Unified Architecture for Low-Shot Counting by Detection and Segmentation

Mon, 01 Jan 0001 00:00:00 +0000

Low-shot object counters estimate the number of objects in an image using few or no annotated exemplars. Objects are localized by matching them to prototypes, which are constructed by unsupervised image-wide object appearance aggregation. Due to potentially diverse object appearances, the existing approaches often lead to overgeneralization and false positive detections. Furthermore, the best-performing methods train object localization by a surrogate loss, that predicts a unit Gaussian at each object center. This loss is sensitive to annotation error, hyperparameters and does not directly optimize the detection task, leading to suboptimal counts. We introduce GeCo, a novel low-shot counter that achieves accurate object detection, segmentation, and count estimation in a unified architecture. GeCo robustly generalizes the prototypes across objects appearances through a novel dense object query formulation. In addition, a novel counting loss is proposed, that directly optimizes the detection task and avoids the issues of the standard surrogate loss. GeCo surpasses the leading few-shot detection-based counters by 25% in the total count MAE, achieves superior detection accuracy and sets a new solid state-of-the-art result across all low-shot counting setups. The code will be available on GitHub.

A Robust PCA algorithm for building representations from panoramic images

Mon, 01 Jan 0001 00:00:00 +0000

We present an artifficial cognitive system for learning visual concepts. It comprises of vision, communication and manipulation sub- systems, which provide visual input, enable verbal and non-verbal communication with a tutor and allow interaction with a given scene. The main goal is to learn associations between automatically extracted visual features and words that describe the scene in an open-ended, continuous manner. In particular, we address the problem of cross-modal learning of visual properties and spatial relations. We introduce and analyse several learning modes requiring diffeerent levels of tutor supervision.

A segmentation-based approach for polyp counting in the wild

Mon, 01 Jan 0001 00:00:00 +0000

We address the problem of jellyfish polyp counting in underwater images. Modern methods utilize convolutional neural networks for feature extraction and work in two stages. First, hypothetical regions are proposed at potential locations, the features of the regions are extracted and classified according to the contained object. Such methods typically require a dense grid for region proposals, explicitly test various scales and are prone to failure in densely populated regions. We propose a segmentation-based polyp counter – SegCo. A convolutional neural network is trained to produce locally-circular segmentation masks on the polyps, which are then detected by localizing circularly symmetric areas in the segmented image. Detection stage is effcient and avoids a greedy search over position and scales. SegCo outperforms the current state-of-the-art object detector RetinaNet and the recent specialized polyp detection method PoCo by 2% and 24% in F-score, respectively, and sets a new state-of-the-art in polyp detection.

A system approach to interactive learning of visual concepts

Mon, 01 Jan 0001 00:00:00 +0000

In this work we present a system and underlying mechanisms for continuous learning of visual concepts in dialogue with a human.

A System for Continuous Learning of Visual Concepts

Mon, 01 Jan 0001 00:00:00 +0000

We present an artifficial cognitive system for learning visual concepts. It comprises of vision, communication and manipulation sub- systems, which provide visual input, enable verbal and non-verbal com munication with a tutor and allow interaction with a given scene. The main goal is to learn associations between automatically extracted visual features and words that describe the scene in an open-ended, continuous manner. In particular, we address the problem of cross-modal learning of visual properties and spatial relations. We introduce and analyse several learning modes requiring different levels of tutor supervision.

A system for interactive learning in dialogue with a tutor

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present representations and mechanisms that facilitate continuous learning of visual concepts in dialogue with a tutor and show the implemented robot system. We present how beliefs about the world are created by processing visual and linguistic information and show how they are used for planning system behaviour with the aim at satisfying its internal drive – to extend its knowledge. The system facilitates different kinds of learning initiated by the human tutor or by the system itself. We demonstrate these principles in the case of learning about object colours and basic shapes.

A system for learning basic object affordances using a self-organizing map

Mon, 01 Jan 0001 00:00:00 +0000

When a cognitive system encounters particular objects, it needs to know what effect each of its possible actions will have on the state of each of those objects in order to be able to make effective decisions and achieve its goals. Moreover, it should be able to generalize effectively so that when it encounters novel objects, it is able to estimate what effect its actions will have on them based on its experiences with previously encountered similar objects. This idea is encapsulated by the term “affordance”, e.g. “a ball affords being rolled to the right when pushed from the left.” In this paper, we discuss the development of a cognitive vision platform that uses a robotic arm to interact with household objects in an attempt to learn some of their basic affordance properties. We outline the various sensor and effector module competencies that were needed to achieve this and describe an experiment that uses a self-organizing map to integrate these modalities in a working affordance learning system.

A Template-Based Multi-Player Action Recognition of the Basketball Game

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present a method for fully automatic trajectory based analysis of basketball game in the form of large and small scale modelling of the game. The large-scale game model is obtained by dividing the game into several game phases. Every game phase is then individually modelled using mixture of Gaussian distributions. The Expectation-Maximization algorithm is used to determine the parameters of the Gaussian distributions. On the other hand, the small-scale modelling of the game deals with specific basketball actions which can be defined in the form of action templates that are used by the basketball experts to pass their instructions to the players. For the recognition purposes we define the basic game elements which are the building blocks of the more complex game actions. These elements are then used to semantically describe the observed basketball actions and the templates. To establish if the observed action corresponds to the template, the similarity of descriptions is calculated using Levenstein distance measure. Experiments show that the proposed method could become a powerful tool for the recognition of various basketball actions.

A Trajectory-Based Analysis of Coordinated Team Activity in a Basketball Game

Mon, 01 Jan 0001 00:00:00 +0000

A Two-Stage Dynamic Model for Visual Tracking

Mon, 01 Jan 0001 00:00:00 +0000

We propose a new dynamic model which can be used within blob trackers to track the target’s center of gravity. A strong point of the model is that it is designed to track a variety of motions which are usually encountered in applications such as pedestrian tracking, hand tracking and sports. We call the dynamic model a two-stage dynamic model due to its particular structure, which is a composition of two models: a liberal model and a conservative model. The liberal model allows larger perturbations in the target’s dynamics and is able to account for motions in between the random-walk dynamics and the nearly-constant-velocity dynamics. On the other hand, the conservative model assumes smaller perturbations and is used to further constrain the liberal model to the target’s current dynamics. We implement the two-stage dynamic model in a two-stage probabilistic tracker based on the particle filter and apply it to two separate examples of blob tracking: (i) tracking entire persons and (ii) tracking of a person’s hands. Experiments show that, in comparison to the widely used models, the proposed two-stage dynamic model allows tracking with smaller number of particles in the particle filter (e.g., 25 particles), while achieving smaller errors in the state estimation and a smaller failure rate. The results suggest that the improved performance comes from the model’s ability to actively adapt to the target’s motion during tracking.

A Visualization and User Interface Framework for Heterogeneous Distributed Environments

Mon, 01 Jan 0001 00:00:00 +0000

Systems that require complex computations are frequently implemented in a distributed manner. Such systems are often split into components where each component is employed to perform a specific type of processing. The components of a system may be implemented in different programming languages because some languages are more suited for expressing and solving certain kinds of problems. The user of the system must have a way to monitor the state of individual components and also to modify their execution parameters through a user interface while the system is running. The distributed execution and programming language diversity represent a problem for the development of graphic user interfaces. In this paper we describe a framework in which a server provides two types of services to the components of a distributed system. First it manages visualization objects provided by individual components and combines and displays those objects in various views. Second, it displays and executes graphic user interface objects defined at runtime by the components and communicates with the components when changes occur in the user interface or in the internal state of the components. The framework was successfully used in a distributed robotic environment.

A water-obstacle separation and refinement network for unmanned surface vehicles

Mon, 01 Jan 0001 00:00:00 +0000

Obstacle detection by semantic segmentation shows a great promise for autonomous navigation in unmanned surface vehicles (USV). However, existing methods suffer from poor estimation of the water edge in presence of visual ambiguities, poor detection of small obstacles and high false-positive rate on water reflections and wakes. We propose a new deep encoder-decoder architecture, a water-obstacle separation and refinement network (WaSR), to address these issues. Detection and water edge accuracy are improved by a novel decoder that gradually fuses inertial information from IMU with the visual features from the encoder. In addition, a novel loss function is designed to increase the separation between water and obstacle features early on in the network. Subsequently, the capacity of the remaining layers in the decoder is better utilised, leading to a significant reduction in false positives and increased true positives. Experimental results show that WaSR outperforms the current state-of-the-art by a large margin, yielding a 14% increase in F-measure over the second-best method.

A web-service for object detection using hierarchical models

Mon, 01 Jan 0001 00:00:00 +0000

This paper proposes an architecture for an object detection system suitable for a web-service running distributed on a cluster of machines. We build on top of a recently proposed architecture for distributed visual recognition system and extend it with the object detection algorithm. As sliding-window techniques are computationally unsuitable for web-services we rely on models based on state-of-the-art hierarchical compositions for the object detection algorithm. We provide implementation details for running hierarchical models on top of a distributed platform and propose an additional hypothesis verification step to reduce many false-positives that are common in hierarchical models. For a verification we rely on a state-of-the-art descriptor extracted from the hierarchical structure and use a support vector machine for object classification. We evaluate the system on a cluster of 80 workers and show a response time of around 10 seconds at throughput of around 60 requests per minute.

About different active learning approaches for acquiring categorical knowledge

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we address the problem of acquiring categorical knowledge from the active learning perspective. We describe and implement several teacher and learnerdrivenapproaches that require different levels of teacher competencies and consider different types of knowledge for selection of training samples. The experimental results show that the active learning approach outperforms the passive one and that the adaptation of the learning process to the learners knowledge significantly improves the learning performance.

Acquiring range images of objects with non-uniform reflectance using high dynamic scale radiance maps

Mon, 01 Jan 0001 00:00:00 +0000

Active learning with teacher-learner mutuality

Mon, 01 Jan 0001 00:00:00 +0000

In active learning, the basic objective is to reach a desired performance of some learning algorithm with as little training instances as possible. The reason behind is that labeling of training instances may be expensive with respect to the amount of time and intellectual effort of a human annotator. We propose a new approach for active learning, called “mutual active learning”, which helps the artificial intelligent learner to pose questions to his human teacher, which are as clear and as understandable as possible. Such learning appears to be more reliable and successful than basic active learning.

Adaptive Dynamic Window Approach for Local Navigation

Mon, 01 Jan 0001 00:00:00 +0000

Local navigation is an essential ability of any mobile robot working in a real-world environment. One of the most commonly used methods for local navigation is the Dynamic Window Approach (DWA), which heavily depends on the settings of the parameters in its cost function. Since the optimal choice of the parameters depends on the environment that may significantly vary and change at any time, the parameters should be chosen dynamically in a data-driven way. To cope with this problem, we propose a novel deep convolutional neural network, which dynamically predicts these parameters considering the sensor readings. The network is trained using a state-of-the art reinforcement learning algorithm. In this way, we combine the power of data-driven learning and the dynamic model of the robot, enabling adaptation to the current environment as well as guaranteeing collision-free movement and smooth trajectories of the mobile robot. The experimental results show that the proposed method outperforms the DWA method as well as its recent extension.

Adding discriminative power to a generative hierarchical compositional model using histograms of compositions

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we identify two types of problems with excessive feature sharing and the lack of discriminative learning in hierarchical compositional models: (a) similar category misclassifications and (b) phantom detections in background objects. We propose to overcome those issues by fully utilizing a discriminative features already present in the generative models of hierarchical compositions. We introduce descriptor called Histogram of Compositions to capture the information important for improving discriminative power and use it with a classifier to learn distinctive features important for successful discrimination. The generative model of hierarchical compositions is combined with the discriminative descriptor by performing hypothesis verification of detections produced by the hierarchical compositional model. We evaluate proposed descriptor on five datasets and show to improve the misclassification rate between similar categories as well as the misclassification rate of phantom detections on backgrounds. Additionally, we compare our approach against a state-of-the-art convolutional neural network and show to outperform it under significant occlusions.

Adding discriminative power to hierarchical compositional models for object class detection

Mon, 01 Jan 0001 00:00:00 +0000

In recent years, hierarchical compositional models have been shown to possess many appealing properties for the object class detection such as coping with potentially large number of object categories. The reason is that they encode categories by hierarchical vocabularies of parts which are shared among the categories. On the downside, the sharing and purely reconstructive nature causes problems when categorizing visually-similar categories and separating them from the background. In this paper we propose a novel approach that preserves the appealing properties of the generative hierarchical models, while at the same time improves their discrimination properties. We achieve this by introducing a network of discriminative nodes on top of the existing generative hierarchy. The discriminative nodes are sparse linear combinations of activated generative parts. We show in the experiments that the discriminative nodes consistently improve a state-of-the-art hierarchical compositional model. Results show that our approach considers only a fraction of all nodes in the vocabulary (less than $10%$) which also makes the system computationally efficient.

Aktivno učenje in vzajemnost med učiteljem in učencem

Mon, 01 Jan 0001 00:00:00 +0000

Osnovni cilj aktivnega učenja je doseči želeno uspešnost danega učnega algoritma s čim manjšim številom učnih primerov. Vzrok te težnje je v dejstvu, da je označevanje učnih primerov običajno drago zaradi količine časa in umskega napora človeškega označevalca. Vendar pa ima aktivno učenje pomanjkljivosti, kot so neupoštevanje stopnje učiteljevega poznavanja problema in pomanjkanje mehanizma za zagotavljanje razumljivosti aktivno izbranih učnih primerov. V tem članku predlagamo nov pristop k aktivnemu učenju, t. i. “vzajemno aktivno učenje”, ki umetnemu inteligentnemu učencu pomaga, da svojemu učitelju zastavi kar se da jasna in razumljiva vprašanja. Tovrstno učenje se izkaže za bolj zanesljivo in uspešno v primerjavi z osnovnim aktivnim učenjem.

Aktivno učenje z mešanimi oznakami za detekcijo površinskih napak z globokimi nevronskimi mrežami

Mon, 01 Jan 0001 00:00:00 +0000

This paper investigates active learning strategies for mixed supervision in surface defect detection, where we search for a minimal set of samples selected for more accurate manual segmentation. We explore several approaches for sample selection based on entropy, margin sampling, and least confidence and apply them to a mixed supervision method, SegDecNet. We additionally explore extending active learning with probability calibration and equal sampling by categories to improve the robustness. Active learning approaches are evaluated on the KSDD2 dataset and compared against random sampling and a related purpose-built method for active learning in surface defect detection. We demonstrate that the least confidence method with the proposed extensions an outperform random sampling and other methods, achieving the same result as fully annotated dataset while requiring only a third of the fully annotated samples.

An adaptive coupled-layer visual model for robust visual tracking

Mon, 01 Jan 0001 00:00:00 +0000

This paper addresses the problem of tracking objects which undergo rapid and significant appearance changes. We propose a novel coupled-layer visual model that combines the target’s global and local appearance. The local layer in this model is a set of local patches that geometrically constrain the changes in the target’s appearance. This layer probabilistically adapts to the target’s geometric deformation, while its structure is updated by removing and adding the local patches. The addition of the patches is constrained by the global layer that probabilistically models target’s global visual properties such as color, shape and apparent local motion. The global visual properties are updated during tracking using the stable patches from the local layer. By this coupled constraint paradigm between the adaptation of the global and the local layer, we achieve a more robust tracking through significant appearance changes. Indeed, the experimental results on challenging sequences confirm that our tracker outperforms the related state-of-the-art trackers by having smaller failure rate as well as better accuracy.

An alternative way to calibrate ubisense real-time location system via multi-camera calibration methods

Mon, 01 Jan 0001 00:00:00 +0000

Ubisense Real-Time Location System is considered. The approach is based on capturing the raw angles of arrival and projecting them into virtual image plane, as if sensors were perspective cameras. The extrinsic parameters (position and orientation) of sensors are then obtained by calibration of virtual perspective cameras using multicamera calibration methods. An application considered in the paper is rapid deployment of Ubisense system for tracking in sports. Survey points can be easily determined from the standard markings on the court floor, which makes calibration from survey points coordinates more convenient than measuring sensor positions, which is prerequisite for standard Ubisense system calibration.

An Analysis Of Basketball Players' Movements In The Slovenian Basketball League Play-Offs Using The Sagit Tracking System

Mon, 01 Jan 0001 00:00:00 +0000

An integrated system for interactive continuous learning of categorical knowledge

Mon, 01 Jan 0001 00:00:00 +0000

This article presents an integrated robot system capable of interactive learning in dialogue with a human. Such a system needs to have several competencies and must be able to process different types of representations. In this article, we describe a collection of mechanisms that enable integration of heterogeneous competencies in a principled way. Central to our design is the creation of beliefs from visual and linguistic information, and the use of these beliefs for planning system behaviour to satisfy internal drives. The system is able to detect gaps in its knowledge and to plan and execute actions that provide information needed to fill these gaps. We propose a hierarchy of mechanisms which are capable of engaging in different kinds of learning interactions, e.g. those initiated by a tutor or by the system itself. We present the theory these mechanisms are build upon and an instantiation of this theory in the form of an integrated robot system. We demonstrate the operation of the system in the case of learning conceptual models of objects and their visual properties.

An integrated system for interactive continuous learning of categorical knowledge

Mon, 01 Jan 0001 00:00:00 +0000

This article presents an integrated robot system capable of interactive learning in dialogue with a human. Such a system needs to have several competencies and must be able to process dierent types of representations. In this article we describe a collection of mechanisms that enable integration of heterogeneous competencies in a principled way. Central to our design is the creation of beliefs from visual and linguistic information, and the use of these beliefs for planning system behaviour to satisfy internal drives. The system is able to detect gaps in its knowledge and to plan and execute actions that provide information needed to ll these gaps. We propose a hierarchy of mechanisms which are capable of engaging in dierent kinds of learning interactions, e.g. those initiated by a tutor or by the system itself. We present the theory these mechanisms are build upon and an instantiation of this theory in the form of an integrated robot system. We demonstrate the operation of the system in the case of learning conceptual models of objects and their visual properties.

Analiza robustnosti globokih nenadzorovanih metod za detekcijo vizualnih anomalij

Mon, 01 Jan 0001 00:00:00 +0000

Unsupervised generative methods have recently attracted significant attention in the field of industrial visual anomaly detection, mainly owing to their ability to learn from non anomalous data withouth requiring anomalous samples and pixel-level labels, which are costly to obtain. An assumption that anomalous data are always correctly identified and consequently removed from the training set underlies all of the generative methods. In practice, however, correctly identifying every single anomalous image can either be very costly to do or it can not be done at all due to the nature of the problem. In this paper, we analyze how robust some of the recently proposed generative methods for anomaly detection are, by introducing anomalous data in the training process. Our analysis covers 3 methods and 4 datasets with 8 categories in total, and we conclude that while some of the methods are more robust than others, introducing a minor percentage of anomalous data in the training set does not significantly deteriorate the performance.

Analysis of multi-agent activity using Petri nets

Mon, 01 Jan 0001 00:00:00 +0000

This paper presents the use of Place/Transition Petri Nets (PNs) for the recognition and evaluation of complex multi-agent activities. The PNs were built automatically from the activity templates that are routinely used by experts to encode domain-specific knowledge. The PNs were built in such a way that they encoded the complex temporal relations between the individual activity actions. We extended the original PN formalism to handle the propagation of evidence using net tokens. The evaluation of the spatial and temporal properties of the actions was carried out using trajectory-based action detectors and probabilistic models of the action durations. The presented approach was evaluated using several examples of real basketball activities. The obtained experimental results suggest that this approach can be used to determine the type of activity that a team has performed as well as the stage at which the activity ended.

Anomalous Sound Detection by Feature-Level Anomaly Simulation

Mon, 01 Jan 0001 00:00:00 +0000

Recently a growing number of works focus on machine defect detection from anomalous audio patterns. The datasets for the machine audio domain are scarce and recent methods that perform well on benchmarks such as DCASE2020 Task 2, rely on auxiliary information such as annotated data from other training classes in the domain to extract information that can be used in deep-learning classification-based anomaly detection approaches. However, in practical scenarios, annotated data from the same domain may not be readily available so annotation-free methods that can learn appropriate audio representations from unannotated data are needed. We propose AudDSR, a simulation-based anomaly detection method that learns to detect anomalies without additional annotated data and instead focuses on a discrete feature space sampling method for an anomaly simulation process. AudDSR outperforms competing methods that do not rely on annotated data on the DCASE2020 anomalous sound detection benchmark and even matches the performance of some methods that utilize additional annotation information.

AnomalyVFM - Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Mon, 01 Jan 0001 00:00:00 +0000

Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision–language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page

Appearance-based localization using CCA

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present an appearance-based approach to mobile robot localization based on Canonical Correlation Analysis. The main idea is to learn the relation between the appearances of the environment from a number of training locations and coordinates of these locations using CCA and then to use this knowledge to estimate the position of the robot in the localization stage. We present results of several experiments, which show that this approach is faster and less demanding in terms of space than traditional PCA-based approach, however in its standard form it yields in general inferior localization results.

Application of Temporal Convolutional Neural Network for the Classification of Crops on SENTINEL-2 Time Series

Mon, 01 Jan 0001 00:00:00 +0000

Application of the HIDRA2 deep-learning model for sea level forecasting along the Estonian coast of the Baltic Sea

Mon, 01 Jan 0001 00:00:00 +0000

Approximating Distributions Through Mixtures of Gaussians

Mon, 01 Jan 0001 00:00:00 +0000

Automated detection and segmentation of cracks in concrete surfaces using joined segmentation and classification deep neural network

Mon, 01 Jan 0001 00:00:00 +0000

Automated quality control of pavement and concrete surfaces is essential for maintaining structural integrity and consistency in the construction and infrastructure industries. This paper presents a novel deep learning model designed for automated quality control of these surfaces during both construction and maintenance phases. The model employs per-pixel segmentation and per-image classification, integrating both local and broader context information. Additionally, we utilize the classification results to improve segmentation during both training and inference stages. We evaluated the proposed model on a publicly available dataset containing more than 7,000 images of pavement and concrete cracks. The model achieved a Dice score of 81% and an intersection-over-union of 71%, surpassing publicly available state-of-the-art methods by at least 6-7 percentage points. An ablation study confirms that leveraging classification information enhances overall segmentation performance. Furthermore, our model is computationally efficient, processing over 30 FPS for 512x512 images, making it suitable for real-time applications on medium-resolution images. Upon acceptance, both the code and the corrected dataset ground truths will be made publicly available.

Automatic Evaluation of Organized Basketball Activity

Mon, 01 Jan 0001 00:00:00 +0000

In this article the trajectory-based evaluation of multi-player basketball activity is addressed. The organized basketball activity consists of a set of key elements and their temporal relations. The activity evaluation is performed by analyzing individually each of them and the final reasoning about the activity is achieved using the Bayesian network. The network structure is obtained automatically from the activity template which is a standard tool used by the basketball experts. The experimental results suggest that our approach can successfully evaluate the quality of the observed activity.

Automatic fruit recognition using computer vision

Mon, 01 Jan 0001 00:00:00 +0000

Osrednja tema diplomske naloge je bila analiza primernosti razlicnih algoritmov racunalniskega vida za problem razpoznavanja sadja. Sadje nudi zahtevno domeno za razpoznavanje zaradi svoje raznovrstnosti med sadezi istega razreda, podobnosti sadezev razlicnih razredov in samega stevila razlicnih sadezev. Za uspesno razpoznavanje sadja je bilo potrebno slike opisati z dobrim atributnim zapisom. Informacije o barvi, teksturi, velikosti in obliki sadezev so bile zajete s pomocjo uveljavljenih opisnikov. Klasikacija slik na podlagi atributnega zapisa pridobljenega s pomocjo teh opisnikov je potekala z ze uveljavljenimi klasikacijskimi metodami s podrocja strojnega ucenja. Za uspesnost klasikacijskih metod je bilo potrebno pridobiti veliko in dobro zbirko slik sadja. Ker taksna javno dostopna zbirka slik sadja ne obstoja, jo je bilo potrebno zajeti. Na podlagi analize rezultatov v diplomskem delu je bil zgrajen priporocilni sistem za razpoznavanje sadja, ki je na zahtevni zajeti zbirki slik dosegel kar 85% uspesnost.

Avtomatsko modeliranje 3-dimenzionalnih veèbarvnih predmetov z uporabo globinskega senzorja

Mon, 01 Jan 0001 00:00:00 +0000

V magistrski nalogi je opisan postopek avtomatske gradnje 3-D modelov resniènih predmetov iz globinskih slik. Za zajemanje globinskih slik uporabljamo globinski senzor, ki deluje na principu aktivne triangulacije z uporabo kodirane svetlobe. Razvili smo nov pristop za gradnjo slik z dinamiènim obsegom intenzitetnih vrednosti iz veèjega števila navadnih intenzitetnih slik, posnetih ob razliènih osvetljenostih. Z uporabo slik z dinamiènim intenzitetnim obsegom lahko uspešno izraèunamo tudi globinske slike predmetov z nehomogenimi odbojnimi lastnostmi. Iz globinske slike nato izraèunamo 3-D koordinate toèk, ki ležijo na površini predmeta in so vidne iz zornega kota kamere. S triangulacijo površine med dobljenimi toèkami zgradimo 2.5-D model, ki ga nato poenostavimo, za bolj realistièen izgled pa nanj nalepimo teksturo. Na koncu še združimo veè 2.5-D modelov istega predmeta v enoten 3-D model.

Back To The Drawing Board: Rethinking Scene-Level Sketch-Based Image Retrieval

Mon, 01 Jan 0001 00:00:00 +0000

The goal of Scene-level Sketch-Based Image Retrieval is to retrieve natural images matching the overall semantics and spatial layout of a free-hand sketch. Unlike prior work focused on architectural augmentations of retrieval models, we emphasize the inherent ambiguity and noise present in real-world sketches. This insight motivates a training objective that is explicitly designed to be robust to sketch variability. We show that with an appropriate combination of pre-training, encoder architecture, and loss formulation, it is possible to achieve state-of-the-art performance without the introduction of additional complexity. Extensive experiments on a challenging FS-COCO and widely-used SketchyCOCO datasets confirm the effectiveness of our approach and underline the critical role of training design in cross-modal retrieval tasks, as well as the need to improve the evaluation scenarios of scene-level SBIR.

Bayes Spectral Entropy-Based Measure of Camera Focus

Mon, 01 Jan 0001 00:00:00 +0000

Be the Change You Want to See: Revisiting Remote Sensing Change Detection Practices

Mon, 01 Jan 0001 00:00:00 +0000

Remote sensing change detection aims to localize semantic changes between images of the same location captured at different times. In the past few years, newer methods have attributed enhanced performance to the additions of new and complex components to existing architectures. Most fail to measure the performance contribution of fundamental design choices such as backbone selection, pre-training strategies, and training configurations. We claim that such fundamental design choices often improve performance even more significantly than the addition of new architectural components. Due to that, we systematically revisit the design space of change detection models and analyse the full potential of a well-optimised baseline. We identify a set of fundamental design choices that benefit both new and existing architectures. Leveraging this insight, we demonstrate that when carefully designed, even an architecturally simple model can match or surpass state-of-the-art performance on six challenging change detection datasets. Our best practices generalise beyond our architecture and also offer performance improvements when applied to related methods, indicating that the space of fundamental design choices has been underexplored. Our guidelines and architecture provide a strong foundation for future methods, emphasizing that optimizing core components is just as important as architectural novelty in advancing change detection performance.

Beyond monthly composites: maximizing information retention in satellite image time series for forest stand classification

Mon, 01 Jan 0001 00:00:00 +0000

This study investigates the effectiveness of data pre-processing and classifier selection in forest stand classification using Satellite Image Time Series (SITS). We compare the performance of Random Forest (RF) and Light Gradient Boosting Machine (LightGBM) on monthly composites and dense time series. While the monthly RF achieves an average accuracy of 74.1%, the use of LightGBM results in lower performance on monthly composites. Our approach, which utilizes synthetic bands generated based on the available Sentinel−2 SITS, improved RF performance by 13.2 percentage points, exceeding the improvement observed when using 10-day composites. This highlights the loss of information that occurs when using composites. LightGBM improved the results by an additional 1.9 percentage points. However, without additional pre-processing, LightGBM can use the raw SITS and outperform these results with an F1 score of 0.906. The generated map was further improved by using margin values to highlight uncertainties and mask areas of uncertainty. Overall, while monthly composites provide a good starting point, the best results are obtained with raw SITS, which allows efficient processing for larger regions without additional pre-processing.

Beyond standard benchmarks: Parameterizing performance evaluation in visual object tracking

Mon, 01 Jan 0001 00:00:00 +0000

Object-to-camera motion produces a variety of apparent motion patterns that significantly affect performance of short-term visual trackers. Despite being crucial for designing robust trackers, their influence is poorly explored in standard benchmarks due to weakly defined, biased and overlapping attribute annotations. In this paper we propose to go beyond pre-recorded benchmarks with post-hoc annotations by presenting an approach that utilizes omnidirectional videos to generate realistic, consistently annotated, short-term tracking scenarios with exactly parameterized motion patterns. We have created an evaluation system, constructed a fully annotated dataset of omnidirectional videos and generators for typical motion patterns. We provide an in-depth analysis of major tracking paradigms which is complementary to the standard benchmarks and confirms the expressiveness of our evaluation approach.

Binding and Cross-modal Learning in Markov Logic Networks

Mon, 01 Jan 0001 00:00:00 +0000

Binding the ability to combine two or more modal representations of the same entity into a single shared representation is vital for every cognitive system operating in a complex environment. In order to successfully adapt to changes in an dynamic environment the binding mechanism has to be supplemented with cross-modal learning. In this paper we define the problems of high-level binding and cross-modal learning. By these definitions we model a binding mechanism and a cross-modal learner in Markov logic network and test the system on a synthetic object database.

Binding and Cross-modal Learning in Markov Logic Networks

Mon, 01 Jan 0001 00:00:00 +0000

Binding – the ability to combine two or more modal representations of the same entity into a single shared representation is vital for every cognitive system operating in a complex environment. In order to successfully adapt to changes in an dynamic environment the binding mechanism has to be supplemented with cross-modal learning. In this paper we define the problems of high-level binding and cross-modal learning. By these definitions we model a binding mechanism and a cross-modal learner in Markov logic network and test the system on a synthetic object database.

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

Mon, 01 Jan 0001 00:00:00 +0000

Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student’s pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources.

Categorial Perception

Mon, 01 Jan 0001 00:00:00 +0000

CDTB: A Color and Depth Visual Object Tracking Dataset and Benchmark

Mon, 01 Jan 0001 00:00:00 +0000

A long-term visual object tracking performance evaluation methodology and a benchmark are proposed. Performance measures are designed by following a long-term tracking definition to maximize the analysis probing strength. The new measures outperform existing ones in interpretation potential and in better distinguishing between different tracking behaviors. We show that these measures generalize the short-term performance measures, thus linking the two tracking problems. Furthermore, the new measures are highly robust to temporal annotation sparsity and allow annotation of sequences hundreds of times longer than in the current datasets without increasing manual annotation labor. A new challenging dataset of carefully selected sequences with many target disappearances is proposed. A new tracking taxonomy is proposed to position trackers on the short-term/long-term spectrum. The benchmark contains an extensive evaluation of the largest number of long-term tackers and comparison to state-of-the-art short-term trackers. We analyze the influence of tracking architecture implementations to long-term performance and explore various re-detection strategies as well as influence of visual model update strategies to long-term tracking drift. The methodology is integrated in the VOT toolkit to automate experimental analysis and benchmarking and to facilitate future development of long-term trackers.

Center Direction Network for Grasping Point Localization on Cloths

Mon, 01 Jan 0001 00:00:00 +0000

Object grasping is a fundamental challenge in robotics and computer vision, critical for advancing robotic manipulation capabilities. Deformable objects, like fabrics and cloths, pose additional challenges due to their non-rigid nature. In this work, we introduce CeDiRNet-3DoF, a deep-learning model for grasp point detection, with a particular focus on cloth objects. CeDiRNet-3DoF employs center direction regression alongside a localization network, attaining first place in the perception task of ICRA 2023’s Cloth Manipulation Challenge. Recognizing the lack of standardized benchmarks in the literature that hinder effective method comparison, we present the ViCoS {Towel} Dataset. This extensive benchmark dataset comprises 8,000 real and 12,000 synthetic images, serving as a robust resource for training and evaluating contemporary data-driven deep-learning approaches. Extensive evaluation revealed CeDiRNet-3DoF’s robustness in real-world performance, outperforming state-of-the-art methods, including the latest transformer-based models. Our work bridges a crucial gap, offering a robust solution and benchmark for cloth grasping in computer vision and robotics. Code and dataset are available at: https://github.com/vicoslab/CeDiRNet-3DoF

Cheating Depth: Enhancing 3D Surface Anomaly Detection via Depth Simulation

Mon, 01 Jan 0001 00:00:00 +0000

RGB-based surface anomaly detection methods have advanced significantly. However, certain surface anomalies remain practically invisible in RGB alone, necessitating the incorporation of 3D information. Existing approaches that employ point-cloud backbones suffer from suboptimal representations and reduced applicability due to slow processing. Re-training RGB backbones, designed for faster dense input processing, on industrial depth datasets is hindered by the limited availability of sufficiently large datasets. We make several contributions to address these challenges. (i) We propose a novel Depth-Aware Discrete Autoencoder (DADA) architecture, that enables learning a general discrete latent space that jointly models RGB and 3D data for 3D surface anomaly detection. (ii) We tackle the lack of diverse industrial depth datasets by introducing a simulation process for learning informative depth features in the depth encoder. (iii) We propose a new surface anomaly detection method 3DSR, which outperforms all existing state-of-the-art on the challenging MVTec3D anomaly detection benchmark, both in terms of accuracy and processing speed. The experimental results validate the effectiveness and efficiency of our approach, highlighting the potential of utilizing depth information for improved surface anomaly detection.

Closed-world tracking of multiple interacting targets for indoor-sports applications

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present an efficient algorithm for tracking multiple players during indoor sports matches. A sports match can be considered as a semi-controlled environment for which a set of closed-world assumptions regarding the visual as well as the dynamical properties of the players and the court can be derived. These assumptions are then used in the context of particle filtering to arrive at a computationally fast, closed-world, multi-player tracker. The proposed tracker is based on multiple, single-player trackers, which are combined using a closed-world assumption about the interactions among players. With regard to the visual properties, the robustness of the tracker is achieved by deriving a novel sports-domain-specific likelihood function and employing a novel background-elimination scheme. The restrictions on the player’s dynamics are enforced by employing a novel form of local smoothing. This smoothing renders the tracking more robust and reduces the computational complexity of the tracker. We evaluated the proposed closed-world, multi-player tracker on a challenging data set. In comparison with several similar trackers that did not utilize all of the closed-world assumptions, the proposed tracker produced better estimates of position and prediction as well as reducing the number of failures.

Co-segmentation for visual object tracking

Mon, 01 Jan 0001 00:00:00 +0000

Cognitive Systems

Mon, 01 Jan 0001 00:00:00 +0000

Combining Reconstructive and Discriminative Subspace Methods for Robust Classification and Regression by Subsampling

Mon, 01 Jan 0001 00:00:00 +0000

Linear subspace methods that provide sufficient reconstruction of the data such as PCA offer an efficient way of dealing with missing pixels, outliers, and occlusions that often appear in the visual data. Discriminative methods, such as LDA and CCA, which on the other hand, are better suited for classification and regression tasks, are highly sensitive to corrupted data. We present a theoretical framework for achieving best of both types of methods: an approach that combines the discrimination power of discriminative methods with the reconstruction property of reconstructive methods which enables one to work on subsets of pixels in images, to efficiently detect and reject the outliers. The proposed approach is therefore capable of robust classification/regression with a high-breakdown point. The theoretical results are demonstrated on several computer vision tasks showing that the proposed approach significantly outperforms the standard discriminative methods in the case of missing pixels and images containing occlusions and outliers.

Comparing different learning approaches in categorical knowledge acquisition

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we address the problem of acquiring categorical knowledge from the active learning perspective. We describe and implement several teacher and learner-driven approaches that require different levels of teacher competencies and consider different types of knowledge for selection of training samples. The experimental results show that the active learning approach outperforms the passive one and that the adaptation of the learning process to the learner’s knowledge significantly improves the learning performance.

Computer vision - CVWW '04 : proceedings of the 9th Computer Vision Winter Workshop

Mon, 01 Jan 0001 00:00:00 +0000

Conservative visual learning for object detection with minimal hand labeling effort

Mon, 01 Jan 0001 00:00:00 +0000

We present a novel framework for unsupervised training of an object detection system. The basic idea is to (1) exploit a huge amount of unlabeled video data by being very conservative in selecting training examples; and (2) to start with a very simple object detection system and using generative and discriminative classifiers in an iterative co- training fashion arriving at a better object detector. We demonstrate the framework on a surveillance task where we learn a person detector. We start with a simple moving object classiffier and proceed with a robust PCA (on shape and appearance) as a generative classiffier which in turn generates a training set for a discriminative AdaBoost classiffier. The results obtained by AdaBoost are again filtered by PCA which produces an even better training set. We demonstrate that by using this approach we avoid hand labeling training data and still achieve a state of the art detection rate.

Context awareness for object detection

Mon, 01 Jan 0001 00:00:00 +0000

A wide range of algorithms have been proposed to detect objects in still images. However, most of the current approaches are purely based on local appearance and ignore the context in which these objects are embedded. This paper proposes a general approach to extract, learn and use contextual information from images to increase the performance of classical object detection methods. The important properties of the proposed approach are that it can be combined with any existing object detection method and it provides a general framework not limited to one specific object category.

Context Driven Focus of Attention for Object Detection

Mon, 01 Jan 0001 00:00:00 +0000

Context plays an important role in general scene perception. In particular, it can provide cues about an object’s location within an image. In computer vision, object detectors typically ignore this information. We tackle this problem by presenting a concept of how to extract and learn contextual information from examples. This context is then used to calculate a focus of attention, that represents a prior for object detection. State-of-the-art local appearance-based object detection methods are then applied on selected parts of the image only. We demonstrate the performance of this approach on the task of pedestrian detection in urban scenes using a demanding image database. Results show that context awareness provides complementary information over pure local appearance-based processing. In addition, it cuts down the search complexity and increases the robustness of object detection.

Continuous Learning of Simple Visual Concepts using Incremental Kernel Density Estimation

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we propose a method for continuous learning of simple visual concepts. The method continuously associates words describing observed scenes with automatically extracted visual features. Since in our setting every sample is labelled with multiple concept labels, and there are no negative examples, reconstructive representations of the incoming data are used. The associated features are modelled with kernel density probability distribution estimates, which are built incrementally. The proposed approach is applied to the learning of object properties and spatial relations.

Correcting decalibration of stereo cameras in self-driving vehicles

Mon, 01 Jan 0001 00:00:00 +0000

CRITER 1.0: a coarse reconstruction with iterative refinement network for sparse spatio-temporal satellite data

Mon, 01 Jan 0001 00:00:00 +0000

Satellite observations of sea surface temperature (SST) are essential for accurate weather forecasting and climate modeling. However, these data often suffer from incomplete coverage due to cloud obstruction and limited satellite swath width, which requires development of dense reconstruction algorithms. The current state of the art struggles to accurately recover high-frequency variability, particularly in SST gradients in ocean fronts, eddies, and filaments, which are crucial for downstream processing and predictive tasks. To address this challenge, we propose a novel two-stage method CRITER (Coarse Reconstruction with ITerative Refinement Network), which consists of two stages. First, it reconstructs low-frequency SST components utilizing a Vision Transformer-based model, leveraging global spatio-temporal correlations in the available observations. Second, a UNet type of network iteratively refines the estimate by recovering high-frequency details. Extensive analysis on datasets from the Mediterranean, Adriatic, and Atlantic seas demonstrates CRITER’s superior performance over the current state of the art. Specifically, CRITER achieves up to 44 % lower reconstruction errors of the missing values and over 80 % lower reconstruction errors of the observed values compared to the state of the art.

Cross-modal learning

Mon, 01 Jan 0001 00:00:00 +0000

Cross-modal learning refers to any kind of learning that involves information obtained from more than one modality. In the literature the term modality typically refers to a sensory modality, also known as stimulus modality. A stimulus modality provides information obtained from a particular sensorial input, for example visual, auditory, olfactory, or kinesthetic information. Examples from artificial cognitive systems (“robots”) include also information about detected range (by sonar or laser range-finders), movement (by odometry sensors), or motor state (by proprioceptive sensors). We adopt here the notion of modality that includes both the sensorial data, and further interpretations of that data within the modality. For example, from a pair of (depth-calibrated) images, a cloud of points in 3-dimensional space can be computed. We obtain both types of data (the image data, and the 3D points) from the same visual sensor. At the same time, they differ in what information they provide. We consider information sources derived from sensorial data as derived modalities that by themselves can be involved again in cross-modal learning.

D3S - A Discriminative Single Shot Segmentation Tracker

Mon, 01 Jan 0001 00:00:00 +0000

Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker – D3S, which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve high robustness and online target segmentation. Without per-dataset finetuning and trained only for segmentation as the primary output, D3S outperforms all trackers on VOT2016, VOT2018 and GOT-10k benchmarks and performs close to the state-of-the-art trackers on the TrackingNet. D3S outperforms the leading segmentation tracker SiamMask on video object segmentation benchmarks and performs on par with top video object segmentation algorithms, while running an order of magnitude faster, close to real-time.

DAL: A Deep Depth-Aware Long-term Tracker

Mon, 01 Jan 0001 00:00:00 +0000

Dana36: A Multi-Camera Image Dataset for Object Identification in Surveillance Scenarios

Mon, 01 Jan 0001 00:00:00 +0000

We present a novel dataset for evaluation of object matching and recognition methods in surveillance scenarios. Dataset consists of more than 23,000 images, depicting 15 persons and nine vehicles. A ground truth data - the identity of each person or vehicle - is provided, along with the coordinates of the bounding box in the full camera image. The dataset was acquired from 36 stationary camera views using a variety of surveillance cameras with resolutions ranging from standard VGA to three megapixel. 27 cameras observed the persons and vehicles in an outdoor environment, while the remaining nine observed the same persons indoors. The activity of persons was planned in advance, they drive the cars to the parking lot, exit the cars and walk around the building, through the main entrance, and up the stairs, towards the first floor of the building. The intended use of the dataset is performance evaluation of computer vision methods that aim to (re)identify people and objects from many different viewpoints in different environments and under variable conditions. Due to variety of camera locations, vantage points and resolutions, the dataset provides means to adjust the difficulty of the identification task in a controlled and documented manner. An interface for easy use of dataset within Matlab is provided as well, and the data is complemented by baseline results using a basic color histogram-based descriptor. While the cropped images of persons and vehicles represent the primary data in our dataset, we also provide full-frame images and a set of tracklets for each object as a courtesy to the dataset users.

Deep Learning for Large-Scale Traffic-Sign Detection and Recognition

Mon, 01 Jan 0001 00:00:00 +0000

Automatic detection and recognition of traffic signs plays a crucial role in management of the traffic-sign inventory. It provides accurate and timely way to manage traffic-sign inventory with a minimal human effort. In the computer vision community the recognition and detection of traffic signs is a well-researched problem. A vast majority of existing approaches perform well on traffic signs needed for advanced drivers-assistance and autonomous systems. However, this represents a relatively small number of all traffic signs (around 50 categories out of several hundred) and performance on the remaining set of traffic signs, which are required to eliminate the manual labor in traffic-sign inventory management, remains an open question. In this paper, we address the issue of detecting and recognizing a large number of traffic-sign categories suitable for automating traffic-sign inventory management. We adopt a convolutional neural network (CNN) approach, the Mask R-CNN, to address the full pipeline of detection and recognition with automatic end-to-end learning. We propose several improvements that are evaluated on the detection of traffic signs and result in an improved overall performance. This approach is applied to detection of 200 traffic-sign categories represented in our novel dataset. Results are reported on highly challenging traffic-sign categories that have not yet been considered in previous works. We provide comprehensive analysis of the deep learning method for the detection of traffic signs with large intra-category appearance variation and show below 3% error rates with the proposed approach, which is sufficient for deployment in practical applications of traffic-sign inventory management.

Deep reinforcement learning for map-less goal-driven robot navigation

Mon, 01 Jan 0001 00:00:00 +0000

Mobile robots that operate in real-world environments need to be able to safely navigate their surroundings. Obstacle avoidance and path planning are crucial capabilities for achieving autonomy of such systems. However, for new or dynamic environments, navigation methods that rely on an explicit map of the environment can be impractical or even impossible to use. We present a new local navigation method for steering the robot to global goals without relying on an explicit map of the environment. The proposed navigation model is trained in a deep reinforcement learning framework based on Advantage Actor–Critic method and is able to directly translate robot observations to movement commands. We evaluate and compare the proposed navigation method with standard map-based approaches on several navigation scenarios in simulation and demonstrate that our method is able to navigate the robot also without the map or when the map gets corrupted, while the standard approaches fail. We also show that our method can be directly transferred to a real robot.

Deep-learning transformer-based sea level modeling ensemble for the Adriatic basin

Mon, 01 Jan 0001 00:00:00 +0000

Storm surges and coastal floods are persistent threats to civil and economic safety in the Northern Adriatic. Meteorologically induced sea level signal is, however, often difficult to forecast deterministically due to the resonant character of the Adriatic basin. A standard solution is therefore resorting to ensembles of numerical ocean models, which are numerically expensive. In recent years, deep-learning-based methods have shown significant potential for numerically cheap alternatives. This is the venue followed in our work. We propose a new deep-learning transformer-based architecture HIDRA-T, a continuation of our recent model HIDRA2 (Rus et al., GMD 2023), which outperformed both state-of-the-art deep-learning network design HIDRA1 and two state-of-the-art numerical ocean models (a NEMO engine and a SCHISM ocean modeling system). HIDRA-T is our latest attempt at sea level forecasting, employing novel transformer-based atmospheric and sea level encoders. Transformers are designed for sequential data, and in HIDRA-T we use self-attention blocks to extract features from the atmospheric data firstly by tokenizing over spatial dimension, then over temporal dimension. HIDRA-T was trained on surface wind and pressure fields from the ECMWF atmospheric ensemble and on Koper tide gauge observations. On an independent and challenging test set, HIDRA-T outperforms all other models, reducing previous best mean absolute forecast error in storm events of HIDRA2 by 2.6 %.

Deep-learning-based computer vision system for surface-defect detection

Mon, 01 Jan 0001 00:00:00 +0000

Automating optical-inspection systems using machine learning has become an interesting and promising area of research. In particular, the deep-learning approaches have shown a very high and direct impact on the application domain of visual inspection. This paper presents a complete inspection system for automated quality control of a specific industrial product. Both hardware and software part of the system are described, with machine vision used for image acquisition and pre-processing followed by a segmentation-based deep-learning model used for surface-defect detection. The deep-learning model is compared with the state-of-the-art commercial software, showing that the proposed approach outperforms the related method on the specific domain of surface-crack detection. Experiments are performed on a real-world quality-control case and demonstrate that the deep-learning model can be successfully used even when only 33 defective training samples are available. This makes the deep-learning method practical for use in industry where the number of available defective samples is limited.

Deformable Parts Correlation Filters for Robust Visual Tracking

Mon, 01 Jan 0001 00:00:00 +0000

Deformable parts models show a great potential in tracking by principally addressing non-rigid object deformations and self occlusions, but according to recent benchmarks, they often lag behind the holistic approaches. The reason is that potentially large number of degrees of freedom have to be estimated for object localization and simplifications of the constellation topology are often assumed to make the inference tractable. We present a new formulation of the constellation model with correlation filters that treats the geometric and visual constraints within a single convex cost function and derive a highly efficient optimization for MAP inference of a fully-connected constellation. We propose a tracker that models the object at two levels of detail. The coarse level corresponds a root correlation filter and a novel color model for approximate object localization, while the mid-level representation is composed of the new deformable constellation of correlation filters that refine the object location. The resulting tracker is rigorously analyzed on a highly challenging OTB, VOT2014 and VOT2015 benchmarks, exhibits a state-of-the-art performance and runs in real-time.

Demonstracijska celica za prikaz globokega učenja v praktičnih aplikacijah

Mon, 01 Jan 0001 00:00:00 +0000

V zadnjih letih so metode globokega učenja postale ključno orodje za reševanje raznolikih praktičnih izzivov. Kljub temu pa potencial takih metod pogosto ostaja slabo razumljiv širši javnosti zaradi pogostega ločevanja razvoja in demonstracije algoritmov od dejanskih praktičnih problemov, ki jih algoritmi naslavljajo. V tem članku predstavljamo demonstracijsko celico, ki združuje strojno in programsko opremo ter algoritme globokega učenja, omogočajoč enostavno prikazovanje delovanja teh metod v različnih aplikativnih domenah. Celica vključuje kamere, grafični vmesnik in pet demonstracijskih programov, ki demonstrirajo klasifikacijo lesenih desk, detekcijo površinskih anomalij, štetje polipov, detekcijo prometnih znakov in detekcijo vogalov tekstilnih izdelkov. Implementiran modularni pristop omogoča enostavno integracijo različnih algoritmov globokega učenja. Sistem omogoča boljše razumevanje in uporabo teh metod v praktičnih scenarijih ter prispeva k razvoju inovativnih rešitev na področju globokega učenja.

Dense Center-Direction Regression for Object Counting and Localization with Point Supervision

Mon, 01 Jan 0001 00:00:00 +0000

Object counting and localization problems are commonly addressed with point supervised learning, which allows the use of less labor-intensive point annotations. However, learning based on point annotations poses challenges due to the high imbalance between the sets of annotated and unannotated pixels, which is often treated with Gaussian smoothing of point annotations and focal loss. However, these approaches still focus on the pixels in the immediate vicinity of the point annotations and exploit the rest of the data only indirectly. In this work, we propose a novel approach termed CeDiRNet for point-supervised learning hat uses a dense regression of directions pointing owards the nearest object centers, i.e. center-directions. This provides greater support for each center point arising from many surrounding pixels pointing towards the object center. We propose a formulation of center-directions that allows the problem to be split into the domain-specific dense regression of center-directions and the final localization task based on a small, lightweight, and domain-agnostic localization network that can be trained with synthetic data completely independent of the target domain. We demonstrate the performance of the proposed method on six different datasets for object counting and localization, and show that it outperforms the existing state-of-the-art methods. Keywords: Point-Supervision, Object Counting, Object Localization, Center-Point Prediction, Center-Direction Regression, CeDiRNet

Depth Fingerprinting for Obstacle Tracking using 3D Point Cloud

Mon, 01 Jan 0001 00:00:00 +0000

We present a method for automatic detection and tracking of obstacles on water surface that uses solely the point cloud obtained from the surroundings of the unmanned surface vehicle (USV). For this purpose, we use a calibrated pair of stereo cameras, affixed to the mast at the front of the USV. Reliable obstacle tracking in outdoor environment is a difficult task, but unlike the monocular approaches, our framework offloads a large part of the problem onto the method that provides a point cloud. In absence of other visual features, our method introduces \emph{depth fingerprint}, a histogram-like feature obtained from the point cloud of an object. The method has been evaluated on the yet unreleased MODD2 dataset and shows promising results, with the depth fingerprinting significantly outperforming tracking based solely on optimal assignment weighted by geometrical distance between object detections (Munkres algorithm). The proposed method is capable of running in real time on board of a small-sized USV.

Detection of surface defects on pharmaceutical solid oral dosage forms with convolutional neural networks

Mon, 01 Jan 0001 00:00:00 +0000

Deep-learning-based approaches have proven to outperform other approaches in various computer vision tasks, making application-focused machine learning a promising area of research in automated visual inspection. In this work, we apply deep learning to the challenging real-world problem domain of automated visual inspection of pharmaceutical products. We focus on investigating whether compact network architectures, adhering to performance, resource, and accuracy requirements, are suitable for usage in the pharmaceutical visual inspection domain. We propose a compact and efficient convolutional neural network architecture design for segmentation and scoring of surface defects, which we evaluate on challenging real-world datasets from the pharmaceutical product-inspection domain. In comparison with other related segmentation approaches, we achieve state-of-the-art performance in terms of defect detection as well as real-time computational efficiency. Compared to the nearest best-performing architecture we achieve state-of-the-art performance with merely 3% of the parameter count, an approximately 8-fold increase in inference speed, and increased surface defect detection performance.

Detekcija napak na površinah z uporabo anotiranih slik in globokim učenjem

Mon, 01 Jan 0001 00:00:00 +0000

Automated surface anomaly detection using machine learn-ing has become an interesting area of research with a very high direct impact to the application domain of visual inspection. Deep learning approaches seem to be very appropriate for enabling to teach inspection systems detecting surface anomalies by showing them a number of exemplar images. In this paper we present and analyze a deep learning architecture for segmentation of surface anomalies upgraded with a simple classification function that differentiates between images of faulty and defect free surfaces. The preliminary results show that the approach is very promising and that the deep learning paradigm is appropriate to be applied in the domain of automated visual inspection.

Detekcija ovir iz 3D oblaka točk za potrebe avtonomne plovbe

Mon, 01 Jan 0001 00:00:00 +0000

Detekcija površinskih napak na oblačilih za reciklažo z uporabo nadzorovanih metod globokega učenja

Mon, 01 Jan 0001 00:00:00 +0000

Efficient sorting of used garments is essential for textile recycling in the circular economy. Surface defect detection, such as identifying stains or tears, enables automated classification of items for reuse or recycling. In this paper, we focus on the problem of detecting surface defects on second-hand clothing using supervised deep learning methods. We present an analysis of our two previously proposed general-purpose surface defect detection models (SegDecNet and SuperSimpleNet) along with four modern backbone image classification architectures (ConvNeXt, ViT, Swin, and DINO). For evaluation, we curate a tailored binary classification dataset derived from the real-world garment dataset, including over 12000 annotated clothing images. Our results show that SuperSimpleNet significantly outperforms other methods, achieving an average precision of 72%, while highlight ing the inherent challenges of this task due to garment variability and subtle or occluded defects.

Detekcija točkovnih horizontalnih prometnih znakov, Tehnično poročilo, TR-LUVSS-17/02

Mon, 01 Jan 0001 00:00:00 +0000

Detekcija, lokalizacija in identifikacija oseb z več kamerami ter mapami značilnic

Mon, 01 Jan 0001 00:00:00 +0000

V clanku je predstavljen sistem za detekcijo, lokalizacijo in identifikacijo oseb v posameznih trenutkih, brez casovnega filtriranja, ki je prisotno v vecini tovrstnih sistemov. Glavni cilj predstavljenega pristopa je odpravljanje katastrofalnih napak, ki onemogocajo popolnoma samodejno obdelavo realisticno dolgih video posnetkov. Sistem temelji na zlivanju (fuziji) vec šibkih znacilnic, zapisanih v obliki map znacilnic, zlivanje pa je izvedeno s pomocjo enega ali vec naucenih razvršcevalnikov.

Dimensionality Reduction for Distributed Vision Systems Using Random Projection

Mon, 01 Jan 0001 00:00:00 +0000

Dimensionality reduction is an important issue in the context of distributed vision systems. Processing of dimensionality reduced data requires far less network resources (e.g., storage space, network bandwidth) than processing of original data. In this paper we explore the performance of the random projection method for distributed smart cameras. In our tests, random projection is compared to principal component analysis in terms of recognition efficiency (i.e., object recognition). The results obtained on the COIL-20 image data set show good performance of the random projection in comparison to the principal component analysis, which requires distribution of a subspace and therefore consumes more resources of the network. This indicates that random projection method can elegantly solve the problem of subspace distribution in embedded and distributed vision systems. Moreover, even without explicit orthogonalization or normalization of random projection transformation subspace, the method achieves good object recognition efficiency.

Discriminative Correlation Filter Tracker with Channel and Spatial Reliability

Mon, 01 Jan 0001 00:00:00 +0000

Short-term tracking is an open and challenging problem for which discriminative correlation filters (DCF) have shown excellent performance. We introduce the channel and spatial reliability concepts to DCF tracking and provide a learning algorithm for its efficient and seamless integration in the filter update and the tracking process. The spatial reliability map adjusts the filter support to the part of the object suitable for tracking. This both allows to enlarge the search region and improves tracking of non-rectangular objects. Reliability scores reflect channel-wise quality of the learned filters and are used as feature weighting coefficients in localization. Experimentally, with only two simple standard feature sets, HoGs and Colornames, the novel CSR-DCF method – DCF with Channel and Spatial Reliability – achieves state-of-the-art results on VOT 2016, VOT 2015 and OTB100. The CSR-DCF runs close to real-time on a CPU.

Discriminative Correlation Filter with Channel and Spatial Reliability

Mon, 01 Jan 0001 00:00:00 +0000

Short-term tracking is an open and challenging problem for which discriminative correlation filters (DCF) have shown excellent performance. We introduce the channel and spatial reliability concepts to DCF tracking and provide a novel learning algorithm for its efficient and seamless integration in the filter update and the tracking process. The spatial reliability map adjusts the filter support to the part of the object suitable for tracking. This both allows to enlarge the search region and improves tracking of non-rectangular objects. Reliability scores reflect channel-wise quality of the learned filters and are used as feature weighting coefficients in localization. Experimentally, with only two simple standard features, HoGs and Colornames, the novel CSR-DCF method – DCF with Channel and Spatial Reliability – achieves state-of-the-art results on VOT 2016, VOT 2015 and OTB100. The CSR-DCF runs in real-time on a CPU.

Domain-specific adaptations for region proposals

Mon, 01 Jan 0001 00:00:00 +0000

In this work we propose a novel approach towards the detection of all traffic sign boards. We propose to employ state-of-the-art region proposals as the first step to reduce the initial search space and provide a way to use a strong classifier for a fine-grade classification. We evaluate multiple region proposals on the domain of traffic sign detection and further propose various domain-specific adaptations to improve their performance. We show that edgeboxes with domain-specific learning and re-scoring based on trained shape information are able to significantly outperform remaining methods on German Traffic Sign Database. Furthermore, we show they achieve higher rate of recall with high-quality regions at the lower number of regions than the remaining methods.

DRAEM -- A discriminatively trained reconstruction embedding for surface anomaly detection

Mon, 01 Jan 0001 00:00:00 +0000

Visual surface anomaly detection aims to detect local image regions that significantly deviate from normal appearance. Recent surface anomaly detection methods rely on generative models to accurately reconstruct the normal areas and to fail on anomalies. These methods are trained only on anomaly-free images, and often require hand-crafted post-processing steps to localize the anomalies, which prohibits optimizing the feature extraction for maximal detection capability. In addition to reconstructive approach, we cast surface anomaly detection primarily as a discriminative problem and propose a discriminatively trained reconstruction anomaly embedding model (DRAEM). The proposed method learns a joint representation of an anomalous image and its anomaly-free reconstruction, while simultaneously learning a decision boundary between normal and anomalous examples. The method enables direct anomaly localization without the need for additional complicated post-processing of the network output and can be trained using simple and general anomaly simulations. On the challenging MVTec anomaly detection dataset, DRAEM outperforms the current state-of-the-art unsupervised methods by a large margin and even delivers detection performance close to the fully-supervised methods on the widely used DAGM surface-defect detection dataset, while substantially outperforming them in localization accuracy.

DSR – A Dual Subspace Re-Projection Network for Surface Anomaly Detection

Mon, 01 Jan 0001 00:00:00 +0000

The state-of-the-art in discriminative unsupervised surface anomaly detection relies on external datasets for synthesizing anomaly-augmented training images. Such approaches are prone to failure on near-in-distribution anomalies since these are difficult to be synthesized realistically due to their similarity to anomaly-free regions. We propose an architecture based on quantized feature space representation with dual decoders, DSR, that avoids the image-level anomaly synthesis requirement. Without making any assumptions about the visual properties of anomalies, DSR generates the anomalies at the feature level by sampling the learned quantized feature space, which allows a controlled generation of near-in-distribution anomalies. DSR achieves state-of-the-art results on the KSDD2 and MVTec anomaly detection datasets. The experiments on the challenging real-world KSDD2 dataset show that DSR significantly outperforms other unsupervised surface anomaly detection methods, improving the previous top-performing methods by 10% AP in anomaly detection and 35% AP in anomaly localization.

Effects of rule changes on physical demands and shot characteristics of elite-standard men’s squash and implications for training

Mon, 01 Jan 0001 00:00:00 +0000

Efficient Dimensionality Reduction Using Random Projection

Mon, 01 Jan 0001 00:00:00 +0000

Dimensionality reduction techniques are especially important in the context of embedded vision systems. A promising dimensionality reduction method for a use in such systems is the random projection. In this paper we explore the performance of therandom projection method, which can be easily used in embedded cameras. Random projection is compared to Principal Component Analysis in the terms of recognition efficiency on the COIL-20 image data set. Results show surprisingly good performance of the random projection in comparison to the principal component analysis even without explicit orthogonalization or normalization of transformation subspace. These results support the use of random projection in our hierarchical feature-distribution scheme in visual-sensor networks, where random projection elegantly solves the problem of shared subspace distribution.

Efficient Feature Distribution for Object Matching in Visual-Sensor Networks

Mon, 01 Jan 0001 00:00:00 +0000

In this paper, we propose a framework of hierarchical feature distribution for object matching in a network of visual sensors. In our approach, we hierarchically distribute the information in such a way that each individual node maintains only a small amount of information about the objects seen by the network. Nevertheless, this amount is sufficient to efficiently route queries through the network without any degradation of the matching performance. A set of requirements that have to be fulfilled by the object-matching method to be used in such a framework is defined. We provide examples of mapping four well-known, object-matching methods to a hierarchical feature-distribution scheme. The proposed approach was tested on a standard COIL-100 image database and in a basic surveillance scenario using our own distributed network simulator. The results show that the amount of data transmitted through the network can be significantly reduced in comparison to naive feature-distribution schemes such as flooding.

Efficient spring system optimization for part-based visual tracking

Mon, 01 Jan 0001 00:00:00 +0000

Part-based trackers typically use visual and geometric constraints to find the most optimal positions of the parts in the constellation. Recently, spring systems was successfully applied to model these constraints. In this paper we propose an optimization method developed for multi-dimensional spring systems, which can be integrated in the part-based tracking model. The experimental analysis shows that our optimization method outperforms theconjugated gradient descend optimization in terms of convergence speed, accuracy and numerical stability.

Empirical evaluation of feature selection methods in classification

Mon, 01 Jan 0001 00:00:00 +0000

End-to-end training of a two-stage neural network for defect detection

Mon, 01 Jan 0001 00:00:00 +0000

Segmentation-based, two-stage neural network has shown excellent results in the surface defect detection, enabling the network to learn from a relatively small number of samples. In this work, we introduce end-to-end training of the two-stage network together with several extensions to the training process, which reduce the amount of training time and improve the results on the surface defect detection tasks. To enable end-toend training we carefully balance the contributions of both the segmentation and the classification loss throughout the learning. We adjust the gradient flow from the classification into the segmentation network in order to prevent the unstable features from corrupting the learning. As an additional extension to the learning, we propose frequency-of-use sampling scheme of negative samples to address the issue of over- and under-sampling of images during the training, while we employ the distance transform algorithm on the region-based segmentation masks as weights for positive pixels, giving greater importance to areas with higher probability of presence of defect without requiring a detailed annotation. We demonstrate the performance of the end-to-end training scheme and the proposed extensions on three defect detection datasets—DAGM, KolektorSDD and Severstal Steel defect dataset— where we show state-of-the-art results. On the DAGM and the KolektorSDD we demonstrate 100% detection rate, therefore completely solving the datasets. Additional ablation study performed on all three datasets quantitatively demonstrates the contribution to the overall result improvements for each of the proposed extensions.

Entropy Based Measure of Camera Focus

Mon, 01 Jan 0001 00:00:00 +0000

A new measure for assessing camera focusing via recorded image is presented in this paper. The proposed measure bases on calculating entropy in image frequency domain, and we call it frequency domain entropy or FDE. First an intuitive explanation of measure is presented, and next tests for some classical properties that such measure should meet are conducted and commented.

Evaluating multi-class learning strategies in a generative hierarchical framework for object detection.

Mon, 01 Jan 0001 00:00:00 +0000

eWaSR — An Embedded-Compute-Ready Maritime Obstacle Detection Network

Mon, 01 Jan 0001 00:00:00 +0000

Maritime obstacle detection is critical for safe navigation of autonomous surface vehicles (ASVs). While the accuracy of image-based detection methods has advanced substantially, their computational and memory requirements prohibit deployment on embedded devices. In this paper, we analyze the current best-performing maritime obstacle detection network, WaSR. Based on the analysis, we then propose replacements for the most computationally intensive stages and propose its embedded-compute-ready variant, eWaSR. In particular, the new design follows the most recent advancements of transformer-based lightweight networks. eWaSR achieves comparable detection results to state-of-the-art WaSR with only a 0.52% F1 score performance drop and outperforms other state-of-the-art embedded-ready architectures by over 9.74% in F1 score. On a standard GPU, eWaSR runs 10× faster than the original WaSR (115 FPS vs. 11 FPS). Tests on a real embedded sensor OAK-D show that, while WaSR cannot run due to memory restrictions, eWaSR runs comfortably at 5.5 FPS. This makes eWaSR the first practical embedded-compute-ready maritime obstacle detection network. The source code and trained eWaSR models are publicly available.

Exploring levels of stereo fusion for obstacle detection in marine environment

Mon, 01 Jan 0001 00:00:00 +0000

Fast image-based obstacle detection from unmanned surface vehicles

Mon, 01 Jan 0001 00:00:00 +0000

Obstacle detection plays an important role in unmanned surface vehicles (USV). The USVs operate in highly diverse environments in which an obstacle may be a floating piece of wood, a scuba diver, a pier, or a part of a shoreline, which presents a significant challenge to continuous detection from images taken onboard. This paper addresses the problem of online detection by constrained unsupervised segmentation. To this end, a new graphical model is proposed that affords a fast and continuous obstacle image-map estimation from a single video stream captured onboard a USV. The model accounts for the semantic structure of marine environment as observed from USV by imposing weak structural constraints. A Markov random field framework is adopted and a highly efficient algorithm for simultaneous optimization of model parameters and segmentation mask estimation is derived. Our approach does not require computationally intensive extraction of texture features and comfortably runs in real-time. The algorithm is tested on a new, challenging, dataset for segmentation and obstacle detection in marine environments, which is the largest annotated dataset of its kind. Results on this dataset show that our model outperforms the related approaches, while requiring a fraction of computational effort.

Fast Spatially Regularized Correlation Filter Tracker

Mon, 01 Jan 0001 00:00:00 +0000

Discriminative correlation filters (DCF) have attracted significant attention of the tracking community. Standard formulation of the DCF affords a closed form solution, but is not robust and constrained to learning and detection using a relatively small search region. Spatial regularization was proposed to address learning from larger regions. But this prohibits a closed form solution and leads to an iterative optimization with significant computational load, resulting in slow model learning and tracking. We propose to reformulate the spatially regularized filter cost function such that it offers an efficient optimization. This significantly speeds up the tracker (approximately 14 times) and results in real-time tracking at the same or better accuracy.

Filtering out nondiscriminative keypoints by geometry based keypoint constellations

Mon, 01 Jan 0001 00:00:00 +0000

Keypoint-based object detection typically utilizes the nearest neighbour matching technique in order to mach discriminative and reject nondiscriminative keypoints. A detected keypoint is found to be nondiscriminative if it is similar enough to more than one model keypoint. This strategy does not always prove efficient, especially in cases where objects consist of repeating patterns, such as letters in logotypes, where potentially useful keypoints can get rejected. In this paper we propose a geometry-based approach for filtering out nondiscriminative keypoints. Our approach is not affected by repeating patterns and filters out non discriminative keypoints by means of prelearned geometry constraints. We evaluate our proposed method on a challenging dataset depicting logotypes in real-world environments under strong illumination and viewpoint changes.

Formalization of different learning strategies in a continuous learning framework

Mon, 01 Jan 0001 00:00:00 +0000

While the ability to learn on its own is an important feature of a learning agent, another, equally important feature is ability to interact with its environment and to learn in an interaction with other cognitive agents and humans. In this paper we analyze such interactive learning and define several learning strategies requiring different levels of tutor involvement and robot autonomy. We propose a new formal model for describing the learning strategies. The formalism takes into account different levels and types of communication between the robot and the tutor and different actions that can be undertaken. We also propose appropriate performance measures and show the experimental results of the evaluation of the proposed learning strategies.

FuCoLoT - A Fully-Correlational Long-Term Tracker

Mon, 01 Jan 0001 00:00:00 +0000

A Fully Correlational Long-term Tracker (FuCoLoT) exploits the novel DCF constrained filter learning method to design a detector that is able to re-detect the target in the whole image efficiently. FuCoLoT maintains several correlation filters trained on different time scales that act as the detector components. A novel mechanism based on the correlation response is used for tracking failure estimation. FuCoLoT achieves state-of-the-art results on standard short-term benchmarks and it outperforms the current best-performing tracker on the long-term UAV20L benchmark by over 19%. It has an order of magnitude smaller memory footprint than its best-performing competitors and runs at 15fps in a single CPU thread.

Fully supervised and point-supervised ship detection using center prediction, LUVSS-2021-11

Mon, 01 Jan 0001 00:00:00 +0000

In monitoring of maritime environment the detection of ships from aerial or satellite images is a common task. Although many fully supervised object detection methods can achieve excellent result on this domain, such methods remain limited by the amount of labeling required to create the training images. In this technical report, we explore novel methods for fully and weakly supervised learning of ship detector from satellite images. We propose a novel dense prediction method for object detection that can be used in fully supervised learning mode to achieve state-of-the-art results, while further modification allows for learning on weakly labeled data such as point-supervision. Point-supervision, where only as single point/pixel on object is known, can be applied to fully automated the learning of ship detection method by using openly available satellite images and known positions of ships from the database of global ship tracking AIS. This makes methods that can be trained from point-supervision highly suitable for ship detection domain.

Fusion of non-visual modalities into the probabilistic occupancy map framework for person localization

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we investigate the possibilities for fusion of non-visual sensor modalities into state-of-the-art visionbased framework for person detection and localization, the Probabilistic Occupancy Map (POM), with the aim of improving the frame-by-frame localization results in a realistic (cluttered) indoor environment. We point out the aspects that need to be considered when fusing non-visual sensor information into POM and provide a mathematical model for it. We demonstrate the proposed fusion method on the example of multi-camera and radio-based person localization setup. The performance of both systems is evaluated, showing their strengths and weaknesses. We show that localization results may be significantly improved by fusing the information from the radio-based system into the camera-based POM framework using the proposed model.

Fusion of Non-Visual Modalities Into the Probabilistic Occupancy Map Framework for Person Localization

Mon, 01 Jan 0001 00:00:00 +0000

In the recent years, the problem of person detection and localization has received much attention, with two strong areas of application being surveillance/security and tracking of players in sports. Different solutions based on different sensor modalities have been proposed, and recently sensor fusion has gained prominence as a paradigm for overcoming the limitations of the individual sensor modalities. We investigate the possibilities for fusion of additional, nonvisual, sensor modalities into state-of-the-art vision-based framework for person detection and localization, the Probabilistic Occupancy Map(POM), with the aim of improving the localization results in realistic, cluttered, indoor environment. We point out the aspects that need to be considered when fusing an additional sensor information into POM and provide a possible mathematical model for it. Finally, we experimentally demonstrate the proposed fusion on the example of person localization in a cluttered environment.The performance of a system comprising visual cameras and POM and a radio-based localization system is experimentally evaluated, showing their strengths and weaknesses. We then improve the localization results by fusing the information from the radio-based system into POM using the proposed model. Index Terms—sensor fusion, Probabilistic Occupancy Map, multi-camera, radio, person localization.

Guided Video Object Segmentation by Tracking

Mon, 01 Jan 0001 00:00:00 +0000

The paper presents Guided video object segmentation by tracking (gVOST) method for a human-in-the-loop video object segmentation which significantly reduces the manual annotation effort. The method is designed for an interactive object segmentation in a wide range of videos with a minimal user input. User to iteratively selects and annotates a small set of anchor frames by just a few clicks on the object border. The segmentation then is propagated to intermediate frames. Experiments show that gVOST performs well on diverse and challenging videos used in visual object tracking (VOT2020 dataset) where it achieves an IoU of 73% at only 5% of the user annotated frames. This shortens the annotation time by 98% compared to the brute force approach. gVOST outperforms the state-of-the-art interactive video object segmentation methods on the VOT2020 dataset and performs comparably on a less diverse DAVIS video object segmentation dataset.

Hallucinating Hidden Obstacles for Unmanned Surface Vehicles Using a Compositional Model

Mon, 01 Jan 0001 00:00:00 +0000

The water environment in which unmanned surface vehicles (USVs) navigate presents many unique challenges. One of these is the risk of encountering obstacles that are (partially) submerged and therefore poorly visible. Therefore, their extent cannot be determined directly from available above-water sensor data. On the other hand, it is well known that human skippers are able to safely navigate boats around obstacles even without underwater sensors and only with the help of their expertise. In this paper, we describe initial work on extending the USV obstacle detection to include such functionality using a compositional model. To learn to hallucinate the extent of obstacles with a minimum of learning effort, we exploit the nature of obstacles (people in kayaks, canoes, and on paddleboards) that are visible most of the time, but not always. We evaluate the impact of such hallucinations on USV safety and maneuverability, and suggest additional cases where such hallucinations can be used to improve USV safety.

Hand pointing detection system for tabletop visual human-machine interaction

Mon, 01 Jan 0001 00:00:00 +0000

HIDRA 1.0: deep-learning-based ensemble sea level forecasting in the northern Adriatic

Mon, 01 Jan 0001 00:00:00 +0000

Interactions between atmospheric forcing, topographic constraints to air and water flow, and resonant character of the basin make sea level modelling in the Adriatic a challenging problem. In this study we present an ensemble deep-neural-network-based sea level forecasting method HIDRA, which outperforms our set-up of the general ocean circulation model ensemble (NEMO v3.6) for all forecast lead times and at a minuscule fraction of the numerical cost (order of 2×10−6). HIDRA exhibits larger bias but lower RMSE than our set-up of NEMO over most of the residual sea level bins. It introduces a trainable atmospheric spatial encoder and employs fusion of atmospheric and sea level features into a self-contained network which enables discriminative feature learning. HIDRA architecture building blocks are experimentally analysed in detail and compared to alternative approaches. Results show the importance of sea level input for forecast lead times below 24 h and the importance of atmospheric input for longer lead times. The best performance is achieved by considering the input as the total sea level, split into disjoint sets of tidal and residual signals. This enables HIDRA to optimize the prediction fidelity with respect to atmospheric forcing while compensating for the errors in the tidal model. HIDRA is trained and analysed on a 10-year (2006–2016) time series of atmospheric surface fields from a single member of ECMWF atmospheric ensemble. In the testing phase, both HIDRA and NEMO ensemble systems are forced by the ECMWF atmospheric ensemble. Their performance is evaluated on a 1-year (2019) hourly time series from a tide gauge in Koper (Slovenia). Spectral and continuous wavelet analysis of the forecasts at the semi-diurnal frequency (12 h)−1 and at the ground-state basin seiche frequency (21.5 h)−1 is performed. The energy at the basin seiche in the HIDRA forecast is close to that observed, while our set-up of NEMO underestimates it. Analyses of the January 2015 and November 2019 storm surges indicate that HIDRA has learned to mimic the timing and amplitude of basin seiches.

HIDRA-D: deep-learning model for dense sea level forecasting using sparse altimetry and tide gauge data

Mon, 01 Jan 0001 00:00:00 +0000

This paper introduces HIDRA-D, a novel deep-learning model for basin scale dense (gridded) sea level prediction using sparse satellite altimetry and in situ tide gauge data. Accurate sea level prediction is crucial for coastal risk management, marine operations, and sustainable development. While traditional numerical ocean models are computationally expensive, especially for probabilistic forecasts over many ensemble members, HIDRA-D offers a faster, numerically cheaper, observation-driven alternative. Unlike previous HIDRA models (HIDRA1, HIDRA2 and HIDRA3) that focused on point predictions at tide gauges, HIDRA-D provides dense, two-dimensional, gridded sea level forecasts. The core innovation lies in a new algorithm that effectively leverages sparse and unevenly distributed satellite altimetry data in combination with tide gauge observations, to learn the complex basin-scale dynamics of sea level. HIDRA-D achieves this by integrating a HIDRA3 module for point predictions at tide gauges with a novel Dense decoder module, which generates low-frequency spatial components of the sea level field in the Fourier domain, whose Fourier inverse is an hourly sea level forecast over a 3 d horizon. When comparing 3 d forecasts against satellite absolute dynamic topography (ADT) data in the Adriatic, HIDRA-D achieves a 28.0 % reduction in mean absolute error relative to the NEMO general circulation model. However, while HIDRA-D performs well in open waters, leave-one-out cross-validation at tide gauges indicates limitations in areas with complex bathymetry, such as the Neretva estuary located in a narrow bay, and in regions with sparse satellite ADT data, like the northern Adriatic. Importantly, the model shows robustness to spatially-limited tide gauge coverage, maintaining acceptable performance even when trained using data from distant stations. This suggests its potential for broader applicability in areas with limited in situ observations.

HIDRA-T – A Transformer-Based Sea Level Forecasting Method

Mon, 01 Jan 0001 00:00:00 +0000

Sea surface height forecasting is critical for timely prediction of coastal flooding and mitigation of is impact on coastal comminities. Traditional numerical ocean models are limited in terms of computational cost and accuracy, while deep learning models have shown promising results in this area. However, there is still a need for more accurate and efficient deep learning architectures for sea level and storm surge modeling. In this context, we propose a new deep-learning architecture HIDRA-T for sea level and storm tide modeling, which is based on transformers and outperforms both state-of-the-art deep-learning network designs HIDRA1 and HIDRA2 and two state-of-the-art numerical ocean models (a NEMO engine with sea level data assimilation and a SCHISM ocean modeling system), over all sea level bins and all forecast lead times. Compared to its predecessor HIDRA2, HIDRA-T employs novel transformer-based atmospheric and sea level encoders, as well as a novel feature fusion and regression block. HIDRA-T was trained on surface wind and pressure fields from ECMWF atmospheric ensemble and on Koper tide gauge observations. Compared to other models, a consistent superior performance over all other models is observed in the extreme tail of the sea level distribution.

HIDRA2: deep-learning ensemble sea level and storm tide forecasting in the presence of seiches – the case of the northern Adriatic

Mon, 01 Jan 0001 00:00:00 +0000

We propose a new deep-learning architecture HIDRA2 for sea level and storm tide modeling, which is extremely fast to train and apply and outperforms both our previous network design HIDRA1 and two state-of-the-art numerical ocean models (a NEMO engine with sea level data assimilation and a SCHISM ocean modeling system), over all sea level bins and all forecast lead times. The architecture of HIDRA2 employs novel atmospheric, tidal and sea surface height (SSH) feature encoders as well as a novel feature fusion and SSH regression block. HIDRA2 was trained on surface wind and pressure fields from a single member of the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric ensemble and on Koper tide gauge observations. An extensive ablation study was performed to estimate the individual importance of input encoders and data streams. Compared to HIDRA1, the overall mean absolute forecast error is reduced by 13 %, while in storm events it is lower by an even larger margin of 25 %. Consistent superior performance over HIDRA1 as well as over general circulation models is observed in both tails of the sea level distribution: low tail forecasting is relevant for marine traffic scheduling to ports of the northern Adriatic, while high tail accuracy helps coastal flood response. To assign model errors to specific frequency bands covering diurnal and semi-diurnal tides and the two lowest basin seiches, spectral decomposition of sea levels during several historic storms is performed. HIDRA2 accurately predicts amplitudes and temporal phases of the Adriatic basin seiches, which is an important forecasting benefit due to the high sensitivity of the Adriatic storm tide level to the temporal lag between peak tide and peak seiche.

HIDRA2: deep-learning ensemble sea level and storm tide forecasting in the presence of seiches – the case of the northern Adriatic

Mon, 01 Jan 0001 00:00:00 +0000

We propose a new deep-learning architecture HIDRA2 for sea level and storm tide modeling, which is extremely fast to train and apply and outperforms both our previous network design HIDRA1 and two state-of-the-art numerical ocean models (a NEMO engine with sea level data assimilation and a SCHISM ocean modeling system), over all sea level bins and all forecast lead times. The architecture of HIDRA2 employs novel atmospheric, tidal and sea surface height (SSH) feature encoders as well as a novel feature fusion and SSH regression block. HIDRA2 was trained on surface wind and pressure fields from a single member of the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric ensemble and on Koper tide gauge observations. An extensive ablation study was performed to estimate the individual importance of input encoders and data streams. Compared to HIDRA1, the overall mean absolute forecast error is reduced by 13 %, while in storm events it is lower by an even larger margin of 25 %. Consistent superior performance over HIDRA1 as well as over general circulation models is observed in both tails of the sea level distribution: low tail forecasting is relevant for marine traffic scheduling to ports of the northern Adriatic, while high tail accuracy helps coastal flood response. Power spectrum analysis indicates that HIDRA2 most accurately represents the energy density peak centered on the ground state sea surface eigenmode (seiche) and comes a close second to SCHISM in the energy band of the first excited eigenmode. To assign model errors to specific frequency bands covering diurnal and semi-diurnal tides and the two lowest basin seiches, spectral decomposition of sea levels during several historic storms is performed. HIDRA2 accurately predicts amplitudes and temporal phases of the Adriatic basin seiches, which is an important forecasting benefit due to the high sensitivity of the Adriatic storm tide level to the temporal lag between peak tide and peak seiche.

HIDRA3: a deep-learning model for multipoint ensemble sea level forecasting in the presence of tide gauge sensor failures

Mon, 01 Jan 0001 00:00:00 +0000

Accurate modeling of sea level and storm surge dynamics with several days of temporal horizons is essential for effective coastal flood responses and the protection of coastal communities and economies. The classical approach to this challenge involves computationally intensive ocean models that typically calculate sea levels relative to the geoid, which must then be correlated with local tide gauge observations of sea surface height (SSH). A recently proposed deep-learning model, HIDRA2 (HIgh-performance Deep tidal Residual estimation method using Atmospheric data, version 2), avoids numerical simulations while delivering competitive forecasts. Its forecast accuracy depends on the availability of a sufficiently long history of recorded SSH observations used in training. This makes HIDRA2 less reliable for locations with less abundant SSH training data. Furthermore, since the inference requires immediate past SSH measurements as input, forecasts cannot be made during temporary tide gauge failures. We address the aforementioned issues using a new architecture, HIDRA3, that considers observations from multiple locations, shares the geophysical encoder across the locations, and constructs a joint latent state that is decoded into forecasts at individual locations. The new architecture brings several benefits: (i) it improves training at locations with scarce historical SSH data, (ii) it enables predictions even at locations with sensor failures, and (iii) it reliably estimates prediction uncertainties. HIDRA3 is evaluated by jointly training on 11 tide gauge locations along the Adriatic. Results show that HIDRA3 outperforms HIDRA2 and the Mediterranean basin Nucleus for European Modelling of the Ocean (NEMO) setup of the Copernicus Marine Environment Monitoring Service (CMEMS) by ∼ 15 % and ∼ 13 % mean absolute error (MAE) reductions at high SSH values, creating a solid new state of the art. The forecasting skill does not deteriorate even in the case of simultaneous failure of multiple sensors in the basin or when predicting solely from the tide gauges far outside the Rossby radius of a failed sensor. Furthermore, HIDRA3 shows remarkable performance with substantially smaller amounts of training data compared with HIDRA2, making it appropriate for sea level forecasting in basins with high regional variability in the available tide gauge data.

HIDRA3: A Robust Deep-Learning Model for Multi-Point Sea-Surface Height and Storm Surges Forecasting

Mon, 01 Jan 0001 00:00:00 +0000

Accurate forecasting of storm surges and extreme sea levels is crucial for mitigating coastal flooding and safeguarding communities. While recent advancements have seen machine learning models surpass state-of-the-art physics-based numerical models in sea surface height (SSH) prediction, challenges persist, particularly in areas with limited SSH measurement history and instances of sensor failures. In this study, we developed HIDRA3, a novel deep-learning approach designed to address these challenges by jointly predicting SSH at multiple locations, allowing the training even in the presence of data scarcity and enabling predictions at locations with sensor failures. Compared to the state-of-the-art model HIDRA2 and the numerical model NEMO, HIDRA3 demonstrates notable improvements, achieving, on average, 5.0% lower Mean Absolute Error (MAE) and 11.3% lower MAE on extreme sea surface heights.

HIDRA3: A Robust Deep-Learning Model for Multi-Point Sea-Surface Height Forecasting

Mon, 01 Jan 0001 00:00:00 +0000

Accurate sea surface height (SSH) forecasting is crucial for predicting coastal flooding and protecting communities. Recently, state-of-the-art physics-based numerical models have been outperformed by machine learning models, which rely on atmospheric forecasts and the immediate past measurements obtained from the prediction location. The reliance on past measurements brings several drawbacks. While the atmospheric training data is abundantly available, some locations have only a short history of SSH measurement, which limits the training quality. Furthermore, predictions cannot be made in cases of sensor failure even at locations with abundant past training data. To address these issues, we introduce a new deep learning method HIDRA3, that jointly predicts SSH at multiple locations. This allows improved training even in the presence of data scarcity at some locations and enables making predictions at locations with failed sensors. HIDRA3 surpasses the state-of-the-art model HIDRA2 and the numerical model NEMO, on average obtaining a 5.0% lower Mean Absolute Error (MAE) and an 11.3% lower MAE on extreme sea surface heights.

Hierarchical Feature Encoding for Object Recognition in Visual Sensor Networks

Mon, 01 Jan 0001 00:00:00 +0000

Hierarchical Spatial Model for 2D Range Data Based Room Categorization

Mon, 01 Jan 0001 00:00:00 +0000

The next generation service robots are expected to co-exist with humans in their homes. Such a mobile robot requires an efficient representation of space, which should be compact and expressive, for effective operation in real-world environments. In this paper we present a novel approach for 2D ground-plan-like laser-range-data-based room categorization that builds on a compositional hierarchical representation of space, and show how an additional abstraction layer, whose parts are formed by merging partial views of the environment followed by graph extraction, can achieve improved categorization performance. A new algorithm is presented that finds a dictionary of exemplar elements from a multi-category set, based on the affinity measure defined among pairs of elements. This algorithm is used for part selection in new layer construction. Room categorization experiments have been performed on a challenging publicly available dataset, which has been extended in this work. State-of-the-art results were obtained by achieving the most balanced performance over all categories.

Hierarchical statistical learning of generic parts of object structure

Mon, 01 Jan 0001 00:00:00 +0000

High-Dimensional Feature Matching: Employing the Concept of Meaningful Nearest Neighbors

Mon, 01 Jan 0001 00:00:00 +0000

Matching of high-dimensional features using nearest neighbors search is an important part of image matching methods which are based on local invariant features. In this work we highlight effects pertinent to high-dimensional spaces that are significant for matching, yet have not been explicitly accounted for in previous work. In our approach, we require every nearest neighbor to be meaningful, that is, sufficiently close to a query feature such that it is an outlier to a background feature distribution. We estimate the background feature distribution from the extended neighborhood of a query feature given by its k nearest neighbors. Based on the concept of meaningful nearest neighbors, we develop a novel high-dimensional feature matching method and evaluate its performance by conducting image matching on two challenging image data sets. A superior performance in terms of accuracy is shown in comparison to several state-of-the-art approaches. Additionally, to make search for k nearest neighbors more efficient, we develop a novel approximate nearest neighbors search method based on sparse coding with an overcomplete basis set that provides a ten-fold speed-up over an exhaustive search even for high dimensional spaces and retains excellent approximation to an exact nearest neighbors search.

Histogram of oriented gradients and region covariance descriptor in hierarchical feature-distribution scheme

Mon, 01 Jan 0001 00:00:00 +0000

Hierarchical feature-distribution scheme is a recently proposed framework for distribution of features in visual-sensor networks. It is intended for tasks, where one needs to establish a correspondence between two objects, seen by different cameras at different occasions. In visual-sensor networks, such pair of cameras may be very distant in network terms. Therefore, the hierarchical scheme results in significant reduction of network traffic, compared to naive approaches, which rely on flooding. In this paper we explore the performance of two state-of-the-art feature descriptors (histogram of oriented gradients and region covariance descriptor) in such featuredistribution scheme. Both methods are compared inthe terms of network load on the COIL-100 data set. Results show that even state-of-the-art feature descriptors benefit from hierarchical feature-distribution scheme.

Histograms of optical flow for efficient representation of body motion

Mon, 01 Jan 0001 00:00:00 +0000

How Computer Vision can help in Outdoor Positioning

Mon, 01 Jan 0001 00:00:00 +0000

Localization technologies have been an important focus in ubiquitous computing. This paper explores an underrepresented area, namely computer vision technology, for outdoor positioning. More specifically we explore two modes of positioning in a challenging real world scenario: single snapshot based positioning, improved by a novel highdimensional feature matching method, and continuous positioning enabled by combination of snapshot and incremental positioning. Quite interestingly, vision enables localization accuracies comparable to GPS. Furthermore the paper also analyzes and compares possibilities offered by the combination of different subsets of positioning technologies such as WiFi, GPS and dead reckoning in the same real world scenario as for vision based positioning.

Hypothesis verification with histogram of compositions improves object detection of hierarchical models

Mon, 01 Jan 0001 00:00:00 +0000

This paper focuses on applying and evaluating the additional hypothesis verification step for the detections of learnthierarchy-of-parts (LHOP) method. The applied method reduces the problem of false positives that are a common problem of hierarchical methods specifically in highly textured or cluttered images. We use a Histogram of Compositions (HoC) with a Support Vector Machine in hypothesis verification step. Using HoC descriptor ensures that the additional computation cost is as minimal as possible since HoC descriptor shares the LHOP tree structure. We evaluate the method on the ETHZ Shape Classes dataset and show that our method outperforms the original baseline LHOP method by around 5 percent.

Implementacija CONDENSATION Algoritma v domeni zaprtega sveta

Mon, 01 Jan 0001 00:00:00 +0000

People tracking in general is a challenging task and over the last two decades various computer vision algorithms dealing with this problem were proposed. Given a highly unpredictable nature of human motion stochastic based approaches such as CONDENSATION introduced by M. Issard and A. Blake in 1998, gained a lot of popularity among researchers in this field. In this paper we present an implementation of CONDENSATION algorithm for tracking people in sports. Since sport games usually take place in semicontrolled environments a closed world assumption, introduced by S.S. Intille and A.. Bobick in 1995 has been adopted. We present an architecture of such condensation based tracking algorithm within a closed world domain and show some results.

Improvements of the Adriatic Deep-Learning Sea Level Modeling Network HIDRA

Mon, 01 Jan 0001 00:00:00 +0000

Improving Traffic Sign Detection with Temporal Information

Mon, 01 Jan 0001 00:00:00 +0000

Traffic sign detection is a frequently addressed research and application problem, and many solutions to this problem have been proposed. A vast majority of the proposed approaches perform traffic sign detection on individual images, although a video recordings are often available. In this paper, we propose a method that exploits also the temporal information in image sequences. We propose a three-stage traffic sign detection approach. Traffic signs are first detected on individual images. In the second stage, visual tracking is used to track these initial detections to generate multiple detection hypotheses. These hypotheses are finally integrated and refined detections are obtained. We evaluate the proposed approach by detecting 91 traffic sign categories in a video sequence of more than 18.000 frames. Results show that the traffic signs are better localized and detected with a higher accuracy, which is very beneficial for applications such as maintenance of the traffic sign records.

Improving vision-based obstacle detection on USV using inertial sensor

Mon, 01 Jan 0001 00:00:00 +0000

We present a new semantic segmentation algorithm for obstacle detection in unmanned surface vehicles. The novelty lies in the graphical model that incorporates boat tilt measurements from the on-board inertial measurement unit (IMU). The IMU readings are used to estimate the location of horizon line in the image, and automatically adjusts the priors in the probabilistic semantic segmentation algorithm. We derive the necessary horizon projection equations, an efficient optimization algorithm for the proposed graphical model, and a practical IMU-camera-USV calibration. A new challenging dataset, which is the largest multi-sensor dataset of its kind, is constructed. Results show that the proposed algorithm significantly outperforms state of the art, with 32% improvement in water-edge detection accuracy, an over 15 % reduction of false positive rate, an over 70 % reduction of false negative rate, and an over 55 % increase of true positive rate, while running in real-time on a single core in Matlab.

Increased complexity of low-level structures improves histograms of compositions

Mon, 01 Jan 0001 00:00:00 +0000

While low-level visual features, such as histogram of oriented gradients (HOG), have been successfully used for object detection and categorization, we have been able to improve upon their performance by introducing histogram of compositions (HoC) in our previous work. In this paper we propose an extended version of HoC descriptor that uses additional layers from hierarchical model. We experimentally show that extended HoC surpasses the performance of the original descriptor by approximately 5% as additional layer provides higher complexity of compositions. Furthermore, with additional layer we show to produce competitive results to original HoC descriptor combined with HOG and can even further increase performance by adding HOG on top of HoC with additional layer.

Incremental and robust learning of subspace representations

Mon, 01 Jan 0001 00:00:00 +0000

Learning is a fundamental capability of any cognitive system. To enable efficient operation of a cognitive agent in a real-world environment, visual learning has to be a continuous and robust process. In this article, we present a method for subspace learning, which takes these considerations into account. We present an incremental method, which sequentially updates the principal subspace considering weighted influence of individual images as well as individual pixels within an image. We further extend this approach to enable determination of consistencies in the input data and imputation of the inconsistent values using the previously acquired knowledge, resulting in a novel method for incremental, weighted, and robust subspace learning. We demonstrate the effectiveness of the proposed concept in several experiments on learning of object and background representations.

Incremental approach to robust learning of eigenspaces

Mon, 01 Jan 0001 00:00:00 +0000

The standard PCA approach to visual learning of representations is intrinsically non-robust and usually performed in a batch mode, which is inadmissible in a real-world on-line scenario. In this paper we propose a novel method for robust and incremental learning of eigenspaces. The method sequentially updates the representation using the previously acquired knowledge for determining consistencies and discarding inconsistencies in the input images. We show the experimental results, which demonstrate the advantages and disadvantages of the proposed approach.

Incremental LDA learning by combining reconstructive and discriminative approaches

Mon, 01 Jan 0001 00:00:00 +0000

Incremental subspace methods have proven to enable efficient training if large amounts of training data have to be processed or if not all data is available in advance. In this paper we focus on incremental LDA learning which provides good classification results while it assures a compact data representation. In contrast to existing incremental LDA methods we additionally consider reconstructive information when incrementally building the LDA subspace. Hence, we get a more flexible representation that is capable to adapt to new data. Moreover, this allows to add new instances to existing classes as well as to add new classes. The experimental results show that the proposed approach outperforms other incremental LDA methods even approaching classification results obtained by batch learning.

Incremental learning with Gaussian mixture models

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we propose a new incremental estimation of Gaussian mixture models which can be used for applications of online learning. Our approach allows for adding new samples incrementally as well as removing parts of the mixture by the process of unlearning. Low complexity of the mixtures is maintained through a novel compression algorithm. In contrast to the existing approaches, our approach does not require fine-tuning parameters for a specific application, we do not assume specific forms of the target distributions and temporal constraints are not assumed on the observed data. The strength of the proposed approach is demonstrated with an example of online estimation of a complex distribution, an example of unlearning, and with an interactive learning of basic visual concepts.

Incremental PCA for On-line Visual Learning and Recognition

Mon, 01 Jan 0001 00:00:00 +0000

The methods for visual learning that compute a space of eigenvectors by Principal Component Analysis (PCA) traditionally require a batch computation step. Since this leads to potential problems when dealing with large sets of images, several incremental methods for the computation of the eigenvectors have been introduced. However, such learning cannot be considered as an on-line process, since all the images are retained until the final step of computation of space of eigenvectors, when their coefficients in this subspace are computed. In this paper we propose a method that allows for simultaneous learning and recognition. We show that we can keep only the coefficients of the learned images and discard the actual images and still are able to build a model of appearance that is fast to compute and open-ended. We performed extensive experimental testing which showed that the recognition rate and reconstruction accuracy are comparable to those obtained by the batch method.

Integrating Visual Context and Object Detection within a Probabilistic Framework

Mon, 01 Jan 0001 00:00:00 +0000

Visual context provides cues about an object’s presence, position and size within an observed scene, which are used to increase the performance of object detection techniques. However, state-of-the-art methods for context aware object detection could decrease the initial performance. We discuss the reasons for failure and propose a concept that overcomes these limitations, by introducing a novel technique for integrating visual context and object detection. Therefore, we apply the prior probability function of an object detector, that maps the detector’s output to probabilities. Together, with an appropriate contextual weighting, a probabilistic framework is established. In addition, we present an extension to state-of-the-art methods to learn scale-dependent visual context information and show how this increases the initial performance. The standard methods and our proposed extensions are compared on a novel, demanding image data set. Results show that visual context facilitates object detection methods.

Integration of Computer Vision Components into a Multi-modal Cognitive System

Mon, 01 Jan 0001 00:00:00 +0000

We present a general method for integrating visual components into a multi-modal cognitive system. The integration is very generic and can work with an arbitrary set of other modalities. We illustrate our integration approach with a specific instantiation of the architecture schema that focuses on integration of vision and language: a cognitive system able to collaborate with a human, learn and display some understanding of its surroundings. As examples of cross-modal interaction we describe mechanisms for clarification and visual learning.

Interactive learning and cross-modal binding - a combined approach

Mon, 01 Jan 0001 00:00:00 +0000

Interaktiven sistem za kontinuirano učenje vizualnih konceptov

Mon, 01 Jan 0001 00:00:00 +0000

We present an artifficial cognitive system for learning visual concepts. It comprises of vision, communication and manipulation subsystems, which provide visual input, enable verbal and non-verbal communication with a tutor and allow interaction with a given scene. The main goal is to learn associations between automatically extracted visual features and words that describe the scene in an open-ended, continuous manner. In particular, we address the problem of cross-modal learning of visual properties and spatial relations and analyse several learning modes requiring different levels of tutor supervision.

Is my new tracker really better than yours?

Mon, 01 Jan 0001 00:00:00 +0000

The problem of visual tracking evaluation is sporting an abundance of performance measures, which are used by various authors, and largely suffers from lack of consensus about which measures should be preferred. This is hampering the cross-paper tracker comparison and faster advancement of the field. In this paper we provide a critical analysis of the popular measures and evaluate them experimentally by a large-scale tracking experiment. We also analyze various visualizations of the performance measures. We show that several measures are equivalent from the point of information they provide for tracker comparison and, crucially, that some are more brittle than the others. Based on our analysis we narrow down the specter of measures to only a few complementary ones, thus pushing towards homogenization of the tracker evaluation methodology.

Izvedba algoritma računalniškega vida na omrežni kameri

Mon, 01 Jan 0001 00:00:00 +0000

Joint calibration of a multimodal sensor system for autonomous vehicles

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal sensor systems require precise calibration if they are to be used in the field. Due to the difficulty of obtaining the corresponding features from different modalities, the calibration of such systems is an open problem. We present a systematic approach for calibrating a set of cameras with different modalities (RGB, thermal, polarization, and dual-spectrum near infrared) with regard to a LiDAR sensor using a planar calibration target. Firstly, a method for calibrating a single camera with regard to the LiDAR sensor is proposed. The method is usable with any modality, as long as the calibration pattern is detected. A methodology for establishing a parallax-aware pixel mapping between different camera modalities is then presented. Such a mapping can then be used to transfer annotations, features, and results between highly differing camera modalities to facilitate feature extraction and deep detection and segmentation methods.

Karhunen-Loeve Transform of a Set of Rotated Templates

Mon, 01 Jan 0001 00:00:00 +0000

We propose a novel method for efficiently calculating the eigenvectors of uniformly rotated images of a set of templates. As we show, the images can be optimally approximated by a linear series of eigenvectors which can be calculated without actually decomposing the sample covariance matrix.

Keep DRÆMing: Discriminative 3D anomaly detection through anomaly simulation

Mon, 01 Jan 0001 00:00:00 +0000

Recent surface anomaly detection methods rely on pretrained backbone networks for efficient anomaly detection. On standard RGB anomaly detection benchmarks these methods achieve excellent results but fail on 3D anomaly detection due to a lack of pretrained backbones that suit this domain. Additionally, there is a lack of industrial depth data that would enable the backbone network training that could be used in 3D anomaly detection models. Discriminative anomaly detection methods do not require pretrained networks and are trained using simulated anomalies. The process of simulating anomalies that fit the domain of industrial depth data is not trivial and is necessary for training discriminative methods. We propose a novel 3D anomaly simulation process that follows the natural characteristics of industrial depth data and generates diverse deformations, making it suitable for training discriminative anomaly detection methods. We demonstrate its effectiveness by adapting the DRÆM method to work on 3D anomaly detection, thus obtaining 3DRÆM, a strong discriminative 3D anomaly detection model. The proposed approach achieves excellent results on the MVTec3D anomaly detection benchmark where it achieves state-of-the-art results on both 3D and RGB+3D problem setups, significantly outperforming competing methods.

Knowledge gap detection for interactive learning of categorical knowledge

Mon, 01 Jan 0001 00:00:00 +0000

In interactive machine learning the process of labeling training instances and introducing them to the learner may be expensive in terms of human effort and time. In this paper we present different strategies for detecting gaps in the learner’s knowledge and communicating these gaps to the teacher. These strategies are considered from the viewpoint of extrospective and introspective behavior of the learner – this new perspective is also the main contribution of our paper. The experimental results indicate that the analyzed strategies are successful in reducing the number of training instances required to reach the needed recognition rate. Such a facilitation may be an important step towards the broader use of interactive autonomous systems.

LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark

Mon, 01 Jan 0001 00:00:00 +0000

The progress in maritime obstacle detection is hindered by the lack of a diverse dataset that adequately captures the complexity of general maritime environments. We present the first maritime panoptic obstacle detection benchmark LaRS, featuring scenes from Lakes, Rivers and Seas. Our major contribution is the new dataset, which boasts the largest diversity in recording locations, scene types, obstacle classes, and acquisition conditions among the related datasets. LaRS is composed of over 4000 per-pixel labeled key frames with nine preceding frames to allow utilization of the temporal texture, amounting to over 40k frames. Each key frame is annotated with 8 thing, 3 stuff classes and 19 global scene attributes. We report the results of 27 semantic and panoptic segmentation methods, along with several performance insights and future research directions. To enable objective evaluation, we have implemented an online evaluation server. The LaRS dataset, evaluation toolkit and benchmark are publicly available at: https://lojzezust.github.io/lars-dataset

Learning Contextual Rules for Priming Object Categories in Images

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we introduce and exploit the concept of contextual rules in the field of object detection. These rules are defined as associations between different object likelihood maps and are learned from given examples. The contextual rules can be used to prime regions where a target object category occurs in an image given areas of other object categories. The principal idea is to locate several basic object categories in an image and then use this information to infer object likelihood maps for other object categories. The proposed framework itself is general and not limited to specific object categories. For demonstrating our approach, we use likely occurrences of pedestrians and windows in urban scenes, extracted by a technique employing visual context, and use them to prime for shop logos.

Learning Hierarchical Compositional Representations of Object Structure

Mon, 01 Jan 0001 00:00:00 +0000

Learning hierarchical representations of object categories for robot vision

Mon, 01 Jan 0001 00:00:00 +0000

This paper presents our recently developed approach to constructing a hierarchical representation of visual input that aims to enable recognition and detection of a large number of object categories. Inspired by the principles of efficient indexing, robust matching, and ideas of compositionality, our approach learns a hierarchy of spatially flexible compositions, i.e. parts, in an unsupervised, statistics-driven manner. Starting with simple, frequent features, we learn the statistically most significant compositions (parts composed of parts), which consequently define the next layer. Parts are learned sequentially, layer after layer, optimally adjusting to the visual data. Lower layers are learned in a category-independent way to obtain complex, yet sharable visual building blocks, which is a crucial step towards a scalable representation. Higher layers of the hierarchy, on the other hand, are constructed by using specific categories, achieving a category representation with a small number of highly generalizable parts that gained their structural flexibility through composition within the hierarchy. Built in this way, new categories can be efficiently and continuously added to the system by adding a small number of parts only in the higher layers. The approach is demonstrated on a large collection of images and a variety of object categories.

Learning hierarchical representations of object categories for robot vision.

Mon, 01 Jan 0001 00:00:00 +0000

Learning Maritime Obstacle Detection from Weak Annotations by Scaffolding

Mon, 01 Jan 0001 00:00:00 +0000

Coastal water autonomous boats rely on robust perception methods for obstacle detection and timely collision avoidance. The current state-of-the-art is based on deep segmentation networks trained on large datasets. Per-pixel ground truth labeling of such datasets, however, is labor-intensive and expensive. We observe that far less information is required for practical obstacle avoidance – the location of water edge on static obstacles like shore and approximate location and bounds of dynamic obstacles in the water is sufficient to plan a reaction. We propose a new scaffolding learning regime (SLR) that allows training obstacle detection segmentation networks only from such weak annotations, thus significantly reducing the cost of ground-truth labeling. Experiments show that maritime obstacle segmentation networks trained using SLR substantially outperform the same networks trained with dense ground truth labels. Thus accuracy is not sacrificed for labelling simplicity but is in fact improved, which is a remarkable result.

Learning part-based spatial models for laser-vision-based room categorization

Mon, 01 Jan 0001 00:00:00 +0000

Room categorization, i.e., recognizing the functionality of a never before seen room, is a crucial capability for a household mobile robot. We present a new approach for room categorization that is based on 2D laser range data. The method is based on a novel spatial model consisting of mid-level parts that are built on top of a low-level part-based representation. The approach is then fused with a vision-based method for room categorization, which is also based on a spatial model consisting of mid-level visual-parts. In addition, we propose a new discriminative dictionary learning technique that is applied for part-dictionary selection in both laser-based and vision-based modalities. Finally, we present a comparative analysis between laser-based, vision-based, and laser-vision-fusion-based approaches in a uniform part-based framework that is evaluated on a large dataset with several categories of rooms from the domestic environments.

Learning statistically relevant edge structure improves low-level visual descriptors

Mon, 01 Jan 0001 00:00:00 +0000

Over the recent years, low-level visual descriptors, among which the most popular is the histogram of oriented gradients (HOG), have shown excellent performance in object detection and categorization. We form a hypothesis that the low-level image descriptors can be improved by learning the statistically relevant edge structures from natural images. We validate this hypothesis by introducing a new descriptor called the histogram of compositions (HoC). HoC exploits a learnt vocabulary of parts from a state-of-the-art hierarchical compositional model. Furthermore, we show that HoC is a complementary descriptor to HOG. We experimentally compare our descriptor to the popular HOG descriptor on the task of object categorization. We have observed approximately 4% improved categorization performance of HoC over HOG at lower dimensionality of the descriptor. Furthermore, in comparison to HOG, we show a categorization improvement of approximately 11% when combining HOG with the proposed HoC.

Learning visual context for object detection

Mon, 01 Jan 0001 00:00:00 +0000

Kontekst ima pomembno vlogo pri splošnem zaznavanju prizorov, saj zagotavlja dodatno informacijo o možnih lokacijah objektov v slikah. Detektorji objektov, ki se uporabljajo v računalniškem vidu, tovrstne informacijo običajno ne izkoristijo. V članku bomo zato predstavili koncept, kako se lahko kontekstualne informacije naučimo iz primerov slik prizorov. To informacijo bomo uporabili za izračun kontekstnega polja, ki predstavlja apriorno informacijo za detekcijo objektov glede na možne lokacije. Detekcija objektov, ki temelji na lokalnem videzu, je potem selektivno uporabljena le na nekaterih delih slike. Predlagano metodo smo preizkusili na primerih detekcije pešcev, avtomobilov, in oken, pri čemer smo uporabili zahtevne podatkovne zbirke slik urbanih okolij. Rezultati so pokazali, da kontekstualna informacija dopolnjuje lokalno informacijo na podlagi videza, ter tako zmanjša kompleksnost iskanja in poveča robustnost detekcije predmetov. Prednost predlagane metode je tudi v tem, da je učenje kontekstualnih konfiguracij za različne kategorije objektov neodvisno od specifičnih modelov za posamezne naloge.

Learning with Weak Annotations for Robust Maritime Obstacle Detection

Mon, 01 Jan 0001 00:00:00 +0000

Robust maritime obstacle detection is critical for safe navigation of autonomous boats and timely collision avoidance. The current state-of-the-art is based on deep segmentation networks trained on large datasets. However, per-pixel ground truth labeling of such datasets is labor-intensive and expensive. We propose a new scaffolding learning regime (SLR) that leverages weak annotations consisting of water edges, the horizon location, and obstacle bounding boxes to train segmentation-based obstacle detection networks, thereby reducing the required ground truth labeling effort by a factor of twenty. SLR trains an initial model from weak annotations and then alternates between re-estimating the segmentation pseudo-labels and improving the network parameters. Experiments show that maritime obstacle segmentation networks trained using SLR on weak annotations not only match but outperform the same networks trained with dense ground truth labels, which is a remarkable result. In addition to the increased accuracy, SLR also increases domain generalization and can be used for domain adaptation with a low manual annotation load. The SLR code and pre-trained models are freely available online.

Lokalizacija in ocenjevanje lege predmeta v treh prostostnih stopnjah s središčnimi smernimi vektorji

Mon, 01 Jan 0001 00:00:00 +0000

In this paper, we propose an approach to localize and estimate the pose of objects in three degrees of freedom (3-DOF). Our method is based on point localization combined with regression of the orientation angle for each detected object. We extend existing point localization method to estimate the orientation of all detected objects in an image. The orientation regression is parameterized with trigonometric functions, similar to the direction to the object center. We evaluate our method on the proposed screw dataset, composed of a training set containing synthetic images with photorealistic appearance and a test set containing real images of screws. Compared to the state-of-the-art 6-DOF position estimation method applied to the 3-DOF problem, our approach achieves comparable results at a significantly lower computational cost.

Low-Cost Open-Source Robotic Platform for Education

Mon, 01 Jan 0001 00:00:00 +0000

This article describes an open-source robotic manipulator platform aimed at different levels of STEM education and popularization. It presents the hardware that was used to make a suitable low-cost low-weight manipulator and an evaluation of its capabilities, as well as the software components that were developed to make the platform accessible at different levels of education and in various usage scenarios. Finally, the results of a comprehensive user evaluation study spanning over several years are presented. The system was tested in several different educational scenarios, ranging from a summer school for primary-school students to a university-level course. The results of the study show that the introduction of the system into the educational process improves the motivation as well as the acquired knowledge of the participants.

Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation

Mon, 01 Jan 0001 00:00:00 +0000

Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP’s region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: this URL.

Mixed supervision for surface-defect detection: from weakly to fully supervised learning

Mon, 01 Jan 0001 00:00:00 +0000

Deep-learning methods have recently started being employed for addressing surface-defect detection problems in industrial quality control. However, with a large amount of data needed for learning, often requiring high-precision labels, many industrial problems cannot be easily solved, or the cost of the solutions would significantly increase due to the annotation requirements. In this work, we relax heavy requirements of fully supervised learning methods and reduce the need for highly detailed annotations. By proposing a deep-learning architecture, we explore the use of annotations of different details ranging from weak (image-level) labels through mixed supervision to full (pixel-level) annotations on the task of surface-defect detection. The proposed end-to-end architecture is composed of two sub-networks yielding defect segmentation and classification results. The proposed method is evaluated on several datasets for industrial quality inspection: KolektorSDD, DAGM and Severstal Steel Defect. We also present a new dataset termed KolektorSDD2 with over 3000 images containing several types of defects, obtained while addressing a real-world industrial problem. We demonstrate state-of-the-art results on all four datasets. The proposed method outperforms all related approaches in fully supervised settings and also outperforms weakly-supervised methods when only image-level labels are available. We also show that mixed supervision with only a handful of fully annotated samples added to weakly labelled training images can result in performance comparable to the fully supervised model’s performance but at a significantly lower annotation cost.

Mobile Robot Localization under Varying Illumination

Mon, 01 Jan 0001 00:00:00 +0000

Methods for mobile robot localization that use eigenspaces of panoramic snapshots of the environment are in general sensitive to changes in the illumination of the environment. Therefore, we propose an approach which achieves a reliable localization under severe illumination conditions. The method uses gradient filtering of the eigenspace. After testing the approach on images obtained by a mobile robot, we show that it outperforms the standard eigenspace-based recognition method.

Mobile Robot Localization using an Incremental Eigenspace Model

Mon, 01 Jan 0001 00:00:00 +0000

When using appearance-based recognition for self-localization of mobile robots, the images obtained during the exploration of the environment need to be efficiently stored in the memory. PCA offers means for representing the images in a low-dimensional subspace, which allows for efficient matching and recognition. For active exploration it is necessary to use an incremental method for the computation of the subspace. We propose to use an incremental PCA algorithm with the updating of partial image representations in a way that allows the robot to discard the acquired images immediately after the update. Such a model is open-ended, meaning that we can easily update it with new images. We show that the performance of the proposed method is comparable to the performance of the batch method in terms of compression, computational cost and the precision of localization. We also show that by applying the repetitive learning, the subspace converges to that constructed with the batch method.

Mobile Robots : New Research

Mon, 01 Jan 0001 00:00:00 +0000

In this paper a global vision scheme for estimation of positions and orientations of mobile robots is presented. It is applied to robot soccer application which is a fast dynamic game and therefore needs an efficient and robust vision system implemented. General applicability of the vision system can be found in other robot applications such as mobile transport robots in production, warehouses, attendant robots, fast vision tracking of targets of interest and entertainment robotics. Basic operation of the vision system is divided into two steps. In the first, the incoming image is scanned and pixels are classified into a finite number of classes. At the same time, a segmentation algorithm is used to find corresponding regions belonging to one of the classes. In the second step, all the regions are examined. Selection of the ones that are a part of the observed object is made by means of simple logic procedures. The novelty is focused on optimization of the processing time needed to finish the estimation of possible object positions. Better results of the vision system are achieved by implementing camera calibration and shading correction algorithm. The former corrects camera lens distortion, while the latter increases robustness to irregular illumination conditions.

Modeling binding and cross-modal learning in Markov logic networks

Mon, 01 Jan 0001 00:00:00 +0000

Binding - the ability to combine two or more modal representations of the same entity into a single shared representation - is vital for every cognitive system operating in a complex environment. In order to successfully adapt to changes in a dynamic environment the binding mechanism has to be supplemented with cross-modal learning. In this paper we define the problems of high-level binding and cross-modal learning. By these definitions we model a binding mechanism in a Markov logic network and define its role in a cognitive architecture. We evaluate a prototype binding system off-line, using three different inference methods.

MODS--A USV-Oriented Object Detection and Obstacle Segmentation Benchmark

Mon, 01 Jan 0001 00:00:00 +0000

Small-sized unmanned surface vehicles (USV) are coastal water devices with a broad range of applications such as environmental control and surveillance. A crucial capability for autonomous operation is obstacle detection for timely reaction and collision avoidance, which has been recently explored in the context of camera-based visual scene interpretation. Owing to curated datasets, substantial advances in scene interpretation have been made in a related field of unmanned ground vehicles. However, the current maritime datasets do not adequately capture the complexity of real-world USV scenes and the evaluation protocols are not standardised, which makes cross-paper comparison of different methods difficult and hinders the progress. To address these issues, we introduce a new obstacle detection benchmark MODS, which considers two major perception tasks: maritime object detection and the more general maritime obstacle segmentation. We present a new diverse maritime evaluation dataset containing approximately 81k stereo images synchronized with an on-board IMU, with over 60k objects annotated. We propose a new obstacle segmentation performance evaluation protocol that reflects the detection accuracy in a way meaningful for practical USV navigation. Nineteen recent state-of-the-art object detection and obstacle segmentation methods are evaluated using the proposed protocol, creating a benchmark to facilitate development of the field. The proposed dataset, as well as evaluation routines, are made publicly available at vicos.si/resources.

Multi-camera and radio fusion for person localization in a cluttered environment

Mon, 01 Jan 0001 00:00:00 +0000

We investigate the problem of person localization in a cluttered environment. We evaluate the performance of an Ultra-Wideband radio localization system and a multi-camera system based on the Probabilistic Occupancy Map algorithm. After demonstrating the strengths and weaknesses of both systems, we improve the localization results by fusing both the radio and the visual information within the Probabilistic Occupancy Map framework. This is done by treating the radio modality as an additional independent sensory input that contributes to a given cell’s occupancy likelihood.

Multi-camera and radio fusion for person localization in a cluttered environment

Mon, 01 Jan 0001 00:00:00 +0000

Multi-modal Obstacle Avoidance in USVs via Anomaly Detection and Cascaded Datasets

Mon, 01 Jan 0001 00:00:00 +0000

We introduce a novel strategy for obstacle avoidance in aqua- tic settings, using anomaly detection for quick deployment of autonomous water vehicles in limited geographic areas. The unmanned surface vehi- cle (USV) is initially manually navigated to collect training data. The learning phase involves three steps: learning imaging modality specifics, learning the obstacle-free environment using collected data, and setting obstacle detector sensitivity with images containing water obstacles. This approach, which we call cascaded datasets, works with different image modalities and environments without extensive marine-specific data. Re- sults are demonstrated with LWIR and RGB images from river missions.

Multi-modal tracking by identification

Mon, 01 Jan 0001 00:00:00 +0000

In this paper, we demonstrate, by performing quantitative evaluation, the benefit of tracking by identification over state-of-the-art identification by tracking. We evaluate four localization and tracking systems: a commercial localization system based on radio technology, a state-ofthe- art computer-vision algorithm that uses multiple calibrated cameras to perform identification by tracking, and two multi-modal tracking-by-identification systems that have been developed in our laboratory. We briefly describe all four systems and evaluation metric, and present evaluation on a challenging indoor dataset.

Multi-touch surface based on RGBD camera

Mon, 01 Jan 0001 00:00:00 +0000

The popularity of interactive surfaces is increasing because of their natural and intuitive usage. Adding 3D multi-point interaction capabilities to an arbitrary surface creates numerous additional possibilities in fields ranging from marketing to medicine. Interactive tables are nowadays present in numerous museums, schools and companies. With the advent of low-cost RGBD cameras, thee-dimensional surfaces are slowly emerging as well, attracting even more attention. This paper presents an affordable system for 3D human-computer interaction using a RGBD camera that is capable of detecting and tracking user’s fingertips in 3D space. The system is evaluated in terms of accuracy, response time, CPU usage, and user experience. The results of the evaluation show that such low-cost systems are already a viable alternative to other multi-touch technologies and also present interesting new ways of interaction with a surface-based interfaces.

Multi-Year Time Series Transfer Learning: Application of Early Crop Classification

Mon, 01 Jan 0001 00:00:00 +0000

Crop classification is an important task in remote sensing with many applications, such as estimating yields, detecting crop diseases and pests, and ensuring food security. In this study, we combined knowledge from remote sensing, machine learning, and agriculture to investigate the application of transfer learning with a transformer model for variable length satellite image time series (SITS). The objective was to produce a map of agricultural land, reduce required interventions, and limit in-field visits. Specifically, we aimed to provide reliable agricultural land class predictions in a timely manner and quantify the necessary amount of reference parcels to achieve these outcomes. Our dataset consisted of Sentinel-2 satellite imagery and reference crop labels for Slovenia spanning over years 2019, 2020, and 2021. We evaluated adaptability through fine-tuning in a real-world scenario of early crop classification with limited up-to-date reference data. The base model trained on a different year achieved an average F1 score of 82.5% for the target year without having a reference from the target year. To increase accuracy with a new model trained from scratch, an average of 48,000 samples are required in the target year. Using transfer learning, the pre-trained models can be efficiently adapted to an unknown year, requiring less than 0.3% (1500) samples from the dataset. Building on this, we show that transfer learning can outperform the baseline in the context of early classification with only 9% of the data after 210 days in the year.

Multiple interacting targets tracking with application to team sports

Mon, 01 Jan 0001 00:00:00 +0000

The interest in the field of computer aided analysis of sport events is ever growing and the ability of tracking objects during a sport event has become an elementary task for nearly every sport analysis system. We present in this paper a color based probabilistic tracker that is suitable for tracking players on the playground during a sport game. Since the players are being tracked in their natural environment, and this environment is subjected to certain rules of the game, we use the concept of closed worlds, to model the scene context and thus improve the reliability of tracking.

Multivariate Online Kernel Density Estimation

Mon, 01 Jan 0001 00:00:00 +0000

We propose an approach for online kernel density estimation (KDE) which enables building probability density functions from data by observing only a single data-point at a time. The method maintains a non-parametric model of the data itself and uses this model to calculate the corresponding KDE. We propose an new automatic bandwidth selection rule, which can be computed directly from the non-parametric model of the data. Low complexity of the model is maintained through a novel compression and refinement scheme. We compare the online KDE to some state-of-the-art batch KDEs on examples of estimating distributions and on an example of classification. The results show that the online KDE generally achieves comparable performance to the batch approaches, while producing models with lower complexity and allowing online updating using only a single observation at a time.

Multivariate Online Kernel Density Estimation with Gaussian Kernels

Mon, 01 Jan 0001 00:00:00 +0000

We propose a novel approach to online estimation of probability density functions, which is based on kernel density estimation (KDE). The method maintains and updates a non-parametric model of the observed data, from which the KDE can be calculated. We propose an online bandwidth estimation approach and a compression/revitalization scheme which maintains the KDE’s complexity low. We compare the proposed online KDE to the state-of-the-art approaches on examples of estimating stationary and non-stationary distributions, and on examples of classification. The results show that the online KDE outperforms or achieves a comparable performance to the state-of-the-art and produces models with a significantly lower complexity while allowing online adaptation.

MVL Lab5: Multi-modal Indoor Person Localization Dataset

Mon, 01 Jan 0001 00:00:00 +0000

This technical report describes MVL Lab5, a multi-modal indoor person localization dataset. The dataset contains a sequence of video frames obtained from four calibrated and time-synchronized video cameras and location event data stream from a commercially-available radio-based localization system. The scenario involves five individuals walking around a realistically cluttered room. Provided calibration data and ground truth annotations enable evaluation of person detection, localization and identification approaches. These can be either purely computer-vision based, or based on fusion of video and radio information. This document is intended as the primary documentation source for the dataset, presenting its availability, acquisition procedure, and organization. The structure and format of data is described in detail, along with documentation for bundled Matlab code and examples of its use.

Nadgradnja mere AUC pri analizi klasifikatorjev s krivuljami ROC

Mon, 01 Jan 0001 00:00:00 +0000

Mera AUC, ki se uporablja na področju vrednotenja klasifikatorjev in predstavlja eno glavnih orodij analize ROC, ima določene pomanjkljivosti. Ne upošteva namreč vrednosti točkovnih ocen (angl. scores) primerov, temveč le njihovo razvrstitev. Posledica tega dejstva je njena nezanesljivost pri ocenjevanju množic primerov, pri katerih so razlike med točkovnimi ocenami primerov zanemarljive. Slabost mere AUC pa je tudi njena neinformativnost pri medsebojnem primerjanju množic, ki vsebujejo enako število napak.

Raziskovalci so iz teh razlogov predlagali izboljšave mere AUC, ki upoštevajo tudi vrednosti točkovnih ocen. V tem delu obravnavamo štiri tovrstne mere. Ugotovljeno pa je bilo, da tudi te izpeljanke ne odpravijo vseh slabosti oz. celo vpeljejo nove. Pri njih se namreč lahko pojavi neprimeren vpliv lastnosti obravnavanih množic primerov na obnašanje teh različic.

Napredne metode računalniškega vida za avtonomno navigacijo robotskega plovila

Mon, 01 Jan 0001 00:00:00 +0000

The aim of our project is development of computer vision algorithms for autonomous navigation of a sea vessel by means of image segmentation and stabilization, long-term tracking, inference of 3D structure from motion, and horizon detectio

Non-sequential Multi-view Detection, Localization and Identification of People Using Multi-modal Feature Maps

Mon, 01 Jan 0001 00:00:00 +0000

O klasifikaciji slik v ne-enolično določljive razrede

Mon, 01 Jan 0001 00:00:00 +0000

Image classification is one of the most basic and frequently addressed computer vision tasks. The usual formulation of this tasks requires classification of an image into the one of several possible classes. The most common metric for measuring the classifier’s performance is classification accuracy, defined as a percentage of correctly classified images. However, such formalisation of the classification problem relies on a strong assumption that for every image a category is uniquely identifiable and assigned by the domain expert. In this paper we address scenarios where this assumption does not hold. In particular, we present an analysis of the results obtained by the convolutional neural network and twelve participants who were tasked to classify the images of planks into eight classes and discuss the label ambiguity problem.

Object Tracking by Reconstruction with View-Specific Discriminative Correlation Filters

Mon, 01 Jan 0001 00:00:00 +0000

Standard RGB-D trackers treat the target as an inherently 2D structure, which makes modelling appearance changes related even to simple out-of-plane rotation highly challenging. We address this limitation by proposing a novel long-term RGB-D tracker - Object Tracking by Reconstruction (OTR). The tracker performs online 3D target reconstruction to facilitate robust learning of a set of view-specific discriminative correlation filters (DCFs). The 3D reconstruction supports two performance-enhancing features: (i) generation of accurate spatial support for constrained DCF learning from its 2D projection and (ii) point cloud based estimation of 3D pose change for selection and storage of view-specific DCFs which are used to robustly localize the target after out-of-view rotation or heavy occlusion. Extensive evaluation of OTR on the challenging Princeton RGB-D tracking and STC Benchmarks shows it outperforms the state-of-the-art by a large margin.

ObjectCore - Efficient Few-shot Logical Anomaly Detection using Object Representations

Mon, 01 Jan 0001 00:00:00 +0000

Anomaly Detection is an important problem in industrial processes. Two new subfields have recently emerged: logical anomaly detection and few-shot anomaly detection. The combined task, few-shot logical anomaly detection, has proven exceptionally difficult and highly important for industrial processes. Few-shot methods use suboptimal representations to model composition information necessary for detecting logical anomalies, and previous full-shot methods require a large training set. To solve both problems, we propose ObjectCore, a few-shot logical anomaly detection model that captures the composition information from only a few images without any category-specific information. The composition information of an image is modelled as a collection of object representations. Logical anomalies are detected using bipartite matching between object representations in the test image and object representations in the most similar support image. ObjectCore significantly improves over state-of-the-art methods on two standard benchmarks for few-shot logical anomaly detection, MVTec LOCO and CAD-SD, attaining an image-level AUROC of 80.8% and 96.5%, respectively, in the 4-shot setting. Code

Observing Human Motion Using Far-Infrared (FLIR) Camera -- Some Preliminary Studies

Mon, 01 Jan 0001 00:00:00 +0000

Far infrared imaging technology is becoming an interesting choice for many civilian uses. We explored the potential of using far infrared camera for human motion analysis, especially from the viewpoint of possible automated image and video analysis. In this article, we present the main characteristics of far infrared imagery that should be of interest to computer vision researchers and seek to eliminate some common misunderstandings about the far infrared imagery which may influence the choice of far infrared technology over other alternatives. We provide images that illustrate the problems and advances of using the far infrared imaging technology, especially for the purpose of observing humans.

Obstacle Detection for USVs by Joint Stereo-View Semantic Segmentation

Mon, 01 Jan 0001 00:00:00 +0000

We propose a stereo-based obstacle detection approach for unmanned surface vehicles. Obstacle detection is cast as a scene semantic segmentation problem in which pixels are assigned a probability of belonging to water or non-water regions. We extend a single-view model to a stereo system by adding a constraint which prefers consistent class labels assignment to pixels in the left and right camera images corresponding to the same parts of a 3D scene. Our approach jointly fits a semantic model to both images, leading to an improved class-label posterior map from which obstacles and water edge are extracted. In overall F-measure, our approach outperforms the current state-of-the-art monocular approach by 0.495, a monocular CNN by 0.798 and their stereo extensions by 0.059 and 0.515, respectively on the task of obstacle detection while running real-time on a single CPU.

Obstacle Tracking for Unmanned Surface Vessels using 3D Point Cloud

Mon, 01 Jan 0001 00:00:00 +0000

We present a method for detecting and tracking waterborne obstacles from an unmanned surface vehicle (USV) for the purpose of short-term obstacle avoidance. A stereo camera system provides a point cloud of the scene in front of the vehicle. The water surface is estimated by fitting a plane to the point cloud and outlying points are further processed to find potential obstacles. We propose a new plane fitting algorithm for water surface detection that applies a fast approximate semantic segmentation to filter the point cloud and utilizes an external IMU reading to constrain the plane orientation. A novel histogram-like depth appearance model is proposed to keep track of the identity of the detected obstacles through time and to filter out false detections, which negatively impact vehicle’s automatic guidance system. The improved plane fitting algorithm and the temporal verification using depth fingerprints result in notable improvement on the challenging MODD2 dataset, by significantly reducing the amount of false positive detections. The proposed method is able to run in real time on board of a small-sized USV, which was used to acquire the MODD2 dataset as well.

Obtaining high dynamic scale radiance maps by varying illumination intensity

Mon, 01 Jan 0001 00:00:00 +0000

Od računalniškega vida k umetnemu spoznavnemu vidu

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we briefly describe an emerging new scientfic field of cognitive vision. We briefly present some main characteristics of the mature and recogni- sed field of computer vision and show some motiva- tion for its natural development into cognitive vision. We underline some characteristics of cognitive vision systems, which make them different from the classi- cal machine and computer vision systems. We also present some typical applications of cognitive vision systems and cognitive systems in general and indicate several possibilities for their employment.

On-line conservative learning for person detection

Mon, 01 Jan 0001 00:00:00 +0000

We present a novel on-line conservative learning framework for an object detection system. All algorithms operate in an on-line mode, in particular we also present a novel on-line AdaBoost method. The basic idea is to exploit a huge amount of unlabeled video data by being very conservative in selecting training examples and to start with a very simple object detection system and using reconstructive and discriminative classifiers in an iterative co-training fashion to arrive at increasingly better object detectors. We demonstrate the framework on a surveillance task where we learn person detectors that are tested on two surveillance video sequences. We start with a simple moving object classifier and proceed with incremental PCA (on shape and appearance) as a reconstructive classifier which in turn generates a training set for a discriminative on-line AdaBoost classifier.

Online Discriminative Kernel Density Estimation

Mon, 01 Jan 0001 00:00:00 +0000

We propose a new method for online estimation of probabilistic discriminative models. The method is based on the recently proposed online Kernel Density Estimation \mbox(oKDE) framework which produces Gaussian mixture models and allows adaptation using only a single data point at a time. The oKDE builds reconstructive models from the data, and we extend it to take into account the interclass discrimination through a new distance function between the classifiers. We arrive at an online discriminative Kernel Density Estimators \mbox(odKDE). We compare the odKDE to oKDE, batch state-of-the-art KDEs and support vector machine (SVM) on a standard database. The odKDE achieves comparable classification performance to that of best batch KDEs and SVM, while allowing online adaptation, and produces models of lower complexity than the oKDE.

Online Discriminative Kernel Density Estimator With Gaussian Kernels

Mon, 01 Jan 0001 00:00:00 +0000

We propose a new method for a supervised online estimation of probabilistic discriminative models for classification tasks. The method estimates the class distributions from a stream of data in form of Gaussian mixture models (GMM). The reconstructive updates of the distributions are based on the recently proposed online Kernel Density Estimator (oKDE). We maintain the number of components in the model low by compressing the GMMs from time to time. We propose a new cost function that measures loss of interclass discrimination during compression, thus guiding the compression towards simpler models that still retain discriminative properties. The resulting classifier thus independently updates the GMM of each class, but these GMMs interact during their compression through the proposed cost function. We call the proposed method the online discriminative Kernel Density Estimator (odKDE). We compare the odKDE to oKDE, batch state-of-the-art KDEs and batch/incremental support vector machines (SVM) on the publicly-available datasets. The odKDE achieves comparable classification performance to that of best batch KDEs and SVM, while allowing online adaptation from large datasets, and produces models of lower complexity than the oKDE.

Online Kernel Density Estimation For Interactive Learning

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we propose a Gaussian-kernel-based online kernel density estimation which can be used for applications of online probability density estimation and online learning. Our approach generates a Gaussian mixture model of the observed data and allows online adaptation from positive examples as well as from the negative examples. The adaptation from the negative examples is realized by a novel concept of unlearning in mixture models. Low complexity of the mixtures is maintained through a novel compression algorithm. In contrast to the existing approaches, our approach does not require fine-tuning parameters for a specific application, we do not assume specific forms of the target distributions and temporal constraints are not assumed on the observed data. The strength of the proposed approach is demonstrated with examples of online estimation of complex distributions, an example of unlearning, and with an interactive learning of basic visual concepts.

Open-source robotic manipulator and sensory platform

Mon, 01 Jan 0001 00:00:00 +0000

We present an open-source robotic platform for educational use that integrates multiple levels of interaction through the use of additional vision sensor. The environment can be used in virtual, augmented-reality and real-robot modes, enabling smooth transition from a virtual robot manipulator to a real one. We describe the main aspects of our platform that ensure low production costs and encourage openness of both its hardware and software. The main goal of of our work was to create a viable low-cost robotic manipulator platform alternative for the university level courses in intelligent robotics, however, the application domain is very broad.

Optimization framework for learning a hierarchical shape vocabulary for object class detection.

Mon, 01 Jan 0001 00:00:00 +0000

Panoramic Eigenimages for Spatial Localisation

Mon, 01 Jan 0001 00:00:00 +0000

Recent biological evidence suggests that position and orientation can be estimated from an adequately compressed set of environment snapshots and their relationships. In this paper we present a pure appearance-based localisation method using an eigenspace representation of panoramic images. We first review several types of rotational invariant representation of panoramic images in terms of their efficiency for an eigenspace-based localisation problem. Then, for each set of images an eigenspace from 25 location snapshots is built and analyzed. We evaluated simple localisation of images not included in the training set. The results show good prospects for the panoramic eigenspace approach.

Panoramic Volumes for Robot Localization

Mon, 01 Jan 0001 00:00:00 +0000

We propose a method for visual robot localization using a panoramic image volume as the representation from which we can generate views from virtual viewpoints and match them to the current view. We use a geometric image-based rendering formalism in combination with a subspace representation of images, which allows us to synthesize views at arbitrary virtual viewpoints from a compact low-dimensional representation.

PanSR: An Object-Centric Mask Transformer for Panoptic Segmentation

Mon, 01 Jan 0001 00:00:00 +0000

Panoptic segmentation is a fundamental task in computer vision and a crucial component for perception in autonomous vehicles. Recent mask-transformer-based methods achieve impressive performance on standard benchmarks but face significant challenges with small objects, crowded scenes and scenes exhibiting a wide range of object scales. We identify several fundamental shortcomings of the current approaches: (i) the query proposal generation process is biased towards larger objects, resulting in missed smaller objects, (ii) initially well-localized queries may drift to other objects, resulting in missed detections, (iii) spatially well-separated instances may be merged into a single mask causing inconsistent and false scene interpretations. To address these issues, we rethink the individual components of the network and its supervision, and propose a novel method for panoptic segmentation PanSR. PanSR effectively mitigates instance merging, enhances small-object detection and increases performance in crowded scenes, delivering a notable +3.4 PQ improvement over state-of-the-art on the challenging LaRS benchmark, while reaching state-of-the-art performance on Cityscapes. URL

Parametric Eigenspace Representations of Panoramic Images

Mon, 01 Jan 0001 00:00:00 +0000

This paper describes a novel approach for robot localization using a view-based representation with panoramic images. We propose to use a representation based on a complex basis of eigenvectors. We demonstrate that this results in a speed up of building the eigenspace and in a fast and accurate localization.

Part-Based Room Categorization for Household Service Robots

Mon, 01 Jan 0001 00:00:00 +0000

A service robot that operates in a previously-unseen home environment should be able to recognize the functionality of the rooms it visits, such as a living room, a bathroom, etc. We present a novel part-based model and an approach for room categorization using data obtained from a visual sensor. Images are represented with sets of unordered parts that are obtained by object-agnostic region proposals, and encoded using state-of-the-art image descriptor extractor — a convolutional neural network (CNN). An approach is proposed that learns category-specific discriminative parts for the part-based model. The proposed approach was compared to the state-of-the-art CNN trained specifically for place recognition. Experimental results show that the proposed approach outperforms the holistic CNN by being robust to image degradation, such as occlusions, modifications of image scaling, and aspect changes. In addition, we report non-negligible annotation errors and image duplicates in a popular dataset for place categorization and discuss annotation ambiguities.

Performance Evaluation Methodology for Long-Term Single Object Tracking

Mon, 01 Jan 0001 00:00:00 +0000

A long-term visual object tracking performance evaluation methodology and a benchmark are proposed. Performance measures are designed by following a long-term tracking definition to maximize the analysis probing strength. The new measures outperform existing ones in interpretation potential and in better distinguishing between different tracking behaviors. We show that these measures generalize the short-term performance measures, thus linking the two tracking problems. Furthermore, the new measures are highly robust to temporal annotation sparsity and allow annotation of sequences hundreds of times longer than in the current datasets without increasing manual annotation labor. A new challenging dataset of carefully selected sequences with many target disappearances is proposed. A new tracking taxonomy is proposed to position trackers on the short-term/long-term spectrum. The benchmark contains an extensive evaluation of the largest number of long-term trackers and comparison to state-of-the-art short-term trackers. We analyze the influence of tracking architecture implementations to long-term performance and explore various re-detection strategies as well as influence of visual model update strategies to long-term tracking drift. The methodology is integrated in the VOT toolkit to automate experimental analysis and benchmarking and to facilitate future development of long-term trackers.

Physics-Based Modelling of Human Motion using Kalman Filter and Collision Avoidance Algorithm

Mon, 01 Jan 0001 00:00:00 +0000

The paper deals with the problem of computer vision based multi-Peršon motion tracking, which in many cases suffers from lack of discriminating features of observed Peršons. To solve this problem, a physics based model of human motion is proposed, which includes intertial forces of the Peršons by the means of the Kalman filter, and the cylindrical envelopes, which produce collision avoiding forces when observed Peršons come to close proximity. We tested the proposed method on two sequences, one from squash match, and the other from the basketball play and found out that the number of tracker mistakes significantly decreased.

Pregled programskih orodij za globoko učenje z vidika uporabe v industrijskih aplikacijah

Mon, 01 Jan 0001 00:00:00 +0000

Globoko učenje je prineslo revolucionarne spremembe na področju računalniškega vida in si utira svojo pot tudi na področje industrijskega strojnega vida. V tem članku predstavljamo šest najbolj poznanih orodij za delo z globokimi arhitekturami: Caffe, Torch, Theano, MatConvNet, TensorFlow in Keras. Predstavili bomo njihove glavne značilnosti tako z vidika razvoja kot integracije v industrijske aplikacije.

Prepletanje umetne inteligence in fizike pri napovedovanju obalnih poplav

Mon, 01 Jan 0001 00:00:00 +0000

Podnebne spremembe prek številnih mehanizmov povzročajo dvig srednje gladine globalnih oceanov, kar velja tudi za Slovensko morje. Modelske projekcije rasti globalne gladine morja predvidevajo, da bo do leta 2050 srednja gladina morja v Tržaškem zalivu najverjetneje narasla za 30 do 50 centimetrov, do konca stoletja pa za 40 do 100 cm. To pomeni, da naj bi do srede stoletja pogostost poplav narasla 10 do 20-krat, do konca stoletja pa naj bi bile poplave tudi do dvestokrat bolj pogoste. Napovedovanje poplav je zaradi specifike jadranskega bazena izredno zahtevno, saj vključuje simulacijo razvoja atmofserskefa modela in modela morja. V članku razložimo pristop k napovedovanju z globoko nevronsko mrežo, ki dosega ali presega natančnost napovedi fizikalnega modela.

Probabilistic Combination of Visual Context Based Attention and Object Detection

Mon, 01 Jan 0001 00:00:00 +0000

Visual context provides cues about an object’s presence, position and size within the observed scene, which are used to increase the performance of object detection techniques. However, state-of-the-art methods for context aware object detection could decrease the initial performance. We discuss the reasons for failure and propose a concept that overcomes these limitations. Therefore, we introduce the prior probability function of an object detector, that maps the detector’s output to probabilities. Together, with an appropriate contextual weighting a probabilistic framework is established. In addition, we present an extension to state-of-the-art methods to learn scale-dependent visual context information and show how this increases the initial performance. The standard methods and our proposed extensions are compared on a novel demanding image data set.

Probabilistic tracking using optical flow to resolve color ambiguities

Mon, 01 Jan 0001 00:00:00 +0000

Color-based tracking is prone to failure in situations where visually similar targets are moving in close proximity to each other. To deal with the ambiguities in color information we propose an additional color-independent feature based on the target’s local motion, which is calculated from the optical flow induced by the target in consecutive images. By modifying a color-based particle filter to account for the target’s local-motion, the hybrid color/local-motion-based tracker is constructed. The hybrid tracker was compared to a purely color-based tracker on a challenging data-set that involved near-collisions and complete occlusions between visually similar Peršons. The optical flow was estimated using a robust and a nonrobust method. The experiments show that even if a nonrobust method is used to estimate the optical flow, the local-motion feature largely resolves ambiguities caused by the visual similarity between Peršons.

Prototipi značilk za adaptivno zaznavanje ovir na vodni površini

Mon, 01 Jan 0001 00:00:00 +0000

Unmanned surface vehicles (USV) rely on robust perception methods for obstacle detection. Current segmentation-based state-of-the-art methods lack the desired robustness and generalization capabilities required to adapt to new situations. To address this, we design WaSR-AD, a network with an explicit adaptation capability based on class prototypes. Initial prototypes are extracted during training and adapted during inference in an online fashion. The adapted prototypes are used to enrich the image features with additional adaptive context. Evaluation on the MODS benchmark reveals that such explicit adaptation of the prototypes significantly improves the detection performance, achieving 14% lower water segmentation error and 3.6% F1-score increase inside the critical 15m danger-zone area around the boat, with a negligible cost in inference time.

Quality of region proposals in traffic sign detection and recognition

Mon, 01 Jan 0001 00:00:00 +0000

Range image acquisition of objects with non-uniform albedo using structured light range sensor

Mon, 01 Jan 0001 00:00:00 +0000

Razlike v opravljeni poti in povprečni hitrosti gibanja med različnimi tipi košarkarjev

Mon, 01 Jan 0001 00:00:00 +0000

V članku obravnavamo problem obremenitve košarkarjev na tekmah. Osnovni cilj raziskave je ugotavljanje intenzivnosti in obsega gibanja košarkarjev s pomočjo merilnega sistema SAGIT. Gre za razmeroma novo tehnologijo, ki temelji na metodah računalniškega vida in omogoča avtomatsko pridobivanje podatkov iz video posnetkov tekem. S pomočjo omenjenega sistema smo ugotavljali opravljeno pot in povprečno hitrost gibanja košarkarjev na treh tekmah končnice državnega prvenstva Slovenije za člane med ekipama Union Olimpija in Geoplin Slovan v sezoni 2004/05. Omenjene parametre smo ugotavljali za skupno 22 košarkarjev, ki so v posameznem polčasu tekme igrali vsaj 200 sekund. Glede na to, da v košarki poznamo več različnih tipov igralcev, ki imajo različne vloge v igri, smo opravljeno pot in povprečno hitrost gibanja izračunali za tri osnovne tipe igralcev (branilce, krila in centre) in s pomočjo enosmerne analize variance ugotavljali razlike med njimi. Ugotovili smo, da v aktivnem delu igre (ko ura za merjenje igralnega časa teče) v enem polčasu oz. 20 minutah igralci v povprečju opravijo pot dolgo 2227 metrov, v pasivnem delu pa še dodatnih 920 metrov. Povprečna hitrost gibanja igralcev v aktivnem delu igre znaša 1,84 m/s. Kar se tiče posameznih tipov igralcev lahko ugotovimo, da v aktivni fazi igre najdaljšo pot v povprečju opravijo branilci (2300 m), sledijo jim krila (2246 m) in nato centri (2118 m). Razlike med posameznimi tipi igralcev so statistično značilne na nivoju 1% napake. Enako velja tudi za povprečno hitrost gibanja, pri čemer se branilci gibljejo s povprečno hitrostjo 1,92 m/s, krila 1,87 m/s, centri pa 1,74 m/s.

Recognition of Multi-Agent Activities with Petri Nets

Mon, 01 Jan 0001 00:00:00 +0000

Reconstruction by inpainting for visual anomaly detection

Mon, 01 Jan 0001 00:00:00 +0000

Visual anomaly detection addresses the problem of classification or localization of regions in an image that deviate from their normal appearance. A popular approach trains an auto-encoder on anomaly-free images and performs anomaly detection by calculating the difference between the input and the reconstructed image. This approach assumes that the auto-encoder will be unable to accurately reconstruct anomalous regions. But in practice neural networks generalize well even to anomalies and reconstruct them sufficiently well, thus reducing the detection capabilities. Accurate reconstruction is far less likely if the anomaly pixels were not visible to the auto-encoder. We thus cast anomaly detection as a self-supervised reconstruction-by-inpainting problem. Our approach (RIAD) randomly removes partial image regions and reconstructs the image from partial inpaintings, thus addressing the drawbacks of auto-enocoding methods. RIAD is extensively evaluated on several benchmarks and sets a new state-of-the art on a recent highly challenging anomaly detection benchmark.

Relevance Determination for Learning Vector Quantization using the Fisher Criterion Score

Mon, 01 Jan 0001 00:00:00 +0000

Two new feature relevance determination algorithms are proposed for learning vector quanti- zation. The algorithms exploit the positioning of the prototype vectors in the input feature space to esti- mate Fisher criterion scores for the input dimensions during training. These scores are used to form online estimates of weighting factors for an adaptive metric that accounts for dimensional relevance with respect to classifier output. The methods offer theoretical advantages over previously proposed LVQ relevance determination techniques based on gradient descent, as well as performance advantages as demonstrated in experiments on various datasets including a visual dataset from a cognitive robotics object affordance learning experiment.

Robust and efficient vision system for group of cooperating mobile robots with application to soccer robots

Mon, 01 Jan 0001 00:00:00 +0000

Robust continuous subspace learning and recognition

Mon, 01 Jan 0001 00:00:00 +0000

Robust estimation of canonical correlation coefficients

Mon, 01 Jan 0001 00:00:00 +0000

Canonical Correlation Analysis is well suited for regression tasks in appearance-based approach to modelling of objects and scenes. However, since it relies on the standard projection it is inherently non-robust. In this paper we propose to embed the estimation of CCA coefficients in an augmented PCA space, which enables detection of outliers and preserves regression-relevant information enabling robust estimation of canonical correlation coefficients.

Robust Localization using an Omnidirectional Appearance-based Subspace Model of Environment

Mon, 01 Jan 0001 00:00:00 +0000

Appearance-based visual learning and recognition techniques that are based on models derived from a training set of 2D images are being widely used in computer vision applications. In robotics, they have received most attention in visual servoing and navigation. In this paper we discuss a framework for visual self-localization of mobile robots using a parametric model built from panoramic snapshots of the environment. In particular, we propose solutions to the problems related to robustness against occlusions and invariance to the rotation of the sensor. Our principal contribution is an ``eigenspace of spinning-images’’, i.e., a model of the environment which successfully exploits some of the specific properties of panoramic images in order to efficiently calculate the optimal subspace in terms of principal components analysis (PCA) of a set of training snapshots without actually decomposing the covariance matrix. By integrating a robust recover-and-select algorithm for the computation of image parameters we achieve reliable localization even in the case when the input images are partly occluded or noisy. In this way, the robot is capable of localizing itself in realistic environments.

Robust Localization using Eigenspace of Spinning-Images

Mon, 01 Jan 0001 00:00:00 +0000

Under in-plane rotations of a panoramic camera, the information content of a panoramic image is, in general, preserved. However, different representations that can be derived have important implications on further processing, e.g. for appearance-based localisation. We discuss several approaches based on different representations that have been proposed and evaluate them from different points-of-view, in particular, we argue that most of them are not suitable for robust localization under partially occluded views. In this paper we propose a representation-eigenspace of spinning-images-which enables a straightforward application of the robust estimation of eigenimage coefficients which is directly related to the localization.

Robust Localization using Panoramic View-Based Recognition

Mon, 01 Jan 0001 00:00:00 +0000

The results of earlier studies on the possibility of spatial localization from panoramic images have shown good prospects for view-based methods. The major advantages of these methods are a wide field-of-view, capability of modeling cluttered environments, and flexibility in the learning phase. The redundant information captured in similar views is efficiently handled by the eigenspace approach. However, the standard approaches are sensitive to noise and occlusion. We present a method of view-based localization in a robust framework that solves these problems to a large degree. Experimental results on a large set of real panoramic images demonstrate the effectiveness of the approach and the level of achieved robustness.

Robust Recognition and Pose Determination of 3-D Objects Using Range Images in Eigenspace Approach

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we propose a robust method for recognition and pose determination of 3-D objects using range images in the eigenspace approach. Instead of computing the coefficients by a projection of the data onto the eigenimages, we determine the coefficients by solving a set of linear equations in a robust manner. The method efficiently overcomes the problem of missing pixels, noise and occlusions in range images. The results show that the proposed method outperforms the standard one in recognition and pose determination.

Robust subspace approaches to visual learning and recognition

Mon, 01 Jan 0001 00:00:00 +0000

In the real world, visual learning is supposed to be a robust and continuous process. All available visual data is not equally important; in the case of occlusions or other undesirable intrusions in the field of view some visual data can even be misleading. Human visual system treats visual data selectively and builds efficient representations of observed objects and scenes even in non-ideal conditions. Furthermore, these representations can afterwards be updated with newly acquired information, thus adapting to the changing world. In this dissertation we study these premises and propose several methods, which introduce similar principles in the machine visual learning and recognition as well. We approach visual learning by the appearance-based modeling of objects and scenes. Models are built using principal component analysis (PCA), which has several shortcomings with respect to the premises mentioned above. In order to overcome these shortcomings, we propose several extensions of the standard PCA. PCA-based learning is traditionally performed in a batch mode, thus requiring all training images to be given in advance. Since this is not admissible in the framework of continuous learning, we propose an incremental method, which processes images sequentially one by one and updates the representation at each step accordingly. Each image can be discarded immediately after the model is updated, which makes the method perfectly well suited for real on-line scenarios. In addition, in the standard PCA approach all pixels of an image receive equal treatment. Also, all training images have equal influence on the estimation of principal subspace. In this dissertation, we present a generalized PCA approach, which estimates principal axes and principal components considering weighted pixels and images. We further extend this weighted approach into a method for learning from incomplete data, which builds the model of an object even when the part of input data is missing. Images of objects and scenes are not always ideal and as such they may contain various deceptive additions like reflections or occlusions. PCA in its standard form is intrinsically non-robust to such non-gaussian noise. Several methods for robust recognition have already been proposed, however robust learning has been tackled very rarely. In the dissertation we introduce a novel approach to the robust subspace learning. The proposed batch and incremental methods detect inconsistencies in the training images and build the representations from consistent data only. As a result, the obtained models are more robust and efficient enabling more reliable visual learning and recognition even when the learning conditions are not ideal. In the dissertation we derive all the methods mentioned above and present suitable algorithms. We also experimentally evaluate all the proposed algorithms on different image domains and determine the applicability of the methods in different scenarios.

Robust Visual Tracking using an Adaptive Coupled-layer Visual Model

Mon, 01 Jan 0001 00:00:00 +0000

This paper addresses the problem of tracking objects which undergo rapid and significant appearance changes. We propose a novel coupled-layer visual model that combines the target’s global and local appearance by interlacing two layers. The local layer in this model is a set of local patches that geometrically constrain the changes in the target’s appearance. This layer probabilistically adapts to the target’s geometric deformation, while its structure is updated by removing and adding the local patches. The addition of these patches is constrained by the global layer that probabilistically models target’s global visual properties such as color, shape and apparent local motion. The global visual properties are updated during tracking using the stable patches from the local layer. By this coupled constraint paradigm between the adaptation of the global and the local layer, we achieve a more robust tracking through significant appearance changes. We experimentally compare our tracker to eleven state-of-the-art trackers. The experimental results on challenging sequences confirm that our tracker outperforms the related trackers in many cases by having smaller failure rate as well as better accuracy. Furthermore, the parameter analysis shows that our tracker is stable over a range of parameter values.

Robust visual tracking using template anchors

Mon, 01 Jan 0001 00:00:00 +0000

Deformable part models exhibit excellent performance in tracking non-rigidly deforming targets, but are usually outperformed by holistic models when the target does not deform or in the presence of uncertain visual data. The reason is that part-based models require estimation of a larger number of parameters compared to holistic models and since the updating process is self-supervised, the errors in parameter estimation are amplified with time, leading to a faster accuracy reduction than in holistic models. On the other hand, the robustness of part-based trackers is generally greater than in holistic trackers. We address the problem of self-supervised estimation of a large number of parameters by introducing controlled graduation in estimation of the free parameters. We propose decomposing the visual model into several sub-models, each describing the target at a different level of detail. The sub-models interact during target localization and, depending on the visual uncertainty, serve for cross-sub-model supervised updating. A new tracker is proposed based on this model which exhibits the qualities of part-based as well as holistic models. The tracker is tested on the highly-challenging VOT2013 and VOT2014 benchmarks, outperforming the state-of-the-art.

Robustno vizualno učenje na podlagi podprostorov

Mon, 01 Jan 0001 00:00:00 +0000

Vizualno uèenje, tj. uèenje iz vizualnih podatkov, mora biti robusten in kontinuiran proces. Vsi razpoložljivi vizualni podatki namreè niso enako pomembni; v primeru prekrivanj in drugih nezaželenih motenj v vidnem polju so lahko nekateri celo zavajajoèi. Èloveški vizualni sistem obravnava vizualne podatke selektivno in zgradi uèinkovite predstavitve opazovanih predmetov in prizorov tudi v slabih pogojih. Te predstavitve lahko nato še posodablja z na novo pridobljenimi informacijami in jih tako prilagaja spremembam. V tem èlanku predstavljamo veè metod, ki uvedejo podobne principe tudi na podroèje strojnega vizualnega uèenja in razpoznavanja. Vizualno uèenje je realizirano z modeliranjem osnovanim na direktnem videzu predmetov in prizorov. Gradnja modelov temelji na metodi glavnih komponent (PCA), ki pa ima v svoji standardni izvedbi pomanjkljivosti, ki onemogoèajo uveljavitev prej omenjenih naèel. Za premostitev teh pomanjkljivosti smo predlagali veè razširitev standardne metode glavnih komponent, tj. metode za inkrementalno, uteženo in robustno uèenje. Predlagane metode smo tudi ovrednotili na razliènih slikovnih domenah. Iz rezultatov je razvidna uporabnost metod za vizualno uèenje in razpoznavanje v razliènih primerih.

ROC analysis of classifiers in machine learning : a survey

Mon, 01 Jan 0001 00:00:00 +0000

The use of ROC (Receiver Operating Characteristics) analysis as a tool for evaluating the performance of classification models in machine learning has been increasing in the last decade. Among the most notable advances in this area are the extension of two-class ROC analysis to the multi-class case as well as the employment of ROC analysis in cost-sensitive learning. Methods now exist which take instance-varying costs into account. The purpose of our paper is to present a survey of this field with the aim of gathering important achievements in one place. In the paper, we present application areas of the ROC analysis in machine learning, describe its problems and challenges and provide a summarized list of alternative approaches to ROC analysis. In addition to presented theory, we also provide a couple of examples intended to illustrate the described approaches.

Room Categorization Based on a Hierarchical Representation of Space

Mon, 01 Jan 0001 00:00:00 +0000

For successful operation in real-world environments, a mobile robot requires an effective spatial model. The model should be compact, should possess large expressive power and should scale well with respect to the number of modelled categories. In this paper we propose a new compositional hierarchical representation of space that is based on learning statistically significant observations, in terms of the frequency of occurrence of various shapes in the environment. We have focused on a two-dimensional space, since many robots perceive their surroundings in two dimensions with the use of a laser range finder or sonar. We also propose a new low-level image descriptor, by which we demonstrate the performance of our representation in the context of a room categorization problem. Using only the lower layers of the hierarchy, we obtain state-of-the-art categorization results in two different experimental scenarios. We also present a large, freely available, dataset, which is intended for room categorization experiments based on data obtained with a laser range finder.

Room Classification using a Hierarchical Representation of Space

Mon, 01 Jan 0001 00:00:00 +0000

SALAD -- Semantics-Aware Logical Anomaly Detection

Mon, 01 Jan 0001 00:00:00 +0000

Recent surface anomaly detection methods excel at identifying structural anomalies, such as dents and scratches, but struggle with logical anomalies, such as irregular or missing object components. The best-performing logical anomaly detection approaches rely on aggregated pretrained features or handcrafted descriptors (most often derived from composition maps), which discard spatial and semantic information, leading to suboptimal performance. We propose SALAD, a semantics-aware discriminative logical anomaly detection method that incorporates a newly proposed composition branch to explicitly model the distribution of object composition maps, consequently learning important semantic relationships. Additionally, we introduce a novel procedure for extracting composition maps that requires no hand-made labels or category-specific information, in contrast to previous methods. By effectively modelling the composition map distribution, SALAD significantly improves upon state-of-the-art methods on the standard benchmark for logical anomaly detection, MVTec LOCO, achieving an impressive image-level AUROC of 96.1%. URL

Segmentation-Based Deep-Learning Approach for Surface-Defect Detection

Mon, 01 Jan 0001 00:00:00 +0000

Automated surface-anomaly detection using machine learning has become an interesting and promising area of research, with a very high and direct impact on the application domain of visual inspection. Deep-learning methods have become the most suitable approaches for this task. They allow the inspection system to learn to detect the surface anomaly by simply showing it a number of exemplar images. This paper presents a segmentation-based deep-learning architecture that is designed for the detection and segmentation of surface anomalies and is demonstrated on a specific domain of surface-crack detection. The design of the architecture enables the model to be trained using a small number of samples, which is an important requirement for practical applications. The proposed model is compared with the related deep-learning methods, including the state-of-the-art commercial software, showing that the proposed approach outperforms the related methods on the specific domain of surface-crack detection. The large number of experiments also shed light on the required precision of the annotation, the number of required training samples and on the required computational cost. Experiments are performed on a newly created dataset based on a real-world quality control case and demonstrates that the proposed approach is able to learn on a small number of defected surfaces, using only approximately 25-30 defective training samples, instead of hundreds or thousands, which is usually the case in deep-learning applications. This makes the deep-learning method practical for use in industry where the number of available defective samples is limited. The dataset is also made publicly available to encourage the development and evaluation of new methods for surface-defect detection.

Sekvenčne Monte Carlo metode za sledenje oseb v računalniškem vidu

Mon, 01 Jan 0001 00:00:00 +0000

People tracking is a part of a broad domain of computer vision, that has received a great attention from researchers over the last twenty years. An interesting aspect of the problem of tracking originates from the field of control theory and considers the object being tracked as a dynamical system with a hidden state, of which only the current measurements are available and observed. The classical methods that were used in the past to tackle this problem employed Kalman filters and their derivatives. These generally assume a Gaussian linear dynamical and measurement model, assumptions, which are usually too restrictive for the majority of natural processes. In the late 90’s, the advances in the sequential Monte Carlo methods on various fields of science gave rise to a family of methods that effectively deal with problems of this kind. Their main advantage over the Kalman filter is that they do not impose as restrictive assumptions and can be relatively easily implemented. In computer vision, the sequential Monte Carlo methods, also known as particle filters, became extremely popular with the introduction of the Condensation algorithm. Since then, a body of literature has been published regarding these methods. This thesis is dedicated to the problem of tracking people by means of sequential Monte Carlo methods, application of which is demonstrated on a system for tracking players in team sports. We first consider the problem of tracking in the context of statistical estimation and present the main parts of the Monte Carlo solutions. The well known Condensation algorithm, which comprises the central part of all the trackers presented here, is introduced as a sequential Monte Carlo method and a simple algorithm to track one player is presented. By considering a team sport in the context of a closed world, a set of assumptions that depicts a typical match is derived. Following these assumptions, a more robust single-player tracker is developed and then extended to the case of multiple players. Finally, two variants of trackers for tracking multiple players in the closed worlds are presented. A number of experiments are reported to evaluate the performance of the trackers and based on the results, the most suitable multi-player tracker is chosen. We also point out some guidelines for future development of the application for tracking multiple players.

Selecting features for object detection using an AdaBoost-compatible evaluation function

Mon, 01 Jan 0001 00:00:00 +0000

This paper addresses the problem of selecting features in a visual object detection setup where a detection algorithm is applied to an input image represented by a set of features. The set of features to be employed in the test stage is prepared in two training-stage steps. In the first step, a feature extraction algorithm produces a (possibly large) initial set of features. In the second step, on which this paper focuses, the initial set is reduced using a selection procedure. The proposed selection procedure is based on a novel evaluation function that measures the utility of individual features for a certain detection task. Owing to its design, the evaluation function can be seamlessly embedded into an AdaBoost selection framework. The developed selection procedure is integrated with state-of-the-art feature extraction and object detection methods. The presented system was tested on five challenging detection setups. In three of them, a fairly high detection accuracy was effected by as few as six features selected out of several hundred initial candidates.

Self-Supervised Cross-Modal Online Learning of Basic Object Affordances for Developmental Robotic Systems

Mon, 01 Jan 0001 00:00:00 +0000

For a developmental robotic system to function successfully in the real world, it is important that it be able to form its own internal representations of affordance classes based on observable regularities in sensory data. Usually successful classifiers are built using labeled training data, but it is not always realistic to assume that labels are available in a developmental robotics setting. There does, however, exist an advantage in this setting that can help circumvent the absence of labels: co-occurrence of correlated data across separate sensory modalities over time. The main contribution of this paper is an online classifier training algorithm based on Kohonenâ?s learning vector quantization (LVQ) that, by taking advantage of this co- occurrence information, does not require labels during training, either dynamically generated or otherwise. We evaluate the algorithm in experiments involving a robotic arm that interacts with various household objects on a table surface where camera systems extract features for two separate visual modalities. It is shown to improve its ability to classify the affordances of novel objects over time, coming close to the performance of equivalent fully-supervised algorithms.

Self-understanding and self-extension: a systems and representational approach

Mon, 01 Jan 0001 00:00:00 +0000

There are many different approaches to building a system that can engage in autonomous mental development. In this paper we present an approach based on what we term \em self-understanding, by which we mean the use of explicit representation of and reasoning about what a system does and doesn’t know, and how that understanding changes under action. We present a coherent architecture and a set of representations used in two robot systems that exhibit a limited degree of autonomous mental development, what we term \em self-extension. The contributions include: representations of gaps and uncertainty for specific kinds of knowledge, and a motivational and planning system for setting and achieving learning goals.

Similarity-based cross-layered hierarchical representation for object categorization.

Mon, 01 Jan 0001 00:00:00 +0000

Sledenje objektov s kvadrokopterjem z gibljivo kamero

Mon, 01 Jan 0001 00:00:00 +0000

Sledenje objektov v robotskem nogometu

Mon, 01 Jan 0001 00:00:00 +0000

Sledenje objektov v robotskem nogometu

Mon, 01 Jan 0001 00:00:00 +0000

Robotski nogomet je visoko tehnološki šport, ki ga je leta 1995 na korejskem tehnološkem inštitutu razvil profesor Jong-Hwan Kim kot večnamensko okolje za učenje in testiranje aplikacij analize slik, umetne inteligence, senzorjev, komunikacij itd. V zadnjih osmih letih je robotski nogomet doživel velik razmah tako v sklopu zabavne elektronike kot v sklopu testiranja in razvoja novih tehnologij. Danes obstajata dve mednarodni zvezi robotskega nogometa in sicer Robocup in FIRA (Federation of International Robot Association). Vsaka izmed obeh zvez organizira ločena tekmovanja v različnih kategorijah, kategorija pa določa lastnosti izvedbe tekme in sicer od čistih simulacij na računalniku preko mikrorobotov do humanoidnih robotov. Na Fakulteti za Elektrotehniko v Ljubljani so se z robotskim nogometom kategorije MiroSot pričeli ukvarjati leta 2000, in ga poimenovali Robobrc. Robobrc deluje v dveh različicah kategorije MiroSot, ki se razlikujeta le v številu igralcev in dimenzijah igrišča. V prvi različici vključuje vsak tim po tri igralce (igra treh igralcev ali mala liga), v drugi različici pa po pet (igra petih igralcev ali srednja liga). S prehodom iz igre treh na igro petih igralcev se je pojavila potreba po učinkovitem sledilniku, ki bi ločil večje število barv in sledil desetim robotkom in žogici v realnem času. V diplomski nalogi je obravnavana aplikacija sledenja robotkov pri tekmah Robobrca, zato bomo najprej v uvodu predstavili le tista pravila obeh različic igre, ki so pomembna za nadzorni sistem računalniškega vida. V nadaljevanju bomo predstavili če področje računalniškega vida in sledenja objektov, kjer bomo podali kratek pregled literature o sledenju v športu in robotskem nogometu.

Sledenje več igralcev v športnih igrah na podlagi vizualne informacije

Mon, 01 Jan 0001 00:00:00 +0000

V članku je predstavljen sledilnik za sledenje več igralcev v dvoranskih športnih igrah, kot sta rokomet in košarka, na podlagi vizualne informacije, pridobljene s kamero, nameščeno nad igriščem. Sledenje posameznega igralca je postavljeno v kontekst Bayesovega filtriranja za rekurzivno ocenjevanje a posteriori porazdelitve stanja tarče in temelji na metodah filtrov z delci. V članku sta obdelana dva glavna dela sledilnika: sledilnik za sledenje posameznega igralca in mehanizem za sledenje več vizualno podobnih igralcev. V okviru slednjega mehanizma predlagamo originalno rešitev, kjer sliko v vsakem časovnem koraku razdelimo v take neprekrivajoče se regije, da vsaka vsebuje le po enega igralca, ter tako dosežemo poenostavitev problema sledenja več tarč, kadar med vizualno podobnimi tarčami prihaja do trkov. Predlagani sledilnik smo primerjali z nerobustnim sledilnikom, ki ni vseboval mehanizma za obvladovanje situacij, ko med tarčami prihaja do trkov. Ugotovili smo, da predlagani mehanizem zmanjša število potrebnih intervencij operaterja in tako omogoča robustno in hitro obdelavo velike količine videopodatkov.

Spatially-Adaptive Filter Units for Compact and Efficient Deep Neural Networks

Mon, 01 Jan 0001 00:00:00 +0000

Convolutional neural networks excel in a number of computer vision tasks. One of their most crucial architectural elements is the effective receptive field size, which has to be manually set to accommodate a specific task. Standard solutions involve large kernels, down/up-sampling and dilated convolutions. These require testing a variety of dilation and down/up-sampling factors and result in non-compact networks and large number of parameters. We address this issue by proposing a new convolution filter composed of displaced aggregation units (DAU). DAUs learn spatial displacements and adapt the receptive field sizes of individual convolution filters to a given problem, thus reducing the need for hand-crafted modifications. DAUs provide a seamless substitution of convolutional filters in existing state-of-the-art architectures, which we demonstrate on AlexNet, ResNet50, ResNet101, DeepLab and SRN-DeblurNet. The benefits of this design are demonstrated on a variety of computer vision tasks and datasets, such as image classification (ILSVRC 2012), semantic segmentation (PASCAL VOC 2011, Cityscape) and blind image de-blurring (GOPRO). Results show that DAUs efficiently allocate parameters resulting in up to 4× more compact networks in terms of the number of parameters at similar or better performance.

Spatially-Adaptive Filter Units for Deep Neural Networks

Mon, 01 Jan 0001 00:00:00 +0000

Classical deep convolutional networks increase receptive field size by either gradual resolution reduction or application of hand-crafted dilated convolutions to prevent increase in the number of parameters. In this paper we propose a novel displaced aggregation unit (DAU) that does not require hand-crafting. In contrast to classical filters with units (pixels) placed on a fixed regular grid, the displacement of the DAUs are learned, which enables filters to spatially-adapt their receptive field to a given problem. We extensively demonstrate the strength of DAUs on a classification and semantic segmentation tasks. Compared to ConvNets with regular filter, ConvNets with DAUs achieve comparable performance at faster convergence and up to 3-times reduction in parameters. Furthermore, DAUs allow us to study deep networks from novel perspectives. We study spatial distributions of DAU filters and analyze the number of parameters allocated for spatial coverage in a filter.

Stereo obstacle detection for unmanned surface vehicles by IMU-assisted semantic segmentation

Mon, 01 Jan 0001 00:00:00 +0000

A new obstacle detection algorithm for unmanned surface vehicles (USVs) is presented. A state-of-the-art graphical model for semantic segmentation is extended to incorporate boat pitch and roll measurements from the on-board inertial measurement unit (IMU), and a stereo verification algorithm that consolidates tentative detections obtained from the segmentation is proposed. The IMU readings are used to estimate the location of horizon line in the image, which automatically adjusts the priors in the probabilistic semantic segmentation model. We derive the equations for projecting the horizon into images, propose an efficient optimization algorithm for the extended graphical model, and offer a practical IMU–camera–USV calibration procedure. Using an USV equipped with multiple synchronized sensors, we captured a new challenging multi-modal dataset, and annotated its images with water edge and obstacles. Experimental results show that the proposed algorithm significantly outperforms the state of the art, with nearly 30% improvement in water-edge detection accuracy, an over 21% reduction of false positive rate, an almost 60% reduction of false negative rate, and an over 65% increase of true positive rate, while its Matlab implementation runs in real-time.

Superpixel Segmentation for Robust Visual Tracking

Mon, 01 Jan 0001 00:00:00 +0000

Tailgating Detection Using Histograms of Optical Flow

Mon, 01 Jan 0001 00:00:00 +0000

Teaching Intelligent Robotics with a Low-Cost Mobile Robot Platform

Mon, 01 Jan 0001 00:00:00 +0000

In this short paper we present the requirements and implementation of a mobile robot platform to be used for teaching intelligent robotic classes. We report our experience of using the platform in university courses and various extracurricular activities.

Teaching with open-source robotic manipulator

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present and evaluate the usage of an open-source robotic manipulator platform, that we have developed, in the context of various educational scenarios that we have conducted. The system was tested in multiple diverse learning scenarios, ranging from a summer school for primary-school students, to the course at the university level study. We show that the introduction of the system in the educational process improves the motivation as well as acquired knowledge of the participants.

Temporal Context for Robust Maritime Obstacle Detection

Mon, 01 Jan 0001 00:00:00 +0000

Robust maritime obstacle detection is essential for fully autonomous unmanned surface vehicles (USVs). The currently widely adopted segmentation-based obstacle detection methods are prone to misclassification of object reflections and sun glitter as obstacles, producing many false positive detections, effectively rendering the methods impractical for USV navigation. However, water-turbulence-induced temporal appearance changes on object reflections are very distinctive from the appearance dynamics of true objects. We harness this property to design WaSR-T, a novel maritime obstacle detection network, that extracts the temporal context from a sequence of recent frames to reduce ambiguity. By learning the local temporal characteristics of object reflection on the water surface, WaSR-T substantially improves obstacle detection accuracy in the presence of reflections and glitter. Compared with existing single-frame methods, WaSR-T reduces the number of false-positive detections by 41% overall and by over 53% within the danger zone of the boat, while preserving a high recall, and achieving new state-of-the-art performance on the challenging MODS maritime obstacle detection benchmark.

Temporal Segmentation of Group Motion using Gaussian Mixture Models

Mon, 01 Jan 0001 00:00:00 +0000

This paper presents a new trajectory-based approach for probabilistic temporal segmentation of team sports. The probabilistic game model is applied to the player-trajectory data in order to segment individual game instants into one of the three game phases (offensive game, defensive game and time-outs) and a nonlinear or Gaussian smoothing kernel is used to enforce the temporal continuity of the game. The presented approach is compared to the Support Vector Machine (SVM) classifier on three basketball and three handball matches. The obtained results suggest that our approach is general and robust and as such could be applied to various team sports. It can handle unusual game situations such as player exclusions, substitution or injuries which may happen during the game.

Testing computer vision algorithms over World Wide Web

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we explore different possibilities of using the Internet for making algorithms publicly available. We describe how to build an interactive client/server application which uses World Wide Web for communication. The client program is a Java applet. The server program works on the server as a CGI program which is started by the HTTP server on the demand of the client. The data transfered between the client and the server program passes also through the HTTP server as the HTTP protocol is used for data transfer. A stand-alone program for image segmentation was transformed into the Java-client/CGI-server application, which can now be used as a service on the World Wide Web.

The Eighth Visual Object Tracking VOT2020 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2020 is the eighth annual tracker benchmarking activity organized by the VOT initiative. Results of 58 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The VOT2020 challenge was composed of five sub-challenges focusing on different tracking domains: (i) VOT-ST2020 challenge focused on short-term tracking in RGB, (ii) VOT-RT2020 challenge focused on real-time’ short-term tracking in RGB, (iii) VOT-LT2020 focused on long-term tracking namely coping with target disappearance and reappearance, (iv) VOT-RGBT2020 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2020 challenge focused on long-term tracking in RGB and depth imagery. Only the VOT-ST2020 datasets were refreshed. A significant novelty is introduction of a new VOT short-term tracking evaluation methodology, and introduction of segmentation ground truth in the VOT-ST2020 challenge – bounding boxes will no longer be used in the VOT-ST challenges. A new VOT Python toolkit that implements all these novelites was introduced. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website.

The MaSTr1325 dataset for training deep USV obstacle detection models

Mon, 01 Jan 0001 00:00:00 +0000

The progress of obstacle detection via semantic segmentation on unmanned surface vehicles (USVs) has been significantly lagging behind the developments in the related field of autonomous cars. The reason is the lack of large curated training datasets from USV domain required for development of data-hungry deep CNNs. This paper addresses this issue by presenting MaSTr1325, a marine semantic segmentation training dataset tailored for development of obstacle detection methods in small-sized coastal USVs. The dataset contains 1325 diverse images captured over a two year span with a real USV, covering a range of realistic conditions encountered in a coastal surveillance task. The images are per-pixel semantically labeled. The dataset exceeds previous attempts in this domain in size, scene complexity and domain realism. In addition, a dataset augmentation protocol is proposed to address slight appearance differences of the images in the training set and those in deployment. The accompanying experimental evaluation provides a detailed analysis of popular deep architectures, annotation accuracy and influence of the training set size. MaSTr1325 will be released to reaserch community to facilitate progress in obstacle detection for USVs.

The Ninth Visual Object Tracking VOT2021 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2021 is the ninth annual tracker benchmarking activity organized by the VOT initiative. Results of 71 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in recent years. The VOT2021 challenge was composed of four sub-challenges focusing on different tracking domains: (i) VOT-ST2021 challenge focused on short-term tracking in RGB, (ii) VOT-RT2021 challenge focused on ``real-time’’ short-term tracking in RGB, (iii) VOT-LT2021 focused on long-term tracking, namely coping with target disappearance and reappearance and (iv) VOT-RGBD2021 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2021 dataset was refreshed, while VOT-RGBD2021 introduces a training dataset and sequestered dataset for winner identification. The source code for most of the trackers, the datasets, the evaluation kit and the results along with the source code for most trackers are publicly available at the challenge website.

The Second Visual Object Tracking Segmentation VOTS2024 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking Segmentation VOTS2024 challenge is the twelfth annual tracker benchmarking activity of the VOT initiative. This challenge consolidates the new tracking setup proposed in VOTS2023, which merges short-term and long-term as well as single-target and multiple-target tracking with segmentation masks as the only target location specification. Two sub-challenges are considered. The VOTS2024 standard challenge, focusing on classical objects and the VOTSt2024, which considers objects undergoing a topological transformation. Both challenges use the same performance evaluation methodology. Results of 28 submissions are presented and analyzed. A leaderboard, with participating trackers details, the source code, the datasets, and the evaluation kit are publicly available on the website https://www.votchallenge.net/vots2024/.

The Seventh Visual Object Tracking VOT2019 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by the VOT initiative. Results of 81 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis as well as the standard VOT methodology for long-term tracking analysis. The VOT2019 challenge was composed of five challenges focusing on different tracking domains: (i) VOTST2019 challenge focused on short-term tracking in RGB, (ii) VOT-RT2019 challenge focused on “real-time” shortterm tracking in RGB, (iii) VOT-LT2019 focused on longterm tracking namely coping with target disappearance and reappearance. Two new challenges have been introduced: (iv) VOT-RGBT2019 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2019 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2019, VOT-RT2019 and VOT-LT2019 datasets were refreshed while new datasets were introduced for VOT-RGBT2019 and VOT-RGBD2019. The VOT toolkit has been updated to support both standard shortterm, long-term tracking and tracking with multi-channel imagery. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website.

The sixth Visual Object Tracking VOT2018 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative. Results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis and a \real-time" experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. A long-term tracking subchallenge has been introduced to the set of standard VOT sub-challenges. The new subchallenge focuses on long-term tracking properties, namely coping with target disappearance and reappearance. A new dataset has been compiled and a performance evaluation methodology that focuses on long-term tracking capabilities has been adopted. The VOT toolkit has been updated to support both standard short-term and the new longterm tracking subchallenges. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website.

The Tenth Visual Object Tracking VOT2022 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2022 is the tenth annual tracker benchmarking activity organized by the VOT initiative. Results of 93 entries are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in recent years. The VOT2022 challenge was composed of seven sub-challenges focusing on different tracking domains: (i) VOT-STs2022 challenge focused on short-term tracking in RGB by segmentation, (ii) VOT-STb2022 challenge focused on short-term tracking in RGB by bounding boxes, (iii) VOT-RTs2022 challenge focused on real-time'' short-term tracking in RGB by segmentation, (iv) VOT-RTb2022 challenge focused on real-time’’ short-term tracking in RGB by bounding boxes, (v) VOT-LT2022 focused on long-term tracking, namely coping with target disappearance and reappearance, (vi) VOT-RGBD2022 challenge focused on short-term tracking in RGB and depth imagery, and (vii) VOT-D2022 challenge focused on short-term tracking in depth-only imagery. New datasets were introduced in VOT-LT2022 and VOT-RGBD2022, VOT-ST2022 dataset was refreshed, and a training dataset was introduced for VOT-LT2022. The source code for most of the trackers, the datasets, the evaluation kit and the results are publicly available at the challenge website.

The Thermal Infrared Visual Object Tracking VOT-TIR2015 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking VOT2013 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

Visual tracking has attracted a significant attention in the last few decades. The recent surge in the number of publications on tracking-related problems have made it almost impossible to follow the developments in the field. One of the reasons is that there is a lack of commonly accepted annotated data-sets and standardized evaluation protocols that would allow objective comparison of different tracking methods. To address this issue, the Visual Object Tracking (VOT) workshop was organized in conjunction with ICCV2013. Researchers from academia as well as industry were invited to participate in the first VOT2013 challenge which aimed at single-object visual trackers that do not apply pre-learned models of object appearance (model-free). Presented here is the VOT2013 benchmark dataset for evaluation of single-object visual trackers as well as the results obtained by the trackers competing in the challenge. In contrast to related attempts in tracker benchmarking, the dataset is labeled per-frame by visual attributes that indicate occlusion, illumination change, motion change, size change and camera motion, offering a more systematic comparison of the trackers. Furthermore, we have designed an automated system for performing and evaluating the experiments. We present the evaluation protocol of the VOT2013 challenge and the results of a comparison of 27 trackers on the benchmark dataset. The dataset, the evaluation tools and the tracker rankings are publicly available from the challenge website\footnote{http://votchallenge.net}.

The Visual Object Tracking VOT2014 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking VOT2015 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2015 challenge that go beyond its VOT2014 predecessor are: (i) a new VOT2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2014 evaluation methodology by introduction of a new performance measure. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.

The Visual Object Tracking VOT2016 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply prelearned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.

The Visual Object Tracking VOT2017 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2017 is the fifth annual tracker benchmarking activity organized by the VOT initiative. Results of 51 trackers are presented; many are state-of-the-art published at major computer vision conferences or journals in recent years. The evaluation included the standard VOT and other popular methodologies and a new “real-time” experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The VOT2017 goes beyond its predecessors by (i) improving the VOT public dataset and introducing a separate VOT2017 sequestered dataset, (ii) introducing a realtime tracking experiment and (iii) releasing a redesigned toolkit that supports complex experiments. The dataset, the evaluation kit and the results are publicly available at the challenge website.

The VOT2013 challenge: overview and additional results

Mon, 01 Jan 0001 00:00:00 +0000

Towards a large-scale category detection with a distributed hierarchical compositional model

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we evaluate a visual object detection system implemented on a distributed processing platform, presented in our previous work, with the goal of assessing the scalability of the system to a large-scale category detection. While state-of-the-art detection methods based on sliding windows may not be capable of scaling to a higher number of categories, we provide initial evidence that using a hierarchical compositional method called learned-hierarchy-of-parts (LHOP) may be capable of scaling to a higher number of categories. We show with the library trained on an MPEG-7 Shape database that the method is capable of scaling from a system with 5 categories and 6 second averaged response time to a system with 70 categories and averaged response time of 27 seconds.

Towards an Integrated Robot with Multiple Cognitive Functions

Mon, 01 Jan 0001 00:00:00 +0000

We present integration mechanisms for combining heterogeneous components in a situated information processing system, illustrated by a cognitive robot able to collaborate with a human and display some understanding of its surroundings. These mechanisms include an architectural schema that encourages parallel and incremental information processing, and a method for binding information from distinct representations that when faced with rapid change in the world can maintain a coherent, though distributed, view of it. Provisional results are demonstrated in a robot combining vision, manipulation, language, planning and reasoning capabilities interacting with a human and manipulable objects.

Towards automated scyphistoma census in underwater imagery : a useful research and monitoring too

Mon, 01 Jan 0001 00:00:00 +0000

Towards automated scyphistoma census in underwater imagery: a useful research and monitoring tool

Mon, 01 Jan 0001 00:00:00 +0000

Manual annotation and counting of entities in underwater photographs is common in many branches of marine biology. With a marked increase of jellyfish populations worldwide, understanding the dynamics of the polyp (scyphistoma) stage of their life-cycle is becoming increasingly important. In-situ studies of polyp population dynamics are scarce due to small size of the polyps and tedious manual work required to annotate and count large numbers of items in underwater photographs. We devised an experiment which shows a large variance between human annotators, as well as in annotations made by the same annotator. We have tackled this problem, which is present in many areas of marine biology, by developing a method for automated detection and counting. Our polyp counter (PoCo) uses a two-stage approach with a fast detector (Aggregated Channel Features) and a precise classifier consisting of a pre-trained Convolutional Neural Network and a Support Vector Machine. PoCo was tested on a year-long image dataset and performed with accuracy comparable to human annotators but with 70-fold reduction in time. The algorithm can be used in many marine biology applications, vastly reducing the amount of manual labor and enabling processing of much larger datasets. The source code is freely available on GitHub.

Towards commoditized smart-camera design

Mon, 01 Jan 0001 00:00:00 +0000

We propose a set of design principles for a cost-effective embedded smart camera. Our aim is to alleviate the shortcomings of the existing designs, such as excessive reliance on battery power and wireless networking, over-emphasized focus on specific use cases, and use of specialized technologies. In our opinion, these shortcomings prevent widespread commercialization and adoption of embedded smart cameras, especially in the context of visual-sensor networks. The proposed principles lead to a distinctively different design, which relies on commoditized, standardized and widely-available components, tools and knowledge. As an example of using these principles in practice, we present a smart camera, which is inexpensive, easy to build and support, capable of high-speed communication and enables rapid transfer of computer-vision algorithms to the embedded world.

Towards Deep Compositional Networks

Mon, 01 Jan 0001 00:00:00 +0000

Hierarchical feature learning based on convolutional neural networks (CNN) has recently shown significant potential in various computer vision tasks. While allowing high-quality discriminative feature learning, the downside of CNNs is the lack of explicit structure in features, which often leads to overfitting, absence of reconstruction from partial observations and limited generative abilities. Explicit structure is inherent in hierarchical compositional models, however, these lack the ability to optimize a well-defined cost function. We propose a novel analytic model of a basic unit in a layered hierarchical model with both explicit compositional structure and a well-defined discriminative cost function. Our experiments on two datasets show that the proposed compositional model performs on a par with standard CNNs on discriminative tasks, while, due to explicit modeling of the structure in the feature units, affording a straight-forward visualization of parts and faster inference due to separability of the units.

Towards fast and efficient methods for tracking players in sports

Mon, 01 Jan 0001 00:00:00 +0000

An efficient algorithm for tracking a single player in a sporting match is presented in this paper. The sporting event is considered as a semi-controlled environment for which a set of closed-world assumptions regarding the visual as well as dynamical properties is derived. We show how these assumptions can be used in the context of particle filtering to arrive at a computationally-fast and reliable tracker. The proposed tracker was evaluated on a demanding data set. When compared to several similar trackers that did not utilize all of the closed-world assumptions, the proposed tracker, on average, resulted in a better performance regarding the failure rate as well as position and prediction estimation.

Towards fast lighting condition inference for augmented reality

Mon, 01 Jan 0001 00:00:00 +0000

Towards hierarchical representation of space

Mon, 01 Jan 0001 00:00:00 +0000

Various robotic systems, performing efficient navigation, localization and place recognition in their surrounding environments, have already been developed. These systems posess a representation of space that is based on some engineered knowledge. There is still no such system that would know about the structure of space in general, and whose knowledge would be obtained by learning. We believe that people learn about properties of space through interaction with the environment. Therefore, since people perform really well in the spatial related tasks, we expect that a robotic system that would obtain such knowledge would also perform better. With this in mind, we are developing an algorithm for learning a compositional hierarchical representation of space that is based on statistically significant observations. For now, we have focused on a two dimensional space, since many robots perceive their surroundings in two dimensions with the use of a laser range finder or a sonar. In this paper we evaluate our early work on this topic through room categorization problem. Based on the lower layers of the hierarchy, we obtained encouraging classification results with three different types of rooms.

Towards large-scale traffic sign detection and recognition

Mon, 01 Jan 0001 00:00:00 +0000

Recognition of traffic signs is a well researched field in the computer vision community, with several commercial applications already available. However, a vast majority of existing approaches focuses on recognition of a relatively small number of traffic sign categories (about 50 or less). In this paper, we adopt a convolutional neural network (CNN) approach, i.e., the Faster R-CNN, to address the full pipeline of detection and recognition of more than 100 traffic sign categories, depicted in our novel dataset that was acquired on Slovenian roads. We report promising results on highly challenging traffic sign categories that have not yet been considered in previous works and we provide useful insights for CNN training.

Towards Learning Basic Object Affordances from Object Properties

Mon, 01 Jan 0001 00:00:00 +0000

The capacity for learning to recognize and exploit environmental affordances is an important consideration for the design of current and future developmental robotic systems. We present a system that uses a robotic arm, camera systems and self-organizing maps to learn basic affordances of objects.

Towards on-the fly multi-modal sensor calibration

Mon, 01 Jan 0001 00:00:00 +0000

The robustness of autonomous vehicles can be significantly improved by using multiple sensor modalities. In addition to standard color cameras and less frequently used thermal, multispectral and polarization cameras, LIDAR and RADAR are most often used sensors, and are largely complementary to image sensors. However, the spatial calibration of such a system can be extremely challenging due to the difficulties in obtaining corresponding features from different modalities, as well as the inevitable parallax arising from different sensor positions. In this paper, we present a comprehensive strategy for calibrating such a system using a multi-modal target, and illustrate how such a strategy could be upgraded to an fully automatic, target-less calibration that would rely on features of the scene itself to align at least small sensor offsets from the calibrated position. We find that a high-level understanding of the scene is ideal for this task, as this way we can identify characteristic points for spatial alignment of sensor data of different modalities.

Towards Probabilistic Online Discriminative Models

Mon, 01 Jan 0001 00:00:00 +0000

Towards Scalable Representations of Visual Categories: Learning a Hierarchy of parts.

Mon, 01 Jan 0001 00:00:00 +0000

Towards the deep learning recognition of cultivated terraces based on Lidar data: The case of Slovenia

Mon, 01 Jan 0001 00:00:00 +0000

Tracking and Segmentation of Transparent Objects

Mon, 01 Jan 0001 00:00:00 +0000

Transparent object tracking is a challenging, recently introduced, problem. Existing methods predict target location as a bounding box, which is often only a poor approximation of actual location. Segmentation mask is a more accurate prediction, but benchmarks for evaluating tracking and segmentation performance of transparent objects does not exist. In this paper we address this drawback by introducing a new dataset for tracking and segmentation of transparent objects. In particular we sparsely re-annotate the existing bounding box TOTB dataset with ground-truth segmentation masks. A comprehensive analysis demonstrates that existing segmentation methods perform surprisingly well on this task indicating good design generalization and potential for transparent object tracking tasks. In addition, we show that existing bounding box trackers can be easily transformed into segmentation trackers using modern mask refinement methods.

Tracking by Identification Using Computer Vision and Radio

Mon, 01 Jan 0001 00:00:00 +0000

We present a novel system for detection, localization and tracking of multiple people, which fuses a multi-view computer vision approach with a radio-based localization system. The proposed fusion combines the best of both worlds, excellent computer-vision-based localization, and strong identity information provided by the radio system, and is therefore able to perform tracking by identification, which makes it impervious to propagated identity switches. We present comprehensive methodology for evaluation of systems that perform person localization in world coordinate system and use it to evaluate the proposed system as well as its components. Experimental results on a challenging indoor dataset, which involves multiple people walking around a realistically cluttered room, confirm that proposed fusion of both systems significantly outperforms its individual components. Compared to the radio-based system, it achieves better localization results, while at the same time it successfully prevents propagation of identity switches that occur in pure computer-vision-based tracking.

Tracking Non-Rigid Objects by Combining Local and Global Visual Model

Mon, 01 Jan 0001 00:00:00 +0000

We present an appearance-based tracker which hierarchically combines a global and a local visual model in two layers. The bottom layer contains the local part of the visual model and consists of a set of sub-trackers, each of them observing only a local aspect of the object. The top layer constrains and focuses the movement of individual sub-tracker by accounting for the global part of the model - the spatial relations between the trackers. The visual model is updated by modifying the spatial relations and by reinitializing the sub-trackers which do not follow the target. By reinitializing a single or a small number of sub-trackers the tracker can adapt only a part of its visual model to the new appearance of the object. This makes the tracker less vulnerable to drifting. The implementation of the two-layered tracker that uses a SSD template matching for the sub-trackers is presented and tested on a demanding data set of non-rigid objects.

Tracking people in video data using probabilistic models

Mon, 01 Jan 0001 00:00:00 +0000

Traffic sign classification with batch and on-line linear support vector machines

Mon, 01 Jan 0001 00:00:00 +0000

This paper presents a comprehensive benchmark of several feature types and colorspace representations on the task of traffic sign classification. We focus on linear Support Vector Machine classifiers, and test several multi-class formulations, as well as a formulation that allows on-line training and updates. Experiments on two standard traffic sign classification datasets show that despite their relative simplicity, these classifiers offer competitive performance, and ultimately allow design of a flexible classification system in the context of application for automatic maintenance of traffic signalization inventory.

Trans2k: Unlocking the Power of Deep Models for Transparent Object Tracking

Mon, 01 Jan 0001 00:00:00 +0000

Visual object tracking has focused predominantly on opaque objects, while transparent object tracking received very little attention. Motivated by the uniqueness of transparent objects in that their appearance is directly affected by the background, the first dedicated evaluation dataset has emerged recently. We contribute to this effort by proposing the first transparent object tracking training dataset Trans2k that consists of over 2k sequences with 104,343 images overall, annotated by bounding boxes and segmentation masks. Noting that transparent objects can be realistically rendered by modern renderers, we quantify domain-specific attributes and render the dataset containing visual attributes and tracking situations not covered in the existing object training datasets. We observe a consistent performance boost (up to 16%) across a diverse set of modern tracking architectures when trained using Trans2k, and show insights not previously possible due to the lack of appropriate training sets. The dataset and the rendering engine will be publicly released to unlock the power of modern learning-based trackers and foster new designs in transparent object tracking.

TransFusion - A Transparency-Based Diffusion Model for Anomaly Detection

Mon, 01 Jan 0001 00:00:00 +0000

Surface anomaly detection is a vital component in manufacturing inspection. Current discriminative methods follow a two-stage architecture composed of a reconstructive network followed by a discriminative network that relies on the reconstruction output. Currently used reconstructive networks often produce poor reconstructions that either still contain anomalies or lack details in anomaly-free regions. Discriminative methods are robust to some reconstructive network failures, suggesting that the discriminative network learns a strong normal appearance signal that the reconstructive networks miss. We reformulate the two-stage architecture into a single-stage iterative process that allows the exchange of information between the reconstruction and localization. We propose a novel transparency-based diffusion process where the transparency of anomalous regions is progressively increased, restoring their normal appearance accurately while maintaining the appearance of anomaly-free regions using localization cues of previous steps. We implement the proposed process as TRANSparency DifFUSION (TransFusion), a novel discriminative anomaly detection method that achieves state-of-the-art performance on both the VisA and the MVTec AD datasets, with an image-level AUROC of 98.5% and 99.2%, respectively. Code: https://github.com/MaticFuc/ECCV_TransFusion

TraX: The visual Tracking eXchange Protocol and Library

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we address the problem of developing on-line visual tracking algorithms. We present a specialized communication protocol that serves as a bridge between a tracker implementation and utilizing application. It decouples development of algorithms and application, encouraging re-usability. The primary use case is algorithm evaluation where the protocol facilitates more complex evaluation scenarios that are used nowadays thus pushing forward the field of visual tracking. We present a reference implementation of the protocol that makes it easy to use in several popular programming languages and discuss where the protocol is already used and some usage scenarios that we envision for the future.

TraX: Visual Tracking eXchange Protocol

Mon, 01 Jan 0001 00:00:00 +0000

This report motivates the TraX communication protocol as well as specifies its first iteration. TraX protocol is a simple protocol that was designed to make automatic evaluation of visual tracking algorithms quick, easy and independent of the choice of programming language, availability of source code or even the target operating system. It integrates with existing tracker implementations with little additional work and enables communication with external evaluation tools in order to perform objective evaluation experiments. In addition to the protocol specification we provide a reference implementation of the protocol in several popular programming languages that makes the protocol even easier to use.

Understanding Convolutional Neural Networks for Object Recognition

Mon, 01 Jan 0001 00:00:00 +0000

Since deep learning originates from the field of computer vision this talk we will focus more closely on deep learning approaches for computer vision problems. We will focus on convolutional neural networks (CNN or ConvNet), how they work, what makes them particularly useful for computer vision problems, what are the important “tricks” that makes them work that well (ReLU, dropout, batch norm …), and what can visualization of feature tell us about CNNs. The talk will start with the basics of deep learning (gradient descent and back-propagation) so no prior knowledge is needed but some knowledge of mathematics (statistics and derivatives) could be useful for properly understanding more advanced “tricks”. At the end of the talk we will also look at the method being developed at ViCoS lab in UL FRI that tries to advance CNNs by combining them with compositional hierarchies and improve the understanding of features.

Unsupervised Learning of a Hierarchy of Topological Maps using Omnidirectional Images

Mon, 01 Jan 0001 00:00:00 +0000

This paper presents a novel appearance-based method for path-based map learning by a mobile robot equipped with an omnidirectional camera. In particular, we focus on an unsupervised construction of topological maps, which provide an abstraction of the environment in terms of visual aspects. An unsupervised clustering algorithm is used to represent the images in multiple subspaces, forming thus a sensory grounded representation of the environment’s appearance. By introducing transitional fields between clusters we are able to obtain a partitioning of the image set into distinctive visual aspects. By abstracting the low-level sensory data we are able to efficiently reconstruct the overall topological layout of the covered path. After the high level topology is estimated, we repeat the procedure on the level of visual aspects to obtain local topological maps. We demonstrate how the resulting representation can be used for modeling indoor and outdoor environments, how it successfully detects previously visited locations and how it can be used for the estimation of the current visual aspect and the retrieval of the relative position within the current visual aspect.

Unsupervised Learning of Basic Object Affordances from Object Properties

Mon, 01 Jan 0001 00:00:00 +0000

Affordance learning has, in recent years, been generating heightened interest in both the cognitive vision and developmental robotics communities. In this paper we describe the development of a system that uses a robotic arm to interact with household objects on a table surface while observing the interactions using camera systems. Various computer vision methods are used to derive, firstly, object property features from intensity images and range data gathered before interaction and, subsequently, result features derived from video sequences gathered during and after interaction. We propose a novel affordance learning algorithm that automatically discretizes the result feature space in an unsupervised manner to form affordance classes that are then used as labels to train a supervised classifier in the object property feature space. This classifier may then be used to predict affordance classes, grounded in the result space, of novel objects based on object property observations.

Uporaba lokalnih značilnic v aplikacijah spoznavnega vida za urbana okolja (Local features in cognitive vision applications for urban environments)

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present a performance evaluation of MSER and Hessian-Affine local feature detectors in a typical use case of cognitive vision applications in urban environments. By using a wide baseline stereo matching approach we try to find camera motion between a user image and images stored in a database. Running this application on test images twice while only changing the underlying local feature type has shown that the MSER local feature detector outperforms the Hessian-Affine detector. Additionally, we have shown that local features can perform well in cognitive vision applications for urban environments.

Using discriminative analysis for improving hierarchical compositional models

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we propose a method to extract discriminative information from a generative model produced by a compositional hierarchical approach. We present discriminative information as a score computed from a weighted summation of the activation vector. We base the activation vector on individual activations of features from a parse tree of the detection. We utilize the score to reduce false positive detections by removing generative models with poor discriminative information from the vocabulary and by thresholding the detections with low discriminative score. We evaluate our approach on the ETHZ Shape Classes database where we show a reduction in the number of false positives and a decrees in detection time without reducing the detection rate.

Vegetation segmentation for boosting performance of MSER feature detector

Mon, 01 Jan 0001 00:00:00 +0000

In this paper, we present a new application of image segmentation algorithms and an adaptation of the image segmentation method of Tavakoli et al. to the problem of vegetation segmentation. While the traditional goal of image segmentation is to provide a figure/ground segmentation for object recognition or semantic segmentation to assist humans, we propose to use image segmentation in order to boost performance of local invariant feature detectors. In particular, we analyze the performance of MSER feature detector and we show that we can prune all features detected on vegetation to gain a 67% speed-up while accuracy of image matching does not decrease. The image segmentation method of Tavakoli et al. that we adapt to the problem of vegetation segmentation is based on singular value decomposition (SVD) of local image patches, where the sum of the smaller singular values describes the high frequency part of the patch. The results of the automatic segmentation of vegetation show that the average overlap between manual and automatic vegetation segmentation is 33% and that the automatic procedure for vegetation segmentation can prune 25% of MSER features, resulting in 33% faster image retrieval.

ViCoS Eye - a webservice for visual object categorization

Mon, 01 Jan 0001 00:00:00 +0000

In our paper we present an architecture for a system capable of providing back-end support for webservice by running a variety of computer vision algorithms distributed across a cluster of machines. We divide the architecture into learning, real-time processing and a request handling for web-service. We implement learning in MapReduce domain with Hadoop jobs, while we implement real-time processing as a Storm application. An additional website and Android application front-end are implemented as part of web-service to provide user interface. We evaluate the system on our own cluster and show that the system running on a cluster of our size can learn Caltech-101 dataset in 40 minutes while real-time processing can achieve response time of 2 seconds, which is adequate for multitude of online applications.

ViCoS Eye - Spletna storitev za kategorizacijo vizualnih objektov

Mon, 01 Jan 0001 00:00:00 +0000

V članku predstavimo arhitekturo sistema za spletno storitev, ki omogoča poganjanje naprednih algoritmov računalniškega vida porazdeljenih preko večjega števila računalnikov. Arhitekturno sistem ločimo na učenje, tokovno procesiranje v realnem času in uporabniški vmesnik za spletno storitev. Učenje implementiramo v domeni MapReduce s pomočjo Hadoop poslov, medtem ko implementiramo realno-časovno procesiranje kot aplikacijo na sistemu Storm. Kot spletni vmesnik za končnega uporabnika dodatno implementiramo tudi spletno stran in Android aplikacijo. Sistem testiramo na naši gruči računalnikov in pokažemo, da se lahko slike iz podatkovne zbirke Caltech-101 naučimo v 40 minutah, medtem ko lahko tokovno procesiranje v realnem času obdela posamezno vhodno zahtevo v manj kot dveh sekundah.

Video segmentation of water scenes using semi supervised learning

Mon, 01 Jan 0001 00:00:00 +0000

Obstacle detection is a crucial component in unmanned surface vehicles to prevent collisions and unnecessary stopping due to false detections. Autonomous vessels are a relatively unexplored area in comparison to autonomous ground vehicles, thus there are much fewer densely annotated datasets for training modern obstacle detectors. Since manual acquisition of ground truth segmentation data is time-consuming and expensive, a viable alternative is training with minimal supervision to evaluate unsupervised domain adaptation methods, trained on a labeled source dataset and an un-labeled target dataset. Four modern adaptation methods are tested (Intra-domain adaptation, Fourier domain adaptation, Instance matching and Bidirectional learning) for training the semantic segmentation network WaSR, which is currently the state-of-the-art for maritime obstacle detection. We consider the original WaSR as well as a modified version. The Fourier domain adaptation applied to a modified WaSR version outperforms the non-adapted original WaSR by 6.3% in F-measure.

Video-Based Ski Jump Style Scoring from Pose Trajectory

Mon, 01 Jan 0001 00:00:00 +0000

Ski jumping is one of the oldest winter sports and takes also part in the Winter Olympics from the very start in 1924. One of the components of the final score, which is used for ranking the competitors, is the style score, given by five judges. The goal of this work was to develop a prototype for automatic style scoring from videos. As the main source of information, the proposed approach uses the detected locations of the ski jumper body parts and his skis to capture a full-body movement through the entire ski jump. We extended a method for human pose estimation from images to detect also the tips and the tails of the skies and adapted it to the domain of ski jumping. We proposed a method to utilize the detected trajectories along with the scores given by real judges to build a model for predicting the style scores. The experimental results obtained on the data that we had available show that the proposed computer-vision-based system for automatic style scoring achieves an error comparable to the error of real judges.

Visual Detection of Business Cards: Key-Point Correspondences Filtering

Mon, 01 Jan 0001 00:00:00 +0000

This study explores a coarse localization of a planar object using interest key-points and RANSAC algorithm. The method is employed as part of an application for the detection and recognition of a business card being waved in front of a camera. Localization follows the method of Vincent and Laganiere where RANSAC algorithm is used to find homography between two images that contain a dominant planar regions. RANSAC algorithm and a method for finding planar objects in two consecutive frames are presented in detail, with additional key-point stability over multiple frames being employed for the removal of background key-points. We evaluate the method on four business cards, two non-textured and two textured, and show to significantly reduce background key-points with the dominant planar object. We also show to completely remove background key-points when correspondences are matched on every fifth frame and when each key-point is required to be visible for at least 15 frames.

Visual Detection of Business Cards: Segmentation

Mon, 01 Jan 0001 00:00:00 +0000

This study explores Graph Cut method for the segmentation of business cards captured from a sequence of images. The process of segmentation is needed as a part of an application for detection and recognition of business cards. We explore Graph Cut method in detail, and show it can be applied to business card segmentation. We show how in Graph Cut method foreground and background regions can be successfully initialized using business card key-point detector from our previous study. Furthermore, we show how a sequence of images can be used to improve foreground/background initialization and how by merging multiple frames final segmentation can be improved. We demonstrate our proposed approach on a set of four business cards sequences, using two textured and two nontextured business cards.

Visual Detection of Business Cards: Study of Interest Key-point Detectors

Mon, 01 Jan 0001 00:00:00 +0000

This study examines the use of interest key-point detector for the application of detecting and recognizing business cards that are slowly waved in front of the camera. We focus on two interest point detectors Scale Invariant Feature Transform and Maximally Stable Extremal Regions. Both are presented in detail with main emphasis on SIFT detector. Characteristics of both are examined with respect to the detection of business cards and SIFT is selected as suitable detector for this problem. A stability of SIFT key-points is also experimentally evaluated on a newly created database of business cards for this purpose.

Visual Information Abstraction For Interactive Robot Learning

Mon, 01 Jan 0001 00:00:00 +0000

Semantic visual perception for knowledge acquisition plays an important role in human cognition, as well as in the learning process of any cognitive robot. In this paper, we present a visual information abstraction mechanism designed for continuously learning robotic systems. We generate spatial information in the scene by considering plane estimation and stereo line detection coherently within a unified probabilistic framework, and show how spaces of interest (SOIs) are generated and segmented using the spatial information. We also demonstrate how the existence of SOIs is validated in the long-term learning process. The proposed mechanism facilitates robust visual information abstraction which is a requirement for continuous interactive learning. Experiments demonstrate that with the refined spatial information, our approach provides accurate and plausible representation of visual objects.

Visual object tracking performance measures revisited

Mon, 01 Jan 0001 00:00:00 +0000

The problem of visual tracking evaluation is sporting a large variety of performance measures, and largely suffers from lack of consensus about which measures should be used in experiments. This makes the cross-paper tracker comparison difficult. Furthermore, as some measures may be less effective than others, the tracking results may be skewed or biased towards particular tracking aspects. In this paper we revisit the popular performance measures and tracker performance visualizations and analyze them theoretically and experimentally. We show that several measures are equivalent from the point of information they provide for tracker comparison and, crucially, that some are more brittle than the others. Based on our analysis we narrow down the set of potential measures to only two complementary ones, describing accuracy and robustness, thus pushing towards homogenization of the tracker evaluation methodology. These two measures can be intuitively interpreted and visualized and have been employed by the recent Visual Object Tracking (VOT) challenges as the foundation for the evaluation methodology.

Visual re-identification across large, distributed camera networks

Mon, 01 Jan 0001 00:00:00 +0000

We propose a holistic approach to the problem of re-identification in an environment of distributed smart cameras. We model the re-identification process in a distributed camera network as a distributed multi-class classifier, composed of spatially distributed binary classifiers. We treat the problem of re-identification as an open-world problem, and address novelty detection and forgetting. As there are many tradeoffs in design and operation of such a system, we propose a set of evaluation measures to be used in addition to the recognition performance. The proposed concept is illustrated and evaluated on a new many-camera surveillance dataset and SAIVT-SoftBio dataset.

Vrednotenje učinkovitosti Kalmanovega filtra pri sledenju ljudi

Mon, 01 Jan 0001 00:00:00 +0000

Kalman filtering (KF) is a standard technique for estimating position and uncertainty of a moving object based on noisy measurements and knowledge of object dynamics. In this paper we apply the Kalman filter algorithm to estimate the motion parameters (position and speed) of a moving Peršon from a video stream. To assess the efficiency of KF tracking various experiments with and without KF were performed. The results showed that modeling of a Peršon motion and measurement noise using KF algorithm can considerably improve the tracking performance in cases of human interactions and occlusions.

WaSR -- A Water Segmentation and Refinement Maritime Obstacle Detection Network

Mon, 01 Jan 0001 00:00:00 +0000

Obstacle detection using semantic segmentation has become an established approach in autonomous vehicles. However, existing segmentation methods, primarily developed for ground vehicles, are inadequate in an aquatic environment as they produce many false positive (FP) detections in the presence of water reflections and wakes. We propose a novel deep encoder-decoder architecture, a water segmentation and refinement (WaSR) network, specifically designed for the marine environment to address these issues. A deep encoder based on ResNet101 with atrous convolutions enables the extraction of rich visual features, while a novel decoder gradually fuses them with inertial information from the inertial measurement unit (IMU). The inertial information greatly improves the segmentation accuracy of the water component in the presence of visual ambiguities, such as fog on the horizon. Furthermore, a novel loss function for semantic separation is proposed to enforce the separation of different semantic components to increase the robustness of the segmentation. We investigate different loss variants and observe a significant reduction in false positives and an increase in true positives (TP). Experimental results show that WaSR outperforms the current state-of-the-art by approximately 4% in F1-score on a challenging USV dataset. WaSR shows remarkable generalization capabilities and outperforms the state of the art by over 24% in F1 score on a strict domain generalization experiment.

Weighted and robust incremental method for subspace learning

Mon, 01 Jan 0001 00:00:00 +0000

Weighted and robust learning of subspace representations

Mon, 01 Jan 0001 00:00:00 +0000

A reliable system for visual learning and recognition should enable a selective treatment of individual parts of input data and should successfully deal with noise and occlusions. These requirements are not satisfactorily met when visual learning is approached by appearance-based modeling of objects and scenes using the traditional PCA approach. In this paper we extend standard PCA approach to overcome these shortcomings. We first present a weighted version of PCA, which, unlike the standard approach, considers individual pixels and images selectively, depending on the corresponding weights. Then we propose a robust PCA method for obtaining a consistent subspace representation in the presence of outlying pixels in the training images. The method is based on the EM algorithm for estimation of principal subspaces in the presence of missing data. We demonstrate the efficiency of the proposed methods in a number of experiments.

Weighted Incremental Subspace Learning

Mon, 01 Jan 0001 00:00:00 +0000

Wide-angle camera distortions and non-uniform illumination in mobile robot tracking

Mon, 01 Jan 0001 00:00:00 +0000

In this paper some fundamentals and solutions to accompanying problems in vision system design for mobile robot tracking are presented. The main topics are correction of camera lens distortion and compensation of non-uniform illumination. Both correction methods contribute to vision system performance if implemented in the appropriate manner. Their applicability is demonstrated by applying them to vision for robot soccer. The lens correction method successfully corrects the distortion caused by the camera lens, thus achieving a more accurate and precise estimation of object position. The illumination compensation improves robustness to irregular and non-uniform illumination that is nearly always present in real conditions.

Zaznavanje terasiranih pokrajin kot semantična segmentacija digitalnega modela višin

Mon, 01 Jan 0001 00:00:00 +0000