Paper on ViCoS Lab

1st Workshop on Maritime Computer Vision (MaCVi) 2023: Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The 1st Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi

2nd Workshop on Maritime Computer Vision (MaCVi) 2024: Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The 2nd Workshop on Maritime Computer Vision (MaCVi) 2024 addresses maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicles (USV). Three challenges categories are considered: (i) UAV-based Maritime Object Tracking with Re-identification, (ii) USV-based Maritime Obstacle Segmentation and De- tection, (iii) USV-based Maritime Boat Tracking. The USV-based Maritime Obstacle Segmentation and Detec- tion features three sub-challenges, including a new em- bedded challenge addressing efficicent inference on real- world embedded devices. This report offers a comprehen- sive overview of the findings from the challenges. We pro- vide both statistical and qualitative analyses, evaluating trends from over 195 submissions. All datasets, evaluation code, and the leaderboard are available to the public at https://macvi.org/workshop/macvi24.

A basic cognitive system for interactive continuous learning of visual concepts

Mon, 01 Jan 0001 00:00:00 +0000

Interactive continuous learning is an important characteristic of a cognitive agent that is supposed to operate and evolve in an everchanging environment. In this paper we present representations and mechanisms that are necessary for continuous learning of visual concepts in dialogue with a tutor. We present an approach for modelling beliefs stemming from multiple modalities and we show how these beliefs are created by processing visual and linguistic information and how they are used for learning. We also present a system that exploits these representations and mechanisms, and demonstrate these principles in the case of learning about object colours and basic shapes in dialogue with the tutor.

A basic cognitive system for interactive learning of simple visual concepts

Mon, 01 Jan 0001 00:00:00 +0000

In this work we present a system and underlying representations and mechanisms for continuous learning of visual concepts in dialogue with a human tutor.

A Computer Vision Integration Model for a Multi-modal Cognitive System

Mon, 01 Jan 0001 00:00:00 +0000

We present a general method for integrating visual components into a multi-modal cognitive system. The integration is very generic and can combine an arbitrary set of modalities. We illustrate our integration approach with a specific instantiation of the architecture schema that focuses on integration of vision and language: a cognitive system able to collaborate with a human, learn and display some understanding of its surroundings. As examples of cross-modal interaction we describe mechanisms for clarification and visual learning.

A Detect-and-Verify Paradigm for Low-Shot Counting - DAVE

Mon, 01 Jan 0001 00:00:00 +0000

Low-shot counters estimate the number of objects corresponding to a selected category, based on only few or no exemplars annotated in the image. The current state-of-the-art estimates the total counts as the sum over the object location density map, but does not provide individual object locations and sizes, which are crucial for many applications. This is addressed by detection-based counters, which, however fall behind in the total count accuracy. Furthermore, both approaches tend to overestimate the counts in the presence of other object classes due to many false positives. We propose DAVE, a low-shot counter based on a detect-and-verify paradigm, that avoids the aforementioned issues by first generating a high-recall detection set and then verifying the detections to identify and remove the outliers. This jointly increases the recall and precision, leading to accurate counts. DAVE outperforms the top density-based counters by ~20% in the total count MAE, it outperforms the most recent detection-based counter by ~20% in detection quality and sets a new state-of-the-art in zero-shot as well as text-prompt-based counting.

A Distractor-Aware Memory for Visual Object Tracking with SAM2

Mon, 01 Jan 0001 00:00:00 +0000

Memory-based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory-based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor-aware memory model for SAM2 and an introspection-based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor-distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state-of-the-art on six of them. The code and the new dataset will be available on https://github.com/jovanavidenovic/DAM4SAM.

A Framework for Robust and Incremental Self-Localization

Mon, 01 Jan 0001 00:00:00 +0000

In this contribution we present a framework for an embodied robotic system that is capable of appearance-based self-localization. Specifically, we concentrate on the issues of robustness, flexibility, and scalability of the system. The framework presented is based on a panoramic eigenspace model of the environment. Its main feature is that it allows for simultaneous localization and map building using an incremental learning algorithm. Further, both the learning and the training processes are designed in a way to achieve robustness and adaptability to changes in the environment.

A graphical model for rapid obstacle image-map estimation from unmanned surface vehicles

Mon, 01 Jan 0001 00:00:00 +0000

A hierarchical dynamic model for tracking in sports

Mon, 01 Jan 0001 00:00:00 +0000

Dynamic models play a crucial role in tracking algorithms. In particle filters, for example, proper modelling of the target dynamics can help achieving the desired tracking accuracy using only a small number of particles and thus reducing the computa- tional complexity of the tracker. We propose a novel hierarchical model for tracking players in sports by combining a conservative and a liberal dynamic model to better describe the player’s dynamics. We show how parameters of the model can be estimated from prior knowledge about the players dynamics. The proposed dynamic model was compared to a widely used model and resulted in better performance in terms of estimating position and prediction.

A hierarchy of cognitive maps from panoramic images

Mon, 01 Jan 0001 00:00:00 +0000

This paper presents a computational model which implements formation of cognitive maps based on panoramic images captured during the exploration phase. The resulting map consists of “place cells” and topological relations between them. The formation of the cognitive map is based on the model introduced by Hafner. The use of panoramic images as inputs would result in high computational complexity of the simulation, therefore we propose to use the PCA (Principal Component Analysis) method to reduce the dimension of the input space. A physical force model is applied to extend the relatively sparse topological map with metric information. Both the computational model and the physical force model try to mimic functions performed in the mammalian brain.

A Long-Term Discriminative Single Shot Segmentation Tracker

Mon, 01 Jan 0001 00:00:00 +0000

State-of-the-art long-term visual object tracking methods are limited to predict target position as an axis-aligned bounding box. Segmentation-based trackers exist, however they do not address long-term disappearances of the target. We propose a long-term discriminative single shot segmentation tracker – D3SLT, which addresses the above shortcomings. The previously developed short-term D3S tracker is upgraded with a global re-detection module, based on an image-wide discriminative correlation filter response and Gaussian motion model. An online learned confidence estimation module is employed for robust estimation target disappearance. Additional backtracking module enables recovery from tracking failures and further improves tracking performance. D3SLT performs close to the state-of-the-art long-term trackers on the bou-nding box based VOT-LT2021 Challenge, achieving F-score of 0.667, while additionally outputting segmentation masks.

A Low-Shot Object Counting Network With Iterative Prototype Adaptation

Mon, 01 Jan 0001 00:00:00 +0000

We consider low-shot counting of arbitrary semantic categories in the image using only few annotated exemplars (few-shot) or no exemplars (no-shot). The standard few-shot pipeline follows extraction of appearance queries from exemplars and matching them with image features to infer the object counts. Existing methods extract queries by feature pooling, but neglect the shape information (e.g., size and aspect), which leads to a reduced object localization accuracy and count estimates. We propose a Low-shot Object Counting network with iterative prototype Adaptation (LOCA). Our main contribution is the new object prototype extraction module, which iteratively fuses the exemplar shape and appearance queries with image features. The module is easily adapted to zero-shot scenario, enabling LOCA to cover the entire spectrum of low-shot counting problems. LOCA outperforms all recent state-of-the-art methods on FSC147 benchmark by 20-30% in RMSE on one-shot and few-shot and achieves state-of-the-art on zero-shot scenarios, while demonstrating better generalization capabilities.

A Novel Unified Architecture for Low-Shot Counting by Detection and Segmentation

Mon, 01 Jan 0001 00:00:00 +0000

Low-shot object counters estimate the number of objects in an image using few or no annotated exemplars. Objects are localized by matching them to prototypes, which are constructed by unsupervised image-wide object appearance aggregation. Due to potentially diverse object appearances, the existing approaches often lead to overgeneralization and false positive detections. Furthermore, the best-performing methods train object localization by a surrogate loss, that predicts a unit Gaussian at each object center. This loss is sensitive to annotation error, hyperparameters and does not directly optimize the detection task, leading to suboptimal counts. We introduce GeCo, a novel low-shot counter that achieves accurate object detection, segmentation, and count estimation in a unified architecture. GeCo robustly generalizes the prototypes across objects appearances through a novel dense object query formulation. In addition, a novel counting loss is proposed, that directly optimizes the detection task and avoids the issues of the standard surrogate loss. GeCo surpasses the leading few-shot detection-based counters by 25% in the total count MAE, achieves superior detection accuracy and sets a new solid state-of-the-art result across all low-shot counting setups. The code will be available on GitHub.

A Robust PCA algorithm for building representations from panoramic images

Mon, 01 Jan 0001 00:00:00 +0000

We present an artifficial cognitive system for learning visual concepts. It comprises of vision, communication and manipulation sub- systems, which provide visual input, enable verbal and non-verbal communication with a tutor and allow interaction with a given scene. The main goal is to learn associations between automatically extracted visual features and words that describe the scene in an open-ended, continuous manner. In particular, we address the problem of cross-modal learning of visual properties and spatial relations. We introduce and analyse several learning modes requiring diffeerent levels of tutor supervision.

A system approach to interactive learning of visual concepts

Mon, 01 Jan 0001 00:00:00 +0000

In this work we present a system and underlying mechanisms for continuous learning of visual concepts in dialogue with a human.

A System for Continuous Learning of Visual Concepts

Mon, 01 Jan 0001 00:00:00 +0000

We present an artifficial cognitive system for learning visual concepts. It comprises of vision, communication and manipulation sub- systems, which provide visual input, enable verbal and non-verbal com munication with a tutor and allow interaction with a given scene. The main goal is to learn associations between automatically extracted visual features and words that describe the scene in an open-ended, continuous manner. In particular, we address the problem of cross-modal learning of visual properties and spatial relations. We introduce and analyse several learning modes requiring different levels of tutor supervision.

A system for interactive learning in dialogue with a tutor

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present representations and mechanisms that facilitate continuous learning of visual concepts in dialogue with a tutor and show the implemented robot system. We present how beliefs about the world are created by processing visual and linguistic information and show how they are used for planning system behaviour with the aim at satisfying its internal drive – to extend its knowledge. The system facilitates different kinds of learning initiated by the human tutor or by the system itself. We demonstrate these principles in the case of learning about object colours and basic shapes.

A system for learning basic object affordances using a self-organizing map

Mon, 01 Jan 0001 00:00:00 +0000

When a cognitive system encounters particular objects, it needs to know what effect each of its possible actions will have on the state of each of those objects in order to be able to make effective decisions and achieve its goals. Moreover, it should be able to generalize effectively so that when it encounters novel objects, it is able to estimate what effect its actions will have on them based on its experiences with previously encountered similar objects. This idea is encapsulated by the term “affordance”, e.g. “a ball affords being rolled to the right when pushed from the left.” In this paper, we discuss the development of a cognitive vision platform that uses a robotic arm to interact with household objects in an attempt to learn some of their basic affordance properties. We outline the various sensor and effector module competencies that were needed to achieve this and describe an experiment that uses a self-organizing map to integrate these modalities in a working affordance learning system.

A Template-Based Multi-Player Action Recognition of the Basketball Game

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present a method for fully automatic trajectory based analysis of basketball game in the form of large and small scale modelling of the game. The large-scale game model is obtained by dividing the game into several game phases. Every game phase is then individually modelled using mixture of Gaussian distributions. The Expectation-Maximization algorithm is used to determine the parameters of the Gaussian distributions. On the other hand, the small-scale modelling of the game deals with specific basketball actions which can be defined in the form of action templates that are used by the basketball experts to pass their instructions to the players. For the recognition purposes we define the basic game elements which are the building blocks of the more complex game actions. These elements are then used to semantically describe the observed basketball actions and the templates. To establish if the observed action corresponds to the template, the similarity of descriptions is calculated using Levenstein distance measure. Experiments show that the proposed method could become a powerful tool for the recognition of various basketball actions.

A Visualization and User Interface Framework for Heterogeneous Distributed Environments

Mon, 01 Jan 0001 00:00:00 +0000

Systems that require complex computations are frequently implemented in a distributed manner. Such systems are often split into components where each component is employed to perform a specific type of processing. The components of a system may be implemented in different programming languages because some languages are more suited for expressing and solving certain kinds of problems. The user of the system must have a way to monitor the state of individual components and also to modify their execution parameters through a user interface while the system is running. The distributed execution and programming language diversity represent a problem for the development of graphic user interfaces. In this paper we describe a framework in which a server provides two types of services to the components of a distributed system. First it manages visualization objects provided by individual components and combines and displays those objects in various views. Second, it displays and executes graphic user interface objects defined at runtime by the components and communicates with the components when changes occur in the user interface or in the internal state of the components. The framework was successfully used in a distributed robotic environment.

A water-obstacle separation and refinement network for unmanned surface vehicles

Mon, 01 Jan 0001 00:00:00 +0000

Obstacle detection by semantic segmentation shows a great promise for autonomous navigation in unmanned surface vehicles (USV). However, existing methods suffer from poor estimation of the water edge in presence of visual ambiguities, poor detection of small obstacles and high false-positive rate on water reflections and wakes. We propose a new deep encoder-decoder architecture, a water-obstacle separation and refinement network (WaSR), to address these issues. Detection and water edge accuracy are improved by a novel decoder that gradually fuses inertial information from IMU with the visual features from the encoder. In addition, a novel loss function is designed to increase the separation between water and obstacle features early on in the network. Subsequently, the capacity of the remaining layers in the decoder is better utilised, leading to a significant reduction in false positives and increased true positives. Experimental results show that WaSR outperforms the current state-of-the-art by a large margin, yielding a 14% increase in F-measure over the second-best method.

A web-service for object detection using hierarchical models

Mon, 01 Jan 0001 00:00:00 +0000

This paper proposes an architecture for an object detection system suitable for a web-service running distributed on a cluster of machines. We build on top of a recently proposed architecture for distributed visual recognition system and extend it with the object detection algorithm. As sliding-window techniques are computationally unsuitable for web-services we rely on models based on state-of-the-art hierarchical compositions for the object detection algorithm. We provide implementation details for running hierarchical models on top of a distributed platform and propose an additional hypothesis verification step to reduce many false-positives that are common in hierarchical models. For a verification we rely on a state-of-the-art descriptor extracted from the hierarchical structure and use a support vector machine for object classification. We evaluate the system on a cluster of 80 workers and show a response time of around 10 seconds at throughput of around 60 requests per minute.

About different active learning approaches for acquiring categorical knowledge

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we address the problem of acquiring categorical knowledge from the active learning perspective. We describe and implement several teacher and learnerdrivenapproaches that require different levels of teacher competencies and consider different types of knowledge for selection of training samples. The experimental results show that the active learning approach outperforms the passive one and that the adaptation of the learning process to the learners knowledge significantly improves the learning performance.

Acquiring range images of objects with non-uniform reflectance using high dynamic scale radiance maps

Mon, 01 Jan 0001 00:00:00 +0000

Active learning with teacher-learner mutuality

Mon, 01 Jan 0001 00:00:00 +0000

In active learning, the basic objective is to reach a desired performance of some learning algorithm with as little training instances as possible. The reason behind is that labeling of training instances may be expensive with respect to the amount of time and intellectual effort of a human annotator. We propose a new approach for active learning, called “mutual active learning”, which helps the artificial intelligent learner to pose questions to his human teacher, which are as clear and as understandable as possible. Such learning appears to be more reliable and successful than basic active learning.

Adaptive Dynamic Window Approach for Local Navigation

Mon, 01 Jan 0001 00:00:00 +0000

Local navigation is an essential ability of any mobile robot working in a real-world environment. One of the most commonly used methods for local navigation is the Dynamic Window Approach (DWA), which heavily depends on the settings of the parameters in its cost function. Since the optimal choice of the parameters depends on the environment that may significantly vary and change at any time, the parameters should be chosen dynamically in a data-driven way. To cope with this problem, we propose a novel deep convolutional neural network, which dynamically predicts these parameters considering the sensor readings. The network is trained using a state-of-the art reinforcement learning algorithm. In this way, we combine the power of data-driven learning and the dynamic model of the robot, enabling adaptation to the current environment as well as guaranteeing collision-free movement and smooth trajectories of the mobile robot. The experimental results show that the proposed method outperforms the DWA method as well as its recent extension.

Adding discriminative power to hierarchical compositional models for object class detection

Mon, 01 Jan 0001 00:00:00 +0000

In recent years, hierarchical compositional models have been shown to possess many appealing properties for the object class detection such as coping with potentially large number of object categories. The reason is that they encode categories by hierarchical vocabularies of parts which are shared among the categories. On the downside, the sharing and purely reconstructive nature causes problems when categorizing visually-similar categories and separating them from the background. In this paper we propose a novel approach that preserves the appealing properties of the generative hierarchical models, while at the same time improves their discrimination properties. We achieve this by introducing a network of discriminative nodes on top of the existing generative hierarchy. The discriminative nodes are sparse linear combinations of activated generative parts. We show in the experiments that the discriminative nodes consistently improve a state-of-the-art hierarchical compositional model. Results show that our approach considers only a fraction of all nodes in the vocabulary (less than $10%$) which also makes the system computationally efficient.

Aktivno učenje z mešanimi oznakami za detekcijo površinskih napak z globokimi nevronskimi mrežami

Mon, 01 Jan 0001 00:00:00 +0000

This paper investigates active learning strategies for mixed supervision in surface defect detection, where we search for a minimal set of samples selected for more accurate manual segmentation. We explore several approaches for sample selection based on entropy, margin sampling, and least confidence and apply them to a mixed supervision method, SegDecNet. We additionally explore extending active learning with probability calibration and equal sampling by categories to improve the robustness. Active learning approaches are evaluated on the KSDD2 dataset and compared against random sampling and a related purpose-built method for active learning in surface defect detection. We demonstrate that the least confidence method with the proposed extensions an outperform random sampling and other methods, achieving the same result as fully annotated dataset while requiring only a third of the fully annotated samples.

An adaptive coupled-layer visual model for robust visual tracking

Mon, 01 Jan 0001 00:00:00 +0000

This paper addresses the problem of tracking objects which undergo rapid and significant appearance changes. We propose a novel coupled-layer visual model that combines the target’s global and local appearance. The local layer in this model is a set of local patches that geometrically constrain the changes in the target’s appearance. This layer probabilistically adapts to the target’s geometric deformation, while its structure is updated by removing and adding the local patches. The addition of the patches is constrained by the global layer that probabilistically models target’s global visual properties such as color, shape and apparent local motion. The global visual properties are updated during tracking using the stable patches from the local layer. By this coupled constraint paradigm between the adaptation of the global and the local layer, we achieve a more robust tracking through significant appearance changes. Indeed, the experimental results on challenging sequences confirm that our tracker outperforms the related state-of-the-art trackers by having smaller failure rate as well as better accuracy.

An alternative way to calibrate ubisense real-time location system via multi-camera calibration methods

Mon, 01 Jan 0001 00:00:00 +0000

Ubisense Real-Time Location System is considered. The approach is based on capturing the raw angles of arrival and projecting them into virtual image plane, as if sensors were perspective cameras. The extrinsic parameters (position and orientation) of sensors are then obtained by calibration of virtual perspective cameras using multicamera calibration methods. An application considered in the paper is rapid deployment of Ubisense system for tracking in sports. Survey points can be easily determined from the standard markings on the court floor, which makes calibration from survey points coordinates more convenient than measuring sensor positions, which is prerequisite for standard Ubisense system calibration.

Analiza robustnosti globokih nenadzorovanih metod za detekcijo vizualnih anomalij

Mon, 01 Jan 0001 00:00:00 +0000

Unsupervised generative methods have recently attracted significant attention in the field of industrial visual anomaly detection, mainly owing to their ability to learn from non anomalous data withouth requiring anomalous samples and pixel-level labels, which are costly to obtain. An assumption that anomalous data are always correctly identified and consequently removed from the training set underlies all of the generative methods. In practice, however, correctly identifying every single anomalous image can either be very costly to do or it can not be done at all due to the nature of the problem. In this paper, we analyze how robust some of the recently proposed generative methods for anomaly detection are, by introducing anomalous data in the training process. Our analysis covers 3 methods and 4 datasets with 8 categories in total, and we conclude that while some of the methods are more robust than others, introducing a minor percentage of anomalous data in the training set does not significantly deteriorate the performance.

Anomalous Sound Detection by Feature-Level Anomaly Simulation

Mon, 01 Jan 0001 00:00:00 +0000

Recently a growing number of works focus on machine defect detection from anomalous audio patterns. The datasets for the machine audio domain are scarce and recent methods that perform well on benchmarks such as DCASE2020 Task 2, rely on auxiliary information such as annotated data from other training classes in the domain to extract information that can be used in deep-learning classification-based anomaly detection approaches. However, in practical scenarios, annotated data from the same domain may not be readily available so annotation-free methods that can learn appropriate audio representations from unannotated data are needed. We propose AudDSR, a simulation-based anomaly detection method that learns to detect anomalies without additional annotated data and instead focuses on a discrete feature space sampling method for an anomaly simulation process. AudDSR outperforms competing methods that do not rely on annotated data on the DCASE2020 anomalous sound detection benchmark and even matches the performance of some methods that utilize additional annotation information.

AnomalyVFM - Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Mon, 01 Jan 0001 00:00:00 +0000

Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision–language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page

Appearance-based localization using CCA

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present an appearance-based approach to mobile robot localization based on Canonical Correlation Analysis. The main idea is to learn the relation between the appearances of the environment from a number of training locations and coordinates of these locations using CCA and then to use this knowledge to estimate the position of the robot in the localization stage. We present results of several experiments, which show that this approach is faster and less demanding in terms of space than traditional PCA-based approach, however in its standard form it yields in general inferior localization results.

Application of Temporal Convolutional Neural Network for the Classification of Crops on SENTINEL-2 Time Series

Mon, 01 Jan 0001 00:00:00 +0000

Automatic Evaluation of Organized Basketball Activity

Mon, 01 Jan 0001 00:00:00 +0000

In this article the trajectory-based evaluation of multi-player basketball activity is addressed. The organized basketball activity consists of a set of key elements and their temporal relations. The activity evaluation is performed by analyzing individually each of them and the final reasoning about the activity is achieved using the Bayesian network. The network structure is obtained automatically from the activity template which is a standard tool used by the basketball experts. The experimental results suggest that our approach can successfully evaluate the quality of the observed activity.

Back To The Drawing Board: Rethinking Scene-Level Sketch-Based Image Retrieval

Mon, 01 Jan 0001 00:00:00 +0000

The goal of Scene-level Sketch-Based Image Retrieval is to retrieve natural images matching the overall semantics and spatial layout of a free-hand sketch. Unlike prior work focused on architectural augmentations of retrieval models, we emphasize the inherent ambiguity and noise present in real-world sketches. This insight motivates a training objective that is explicitly designed to be robust to sketch variability. We show that with an appropriate combination of pre-training, encoder architecture, and loss formulation, it is possible to achieve state-of-the-art performance without the introduction of additional complexity. Extensive experiments on a challenging FS-COCO and widely-used SketchyCOCO datasets confirm the effectiveness of our approach and underline the critical role of training design in cross-modal retrieval tasks, as well as the need to improve the evaluation scenarios of scene-level SBIR.

Bayes Spectral Entropy-Based Measure of Camera Focus

Mon, 01 Jan 0001 00:00:00 +0000

Beyond standard benchmarks: Parameterizing performance evaluation in visual object tracking

Mon, 01 Jan 0001 00:00:00 +0000

Object-to-camera motion produces a variety of apparent motion patterns that significantly affect performance of short-term visual trackers. Despite being crucial for designing robust trackers, their influence is poorly explored in standard benchmarks due to weakly defined, biased and overlapping attribute annotations. In this paper we propose to go beyond pre-recorded benchmarks with post-hoc annotations by presenting an approach that utilizes omnidirectional videos to generate realistic, consistently annotated, short-term tracking scenarios with exactly parameterized motion patterns. We have created an evaluation system, constructed a fully annotated dataset of omnidirectional videos and generators for typical motion patterns. We provide an in-depth analysis of major tracking paradigms which is complementary to the standard benchmarks and confirms the expressiveness of our evaluation approach.

Binding and Cross-modal Learning in Markov Logic Networks

Mon, 01 Jan 0001 00:00:00 +0000

Binding the ability to combine two or more modal representations of the same entity into a single shared representation is vital for every cognitive system operating in a complex environment. In order to successfully adapt to changes in an dynamic environment the binding mechanism has to be supplemented with cross-modal learning. In this paper we define the problems of high-level binding and cross-modal learning. By these definitions we model a binding mechanism and a cross-modal learner in Markov logic network and test the system on a synthetic object database.

Binding and Cross-modal Learning in Markov Logic Networks

Mon, 01 Jan 0001 00:00:00 +0000

Binding – the ability to combine two or more modal representations of the same entity into a single shared representation is vital for every cognitive system operating in a complex environment. In order to successfully adapt to changes in an dynamic environment the binding mechanism has to be supplemented with cross-modal learning. In this paper we define the problems of high-level binding and cross-modal learning. By these definitions we model a binding mechanism and a cross-modal learner in Markov logic network and test the system on a synthetic object database.

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

Mon, 01 Jan 0001 00:00:00 +0000

Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student’s pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources.

CDTB: A Color and Depth Visual Object Tracking Dataset and Benchmark

Mon, 01 Jan 0001 00:00:00 +0000

A long-term visual object tracking performance evaluation methodology and a benchmark are proposed. Performance measures are designed by following a long-term tracking definition to maximize the analysis probing strength. The new measures outperform existing ones in interpretation potential and in better distinguishing between different tracking behaviors. We show that these measures generalize the short-term performance measures, thus linking the two tracking problems. Furthermore, the new measures are highly robust to temporal annotation sparsity and allow annotation of sequences hundreds of times longer than in the current datasets without increasing manual annotation labor. A new challenging dataset of carefully selected sequences with many target disappearances is proposed. A new tracking taxonomy is proposed to position trackers on the short-term/long-term spectrum. The benchmark contains an extensive evaluation of the largest number of long-term tackers and comparison to state-of-the-art short-term trackers. We analyze the influence of tracking architecture implementations to long-term performance and explore various re-detection strategies as well as influence of visual model update strategies to long-term tracking drift. The methodology is integrated in the VOT toolkit to automate experimental analysis and benchmarking and to facilitate future development of long-term trackers.

Cheating Depth: Enhancing 3D Surface Anomaly Detection via Depth Simulation

Mon, 01 Jan 0001 00:00:00 +0000

RGB-based surface anomaly detection methods have advanced significantly. However, certain surface anomalies remain practically invisible in RGB alone, necessitating the incorporation of 3D information. Existing approaches that employ point-cloud backbones suffer from suboptimal representations and reduced applicability due to slow processing. Re-training RGB backbones, designed for faster dense input processing, on industrial depth datasets is hindered by the limited availability of sufficiently large datasets. We make several contributions to address these challenges. (i) We propose a novel Depth-Aware Discrete Autoencoder (DADA) architecture, that enables learning a general discrete latent space that jointly models RGB and 3D data for 3D surface anomaly detection. (ii) We tackle the lack of diverse industrial depth datasets by introducing a simulation process for learning informative depth features in the depth encoder. (iii) We propose a new surface anomaly detection method 3DSR, which outperforms all existing state-of-the-art on the challenging MVTec3D anomaly detection benchmark, both in terms of accuracy and processing speed. The experimental results validate the effectiveness and efficiency of our approach, highlighting the potential of utilizing depth information for improved surface anomaly detection.

Co-segmentation for visual object tracking

Mon, 01 Jan 0001 00:00:00 +0000

Cognitive Systems

Mon, 01 Jan 0001 00:00:00 +0000

Comparing different learning approaches in categorical knowledge acquisition

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we address the problem of acquiring categorical knowledge from the active learning perspective. We describe and implement several teacher and learner-driven approaches that require different levels of teacher competencies and consider different types of knowledge for selection of training samples. The experimental results show that the active learning approach outperforms the passive one and that the adaptation of the learning process to the learner’s knowledge significantly improves the learning performance.

Conservative visual learning for object detection with minimal hand labeling effort

Mon, 01 Jan 0001 00:00:00 +0000

We present a novel framework for unsupervised training of an object detection system. The basic idea is to (1) exploit a huge amount of unlabeled video data by being very conservative in selecting training examples; and (2) to start with a very simple object detection system and using generative and discriminative classifiers in an iterative co- training fashion arriving at a better object detector. We demonstrate the framework on a surveillance task where we learn a person detector. We start with a simple moving object classiffier and proceed with a robust PCA (on shape and appearance) as a generative classiffier which in turn generates a training set for a discriminative AdaBoost classiffier. The results obtained by AdaBoost are again filtered by PCA which produces an even better training set. We demonstrate that by using this approach we avoid hand labeling training data and still achieve a state of the art detection rate.

Context awareness for object detection

Mon, 01 Jan 0001 00:00:00 +0000

A wide range of algorithms have been proposed to detect objects in still images. However, most of the current approaches are purely based on local appearance and ignore the context in which these objects are embedded. This paper proposes a general approach to extract, learn and use contextual information from images to increase the performance of classical object detection methods. The important properties of the proposed approach are that it can be combined with any existing object detection method and it provides a general framework not limited to one specific object category.

Continuous Learning of Simple Visual Concepts using Incremental Kernel Density Estimation

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we propose a method for continuous learning of simple visual concepts. The method continuously associates words describing observed scenes with automatically extracted visual features. Since in our setting every sample is labelled with multiple concept labels, and there are no negative examples, reconstructive representations of the incoming data are used. The associated features are modelled with kernel density probability distribution estimates, which are built incrementally. The proposed approach is applied to the learning of object properties and spatial relations.

D3S - A Discriminative Single Shot Segmentation Tracker

Mon, 01 Jan 0001 00:00:00 +0000

Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker – D3S, which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve high robustness and online target segmentation. Without per-dataset finetuning and trained only for segmentation as the primary output, D3S outperforms all trackers on VOT2016, VOT2018 and GOT-10k benchmarks and performs close to the state-of-the-art trackers on the TrackingNet. D3S outperforms the leading segmentation tracker SiamMask on video object segmentation benchmarks and performs on par with top video object segmentation algorithms, while running an order of magnitude faster, close to real-time.

DAL: A Deep Depth-Aware Long-term Tracker

Mon, 01 Jan 0001 00:00:00 +0000

Dana36: A Multi-Camera Image Dataset for Object Identification in Surveillance Scenarios

Mon, 01 Jan 0001 00:00:00 +0000

We present a novel dataset for evaluation of object matching and recognition methods in surveillance scenarios. Dataset consists of more than 23,000 images, depicting 15 persons and nine vehicles. A ground truth data - the identity of each person or vehicle - is provided, along with the coordinates of the bounding box in the full camera image. The dataset was acquired from 36 stationary camera views using a variety of surveillance cameras with resolutions ranging from standard VGA to three megapixel. 27 cameras observed the persons and vehicles in an outdoor environment, while the remaining nine observed the same persons indoors. The activity of persons was planned in advance, they drive the cars to the parking lot, exit the cars and walk around the building, through the main entrance, and up the stairs, towards the first floor of the building. The intended use of the dataset is performance evaluation of computer vision methods that aim to (re)identify people and objects from many different viewpoints in different environments and under variable conditions. Due to variety of camera locations, vantage points and resolutions, the dataset provides means to adjust the difficulty of the identification task in a controlled and documented manner. An interface for easy use of dataset within Matlab is provided as well, and the data is complemented by baseline results using a basic color histogram-based descriptor. While the cropped images of persons and vehicles represent the primary data in our dataset, we also provide full-frame images and a set of tracklets for each object as a courtesy to the dataset users.

Deep-learning-based computer vision system for surface-defect detection

Mon, 01 Jan 0001 00:00:00 +0000

Automating optical-inspection systems using machine learning has become an interesting and promising area of research. In particular, the deep-learning approaches have shown a very high and direct impact on the application domain of visual inspection. This paper presents a complete inspection system for automated quality control of a specific industrial product. Both hardware and software part of the system are described, with machine vision used for image acquisition and pre-processing followed by a segmentation-based deep-learning model used for surface-defect detection. The deep-learning model is compared with the state-of-the-art commercial software, showing that the proposed approach outperforms the related method on the specific domain of surface-crack detection. Experiments are performed on a real-world quality-control case and demonstrate that the deep-learning model can be successfully used even when only 33 defective training samples are available. This makes the deep-learning method practical for use in industry where the number of available defective samples is limited.

Demonstracijska celica za prikaz globokega učenja v praktičnih aplikacijah

Mon, 01 Jan 0001 00:00:00 +0000

V zadnjih letih so metode globokega učenja postale ključno orodje za reševanje raznolikih praktičnih izzivov. Kljub temu pa potencial takih metod pogosto ostaja slabo razumljiv širši javnosti zaradi pogostega ločevanja razvoja in demonstracije algoritmov od dejanskih praktičnih problemov, ki jih algoritmi naslavljajo. V tem članku predstavljamo demonstracijsko celico, ki združuje strojno in programsko opremo ter algoritme globokega učenja, omogočajoč enostavno prikazovanje delovanja teh metod v različnih aplikativnih domenah. Celica vključuje kamere, grafični vmesnik in pet demonstracijskih programov, ki demonstrirajo klasifikacijo lesenih desk, detekcijo površinskih anomalij, štetje polipov, detekcijo prometnih znakov in detekcijo vogalov tekstilnih izdelkov. Implementiran modularni pristop omogoča enostavno integracijo različnih algoritmov globokega učenja. Sistem omogoča boljše razumevanje in uporabo teh metod v praktičnih scenarijih ter prispeva k razvoju inovativnih rešitev na področju globokega učenja.

Depth Fingerprinting for Obstacle Tracking using 3D Point Cloud

Mon, 01 Jan 0001 00:00:00 +0000

We present a method for automatic detection and tracking of obstacles on water surface that uses solely the point cloud obtained from the surroundings of the unmanned surface vehicle (USV). For this purpose, we use a calibrated pair of stereo cameras, affixed to the mast at the front of the USV. Reliable obstacle tracking in outdoor environment is a difficult task, but unlike the monocular approaches, our framework offloads a large part of the problem onto the method that provides a point cloud. In absence of other visual features, our method introduces \emph{depth fingerprint}, a histogram-like feature obtained from the point cloud of an object. The method has been evaluated on the yet unreleased MODD2 dataset and shows promising results, with the depth fingerprinting significantly outperforming tracking based solely on optimal assignment weighted by geometrical distance between object detections (Munkres algorithm). The proposed method is capable of running in real time on board of a small-sized USV.

Detekcija napak na površinah z uporabo anotiranih slik in globokim učenjem

Mon, 01 Jan 0001 00:00:00 +0000

Automated surface anomaly detection using machine learn-ing has become an interesting area of research with a very high direct impact to the application domain of visual inspection. Deep learning approaches seem to be very appropriate for enabling to teach inspection systems detecting surface anomalies by showing them a number of exemplar images. In this paper we present and analyze a deep learning architecture for segmentation of surface anomalies upgraded with a simple classification function that differentiates between images of faulty and defect free surfaces. The preliminary results show that the approach is very promising and that the deep learning paradigm is appropriate to be applied in the domain of automated visual inspection.

Detekcija ovir iz 3D oblaka točk za potrebe avtonomne plovbe

Mon, 01 Jan 0001 00:00:00 +0000

Detekcija površinskih napak na oblačilih za reciklažo z uporabo nadzorovanih metod globokega učenja

Mon, 01 Jan 0001 00:00:00 +0000

Efficient sorting of used garments is essential for textile recycling in the circular economy. Surface defect detection, such as identifying stains or tears, enables automated classification of items for reuse or recycling. In this paper, we focus on the problem of detecting surface defects on second-hand clothing using supervised deep learning methods. We present an analysis of our two previously proposed general-purpose surface defect detection models (SegDecNet and SuperSimpleNet) along with four modern backbone image classification architectures (ConvNeXt, ViT, Swin, and DINO). For evaluation, we curate a tailored binary classification dataset derived from the real-world garment dataset, including over 12000 annotated clothing images. Our results show that SuperSimpleNet significantly outperforms other methods, achieving an average precision of 72%, while highlight ing the inherent challenges of this task due to garment variability and subtle or occluded defects.

Detekcija, lokalizacija in identifikacija oseb z več kamerami ter mapami značilnic

Mon, 01 Jan 0001 00:00:00 +0000

V clanku je predstavljen sistem za detekcijo, lokalizacijo in identifikacijo oseb v posameznih trenutkih, brez casovnega filtriranja, ki je prisotno v vecini tovrstnih sistemov. Glavni cilj predstavljenega pristopa je odpravljanje katastrofalnih napak, ki onemogocajo popolnoma samodejno obdelavo realisticno dolgih video posnetkov. Sistem temelji na zlivanju (fuziji) vec šibkih znacilnic, zapisanih v obliki map znacilnic, zlivanje pa je izvedeno s pomocjo enega ali vec naucenih razvršcevalnikov.

Dimensionality Reduction for Distributed Vision Systems Using Random Projection

Mon, 01 Jan 0001 00:00:00 +0000

Dimensionality reduction is an important issue in the context of distributed vision systems. Processing of dimensionality reduced data requires far less network resources (e.g., storage space, network bandwidth) than processing of original data. In this paper we explore the performance of the random projection method for distributed smart cameras. In our tests, random projection is compared to principal component analysis in terms of recognition efficiency (i.e., object recognition). The results obtained on the COIL-20 image data set show good performance of the random projection in comparison to the principal component analysis, which requires distribution of a subspace and therefore consumes more resources of the network. This indicates that random projection method can elegantly solve the problem of subspace distribution in embedded and distributed vision systems. Moreover, even without explicit orthogonalization or normalization of random projection transformation subspace, the method achieves good object recognition efficiency.

Discriminative Correlation Filter with Channel and Spatial Reliability

Mon, 01 Jan 0001 00:00:00 +0000

Short-term tracking is an open and challenging problem for which discriminative correlation filters (DCF) have shown excellent performance. We introduce the channel and spatial reliability concepts to DCF tracking and provide a novel learning algorithm for its efficient and seamless integration in the filter update and the tracking process. The spatial reliability map adjusts the filter support to the part of the object suitable for tracking. This both allows to enlarge the search region and improves tracking of non-rectangular objects. Reliability scores reflect channel-wise quality of the learned filters and are used as feature weighting coefficients in localization. Experimentally, with only two simple standard features, HoGs and Colornames, the novel CSR-DCF method – DCF with Channel and Spatial Reliability – achieves state-of-the-art results on VOT 2016, VOT 2015 and OTB100. The CSR-DCF runs in real-time on a CPU.

Domain-specific adaptations for region proposals

Mon, 01 Jan 0001 00:00:00 +0000

In this work we propose a novel approach towards the detection of all traffic sign boards. We propose to employ state-of-the-art region proposals as the first step to reduce the initial search space and provide a way to use a strong classifier for a fine-grade classification. We evaluate multiple region proposals on the domain of traffic sign detection and further propose various domain-specific adaptations to improve their performance. We show that edgeboxes with domain-specific learning and re-scoring based on trained shape information are able to significantly outperform remaining methods on German Traffic Sign Database. Furthermore, we show they achieve higher rate of recall with high-quality regions at the lower number of regions than the remaining methods.

DRAEM -- A discriminatively trained reconstruction embedding for surface anomaly detection

Mon, 01 Jan 0001 00:00:00 +0000

Visual surface anomaly detection aims to detect local image regions that significantly deviate from normal appearance. Recent surface anomaly detection methods rely on generative models to accurately reconstruct the normal areas and to fail on anomalies. These methods are trained only on anomaly-free images, and often require hand-crafted post-processing steps to localize the anomalies, which prohibits optimizing the feature extraction for maximal detection capability. In addition to reconstructive approach, we cast surface anomaly detection primarily as a discriminative problem and propose a discriminatively trained reconstruction anomaly embedding model (DRAEM). The proposed method learns a joint representation of an anomalous image and its anomaly-free reconstruction, while simultaneously learning a decision boundary between normal and anomalous examples. The method enables direct anomaly localization without the need for additional complicated post-processing of the network output and can be trained using simple and general anomaly simulations. On the challenging MVTec anomaly detection dataset, DRAEM outperforms the current state-of-the-art unsupervised methods by a large margin and even delivers detection performance close to the fully-supervised methods on the widely used DAGM surface-defect detection dataset, while substantially outperforming them in localization accuracy.

DSR – A Dual Subspace Re-Projection Network for Surface Anomaly Detection

Mon, 01 Jan 0001 00:00:00 +0000

The state-of-the-art in discriminative unsupervised surface anomaly detection relies on external datasets for synthesizing anomaly-augmented training images. Such approaches are prone to failure on near-in-distribution anomalies since these are difficult to be synthesized realistically due to their similarity to anomaly-free regions. We propose an architecture based on quantized feature space representation with dual decoders, DSR, that avoids the image-level anomaly synthesis requirement. Without making any assumptions about the visual properties of anomalies, DSR generates the anomalies at the feature level by sampling the learned quantized feature space, which allows a controlled generation of near-in-distribution anomalies. DSR achieves state-of-the-art results on the KSDD2 and MVTec anomaly detection datasets. The experiments on the challenging real-world KSDD2 dataset show that DSR significantly outperforms other unsupervised surface anomaly detection methods, improving the previous top-performing methods by 10% AP in anomaly detection and 35% AP in anomaly localization.

Efficient Dimensionality Reduction Using Random Projection

Mon, 01 Jan 0001 00:00:00 +0000

Dimensionality reduction techniques are especially important in the context of embedded vision systems. A promising dimensionality reduction method for a use in such systems is the random projection. In this paper we explore the performance of therandom projection method, which can be easily used in embedded cameras. Random projection is compared to Principal Component Analysis in the terms of recognition efficiency on the COIL-20 image data set. Results show surprisingly good performance of the random projection in comparison to the principal component analysis even without explicit orthogonalization or normalization of transformation subspace. These results support the use of random projection in our hierarchical feature-distribution scheme in visual-sensor networks, where random projection elegantly solves the problem of shared subspace distribution.

Efficient spring system optimization for part-based visual tracking

Mon, 01 Jan 0001 00:00:00 +0000

Part-based trackers typically use visual and geometric constraints to find the most optimal positions of the parts in the constellation. Recently, spring systems was successfully applied to model these constraints. In this paper we propose an optimization method developed for multi-dimensional spring systems, which can be integrated in the part-based tracking model. The experimental analysis shows that our optimization method outperforms theconjugated gradient descend optimization in terms of convergence speed, accuracy and numerical stability.

End-to-end training of a two-stage neural network for defect detection

Mon, 01 Jan 0001 00:00:00 +0000

Segmentation-based, two-stage neural network has shown excellent results in the surface defect detection, enabling the network to learn from a relatively small number of samples. In this work, we introduce end-to-end training of the two-stage network together with several extensions to the training process, which reduce the amount of training time and improve the results on the surface defect detection tasks. To enable end-toend training we carefully balance the contributions of both the segmentation and the classification loss throughout the learning. We adjust the gradient flow from the classification into the segmentation network in order to prevent the unstable features from corrupting the learning. As an additional extension to the learning, we propose frequency-of-use sampling scheme of negative samples to address the issue of over- and under-sampling of images during the training, while we employ the distance transform algorithm on the region-based segmentation masks as weights for positive pixels, giving greater importance to areas with higher probability of presence of defect without requiring a detailed annotation. We demonstrate the performance of the end-to-end training scheme and the proposed extensions on three defect detection datasets—DAGM, KolektorSDD and Severstal Steel defect dataset— where we show state-of-the-art results. On the DAGM and the KolektorSDD we demonstrate 100% detection rate, therefore completely solving the datasets. Additional ablation study performed on all three datasets quantitatively demonstrates the contribution to the overall result improvements for each of the proposed extensions.

Entropy Based Measure of Camera Focus

Mon, 01 Jan 0001 00:00:00 +0000

A new measure for assessing camera focusing via recorded image is presented in this paper. The proposed measure bases on calculating entropy in image frequency domain, and we call it frequency domain entropy or FDE. First an intuitive explanation of measure is presented, and next tests for some classical properties that such measure should meet are conducted and commented.

Evaluating multi-class learning strategies in a generative hierarchical framework for object detection.

Mon, 01 Jan 0001 00:00:00 +0000

Fast Spatially Regularized Correlation Filter Tracker

Mon, 01 Jan 0001 00:00:00 +0000

Discriminative correlation filters (DCF) have attracted significant attention of the tracking community. Standard formulation of the DCF affords a closed form solution, but is not robust and constrained to learning and detection using a relatively small search region. Spatial regularization was proposed to address learning from larger regions. But this prohibits a closed form solution and leads to an iterative optimization with significant computational load, resulting in slow model learning and tracking. We propose to reformulate the spatially regularized filter cost function such that it offers an efficient optimization. This significantly speeds up the tracker (approximately 14 times) and results in real-time tracking at the same or better accuracy.

Filtering out nondiscriminative keypoints by geometry based keypoint constellations

Mon, 01 Jan 0001 00:00:00 +0000

Keypoint-based object detection typically utilizes the nearest neighbour matching technique in order to mach discriminative and reject nondiscriminative keypoints. A detected keypoint is found to be nondiscriminative if it is similar enough to more than one model keypoint. This strategy does not always prove efficient, especially in cases where objects consist of repeating patterns, such as letters in logotypes, where potentially useful keypoints can get rejected. In this paper we propose a geometry-based approach for filtering out nondiscriminative keypoints. Our approach is not affected by repeating patterns and filters out non discriminative keypoints by means of prelearned geometry constraints. We evaluate our proposed method on a challenging dataset depicting logotypes in real-world environments under strong illumination and viewpoint changes.

Formalization of different learning strategies in a continuous learning framework

Mon, 01 Jan 0001 00:00:00 +0000

While the ability to learn on its own is an important feature of a learning agent, another, equally important feature is ability to interact with its environment and to learn in an interaction with other cognitive agents and humans. In this paper we analyze such interactive learning and define several learning strategies requiring different levels of tutor involvement and robot autonomy. We propose a new formal model for describing the learning strategies. The formalism takes into account different levels and types of communication between the robot and the tutor and different actions that can be undertaken. We also propose appropriate performance measures and show the experimental results of the evaluation of the proposed learning strategies.

FuCoLoT - A Fully-Correlational Long-Term Tracker

Mon, 01 Jan 0001 00:00:00 +0000

A Fully Correlational Long-term Tracker (FuCoLoT) exploits the novel DCF constrained filter learning method to design a detector that is able to re-detect the target in the whole image efficiently. FuCoLoT maintains several correlation filters trained on different time scales that act as the detector components. A novel mechanism based on the correlation response is used for tracking failure estimation. FuCoLoT achieves state-of-the-art results on standard short-term benchmarks and it outperforms the current best-performing tracker on the long-term UAV20L benchmark by over 19%. It has an order of magnitude smaller memory footprint than its best-performing competitors and runs at 15fps in a single CPU thread.

Fusion of non-visual modalities into the probabilistic occupancy map framework for person localization

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we investigate the possibilities for fusion of non-visual sensor modalities into state-of-the-art visionbased framework for person detection and localization, the Probabilistic Occupancy Map (POM), with the aim of improving the frame-by-frame localization results in a realistic (cluttered) indoor environment. We point out the aspects that need to be considered when fusing non-visual sensor information into POM and provide a mathematical model for it. We demonstrate the proposed fusion method on the example of multi-camera and radio-based person localization setup. The performance of both systems is evaluated, showing their strengths and weaknesses. We show that localization results may be significantly improved by fusing the information from the radio-based system into the camera-based POM framework using the proposed model.

Fusion of Non-Visual Modalities Into the Probabilistic Occupancy Map Framework for Person Localization

Mon, 01 Jan 0001 00:00:00 +0000

In the recent years, the problem of person detection and localization has received much attention, with two strong areas of application being surveillance/security and tracking of players in sports. Different solutions based on different sensor modalities have been proposed, and recently sensor fusion has gained prominence as a paradigm for overcoming the limitations of the individual sensor modalities. We investigate the possibilities for fusion of additional, nonvisual, sensor modalities into state-of-the-art vision-based framework for person detection and localization, the Probabilistic Occupancy Map(POM), with the aim of improving the localization results in realistic, cluttered, indoor environment. We point out the aspects that need to be considered when fusing an additional sensor information into POM and provide a possible mathematical model for it. Finally, we experimentally demonstrate the proposed fusion on the example of person localization in a cluttered environment.The performance of a system comprising visual cameras and POM and a radio-based localization system is experimentally evaluated, showing their strengths and weaknesses. We then improve the localization results by fusing the information from the radio-based system into POM using the proposed model. Index Terms—sensor fusion, Probabilistic Occupancy Map, multi-camera, radio, person localization.

Hallucinating Hidden Obstacles for Unmanned Surface Vehicles Using a Compositional Model

Mon, 01 Jan 0001 00:00:00 +0000

The water environment in which unmanned surface vehicles (USVs) navigate presents many unique challenges. One of these is the risk of encountering obstacles that are (partially) submerged and therefore poorly visible. Therefore, their extent cannot be determined directly from available above-water sensor data. On the other hand, it is well known that human skippers are able to safely navigate boats around obstacles even without underwater sensors and only with the help of their expertise. In this paper, we describe initial work on extending the USV obstacle detection to include such functionality using a compositional model. To learn to hallucinate the extent of obstacles with a minimum of learning effort, we exploit the nature of obstacles (people in kayaks, canoes, and on paddleboards) that are visible most of the time, but not always. We evaluate the impact of such hallucinations on USV safety and maneuverability, and suggest additional cases where such hallucinations can be used to improve USV safety.

Hand pointing detection system for tabletop visual human-machine interaction

Mon, 01 Jan 0001 00:00:00 +0000

HIDRA-T – A Transformer-Based Sea Level Forecasting Method

Mon, 01 Jan 0001 00:00:00 +0000

Sea surface height forecasting is critical for timely prediction of coastal flooding and mitigation of is impact on coastal comminities. Traditional numerical ocean models are limited in terms of computational cost and accuracy, while deep learning models have shown promising results in this area. However, there is still a need for more accurate and efficient deep learning architectures for sea level and storm surge modeling. In this context, we propose a new deep-learning architecture HIDRA-T for sea level and storm tide modeling, which is based on transformers and outperforms both state-of-the-art deep-learning network designs HIDRA1 and HIDRA2 and two state-of-the-art numerical ocean models (a NEMO engine with sea level data assimilation and a SCHISM ocean modeling system), over all sea level bins and all forecast lead times. Compared to its predecessor HIDRA2, HIDRA-T employs novel transformer-based atmospheric and sea level encoders, as well as a novel feature fusion and regression block. HIDRA-T was trained on surface wind and pressure fields from ECMWF atmospheric ensemble and on Koper tide gauge observations. Compared to other models, a consistent superior performance over all other models is observed in the extreme tail of the sea level distribution.

Hierarchical Feature Encoding for Object Recognition in Visual Sensor Networks

Mon, 01 Jan 0001 00:00:00 +0000

Hierarchical Spatial Model for 2D Range Data Based Room Categorization

Mon, 01 Jan 0001 00:00:00 +0000

The next generation service robots are expected to co-exist with humans in their homes. Such a mobile robot requires an efficient representation of space, which should be compact and expressive, for effective operation in real-world environments. In this paper we present a novel approach for 2D ground-plan-like laser-range-data-based room categorization that builds on a compositional hierarchical representation of space, and show how an additional abstraction layer, whose parts are formed by merging partial views of the environment followed by graph extraction, can achieve improved categorization performance. A new algorithm is presented that finds a dictionary of exemplar elements from a multi-category set, based on the affinity measure defined among pairs of elements. This algorithm is used for part selection in new layer construction. Room categorization experiments have been performed on a challenging publicly available dataset, which has been extended in this work. State-of-the-art results were obtained by achieving the most balanced performance over all categories.

Hierarchical statistical learning of generic parts of object structure

Mon, 01 Jan 0001 00:00:00 +0000

High-Dimensional Feature Matching: Employing the Concept of Meaningful Nearest Neighbors

Mon, 01 Jan 0001 00:00:00 +0000

Matching of high-dimensional features using nearest neighbors search is an important part of image matching methods which are based on local invariant features. In this work we highlight effects pertinent to high-dimensional spaces that are significant for matching, yet have not been explicitly accounted for in previous work. In our approach, we require every nearest neighbor to be meaningful, that is, sufficiently close to a query feature such that it is an outlier to a background feature distribution. We estimate the background feature distribution from the extended neighborhood of a query feature given by its k nearest neighbors. Based on the concept of meaningful nearest neighbors, we develop a novel high-dimensional feature matching method and evaluate its performance by conducting image matching on two challenging image data sets. A superior performance in terms of accuracy is shown in comparison to several state-of-the-art approaches. Additionally, to make search for k nearest neighbors more efficient, we develop a novel approximate nearest neighbors search method based on sparse coding with an overcomplete basis set that provides a ten-fold speed-up over an exhaustive search even for high dimensional spaces and retains excellent approximation to an exact nearest neighbors search.

Histogram of oriented gradients and region covariance descriptor in hierarchical feature-distribution scheme

Mon, 01 Jan 0001 00:00:00 +0000

Hierarchical feature-distribution scheme is a recently proposed framework for distribution of features in visual-sensor networks. It is intended for tasks, where one needs to establish a correspondence between two objects, seen by different cameras at different occasions. In visual-sensor networks, such pair of cameras may be very distant in network terms. Therefore, the hierarchical scheme results in significant reduction of network traffic, compared to naive approaches, which rely on flooding. In this paper we explore the performance of two state-of-the-art feature descriptors (histogram of oriented gradients and region covariance descriptor) in such featuredistribution scheme. Both methods are compared inthe terms of network load on the COIL-100 data set. Results show that even state-of-the-art feature descriptors benefit from hierarchical feature-distribution scheme.

How Computer Vision can help in Outdoor Positioning

Mon, 01 Jan 0001 00:00:00 +0000

Localization technologies have been an important focus in ubiquitous computing. This paper explores an underrepresented area, namely computer vision technology, for outdoor positioning. More specifically we explore two modes of positioning in a challenging real world scenario: single snapshot based positioning, improved by a novel highdimensional feature matching method, and continuous positioning enabled by combination of snapshot and incremental positioning. Quite interestingly, vision enables localization accuracies comparable to GPS. Furthermore the paper also analyzes and compares possibilities offered by the combination of different subsets of positioning technologies such as WiFi, GPS and dead reckoning in the same real world scenario as for vision based positioning.

Hypothesis verification with histogram of compositions improves object detection of hierarchical models

Mon, 01 Jan 0001 00:00:00 +0000

This paper focuses on applying and evaluating the additional hypothesis verification step for the detections of learnthierarchy-of-parts (LHOP) method. The applied method reduces the problem of false positives that are a common problem of hierarchical methods specifically in highly textured or cluttered images. We use a Histogram of Compositions (HoC) with a Support Vector Machine in hypothesis verification step. Using HoC descriptor ensures that the additional computation cost is as minimal as possible since HoC descriptor shares the LHOP tree structure. We evaluate the method on the ETHZ Shape Classes dataset and show that our method outperforms the original baseline LHOP method by around 5 percent.

Implementacija CONDENSATION Algoritma v domeni zaprtega sveta

Mon, 01 Jan 0001 00:00:00 +0000

People tracking in general is a challenging task and over the last two decades various computer vision algorithms dealing with this problem were proposed. Given a highly unpredictable nature of human motion stochastic based approaches such as CONDENSATION introduced by M. Issard and A. Blake in 1998, gained a lot of popularity among researchers in this field. In this paper we present an implementation of CONDENSATION algorithm for tracking people in sports. Since sport games usually take place in semicontrolled environments a closed world assumption, introduced by S.S. Intille and A.. Bobick in 1995 has been adopted. We present an architecture of such condensation based tracking algorithm within a closed world domain and show some results.

Improving Traffic Sign Detection with Temporal Information

Mon, 01 Jan 0001 00:00:00 +0000

Traffic sign detection is a frequently addressed research and application problem, and many solutions to this problem have been proposed. A vast majority of the proposed approaches perform traffic sign detection on individual images, although a video recordings are often available. In this paper, we propose a method that exploits also the temporal information in image sequences. We propose a three-stage traffic sign detection approach. Traffic signs are first detected on individual images. In the second stage, visual tracking is used to track these initial detections to generate multiple detection hypotheses. These hypotheses are finally integrated and refined detections are obtained. We evaluate the proposed approach by detecting 91 traffic sign categories in a video sequence of more than 18.000 frames. Results show that the traffic signs are better localized and detected with a higher accuracy, which is very beneficial for applications such as maintenance of the traffic sign records.

Improving vision-based obstacle detection on USV using inertial sensor

Mon, 01 Jan 0001 00:00:00 +0000

We present a new semantic segmentation algorithm for obstacle detection in unmanned surface vehicles. The novelty lies in the graphical model that incorporates boat tilt measurements from the on-board inertial measurement unit (IMU). The IMU readings are used to estimate the location of horizon line in the image, and automatically adjusts the priors in the probabilistic semantic segmentation algorithm. We derive the necessary horizon projection equations, an efficient optimization algorithm for the proposed graphical model, and a practical IMU-camera-USV calibration. A new challenging dataset, which is the largest multi-sensor dataset of its kind, is constructed. Results show that the proposed algorithm significantly outperforms state of the art, with 32% improvement in water-edge detection accuracy, an over 15 % reduction of false positive rate, an over 70 % reduction of false negative rate, and an over 55 % increase of true positive rate, while running in real-time on a single core in Matlab.

Increased complexity of low-level structures improves histograms of compositions

Mon, 01 Jan 0001 00:00:00 +0000

While low-level visual features, such as histogram of oriented gradients (HOG), have been successfully used for object detection and categorization, we have been able to improve upon their performance by introducing histogram of compositions (HoC) in our previous work. In this paper we propose an extended version of HoC descriptor that uses additional layers from hierarchical model. We experimentally show that extended HoC surpasses the performance of the original descriptor by approximately 5% as additional layer provides higher complexity of compositions. Furthermore, with additional layer we show to produce competitive results to original HoC descriptor combined with HOG and can even further increase performance by adding HOG on top of HoC with additional layer.

Incremental approach to robust learning of eigenspaces

Mon, 01 Jan 0001 00:00:00 +0000

The standard PCA approach to visual learning of representations is intrinsically non-robust and usually performed in a batch mode, which is inadmissible in a real-world on-line scenario. In this paper we propose a novel method for robust and incremental learning of eigenspaces. The method sequentially updates the representation using the previously acquired knowledge for determining consistencies and discarding inconsistencies in the input images. We show the experimental results, which demonstrate the advantages and disadvantages of the proposed approach.

Incremental LDA learning by combining reconstructive and discriminative approaches

Mon, 01 Jan 0001 00:00:00 +0000

Incremental subspace methods have proven to enable efficient training if large amounts of training data have to be processed or if not all data is available in advance. In this paper we focus on incremental LDA learning which provides good classification results while it assures a compact data representation. In contrast to existing incremental LDA methods we additionally consider reconstructive information when incrementally building the LDA subspace. Hence, we get a more flexible representation that is capable to adapt to new data. Moreover, this allows to add new instances to existing classes as well as to add new classes. The experimental results show that the proposed approach outperforms other incremental LDA methods even approaching classification results obtained by batch learning.

Incremental learning with Gaussian mixture models

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we propose a new incremental estimation of Gaussian mixture models which can be used for applications of online learning. Our approach allows for adding new samples incrementally as well as removing parts of the mixture by the process of unlearning. Low complexity of the mixtures is maintained through a novel compression algorithm. In contrast to the existing approaches, our approach does not require fine-tuning parameters for a specific application, we do not assume specific forms of the target distributions and temporal constraints are not assumed on the observed data. The strength of the proposed approach is demonstrated with an example of online estimation of a complex distribution, an example of unlearning, and with an interactive learning of basic visual concepts.

Incremental PCA for On-line Visual Learning and Recognition

Mon, 01 Jan 0001 00:00:00 +0000

The methods for visual learning that compute a space of eigenvectors by Principal Component Analysis (PCA) traditionally require a batch computation step. Since this leads to potential problems when dealing with large sets of images, several incremental methods for the computation of the eigenvectors have been introduced. However, such learning cannot be considered as an on-line process, since all the images are retained until the final step of computation of space of eigenvectors, when their coefficients in this subspace are computed. In this paper we propose a method that allows for simultaneous learning and recognition. We show that we can keep only the coefficients of the learned images and discard the actual images and still are able to build a model of appearance that is fast to compute and open-ended. We performed extensive experimental testing which showed that the recognition rate and reconstruction accuracy are comparable to those obtained by the batch method.

Integration of Computer Vision Components into a Multi-modal Cognitive System

Mon, 01 Jan 0001 00:00:00 +0000

We present a general method for integrating visual components into a multi-modal cognitive system. The integration is very generic and can work with an arbitrary set of other modalities. We illustrate our integration approach with a specific instantiation of the architecture schema that focuses on integration of vision and language: a cognitive system able to collaborate with a human, learn and display some understanding of its surroundings. As examples of cross-modal interaction we describe mechanisms for clarification and visual learning.

Interactive learning and cross-modal binding - a combined approach

Mon, 01 Jan 0001 00:00:00 +0000

Interaktiven sistem za kontinuirano učenje vizualnih konceptov

Mon, 01 Jan 0001 00:00:00 +0000

We present an artifficial cognitive system for learning visual concepts. It comprises of vision, communication and manipulation subsystems, which provide visual input, enable verbal and non-verbal communication with a tutor and allow interaction with a given scene. The main goal is to learn associations between automatically extracted visual features and words that describe the scene in an open-ended, continuous manner. In particular, we address the problem of cross-modal learning of visual properties and spatial relations and analyse several learning modes requiring different levels of tutor supervision.

Is my new tracker really better than yours?

Mon, 01 Jan 0001 00:00:00 +0000

The problem of visual tracking evaluation is sporting an abundance of performance measures, which are used by various authors, and largely suffers from lack of consensus about which measures should be preferred. This is hampering the cross-paper tracker comparison and faster advancement of the field. In this paper we provide a critical analysis of the popular measures and evaluate them experimentally by a large-scale tracking experiment. We also analyze various visualizations of the performance measures. We show that several measures are equivalent from the point of information they provide for tracker comparison and, crucially, that some are more brittle than the others. Based on our analysis we narrow down the specter of measures to only a few complementary ones, thus pushing towards homogenization of the tracker evaluation methodology.

Izvedba algoritma računalniškega vida na omrežni kameri

Mon, 01 Jan 0001 00:00:00 +0000

Knowledge gap detection for interactive learning of categorical knowledge

Mon, 01 Jan 0001 00:00:00 +0000

In interactive machine learning the process of labeling training instances and introducing them to the learner may be expensive in terms of human effort and time. In this paper we present different strategies for detecting gaps in the learner’s knowledge and communicating these gaps to the teacher. These strategies are considered from the viewpoint of extrospective and introspective behavior of the learner – this new perspective is also the main contribution of our paper. The experimental results indicate that the analyzed strategies are successful in reducing the number of training instances required to reach the needed recognition rate. Such a facilitation may be an important step towards the broader use of interactive autonomous systems.

LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark

Mon, 01 Jan 0001 00:00:00 +0000

The progress in maritime obstacle detection is hindered by the lack of a diverse dataset that adequately captures the complexity of general maritime environments. We present the first maritime panoptic obstacle detection benchmark LaRS, featuring scenes from Lakes, Rivers and Seas. Our major contribution is the new dataset, which boasts the largest diversity in recording locations, scene types, obstacle classes, and acquisition conditions among the related datasets. LaRS is composed of over 4000 per-pixel labeled key frames with nine preceding frames to allow utilization of the temporal texture, amounting to over 40k frames. Each key frame is annotated with 8 thing, 3 stuff classes and 19 global scene attributes. We report the results of 27 semantic and panoptic segmentation methods, along with several performance insights and future research directions. To enable objective evaluation, we have implemented an online evaluation server. The LaRS dataset, evaluation toolkit and benchmark are publicly available at: https://lojzezust.github.io/lars-dataset

Learning Contextual Rules for Priming Object Categories in Images

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we introduce and exploit the concept of contextual rules in the field of object detection. These rules are defined as associations between different object likelihood maps and are learned from given examples. The contextual rules can be used to prime regions where a target object category occurs in an image given areas of other object categories. The principal idea is to locate several basic object categories in an image and then use this information to infer object likelihood maps for other object categories. The proposed framework itself is general and not limited to specific object categories. For demonstrating our approach, we use likely occurrences of pedestrians and windows in urban scenes, extracted by a technique employing visual context, and use them to prime for shop logos.

Learning hierarchical representations of object categories for robot vision.

Mon, 01 Jan 0001 00:00:00 +0000

Learning Maritime Obstacle Detection from Weak Annotations by Scaffolding

Mon, 01 Jan 0001 00:00:00 +0000

Coastal water autonomous boats rely on robust perception methods for obstacle detection and timely collision avoidance. The current state-of-the-art is based on deep segmentation networks trained on large datasets. Per-pixel ground truth labeling of such datasets, however, is labor-intensive and expensive. We observe that far less information is required for practical obstacle avoidance – the location of water edge on static obstacles like shore and approximate location and bounds of dynamic obstacles in the water is sufficient to plan a reaction. We propose a new scaffolding learning regime (SLR) that allows training obstacle detection segmentation networks only from such weak annotations, thus significantly reducing the cost of ground-truth labeling. Experiments show that maritime obstacle segmentation networks trained using SLR substantially outperform the same networks trained with dense ground truth labels. Thus accuracy is not sacrificed for labelling simplicity but is in fact improved, which is a remarkable result.

Learning statistically relevant edge structure improves low-level visual descriptors

Mon, 01 Jan 0001 00:00:00 +0000

Over the recent years, low-level visual descriptors, among which the most popular is the histogram of oriented gradients (HOG), have shown excellent performance in object detection and categorization. We form a hypothesis that the low-level image descriptors can be improved by learning the statistically relevant edge structures from natural images. We validate this hypothesis by introducing a new descriptor called the histogram of compositions (HoC). HoC exploits a learnt vocabulary of parts from a state-of-the-art hierarchical compositional model. Furthermore, we show that HoC is a complementary descriptor to HOG. We experimentally compare our descriptor to the popular HOG descriptor on the task of object categorization. We have observed approximately 4% improved categorization performance of HoC over HOG at lower dimensionality of the descriptor. Furthermore, in comparison to HOG, we show a categorization improvement of approximately 11% when combining HOG with the proposed HoC.

Learning visual context for object detection

Mon, 01 Jan 0001 00:00:00 +0000

Kontekst ima pomembno vlogo pri splošnem zaznavanju prizorov, saj zagotavlja dodatno informacijo o možnih lokacijah objektov v slikah. Detektorji objektov, ki se uporabljajo v računalniškem vidu, tovrstne informacijo običajno ne izkoristijo. V članku bomo zato predstavili koncept, kako se lahko kontekstualne informacije naučimo iz primerov slik prizorov. To informacijo bomo uporabili za izračun kontekstnega polja, ki predstavlja apriorno informacijo za detekcijo objektov glede na možne lokacije. Detekcija objektov, ki temelji na lokalnem videzu, je potem selektivno uporabljena le na nekaterih delih slike. Predlagano metodo smo preizkusili na primerih detekcije pešcev, avtomobilov, in oken, pri čemer smo uporabili zahtevne podatkovne zbirke slik urbanih okolij. Rezultati so pokazali, da kontekstualna informacija dopolnjuje lokalno informacijo na podlagi videza, ter tako zmanjša kompleksnost iskanja in poveča robustnost detekcije predmetov. Prednost predlagane metode je tudi v tem, da je učenje kontekstualnih konfiguracij za različne kategorije objektov neodvisno od specifičnih modelov za posamezne naloge.

Lokalizacija in ocenjevanje lege predmeta v treh prostostnih stopnjah s središčnimi smernimi vektorji

Mon, 01 Jan 0001 00:00:00 +0000

In this paper, we propose an approach to localize and estimate the pose of objects in three degrees of freedom (3-DOF). Our method is based on point localization combined with regression of the orientation angle for each detected object. We extend existing point localization method to estimate the orientation of all detected objects in an image. The orientation regression is parameterized with trigonometric functions, similar to the direction to the object center. We evaluate our method on the proposed screw dataset, composed of a training set containing synthetic images with photorealistic appearance and a test set containing real images of screws. Compared to the state-of-the-art 6-DOF position estimation method applied to the 3-DOF problem, our approach achieves comparable results at a significantly lower computational cost.

Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation

Mon, 01 Jan 0001 00:00:00 +0000

Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP’s region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: this URL.

Mobile Robot Localization under Varying Illumination

Mon, 01 Jan 0001 00:00:00 +0000

Methods for mobile robot localization that use eigenspaces of panoramic snapshots of the environment are in general sensitive to changes in the illumination of the environment. Therefore, we propose an approach which achieves a reliable localization under severe illumination conditions. The method uses gradient filtering of the eigenspace. After testing the approach on images obtained by a mobile robot, we show that it outperforms the standard eigenspace-based recognition method.

Mobile Robot Localization using an Incremental Eigenspace Model

Mon, 01 Jan 0001 00:00:00 +0000

When using appearance-based recognition for self-localization of mobile robots, the images obtained during the exploration of the environment need to be efficiently stored in the memory. PCA offers means for representing the images in a low-dimensional subspace, which allows for efficient matching and recognition. For active exploration it is necessary to use an incremental method for the computation of the subspace. We propose to use an incremental PCA algorithm with the updating of partial image representations in a way that allows the robot to discard the acquired images immediately after the update. Such a model is open-ended, meaning that we can easily update it with new images. We show that the performance of the proposed method is comparable to the performance of the batch method in terms of compression, computational cost and the precision of localization. We also show that by applying the repetitive learning, the subspace converges to that constructed with the batch method.

Mobile Robots : New Research

Mon, 01 Jan 0001 00:00:00 +0000

In this paper a global vision scheme for estimation of positions and orientations of mobile robots is presented. It is applied to robot soccer application which is a fast dynamic game and therefore needs an efficient and robust vision system implemented. General applicability of the vision system can be found in other robot applications such as mobile transport robots in production, warehouses, attendant robots, fast vision tracking of targets of interest and entertainment robotics. Basic operation of the vision system is divided into two steps. In the first, the incoming image is scanned and pixels are classified into a finite number of classes. At the same time, a segmentation algorithm is used to find corresponding regions belonging to one of the classes. In the second step, all the regions are examined. Selection of the ones that are a part of the observed object is made by means of simple logic procedures. The novelty is focused on optimization of the processing time needed to finish the estimation of possible object positions. Better results of the vision system are achieved by implementing camera calibration and shading correction algorithm. The former corrects camera lens distortion, while the latter increases robustness to irregular illumination conditions.

Multi-camera and radio fusion for person localization in a cluttered environment

Mon, 01 Jan 0001 00:00:00 +0000

We investigate the problem of person localization in a cluttered environment. We evaluate the performance of an Ultra-Wideband radio localization system and a multi-camera system based on the Probabilistic Occupancy Map algorithm. After demonstrating the strengths and weaknesses of both systems, we improve the localization results by fusing both the radio and the visual information within the Probabilistic Occupancy Map framework. This is done by treating the radio modality as an additional independent sensory input that contributes to a given cell’s occupancy likelihood.

Multi-camera and radio fusion for person localization in a cluttered environment

Mon, 01 Jan 0001 00:00:00 +0000

Multi-modal Obstacle Avoidance in USVs via Anomaly Detection and Cascaded Datasets

Mon, 01 Jan 0001 00:00:00 +0000

We introduce a novel strategy for obstacle avoidance in aqua- tic settings, using anomaly detection for quick deployment of autonomous water vehicles in limited geographic areas. The unmanned surface vehi- cle (USV) is initially manually navigated to collect training data. The learning phase involves three steps: learning imaging modality specifics, learning the obstacle-free environment using collected data, and setting obstacle detector sensitivity with images containing water obstacles. This approach, which we call cascaded datasets, works with different image modalities and environments without extensive marine-specific data. Re- sults are demonstrated with LWIR and RGB images from river missions.

Multi-modal tracking by identification

Mon, 01 Jan 0001 00:00:00 +0000

In this paper, we demonstrate, by performing quantitative evaluation, the benefit of tracking by identification over state-of-the-art identification by tracking. We evaluate four localization and tracking systems: a commercial localization system based on radio technology, a state-ofthe- art computer-vision algorithm that uses multiple calibrated cameras to perform identification by tracking, and two multi-modal tracking-by-identification systems that have been developed in our laboratory. We briefly describe all four systems and evaluation metric, and present evaluation on a challenging indoor dataset.

Multi-touch surface based on RGBD camera

Mon, 01 Jan 0001 00:00:00 +0000

The popularity of interactive surfaces is increasing because of their natural and intuitive usage. Adding 3D multi-point interaction capabilities to an arbitrary surface creates numerous additional possibilities in fields ranging from marketing to medicine. Interactive tables are nowadays present in numerous museums, schools and companies. With the advent of low-cost RGBD cameras, thee-dimensional surfaces are slowly emerging as well, attracting even more attention. This paper presents an affordable system for 3D human-computer interaction using a RGBD camera that is capable of detecting and tracking user’s fingertips in 3D space. The system is evaluated in terms of accuracy, response time, CPU usage, and user experience. The results of the evaluation show that such low-cost systems are already a viable alternative to other multi-touch technologies and also present interesting new ways of interaction with a surface-based interfaces.

Multiple interacting targets tracking with application to team sports

Mon, 01 Jan 0001 00:00:00 +0000

The interest in the field of computer aided analysis of sport events is ever growing and the ability of tracking objects during a sport event has become an elementary task for nearly every sport analysis system. We present in this paper a color based probabilistic tracker that is suitable for tracking players on the playground during a sport game. Since the players are being tracked in their natural environment, and this environment is subjected to certain rules of the game, we use the concept of closed worlds, to model the scene context and thus improve the reliability of tracking.

Multivariate Online Kernel Density Estimation

Mon, 01 Jan 0001 00:00:00 +0000

We propose an approach for online kernel density estimation (KDE) which enables building probability density functions from data by observing only a single data-point at a time. The method maintains a non-parametric model of the data itself and uses this model to calculate the corresponding KDE. We propose an new automatic bandwidth selection rule, which can be computed directly from the non-parametric model of the data. Low complexity of the model is maintained through a novel compression and refinement scheme. We compare the online KDE to some state-of-the-art batch KDEs on examples of estimating distributions and on an example of classification. The results show that the online KDE generally achieves comparable performance to the batch approaches, while producing models with lower complexity and allowing online updating using only a single observation at a time.

Napredne metode računalniškega vida za avtonomno navigacijo robotskega plovila

Mon, 01 Jan 0001 00:00:00 +0000

The aim of our project is development of computer vision algorithms for autonomous navigation of a sea vessel by means of image segmentation and stabilization, long-term tracking, inference of 3D structure from motion, and horizon detectio

Non-sequential Multi-view Detection, Localization and Identification of People Using Multi-modal Feature Maps

Mon, 01 Jan 0001 00:00:00 +0000

O klasifikaciji slik v ne-enolično določljive razrede

Mon, 01 Jan 0001 00:00:00 +0000

Image classification is one of the most basic and frequently addressed computer vision tasks. The usual formulation of this tasks requires classification of an image into the one of several possible classes. The most common metric for measuring the classifier’s performance is classification accuracy, defined as a percentage of correctly classified images. However, such formalisation of the classification problem relies on a strong assumption that for every image a category is uniquely identifiable and assigned by the domain expert. In this paper we address scenarios where this assumption does not hold. In particular, we present an analysis of the results obtained by the convolutional neural network and twelve participants who were tasked to classify the images of planks into eight classes and discuss the label ambiguity problem.

Object Tracking by Reconstruction with View-Specific Discriminative Correlation Filters

Mon, 01 Jan 0001 00:00:00 +0000

Standard RGB-D trackers treat the target as an inherently 2D structure, which makes modelling appearance changes related even to simple out-of-plane rotation highly challenging. We address this limitation by proposing a novel long-term RGB-D tracker - Object Tracking by Reconstruction (OTR). The tracker performs online 3D target reconstruction to facilitate robust learning of a set of view-specific discriminative correlation filters (DCFs). The 3D reconstruction supports two performance-enhancing features: (i) generation of accurate spatial support for constrained DCF learning from its 2D projection and (ii) point cloud based estimation of 3D pose change for selection and storage of view-specific DCFs which are used to robustly localize the target after out-of-view rotation or heavy occlusion. Extensive evaluation of OTR on the challenging Princeton RGB-D tracking and STC Benchmarks shows it outperforms the state-of-the-art by a large margin.

ObjectCore - Efficient Few-shot Logical Anomaly Detection using Object Representations

Mon, 01 Jan 0001 00:00:00 +0000

Anomaly Detection is an important problem in industrial processes. Two new subfields have recently emerged: logical anomaly detection and few-shot anomaly detection. The combined task, few-shot logical anomaly detection, has proven exceptionally difficult and highly important for industrial processes. Few-shot methods use suboptimal representations to model composition information necessary for detecting logical anomalies, and previous full-shot methods require a large training set. To solve both problems, we propose ObjectCore, a few-shot logical anomaly detection model that captures the composition information from only a few images without any category-specific information. The composition information of an image is modelled as a collection of object representations. Logical anomalies are detected using bipartite matching between object representations in the test image and object representations in the most similar support image. ObjectCore significantly improves over state-of-the-art methods on two standard benchmarks for few-shot logical anomaly detection, MVTec LOCO and CAD-SD, attaining an image-level AUROC of 80.8% and 96.5%, respectively, in the 4-shot setting. Code

Observing Human Motion Using Far-Infrared (FLIR) Camera -- Some Preliminary Studies

Mon, 01 Jan 0001 00:00:00 +0000

Far infrared imaging technology is becoming an interesting choice for many civilian uses. We explored the potential of using far infrared camera for human motion analysis, especially from the viewpoint of possible automated image and video analysis. In this article, we present the main characteristics of far infrared imagery that should be of interest to computer vision researchers and seek to eliminate some common misunderstandings about the far infrared imagery which may influence the choice of far infrared technology over other alternatives. We provide images that illustrate the problems and advances of using the far infrared imaging technology, especially for the purpose of observing humans.

Obstacle Detection for USVs by Joint Stereo-View Semantic Segmentation

Mon, 01 Jan 0001 00:00:00 +0000

We propose a stereo-based obstacle detection approach for unmanned surface vehicles. Obstacle detection is cast as a scene semantic segmentation problem in which pixels are assigned a probability of belonging to water or non-water regions. We extend a single-view model to a stereo system by adding a constraint which prefers consistent class labels assignment to pixels in the left and right camera images corresponding to the same parts of a 3D scene. Our approach jointly fits a semantic model to both images, leading to an improved class-label posterior map from which obstacles and water edge are extracted. In overall F-measure, our approach outperforms the current state-of-the-art monocular approach by 0.495, a monocular CNN by 0.798 and their stereo extensions by 0.059 and 0.515, respectively on the task of obstacle detection while running real-time on a single CPU.

Obtaining high dynamic scale radiance maps by varying illumination intensity

Mon, 01 Jan 0001 00:00:00 +0000

Od računalniškega vida k umetnemu spoznavnemu vidu

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we briefly describe an emerging new scientfic field of cognitive vision. We briefly present some main characteristics of the mature and recogni- sed field of computer vision and show some motiva- tion for its natural development into cognitive vision. We underline some characteristics of cognitive vision systems, which make them different from the classi- cal machine and computer vision systems. We also present some typical applications of cognitive vision systems and cognitive systems in general and indicate several possibilities for their employment.

On-line conservative learning for person detection

Mon, 01 Jan 0001 00:00:00 +0000

We present a novel on-line conservative learning framework for an object detection system. All algorithms operate in an on-line mode, in particular we also present a novel on-line AdaBoost method. The basic idea is to exploit a huge amount of unlabeled video data by being very conservative in selecting training examples and to start with a very simple object detection system and using reconstructive and discriminative classifiers in an iterative co-training fashion to arrive at increasingly better object detectors. We demonstrate the framework on a surveillance task where we learn person detectors that are tested on two surveillance video sequences. We start with a simple moving object classifier and proceed with incremental PCA (on shape and appearance) as a reconstructive classifier which in turn generates a training set for a discriminative on-line AdaBoost classifier.

Online Discriminative Kernel Density Estimation

Mon, 01 Jan 0001 00:00:00 +0000

We propose a new method for online estimation of probabilistic discriminative models. The method is based on the recently proposed online Kernel Density Estimation \mbox(oKDE) framework which produces Gaussian mixture models and allows adaptation using only a single data point at a time. The oKDE builds reconstructive models from the data, and we extend it to take into account the interclass discrimination through a new distance function between the classifiers. We arrive at an online discriminative Kernel Density Estimators \mbox(odKDE). We compare the odKDE to oKDE, batch state-of-the-art KDEs and support vector machine (SVM) on a standard database. The odKDE achieves comparable classification performance to that of best batch KDEs and SVM, while allowing online adaptation, and produces models of lower complexity than the oKDE.

Open-source robotic manipulator and sensory platform

Mon, 01 Jan 0001 00:00:00 +0000

We present an open-source robotic platform for educational use that integrates multiple levels of interaction through the use of additional vision sensor. The environment can be used in virtual, augmented-reality and real-robot modes, enabling smooth transition from a virtual robot manipulator to a real one. We describe the main aspects of our platform that ensure low production costs and encourage openness of both its hardware and software. The main goal of of our work was to create a viable low-cost robotic manipulator platform alternative for the university level courses in intelligent robotics, however, the application domain is very broad.

Optimization framework for learning a hierarchical shape vocabulary for object class detection.

Mon, 01 Jan 0001 00:00:00 +0000

Panoramic Eigenimages for Spatial Localisation

Mon, 01 Jan 0001 00:00:00 +0000

Recent biological evidence suggests that position and orientation can be estimated from an adequately compressed set of environment snapshots and their relationships. In this paper we present a pure appearance-based localisation method using an eigenspace representation of panoramic images. We first review several types of rotational invariant representation of panoramic images in terms of their efficiency for an eigenspace-based localisation problem. Then, for each set of images an eigenspace from 25 location snapshots is built and analyzed. We evaluated simple localisation of images not included in the training set. The results show good prospects for the panoramic eigenspace approach.

Panoramic Volumes for Robot Localization

Mon, 01 Jan 0001 00:00:00 +0000

We propose a method for visual robot localization using a panoramic image volume as the representation from which we can generate views from virtual viewpoints and match them to the current view. We use a geometric image-based rendering formalism in combination with a subspace representation of images, which allows us to synthesize views at arbitrary virtual viewpoints from a compact low-dimensional representation.

PanSR: An Object-Centric Mask Transformer for Panoptic Segmentation

Mon, 01 Jan 0001 00:00:00 +0000

Panoptic segmentation is a fundamental task in computer vision and a crucial component for perception in autonomous vehicles. Recent mask-transformer-based methods achieve impressive performance on standard benchmarks but face significant challenges with small objects, crowded scenes and scenes exhibiting a wide range of object scales. We identify several fundamental shortcomings of the current approaches: (i) the query proposal generation process is biased towards larger objects, resulting in missed smaller objects, (ii) initially well-localized queries may drift to other objects, resulting in missed detections, (iii) spatially well-separated instances may be merged into a single mask causing inconsistent and false scene interpretations. To address these issues, we rethink the individual components of the network and its supervision, and propose a novel method for panoptic segmentation PanSR. PanSR effectively mitigates instance merging, enhances small-object detection and increases performance in crowded scenes, delivering a notable +3.4 PQ improvement over state-of-the-art on the challenging LaRS benchmark, while reaching state-of-the-art performance on Cityscapes. URL

Parametric Eigenspace Representations of Panoramic Images

Mon, 01 Jan 0001 00:00:00 +0000

This paper describes a novel approach for robot localization using a view-based representation with panoramic images. We propose to use a representation based on a complex basis of eigenvectors. We demonstrate that this results in a speed up of building the eigenspace and in a fast and accurate localization.

Part-Based Room Categorization for Household Service Robots

Mon, 01 Jan 0001 00:00:00 +0000

A service robot that operates in a previously-unseen home environment should be able to recognize the functionality of the rooms it visits, such as a living room, a bathroom, etc. We present a novel part-based model and an approach for room categorization using data obtained from a visual sensor. Images are represented with sets of unordered parts that are obtained by object-agnostic region proposals, and encoded using state-of-the-art image descriptor extractor — a convolutional neural network (CNN). An approach is proposed that learns category-specific discriminative parts for the part-based model. The proposed approach was compared to the state-of-the-art CNN trained specifically for place recognition. Experimental results show that the proposed approach outperforms the holistic CNN by being robust to image degradation, such as occlusions, modifications of image scaling, and aspect changes. In addition, we report non-negligible annotation errors and image duplicates in a popular dataset for place categorization and discuss annotation ambiguities.

Physics-Based Modelling of Human Motion using Kalman Filter and Collision Avoidance Algorithm

Mon, 01 Jan 0001 00:00:00 +0000

The paper deals with the problem of computer vision based multi-Peršon motion tracking, which in many cases suffers from lack of discriminating features of observed Peršons. To solve this problem, a physics based model of human motion is proposed, which includes intertial forces of the Peršons by the means of the Kalman filter, and the cylindrical envelopes, which produce collision avoiding forces when observed Peršons come to close proximity. We tested the proposed method on two sequences, one from squash match, and the other from the basketball play and found out that the number of tracker mistakes significantly decreased.

Pregled programskih orodij za globoko učenje z vidika uporabe v industrijskih aplikacijah

Mon, 01 Jan 0001 00:00:00 +0000

Globoko učenje je prineslo revolucionarne spremembe na področju računalniškega vida in si utira svojo pot tudi na področje industrijskega strojnega vida. V tem članku predstavljamo šest najbolj poznanih orodij za delo z globokimi arhitekturami: Caffe, Torch, Theano, MatConvNet, TensorFlow in Keras. Predstavili bomo njihove glavne značilnosti tako z vidika razvoja kot integracije v industrijske aplikacije.

Probabilistic Combination of Visual Context Based Attention and Object Detection

Mon, 01 Jan 0001 00:00:00 +0000

Visual context provides cues about an object’s presence, position and size within the observed scene, which are used to increase the performance of object detection techniques. However, state-of-the-art methods for context aware object detection could decrease the initial performance. We discuss the reasons for failure and propose a concept that overcomes these limitations. Therefore, we introduce the prior probability function of an object detector, that maps the detector’s output to probabilities. Together, with an appropriate contextual weighting a probabilistic framework is established. In addition, we present an extension to state-of-the-art methods to learn scale-dependent visual context information and show how this increases the initial performance. The standard methods and our proposed extensions are compared on a novel demanding image data set.

Probabilistic tracking using optical flow to resolve color ambiguities

Mon, 01 Jan 0001 00:00:00 +0000

Color-based tracking is prone to failure in situations where visually similar targets are moving in close proximity to each other. To deal with the ambiguities in color information we propose an additional color-independent feature based on the target’s local motion, which is calculated from the optical flow induced by the target in consecutive images. By modifying a color-based particle filter to account for the target’s local-motion, the hybrid color/local-motion-based tracker is constructed. The hybrid tracker was compared to a purely color-based tracker on a challenging data-set that involved near-collisions and complete occlusions between visually similar Peršons. The optical flow was estimated using a robust and a nonrobust method. The experiments show that even if a nonrobust method is used to estimate the optical flow, the local-motion feature largely resolves ambiguities caused by the visual similarity between Peršons.

Prototipi značilk za adaptivno zaznavanje ovir na vodni površini

Mon, 01 Jan 0001 00:00:00 +0000

Unmanned surface vehicles (USV) rely on robust perception methods for obstacle detection. Current segmentation-based state-of-the-art methods lack the desired robustness and generalization capabilities required to adapt to new situations. To address this, we design WaSR-AD, a network with an explicit adaptation capability based on class prototypes. Initial prototypes are extracted during training and adapted during inference in an online fashion. The adapted prototypes are used to enrich the image features with additional adaptive context. Evaluation on the MODS benchmark reveals that such explicit adaptation of the prototypes significantly improves the detection performance, achieving 14% lower water segmentation error and 3.6% F1-score increase inside the critical 15m danger-zone area around the boat, with a negligible cost in inference time.

Quality of region proposals in traffic sign detection and recognition

Mon, 01 Jan 0001 00:00:00 +0000

Range image acquisition of objects with non-uniform albedo using structured light range sensor

Mon, 01 Jan 0001 00:00:00 +0000

Razlike v opravljeni poti in povprečni hitrosti gibanja med različnimi tipi košarkarjev

Mon, 01 Jan 0001 00:00:00 +0000

V članku obravnavamo problem obremenitve košarkarjev na tekmah. Osnovni cilj raziskave je ugotavljanje intenzivnosti in obsega gibanja košarkarjev s pomočjo merilnega sistema SAGIT. Gre za razmeroma novo tehnologijo, ki temelji na metodah računalniškega vida in omogoča avtomatsko pridobivanje podatkov iz video posnetkov tekem. S pomočjo omenjenega sistema smo ugotavljali opravljeno pot in povprečno hitrost gibanja košarkarjev na treh tekmah končnice državnega prvenstva Slovenije za člane med ekipama Union Olimpija in Geoplin Slovan v sezoni 2004/05. Omenjene parametre smo ugotavljali za skupno 22 košarkarjev, ki so v posameznem polčasu tekme igrali vsaj 200 sekund. Glede na to, da v košarki poznamo več različnih tipov igralcev, ki imajo različne vloge v igri, smo opravljeno pot in povprečno hitrost gibanja izračunali za tri osnovne tipe igralcev (branilce, krila in centre) in s pomočjo enosmerne analize variance ugotavljali razlike med njimi. Ugotovili smo, da v aktivnem delu igre (ko ura za merjenje igralnega časa teče) v enem polčasu oz. 20 minutah igralci v povprečju opravijo pot dolgo 2227 metrov, v pasivnem delu pa še dodatnih 920 metrov. Povprečna hitrost gibanja igralcev v aktivnem delu igre znaša 1,84 m/s. Kar se tiče posameznih tipov igralcev lahko ugotovimo, da v aktivni fazi igre najdaljšo pot v povprečju opravijo branilci (2300 m), sledijo jim krila (2246 m) in nato centri (2118 m). Razlike med posameznimi tipi igralcev so statistično značilne na nivoju 1% napake. Enako velja tudi za povprečno hitrost gibanja, pri čemer se branilci gibljejo s povprečno hitrostjo 1,92 m/s, krila 1,87 m/s, centri pa 1,74 m/s.

Recognition of Multi-Agent Activities with Petri Nets

Mon, 01 Jan 0001 00:00:00 +0000

Relevance Determination for Learning Vector Quantization using the Fisher Criterion Score

Mon, 01 Jan 0001 00:00:00 +0000

Two new feature relevance determination algorithms are proposed for learning vector quanti- zation. The algorithms exploit the positioning of the prototype vectors in the input feature space to esti- mate Fisher criterion scores for the input dimensions during training. These scores are used to form online estimates of weighting factors for an adaptive metric that accounts for dimensional relevance with respect to classifier output. The methods offer theoretical advantages over previously proposed LVQ relevance determination techniques based on gradient descent, as well as performance advantages as demonstrated in experiments on various datasets including a visual dataset from a cognitive robotics object affordance learning experiment.

Robust continuous subspace learning and recognition

Mon, 01 Jan 0001 00:00:00 +0000

Robust estimation of canonical correlation coefficients

Mon, 01 Jan 0001 00:00:00 +0000

Canonical Correlation Analysis is well suited for regression tasks in appearance-based approach to modelling of objects and scenes. However, since it relies on the standard projection it is inherently non-robust. In this paper we propose to embed the estimation of CCA coefficients in an augmented PCA space, which enables detection of outliers and preserves regression-relevant information enabling robust estimation of canonical correlation coefficients.

Robust Localization using Eigenspace of Spinning-Images

Mon, 01 Jan 0001 00:00:00 +0000

Under in-plane rotations of a panoramic camera, the information content of a panoramic image is, in general, preserved. However, different representations that can be derived have important implications on further processing, e.g. for appearance-based localisation. We discuss several approaches based on different representations that have been proposed and evaluate them from different points-of-view, in particular, we argue that most of them are not suitable for robust localization under partially occluded views. In this paper we propose a representation-eigenspace of spinning-images-which enables a straightforward application of the robust estimation of eigenimage coefficients which is directly related to the localization.

Robust Localization using Panoramic View-Based Recognition

Mon, 01 Jan 0001 00:00:00 +0000

The results of earlier studies on the possibility of spatial localization from panoramic images have shown good prospects for view-based methods. The major advantages of these methods are a wide field-of-view, capability of modeling cluttered environments, and flexibility in the learning phase. The redundant information captured in similar views is efficiently handled by the eigenspace approach. However, the standard approaches are sensitive to noise and occlusion. We present a method of view-based localization in a robust framework that solves these problems to a large degree. Experimental results on a large set of real panoramic images demonstrate the effectiveness of the approach and the level of achieved robustness.

Robust Recognition and Pose Determination of 3-D Objects Using Range Images in Eigenspace Approach

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we propose a robust method for recognition and pose determination of 3-D objects using range images in the eigenspace approach. Instead of computing the coefficients by a projection of the data onto the eigenimages, we determine the coefficients by solving a set of linear equations in a robust manner. The method efficiently overcomes the problem of missing pixels, noise and occlusions in range images. The results show that the proposed method outperforms the standard one in recognition and pose determination.

Robust visual tracking using template anchors

Mon, 01 Jan 0001 00:00:00 +0000

Deformable part models exhibit excellent performance in tracking non-rigidly deforming targets, but are usually outperformed by holistic models when the target does not deform or in the presence of uncertain visual data. The reason is that part-based models require estimation of a larger number of parameters compared to holistic models and since the updating process is self-supervised, the errors in parameter estimation are amplified with time, leading to a faster accuracy reduction than in holistic models. On the other hand, the robustness of part-based trackers is generally greater than in holistic trackers. We address the problem of self-supervised estimation of a large number of parameters by introducing controlled graduation in estimation of the free parameters. We propose decomposing the visual model into several sub-models, each describing the target at a different level of detail. The sub-models interact during target localization and, depending on the visual uncertainty, serve for cross-sub-model supervised updating. A new tracker is proposed based on this model which exhibits the qualities of part-based as well as holistic models. The tracker is tested on the highly-challenging VOT2013 and VOT2014 benchmarks, outperforming the state-of-the-art.

Room Classification using a Hierarchical Representation of Space

Mon, 01 Jan 0001 00:00:00 +0000

SALAD -- Semantics-Aware Logical Anomaly Detection

Mon, 01 Jan 0001 00:00:00 +0000

Recent surface anomaly detection methods excel at identifying structural anomalies, such as dents and scratches, but struggle with logical anomalies, such as irregular or missing object components. The best-performing logical anomaly detection approaches rely on aggregated pretrained features or handcrafted descriptors (most often derived from composition maps), which discard spatial and semantic information, leading to suboptimal performance. We propose SALAD, a semantics-aware discriminative logical anomaly detection method that incorporates a newly proposed composition branch to explicitly model the distribution of object composition maps, consequently learning important semantic relationships. Additionally, we introduce a novel procedure for extracting composition maps that requires no hand-made labels or category-specific information, in contrast to previous methods. By effectively modelling the composition map distribution, SALAD significantly improves upon state-of-the-art methods on the standard benchmark for logical anomaly detection, MVTec LOCO, achieving an impressive image-level AUROC of 96.1%. URL

Self-Supervised Cross-Modal Online Learning of Basic Object Affordances for Developmental Robotic Systems

Mon, 01 Jan 0001 00:00:00 +0000

For a developmental robotic system to function successfully in the real world, it is important that it be able to form its own internal representations of affordance classes based on observable regularities in sensory data. Usually successful classifiers are built using labeled training data, but it is not always realistic to assume that labels are available in a developmental robotics setting. There does, however, exist an advantage in this setting that can help circumvent the absence of labels: co-occurrence of correlated data across separate sensory modalities over time. The main contribution of this paper is an online classifier training algorithm based on Kohonenâ?s learning vector quantization (LVQ) that, by taking advantage of this co- occurrence information, does not require labels during training, either dynamically generated or otherwise. We evaluate the algorithm in experiments involving a robotic arm that interacts with various household objects on a table surface where camera systems extract features for two separate visual modalities. It is shown to improve its ability to classify the affordances of novel objects over time, coming close to the performance of equivalent fully-supervised algorithms.

Similarity-based cross-layered hierarchical representation for object categorization.

Mon, 01 Jan 0001 00:00:00 +0000

Sledenje objektov s kvadrokopterjem z gibljivo kamero

Mon, 01 Jan 0001 00:00:00 +0000

Spatially-Adaptive Filter Units for Deep Neural Networks

Mon, 01 Jan 0001 00:00:00 +0000

Classical deep convolutional networks increase receptive field size by either gradual resolution reduction or application of hand-crafted dilated convolutions to prevent increase in the number of parameters. In this paper we propose a novel displaced aggregation unit (DAU) that does not require hand-crafting. In contrast to classical filters with units (pixels) placed on a fixed regular grid, the displacement of the DAUs are learned, which enables filters to spatially-adapt their receptive field to a given problem. We extensively demonstrate the strength of DAUs on a classification and semantic segmentation tasks. Compared to ConvNets with regular filter, ConvNets with DAUs achieve comparable performance at faster convergence and up to 3-times reduction in parameters. Furthermore, DAUs allow us to study deep networks from novel perspectives. We study spatial distributions of DAU filters and analyze the number of parameters allocated for spatial coverage in a filter.

Superpixel Segmentation for Robust Visual Tracking

Mon, 01 Jan 0001 00:00:00 +0000

Tailgating Detection Using Histograms of Optical Flow

Mon, 01 Jan 0001 00:00:00 +0000

Teaching Intelligent Robotics with a Low-Cost Mobile Robot Platform

Mon, 01 Jan 0001 00:00:00 +0000

In this short paper we present the requirements and implementation of a mobile robot platform to be used for teaching intelligent robotic classes. We report our experience of using the platform in university courses and various extracurricular activities.

Teaching with open-source robotic manipulator

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present and evaluate the usage of an open-source robotic manipulator platform, that we have developed, in the context of various educational scenarios that we have conducted. The system was tested in multiple diverse learning scenarios, ranging from a summer school for primary-school students, to the course at the university level study. We show that the introduction of the system in the educational process improves the motivation as well as acquired knowledge of the participants.

Temporal Context for Robust Maritime Obstacle Detection

Mon, 01 Jan 0001 00:00:00 +0000

Robust maritime obstacle detection is essential for fully autonomous unmanned surface vehicles (USVs). The currently widely adopted segmentation-based obstacle detection methods are prone to misclassification of object reflections and sun glitter as obstacles, producing many false positive detections, effectively rendering the methods impractical for USV navigation. However, water-turbulence-induced temporal appearance changes on object reflections are very distinctive from the appearance dynamics of true objects. We harness this property to design WaSR-T, a novel maritime obstacle detection network, that extracts the temporal context from a sequence of recent frames to reduce ambiguity. By learning the local temporal characteristics of object reflection on the water surface, WaSR-T substantially improves obstacle detection accuracy in the presence of reflections and glitter. Compared with existing single-frame methods, WaSR-T reduces the number of false-positive detections by 41% overall and by over 53% within the danger zone of the boat, while preserving a high recall, and achieving new state-of-the-art performance on the challenging MODS maritime obstacle detection benchmark.

Temporal Segmentation of Group Motion using Gaussian Mixture Models

Mon, 01 Jan 0001 00:00:00 +0000

This paper presents a new trajectory-based approach for probabilistic temporal segmentation of team sports. The probabilistic game model is applied to the player-trajectory data in order to segment individual game instants into one of the three game phases (offensive game, defensive game and time-outs) and a nonlinear or Gaussian smoothing kernel is used to enforce the temporal continuity of the game. The presented approach is compared to the Support Vector Machine (SVM) classifier on three basketball and three handball matches. The obtained results suggest that our approach is general and robust and as such could be applied to various team sports. It can handle unusual game situations such as player exclusions, substitution or injuries which may happen during the game.

Testing computer vision algorithms over World Wide Web

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we explore different possibilities of using the Internet for making algorithms publicly available. We describe how to build an interactive client/server application which uses World Wide Web for communication. The client program is a Java applet. The server program works on the server as a CGI program which is started by the HTTP server on the demand of the client. The data transfered between the client and the server program passes also through the HTTP server as the HTTP protocol is used for data transfer. A stand-alone program for image segmentation was transformed into the Java-client/CGI-server application, which can now be used as a service on the World Wide Web.

The Eighth Visual Object Tracking VOT2020 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2020 is the eighth annual tracker benchmarking activity organized by the VOT initiative. Results of 58 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The VOT2020 challenge was composed of five sub-challenges focusing on different tracking domains: (i) VOT-ST2020 challenge focused on short-term tracking in RGB, (ii) VOT-RT2020 challenge focused on real-time’ short-term tracking in RGB, (iii) VOT-LT2020 focused on long-term tracking namely coping with target disappearance and reappearance, (iv) VOT-RGBT2020 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2020 challenge focused on long-term tracking in RGB and depth imagery. Only the VOT-ST2020 datasets were refreshed. A significant novelty is introduction of a new VOT short-term tracking evaluation methodology, and introduction of segmentation ground truth in the VOT-ST2020 challenge – bounding boxes will no longer be used in the VOT-ST challenges. A new VOT Python toolkit that implements all these novelites was introduced. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website.

The MaSTr1325 dataset for training deep USV obstacle detection models

Mon, 01 Jan 0001 00:00:00 +0000

The progress of obstacle detection via semantic segmentation on unmanned surface vehicles (USVs) has been significantly lagging behind the developments in the related field of autonomous cars. The reason is the lack of large curated training datasets from USV domain required for development of data-hungry deep CNNs. This paper addresses this issue by presenting MaSTr1325, a marine semantic segmentation training dataset tailored for development of obstacle detection methods in small-sized coastal USVs. The dataset contains 1325 diverse images captured over a two year span with a real USV, covering a range of realistic conditions encountered in a coastal surveillance task. The images are per-pixel semantically labeled. The dataset exceeds previous attempts in this domain in size, scene complexity and domain realism. In addition, a dataset augmentation protocol is proposed to address slight appearance differences of the images in the training set and those in deployment. The accompanying experimental evaluation provides a detailed analysis of popular deep architectures, annotation accuracy and influence of the training set size. MaSTr1325 will be released to reaserch community to facilitate progress in obstacle detection for USVs.

The Ninth Visual Object Tracking VOT2021 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2021 is the ninth annual tracker benchmarking activity organized by the VOT initiative. Results of 71 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in recent years. The VOT2021 challenge was composed of four sub-challenges focusing on different tracking domains: (i) VOT-ST2021 challenge focused on short-term tracking in RGB, (ii) VOT-RT2021 challenge focused on ``real-time’’ short-term tracking in RGB, (iii) VOT-LT2021 focused on long-term tracking, namely coping with target disappearance and reappearance and (iv) VOT-RGBD2021 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2021 dataset was refreshed, while VOT-RGBD2021 introduces a training dataset and sequestered dataset for winner identification. The source code for most of the trackers, the datasets, the evaluation kit and the results along with the source code for most trackers are publicly available at the challenge website.

The Second Visual Object Tracking Segmentation VOTS2024 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking Segmentation VOTS2024 challenge is the twelfth annual tracker benchmarking activity of the VOT initiative. This challenge consolidates the new tracking setup proposed in VOTS2023, which merges short-term and long-term as well as single-target and multiple-target tracking with segmentation masks as the only target location specification. Two sub-challenges are considered. The VOTS2024 standard challenge, focusing on classical objects and the VOTSt2024, which considers objects undergoing a topological transformation. Both challenges use the same performance evaluation methodology. Results of 28 submissions are presented and analyzed. A leaderboard, with participating trackers details, the source code, the datasets, and the evaluation kit are publicly available on the website https://www.votchallenge.net/vots2024/.

The Seventh Visual Object Tracking VOT2019 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by the VOT initiative. Results of 81 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis as well as the standard VOT methodology for long-term tracking analysis. The VOT2019 challenge was composed of five challenges focusing on different tracking domains: (i) VOTST2019 challenge focused on short-term tracking in RGB, (ii) VOT-RT2019 challenge focused on “real-time” shortterm tracking in RGB, (iii) VOT-LT2019 focused on longterm tracking namely coping with target disappearance and reappearance. Two new challenges have been introduced: (iv) VOT-RGBT2019 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2019 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2019, VOT-RT2019 and VOT-LT2019 datasets were refreshed while new datasets were introduced for VOT-RGBT2019 and VOT-RGBD2019. The VOT toolkit has been updated to support both standard shortterm, long-term tracking and tracking with multi-channel imagery. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website.

The sixth Visual Object Tracking VOT2018 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative. Results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis and a \real-time" experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. A long-term tracking subchallenge has been introduced to the set of standard VOT sub-challenges. The new subchallenge focuses on long-term tracking properties, namely coping with target disappearance and reappearance. A new dataset has been compiled and a performance evaluation methodology that focuses on long-term tracking capabilities has been adopted. The VOT toolkit has been updated to support both standard short-term and the new longterm tracking subchallenges. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website.

The Tenth Visual Object Tracking VOT2022 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2022 is the tenth annual tracker benchmarking activity organized by the VOT initiative. Results of 93 entries are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in recent years. The VOT2022 challenge was composed of seven sub-challenges focusing on different tracking domains: (i) VOT-STs2022 challenge focused on short-term tracking in RGB by segmentation, (ii) VOT-STb2022 challenge focused on short-term tracking in RGB by bounding boxes, (iii) VOT-RTs2022 challenge focused on real-time'' short-term tracking in RGB by segmentation, (iv) VOT-RTb2022 challenge focused on real-time’’ short-term tracking in RGB by bounding boxes, (v) VOT-LT2022 focused on long-term tracking, namely coping with target disappearance and reappearance, (vi) VOT-RGBD2022 challenge focused on short-term tracking in RGB and depth imagery, and (vii) VOT-D2022 challenge focused on short-term tracking in depth-only imagery. New datasets were introduced in VOT-LT2022 and VOT-RGBD2022, VOT-ST2022 dataset was refreshed, and a training dataset was introduced for VOT-LT2022. The source code for most of the trackers, the datasets, the evaluation kit and the results are publicly available at the challenge website.

The Thermal Infrared Visual Object Tracking VOT-TIR2015 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking VOT2013 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

Visual tracking has attracted a significant attention in the last few decades. The recent surge in the number of publications on tracking-related problems have made it almost impossible to follow the developments in the field. One of the reasons is that there is a lack of commonly accepted annotated data-sets and standardized evaluation protocols that would allow objective comparison of different tracking methods. To address this issue, the Visual Object Tracking (VOT) workshop was organized in conjunction with ICCV2013. Researchers from academia as well as industry were invited to participate in the first VOT2013 challenge which aimed at single-object visual trackers that do not apply pre-learned models of object appearance (model-free). Presented here is the VOT2013 benchmark dataset for evaluation of single-object visual trackers as well as the results obtained by the trackers competing in the challenge. In contrast to related attempts in tracker benchmarking, the dataset is labeled per-frame by visual attributes that indicate occlusion, illumination change, motion change, size change and camera motion, offering a more systematic comparison of the trackers. Furthermore, we have designed an automated system for performing and evaluating the experiments. We present the evaluation protocol of the VOT2013 challenge and the results of a comparison of 27 trackers on the benchmark dataset. The dataset, the evaluation tools and the tracker rankings are publicly available from the challenge website\footnote{http://votchallenge.net}.

The Visual Object Tracking VOT2014 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking VOT2015 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2015 challenge that go beyond its VOT2014 predecessor are: (i) a new VOT2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2014 evaluation methodology by introduction of a new performance measure. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.

The Visual Object Tracking VOT2016 challenge results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply prelearned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.

The Visual Object Tracking VOT2017 Challenge Results

Mon, 01 Jan 0001 00:00:00 +0000

The Visual Object Tracking challenge VOT2017 is the fifth annual tracker benchmarking activity organized by the VOT initiative. Results of 51 trackers are presented; many are state-of-the-art published at major computer vision conferences or journals in recent years. The evaluation included the standard VOT and other popular methodologies and a new “real-time” experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The VOT2017 goes beyond its predecessors by (i) improving the VOT public dataset and introducing a separate VOT2017 sequestered dataset, (ii) introducing a realtime tracking experiment and (iii) releasing a redesigned toolkit that supports complex experiments. The dataset, the evaluation kit and the results are publicly available at the challenge website.

The VOT2013 challenge: overview and additional results

Mon, 01 Jan 0001 00:00:00 +0000

Towards a large-scale category detection with a distributed hierarchical compositional model

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we evaluate a visual object detection system implemented on a distributed processing platform, presented in our previous work, with the goal of assessing the scalability of the system to a large-scale category detection. While state-of-the-art detection methods based on sliding windows may not be capable of scaling to a higher number of categories, we provide initial evidence that using a hierarchical compositional method called learned-hierarchy-of-parts (LHOP) may be capable of scaling to a higher number of categories. We show with the library trained on an MPEG-7 Shape database that the method is capable of scaling from a system with 5 categories and 6 second averaged response time to a system with 70 categories and averaged response time of 27 seconds.

Towards an Integrated Robot with Multiple Cognitive Functions

Mon, 01 Jan 0001 00:00:00 +0000

We present integration mechanisms for combining heterogeneous components in a situated information processing system, illustrated by a cognitive robot able to collaborate with a human and display some understanding of its surroundings. These mechanisms include an architectural schema that encourages parallel and incremental information processing, and a method for binding information from distinct representations that when faced with rapid change in the world can maintain a coherent, though distributed, view of it. Provisional results are demonstrated in a robot combining vision, manipulation, language, planning and reasoning capabilities interacting with a human and manipulable objects.

Towards Deep Compositional Networks

Mon, 01 Jan 0001 00:00:00 +0000

Hierarchical feature learning based on convolutional neural networks (CNN) has recently shown significant potential in various computer vision tasks. While allowing high-quality discriminative feature learning, the downside of CNNs is the lack of explicit structure in features, which often leads to overfitting, absence of reconstruction from partial observations and limited generative abilities. Explicit structure is inherent in hierarchical compositional models, however, these lack the ability to optimize a well-defined cost function. We propose a novel analytic model of a basic unit in a layered hierarchical model with both explicit compositional structure and a well-defined discriminative cost function. Our experiments on two datasets show that the proposed compositional model performs on a par with standard CNNs on discriminative tasks, while, due to explicit modeling of the structure in the feature units, affording a straight-forward visualization of parts and faster inference due to separability of the units.

Towards fast and efficient methods for tracking players in sports

Mon, 01 Jan 0001 00:00:00 +0000

An efficient algorithm for tracking a single player in a sporting match is presented in this paper. The sporting event is considered as a semi-controlled environment for which a set of closed-world assumptions regarding the visual as well as dynamical properties is derived. We show how these assumptions can be used in the context of particle filtering to arrive at a computationally-fast and reliable tracker. The proposed tracker was evaluated on a demanding data set. When compared to several similar trackers that did not utilize all of the closed-world assumptions, the proposed tracker, on average, resulted in a better performance regarding the failure rate as well as position and prediction estimation.

Towards fast lighting condition inference for augmented reality

Mon, 01 Jan 0001 00:00:00 +0000

Towards hierarchical representation of space

Mon, 01 Jan 0001 00:00:00 +0000

Various robotic systems, performing efficient navigation, localization and place recognition in their surrounding environments, have already been developed. These systems posess a representation of space that is based on some engineered knowledge. There is still no such system that would know about the structure of space in general, and whose knowledge would be obtained by learning. We believe that people learn about properties of space through interaction with the environment. Therefore, since people perform really well in the spatial related tasks, we expect that a robotic system that would obtain such knowledge would also perform better. With this in mind, we are developing an algorithm for learning a compositional hierarchical representation of space that is based on statistically significant observations. For now, we have focused on a two dimensional space, since many robots perceive their surroundings in two dimensions with the use of a laser range finder or a sonar. In this paper we evaluate our early work on this topic through room categorization problem. Based on the lower layers of the hierarchy, we obtained encouraging classification results with three different types of rooms.

Towards large-scale traffic sign detection and recognition

Mon, 01 Jan 0001 00:00:00 +0000

Recognition of traffic signs is a well researched field in the computer vision community, with several commercial applications already available. However, a vast majority of existing approaches focuses on recognition of a relatively small number of traffic sign categories (about 50 or less). In this paper, we adopt a convolutional neural network (CNN) approach, i.e., the Faster R-CNN, to address the full pipeline of detection and recognition of more than 100 traffic sign categories, depicted in our novel dataset that was acquired on Slovenian roads. We report promising results on highly challenging traffic sign categories that have not yet been considered in previous works and we provide useful insights for CNN training.

Towards Learning Basic Object Affordances from Object Properties

Mon, 01 Jan 0001 00:00:00 +0000

The capacity for learning to recognize and exploit environmental affordances is an important consideration for the design of current and future developmental robotic systems. We present a system that uses a robotic arm, camera systems and self-organizing maps to learn basic affordances of objects.

Towards on-the fly multi-modal sensor calibration

Mon, 01 Jan 0001 00:00:00 +0000

The robustness of autonomous vehicles can be significantly improved by using multiple sensor modalities. In addition to standard color cameras and less frequently used thermal, multispectral and polarization cameras, LIDAR and RADAR are most often used sensors, and are largely complementary to image sensors. However, the spatial calibration of such a system can be extremely challenging due to the difficulties in obtaining corresponding features from different modalities, as well as the inevitable parallax arising from different sensor positions. In this paper, we present a comprehensive strategy for calibrating such a system using a multi-modal target, and illustrate how such a strategy could be upgraded to an fully automatic, target-less calibration that would rely on features of the scene itself to align at least small sensor offsets from the calibrated position. We find that a high-level understanding of the scene is ideal for this task, as this way we can identify characteristic points for spatial alignment of sensor data of different modalities.

Towards Probabilistic Online Discriminative Models

Mon, 01 Jan 0001 00:00:00 +0000

Towards Scalable Representations of Visual Categories: Learning a Hierarchy of parts.

Mon, 01 Jan 0001 00:00:00 +0000

Tracking and Segmentation of Transparent Objects

Mon, 01 Jan 0001 00:00:00 +0000

Transparent object tracking is a challenging, recently introduced, problem. Existing methods predict target location as a bounding box, which is often only a poor approximation of actual location. Segmentation mask is a more accurate prediction, but benchmarks for evaluating tracking and segmentation performance of transparent objects does not exist. In this paper we address this drawback by introducing a new dataset for tracking and segmentation of transparent objects. In particular we sparsely re-annotate the existing bounding box TOTB dataset with ground-truth segmentation masks. A comprehensive analysis demonstrates that existing segmentation methods perform surprisingly well on this task indicating good design generalization and potential for transparent object tracking tasks. In addition, we show that existing bounding box trackers can be easily transformed into segmentation trackers using modern mask refinement methods.

Tracking Non-Rigid Objects by Combining Local and Global Visual Model

Mon, 01 Jan 0001 00:00:00 +0000

We present an appearance-based tracker which hierarchically combines a global and a local visual model in two layers. The bottom layer contains the local part of the visual model and consists of a set of sub-trackers, each of them observing only a local aspect of the object. The top layer constrains and focuses the movement of individual sub-tracker by accounting for the global part of the model - the spatial relations between the trackers. The visual model is updated by modifying the spatial relations and by reinitializing the sub-trackers which do not follow the target. By reinitializing a single or a small number of sub-trackers the tracker can adapt only a part of its visual model to the new appearance of the object. This makes the tracker less vulnerable to drifting. The implementation of the two-layered tracker that uses a SSD template matching for the sub-trackers is presented and tested on a demanding data set of non-rigid objects.

Traffic sign classification with batch and on-line linear support vector machines

Mon, 01 Jan 0001 00:00:00 +0000

This paper presents a comprehensive benchmark of several feature types and colorspace representations on the task of traffic sign classification. We focus on linear Support Vector Machine classifiers, and test several multi-class formulations, as well as a formulation that allows on-line training and updates. Experiments on two standard traffic sign classification datasets show that despite their relative simplicity, these classifiers offer competitive performance, and ultimately allow design of a flexible classification system in the context of application for automatic maintenance of traffic signalization inventory.

Trans2k: Unlocking the Power of Deep Models for Transparent Object Tracking

Mon, 01 Jan 0001 00:00:00 +0000

Visual object tracking has focused predominantly on opaque objects, while transparent object tracking received very little attention. Motivated by the uniqueness of transparent objects in that their appearance is directly affected by the background, the first dedicated evaluation dataset has emerged recently. We contribute to this effort by proposing the first transparent object tracking training dataset Trans2k that consists of over 2k sequences with 104,343 images overall, annotated by bounding boxes and segmentation masks. Noting that transparent objects can be realistically rendered by modern renderers, we quantify domain-specific attributes and render the dataset containing visual attributes and tracking situations not covered in the existing object training datasets. We observe a consistent performance boost (up to 16%) across a diverse set of modern tracking architectures when trained using Trans2k, and show insights not previously possible due to the lack of appropriate training sets. The dataset and the rendering engine will be publicly released to unlock the power of modern learning-based trackers and foster new designs in transparent object tracking.

TransFusion - A Transparency-Based Diffusion Model for Anomaly Detection

Mon, 01 Jan 0001 00:00:00 +0000

Surface anomaly detection is a vital component in manufacturing inspection. Current discriminative methods follow a two-stage architecture composed of a reconstructive network followed by a discriminative network that relies on the reconstruction output. Currently used reconstructive networks often produce poor reconstructions that either still contain anomalies or lack details in anomaly-free regions. Discriminative methods are robust to some reconstructive network failures, suggesting that the discriminative network learns a strong normal appearance signal that the reconstructive networks miss. We reformulate the two-stage architecture into a single-stage iterative process that allows the exchange of information between the reconstruction and localization. We propose a novel transparency-based diffusion process where the transparency of anomalous regions is progressively increased, restoring their normal appearance accurately while maintaining the appearance of anomaly-free regions using localization cues of previous steps. We implement the proposed process as TRANSparency DifFUSION (TransFusion), a novel discriminative anomaly detection method that achieves state-of-the-art performance on both the VisA and the MVTec AD datasets, with an image-level AUROC of 98.5% and 99.2%, respectively. Code: https://github.com/MaticFuc/ECCV_TransFusion

Unsupervised Learning of Basic Object Affordances from Object Properties

Mon, 01 Jan 0001 00:00:00 +0000

Affordance learning has, in recent years, been generating heightened interest in both the cognitive vision and developmental robotics communities. In this paper we describe the development of a system that uses a robotic arm to interact with household objects on a table surface while observing the interactions using camera systems. Various computer vision methods are used to derive, firstly, object property features from intensity images and range data gathered before interaction and, subsequently, result features derived from video sequences gathered during and after interaction. We propose a novel affordance learning algorithm that automatically discretizes the result feature space in an unsupervised manner to form affordance classes that are then used as labels to train a supervised classifier in the object property feature space. This classifier may then be used to predict affordance classes, grounded in the result space, of novel objects based on object property observations.

Uporaba lokalnih značilnic v aplikacijah spoznavnega vida za urbana okolja (Local features in cognitive vision applications for urban environments)

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we present a performance evaluation of MSER and Hessian-Affine local feature detectors in a typical use case of cognitive vision applications in urban environments. By using a wide baseline stereo matching approach we try to find camera motion between a user image and images stored in a database. Running this application on test images twice while only changing the underlying local feature type has shown that the MSER local feature detector outperforms the Hessian-Affine detector. Additionally, we have shown that local features can perform well in cognitive vision applications for urban environments.

Using discriminative analysis for improving hierarchical compositional models

Mon, 01 Jan 0001 00:00:00 +0000

In this paper we propose a method to extract discriminative information from a generative model produced by a compositional hierarchical approach. We present discriminative information as a score computed from a weighted summation of the activation vector. We base the activation vector on individual activations of features from a parse tree of the detection. We utilize the score to reduce false positive detections by removing generative models with poor discriminative information from the vocabulary and by thresholding the detections with low discriminative score. We evaluate our approach on the ETHZ Shape Classes database where we show a reduction in the number of false positives and a decrees in detection time without reducing the detection rate.

Vegetation segmentation for boosting performance of MSER feature detector

Mon, 01 Jan 0001 00:00:00 +0000

In this paper, we present a new application of image segmentation algorithms and an adaptation of the image segmentation method of Tavakoli et al. to the problem of vegetation segmentation. While the traditional goal of image segmentation is to provide a figure/ground segmentation for object recognition or semantic segmentation to assist humans, we propose to use image segmentation in order to boost performance of local invariant feature detectors. In particular, we analyze the performance of MSER feature detector and we show that we can prune all features detected on vegetation to gain a 67% speed-up while accuracy of image matching does not decrease. The image segmentation method of Tavakoli et al. that we adapt to the problem of vegetation segmentation is based on singular value decomposition (SVD) of local image patches, where the sum of the smaller singular values describes the high frequency part of the patch. The results of the automatic segmentation of vegetation show that the average overlap between manual and automatic vegetation segmentation is 33% and that the automatic procedure for vegetation segmentation can prune 25% of MSER features, resulting in 33% faster image retrieval.

ViCoS Eye - a webservice for visual object categorization

Mon, 01 Jan 0001 00:00:00 +0000

In our paper we present an architecture for a system capable of providing back-end support for webservice by running a variety of computer vision algorithms distributed across a cluster of machines. We divide the architecture into learning, real-time processing and a request handling for web-service. We implement learning in MapReduce domain with Hadoop jobs, while we implement real-time processing as a Storm application. An additional website and Android application front-end are implemented as part of web-service to provide user interface. We evaluate the system on our own cluster and show that the system running on a cluster of our size can learn Caltech-101 dataset in 40 minutes while real-time processing can achieve response time of 2 seconds, which is adequate for multitude of online applications.

ViCoS Eye - Spletna storitev za kategorizacijo vizualnih objektov

Mon, 01 Jan 0001 00:00:00 +0000

V članku predstavimo arhitekturo sistema za spletno storitev, ki omogoča poganjanje naprednih algoritmov računalniškega vida porazdeljenih preko večjega števila računalnikov. Arhitekturno sistem ločimo na učenje, tokovno procesiranje v realnem času in uporabniški vmesnik za spletno storitev. Učenje implementiramo v domeni MapReduce s pomočjo Hadoop poslov, medtem ko implementiramo realno-časovno procesiranje kot aplikacijo na sistemu Storm. Kot spletni vmesnik za končnega uporabnika dodatno implementiramo tudi spletno stran in Android aplikacijo. Sistem testiramo na naši gruči računalnikov in pokažemo, da se lahko slike iz podatkovne zbirke Caltech-101 naučimo v 40 minutah, medtem ko lahko tokovno procesiranje v realnem času obdela posamezno vhodno zahtevo v manj kot dveh sekundah.

Video segmentation of water scenes using semi supervised learning

Mon, 01 Jan 0001 00:00:00 +0000

Obstacle detection is a crucial component in unmanned surface vehicles to prevent collisions and unnecessary stopping due to false detections. Autonomous vessels are a relatively unexplored area in comparison to autonomous ground vehicles, thus there are much fewer densely annotated datasets for training modern obstacle detectors. Since manual acquisition of ground truth segmentation data is time-consuming and expensive, a viable alternative is training with minimal supervision to evaluate unsupervised domain adaptation methods, trained on a labeled source dataset and an un-labeled target dataset. Four modern adaptation methods are tested (Intra-domain adaptation, Fourier domain adaptation, Instance matching and Bidirectional learning) for training the semantic segmentation network WaSR, which is currently the state-of-the-art for maritime obstacle detection. We consider the original WaSR as well as a modified version. The Fourier domain adaptation applied to a modified WaSR version outperforms the non-adapted original WaSR by 6.3% in F-measure.

Video-Based Ski Jump Style Scoring from Pose Trajectory

Mon, 01 Jan 0001 00:00:00 +0000

Ski jumping is one of the oldest winter sports and takes also part in the Winter Olympics from the very start in 1924. One of the components of the final score, which is used for ranking the competitors, is the style score, given by five judges. The goal of this work was to develop a prototype for automatic style scoring from videos. As the main source of information, the proposed approach uses the detected locations of the ski jumper body parts and his skis to capture a full-body movement through the entire ski jump. We extended a method for human pose estimation from images to detect also the tips and the tails of the skies and adapted it to the domain of ski jumping. We proposed a method to utilize the detected trajectories along with the scores given by real judges to build a model for predicting the style scores. The experimental results obtained on the data that we had available show that the proposed computer-vision-based system for automatic style scoring achieves an error comparable to the error of real judges.

Visual Information Abstraction For Interactive Robot Learning

Mon, 01 Jan 0001 00:00:00 +0000

Semantic visual perception for knowledge acquisition plays an important role in human cognition, as well as in the learning process of any cognitive robot. In this paper, we present a visual information abstraction mechanism designed for continuously learning robotic systems. We generate spatial information in the scene by considering plane estimation and stereo line detection coherently within a unified probabilistic framework, and show how spaces of interest (SOIs) are generated and segmented using the spatial information. We also demonstrate how the existence of SOIs is validated in the long-term learning process. The proposed mechanism facilitates robust visual information abstraction which is a requirement for continuous interactive learning. Experiments demonstrate that with the refined spatial information, our approach provides accurate and plausible representation of visual objects.

Vrednotenje učinkovitosti Kalmanovega filtra pri sledenju ljudi

Mon, 01 Jan 0001 00:00:00 +0000

Kalman filtering (KF) is a standard technique for estimating position and uncertainty of a moving object based on noisy measurements and knowledge of object dynamics. In this paper we apply the Kalman filter algorithm to estimate the motion parameters (position and speed) of a moving Peršon from a video stream. To assess the efficiency of KF tracking various experiments with and without KF were performed. The results showed that modeling of a Peršon motion and measurement noise using KF algorithm can considerably improve the tracking performance in cases of human interactions and occlusions.

Weighted and robust incremental method for subspace learning

Mon, 01 Jan 0001 00:00:00 +0000

Weighted Incremental Subspace Learning

Mon, 01 Jan 0001 00:00:00 +0000

Zaznavanje terasiranih pokrajin kot semantična segmentacija digitalnega modela višin

Mon, 01 Jan 0001 00:00:00 +0000