Projects at the Chair for Information Technology

One of the striking differences between current reinforcement learning algorithms and early human learning is that animals and infants appear to explore their environments with autonomous purpose, in a manner appropriate to their current level of skills. We propose to investigate exploration, the learning of multiple skills, and the construction of higher-level representations together: learning multiple skills facilitates construction of higher-level representations, and when an agent proposes a new higher level representation, it then has natural autonomous motivation for explorations to improve, extend, and validate the representation it has proposed. The main objective of this work package is to develop algorithms and underlying theory for autonomously motivated exploration and skill acquisition. Firstly, we will develop appropriate notions of intrinsic rewards and autonomously motivated exploration, accompanied by behavioural algorithms derived from these notions. Secondly, we will develop skill-based representations of complex domains which allow to compose previously learned skills into new skills.

1. Autonomously motivated exploration. We will develop and evaluate models for intrinsic motivation and curiosity driven exploration in respect to their ability of effecting future behaviour of the learning agent. Useful autonomous exploration should enable the agent to quickly adapt to new tasks and perform them efficiently. We will formally link appropriate notions of intrinsic reward to the overall behaviour of the agent.

2. Skill acquisition. Skill acquisition allows an agent to perform more complex tasks by composing previously learned skills. A skill is the agent’s ability to perform a certain subtask or reach a certain state in a reliable manner. We propose that an intelligent agent should construct a skill-based understanding of its environment; in the sense that it should have (and/or build) a representation of what it knows about how to achieve goals and what skills to apply. We will develop algorithms for constructing such skill-based representations of the environment.

Exploration vs. exploitation trade-offs in query by example searches

In the query by example search paradigm the user is stirring the search by providing relevance feedback to the examples presented by the search engine. The objective of a searching strategy is to provide the user with satisfying results most quickly. One possible strategy is to select examples with the highest likelihood for satisfying the user’s query. But this is not necessarily an optimal strategy if more information about the user’s query can be obtained from feedback to different examples. Thus, a search strategy is facing an exploration–exploitation dilemma: should it present an example to the user for which informative feedback is to be expected, or should it present an example which is more likely to match the user’s query. This dilemma is even more severe when the current hypothesis about the user’s query favors examples in the vicinity of past relevant examples, but ignores possibly more relevant examples in an unexplored part of the search space.

A similar dilemma exists for a web content provider who needs to display content which is interesting to the user. The system may either choose to display content which is likely to be interesting given the current statistics (exploitation), or it may display content for which statistics are insufficient (exploration).

As a basic scenario for the exploration–exploitation dilemma the bandit problem has been studied in the statics community and more recently also in the machine learning community. While algorithms for the basic bandit problem are well developed, only few algorithms for more involved scenarios do exist. In the context of this project such scenarios need to deal with available side information as well as with delayed feedback. Whereas the basic bandit problem adequately models choices (e.g. of examples to be presented to the user) which are independent of each other, extension of the bandit problem are needed to model choices which can be related to each other through their descriptions and comparison metrics. Descriptions of possible choices provide side information which helps the system to improve its performance.

If a search continues for several steps, the benefit of an example chosen by the system is not immediately revealed but can be assessed only when the user finishes her search. Thus mechanisms for dealing with such delayed feedback are required. In particular reinforcement learning is a suitable approach in this respect.

We will develop and provide algorithms for dealing with the exploration–exploitation dilemma to improve results for query by example search. Building on our previous work about dealing with side information and about reinforcement learning, we will make algorithms available which are tailored to the specific needs of searches with relevance feedback.

Learning for Cognitive Vision

The main focus of this project part can be summarized as an attempt to create more autonomous learning algorithms for vision systems. We want to decrease the amount of human intervention which is necessary for successful learning: selection and labelling of training images, segmentation by hand, annotation of videos, etc. The motivation to focus research into this direction is twofold. First, human intervention is expensive and large scale vision systems seem feasible only if the need for such intervention can be reduced. Second, learning in biological cognitive systems seems to be rather autonomous with only very weak external supervision or feedback. Several interrelated research tasks will investigate approaches towards such more autonomous cognitive vision systems:

  1. The conservative learning framework is a bootstrap approach to learning where the performance of different but communicating learning algorithms is improved through co-training with minimal human intervention.
  2. Complementing conservative learning, active learning requests labelling only for the most informative examples. We will extend our boosting algorithm for object categorization to active learning.
  3. As a main tool for conservative learning and active learning we will research online boosting.
  4. As a more technical goal we want to reduce the amount of manual tuning for vision algorithms and instead use learning to improve the performance of vision components.
  5. Finally we will tackle the very ambitious goal to aggregate visual primitives into combined primitives to obtain short descriptions of scenes and objects.

Online performance of reinforcement learning with internal reward functions

We consider reinforcement learning under the paradigm of online learning where the objective is good performance during the whole learning process. This is in contrast to the typical analysis of reinforcement learning where one is interested in learning a finally near-optimal strategy. We will conduct a mathematically rigorous analysis of reinforcement learning under this alternate paradigm and expect as a result novel and efficient learning algorithms.

We believe that for intelligent interfaces the proposed online paradigm provides significant benefits as such an interface would deliver reasonable performance even early in the training process.

Starting point for our analysis will be the method of upper confidence bounds which has already been very effective for simplified versions of reinforcement learning. To carry the analysis to realistic problems with large or continuous state spaces we will estimate the utility of states by value function approximation through kernel regression. Kernel regression is a well founded function approximation method related to support vector machines and holds significant promise for reinforcement learning.

Finally we are interested in methods for reinforcement learning where no or only little external reinforcement is provided for the learning agent. Since useful external rewards are often hard to come by, we will investigate the creation of internal reward functions which drive the consolidation and the extension of learned knowledge, mimicking cognitive behaviour.

Sequential Forecasting and Partial Feedback: Applications to Machine Learning