From Satisficing to Optimization in Reinforcement Learning
FWF project PAT6918624 (2025-2027)

Project leader: Ronald Ortner
Department für Mathematik und Informationstechnologie
Lehrstuhl für Informationstechnologie
Montanuniversität Leoben
Franz-Josef-Straße 18
A-8700 Leoben
Tel.: +43 3842 402-1503
Fax: +43-3842-402-1502
E-mail: ronald.ortner(at)unileoben.ac.at
PostDoc Position
For the FWF-project "Reinforcement Learning: From Satisficing to Optimization" we are looking for a PostDoc with strong background in mathematics (such as probability theory, statistics, optimization, OR) and interest in reinforcement learning theory and analysis of algorithms. Highly qualified PhD candidates might be considered as well.
The position is for 1,5 up to 2 years beginning from the negotiable project start in 2025 and will be filled as soon as a suitable candidate has been found. Salary is about 4900 EUR per month (after taxes and insurance contributions this is net about 3700 EUR) including benefits like health insurance.
The University of Leoben is one of Austrias three universities of technology. Leoben is a small yet nice town in the mountains, Austrias second largest city Graz is in commuting distance (45min by train) and the capital Vienna is not too far either (90min by car, 2h by train).
Applications should include a CV, up to three recent publications as well as contact information of up to three senior academics who are willing to serve as reference.
Please send applications as well as further enquiries to the project leader Ronald Ortner (ronald.ortner@unileoben.ac.at).
About the project
Reinforcement learning (RL) has been successful in applications, but theory has not been able to guarantee reliability and robustness of the used algorithms. One reason is that RL theory focuses on optimization, while practical RL problems are task-oriented so that optimality doesn't play any role. We aim at a restart of RL theory by replacing the optimality paradigm by a criterion based on satisficing, which will alleviate the development and analysis of algorithms. In the precursor project we were able to give first regret bounds in the bandit as well as in the general MDP setting that unlike classic regret bounds are independent of the horizon. Now we are interested in
- optimizing the parameters in these bounds,
- obtaining respective lower bounds,
- taking a closer look at the regime between satisficing and optimization, and
- alternative performance measures that unlike the regret are not worst-case.