比利时vs摩洛哥足彩
,
            
university of california san diego
        
        ****************************
math 278c - optimization and data science seminar
mengdi wang
princeton university
regret bounds of model-based reinforcement learning
abstract:
we discuss some recent results on model-based methods for online reinforcement learning (rl). the goal of online rl is to adaptively explore an unknown environment and learn to act with provable regret bounds. in particular, we focus on finite-horizon episodic rl where the unknown transition law belongs to a generic family of models. we propose a model based `value-targeted regression' rl algorithm that is based on optimism principle: in each episode, the set of models that are `consistent' with the data collected is constructed. the criterion of consistency is based on the total squared error of that the model incurs on the task of predicting values as determined by the last value estimate along the transitions. the next value function is then chosen by solving the optimistic planning problem with the constructed set of models. we derive a bound on the regret, for arbitrary family of transition models, using the notion of the so-called eluder dimension proposed by russo \& van roy (2014).
host: jiawang nie
april 14, 2021
3:00 pm
meeting id: 982 9781 6626 password: 278csp21
****************************

